
Online Bayesian PassiveAggressive Learning
Tianlin Shi and Jun Zhu.
Journal of Machine Learning Research 2017 (in press)
Documents: [Abstract] [PDF] [Slides at ICML] [ICML Version] [Supp] [Slides at TNList Lab] [Talk]
We present online Bayesian PassiveAggressive (BayesPA) learning, a generic online learning
framework for hierarchical Bayesian models with maxmargin posterior regularization. We provide
provable Bayesian regret bounds for both averaging classifiers and Gibbs classifiers. We show
that BayesPA subsumes the standard online PassiveAggressive (PA) learning and more importantly
extends naturally to incorporate latent variables for both parametric and nonparametric Bayesian inference,
therefore providing great flexibility for explorative analysis. As an important example, we
apply BayesPA to topic modeling and derive efficient online learning algorithms for maxmargin
topic models. We further develop nonparametric BayesPA topic models to resolve the unknown
number of topics. Experimental results on 20newsgroups and a large Wikipedia multilabel data
set (with 1.1 millions of training documents and 0.9 million of unique terms in the vocabulary)
show that our approaches significantly improve time efficiency while maintaining comparable results
with the batch counterpart methods.


Improving Survey Aggregation with Sparsely Represented Signals
Tianlin Shi, Forest Agostinelli, Matthew Staib, David Wipf, Thomas Moscibroda
KDD 2016: The 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
Documents: [Abstract]
[PDF] [Video]
Errata: In theorem 4.3, M should be f, the global mean.
In this paper, we develop a new aggregation technique to reduce the cost of surveying. Our method aims to jointly estimate a vector of target quantities such as public opinion or voter intent across time and maintain good estimates when using only a fraction of the data. Inspired by the JamesStein estimator, we resolve this challenge by shrinking the estimates to a global mean which is assumed to have a sparse representation in some known basis. This assumption has lead to two diﬀerent methods for estimating the global mean: orthogonal matching pursuit and deep learning. Both of which signiﬁcantly reduce the number of samples needed to achieve good estimates of the true means of the data and, in the case of presidential elections, can estimate the outcome of the 2012 United States elections while saving hundreds of thousands of samples and maintaining accuracy.


MaxMargin Deep Generative Models
Chongxuan Li, Jun Zhu, Tianlin Shi and Bo Zhang.
NIPS 2015: The Twentyninth Annual Conference on Neural Information Processing Systems (NIPS).
Documents: [Abstract]
[PDF]
Deep generative models (DGMs) are effective on learning multilayered representations of complex data and performing inference of input data by exploring the generative ability. However, little work has been done on examining or empowering the discriminative ability of DGMs on making accurate predictions. This paper presents maxmargin deep generative models (mmDGMs), which explore the strongly discriminative principle of maxmargin learning to improve the discriminative power of DGMs, while retaining the generative capability. We develop an efficient doubly stochastic subgradient algorithm for the piecewise linear objective. Empirical results on MNIST and SVHN datasets demonstrate that (1) maxmargin learning can significantly improve the prediction performance of DGMs and meanwhile retain the generative ability; and (2) mmDGMs are competitive to the stateoftheart fully discriminative networks by employing deep convolutional neural networks (CNNs) as both recognition and generative models.


Learning Where to Sample in Structured Prediction
Tianlin Shi, Jacob Steinhardt and Percy Liang
AISTATS 2015: Accepted as Oral (6.1%).
The 18th International Conference on Artificial Intelligence and Statistics, San Diego, California, USA.
Documents: [Abstract]
[PDF] [Slides]
In structured prediction, most inference algorithms allocate a homogeneous amount of computation to all parts of the output, which can be wasteful when different parts vary widely in terms of difficulty. In this paper, we propose a heterogeneous approach that dynamically allocates computation to the different parts. Given a pretrained model, we tune its inference algorithm (a sampler) to increase testtime throughput. The inference algorithm is parametrized by a metamodel and trained via reinforcement learning, where actions correspond to sampling candidate parts of the output, and rewards are log likelihood improvements. The metamodel is based on a set of domaingeneral metafeatures capturing the progress of the sampler. We test our approach on five datasets and show that it attains the same accuracy as Gibbs sampling but is 2 to 5 times faster.


Correlated Compressive Sensing for Networked Data
Tianlin Shi, Da Tang, Liwen Xu and Thomas Moscibroda
UAI 2014: 30th Conference on Uncertainty in Artificial Intelligence, Quebec City, Canada, July 2014.
Documents: [Abstract] [PDF]
This paper originates from the course project of Network Science, taught by Prof. Thomas Moscibroda . See our project report..
We consider the problem of recovering sparse
correlated data on networks. To improve accuracy
and reduce costs, it is strongly desirable
to take the potentially useful sideinformation of
network structure into consideration. In this paper
we present a novel correlated compressive
sensing method called CorrCS for networked
data. By naturally extending Bayesian compressive
sensing, we extract correlations from network
topology and encode them into a graphical
model as prior. Then we derive posterior inference
algorithms for the recovery of jointly sparse
and correlated networked data. First, we design
algorithms to recover the data based on pairwise
correlations between neighboring nodes in the
network. Next, we generalize this model through
a diffusion process to capture higherorder correlations.
Both realvalued and binary data are
considered. Our models are extensively tested
on several real datasets from social and sensor
networks and are shown to outperform baseline
compressive sensing models in terms of recovery
performance.


A Reverse Hierarchy Model for Predicting Eye Fixations
Tianlin Shi, Liang Ming and Xiaolin Hu.
CVPR 2014: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA., 2014
Documents: [Abstract] [Arxiv]
A number of psychological and physiological evidences
suggest that early visual attention works in a coarsetofine way, which lays a basis for the reverse hierarchy theory
(RHT). This theory states that attention propagates
from the top level of the visual hierarchy that processes
gist and abstract information of input, to the bottom level
that processes local details. Inspired by the theory, we
develop a computational model for saliency detection in
images. First, the original image is downsampled to different
scales to constitute a pyramid. Then, saliency on
each layer is obtained by image superresolution reconstruction
from the layer above, which is defined as unpredictability
from this coarsetofine reconstruction. Finally,
saliency on each layer of the pyramid is fused into stochastic
fixations through a probabilistic model, where attention
initiates from the top layer and propagates downward
through the pyramid. Extensive experiments on two standard
eyetracking datasets show that the proposed method
can achieve competitive results with stateoftheart models.


A Fully PolynomialTime Approximation Scheme for Approximating a Sum of Random Variables
Jian Li*, and Tianlin Shi*.
ORL: Operations Research Letters 42.3 (2014): 197202.
Documents: [Abstract] [PDF] [Elsevier] [Arxiv] [Software] .
*Author names follow alphabetic order. This paper is based on a pset problem from the course Algorithm Design, taught by Prof. Jian Li.
Given $n$ independent random variables $X_1,X_2,...,X_n$ and an integer $C$, we study the fundamental problem of computing the probability that the sum $X=X_1+X_2+...+X_n$ is at most $C$. We assume that each random variable $X_i$ is implicitly given by an oracle which, given an input value $k$, returns the probability $X_i \leq k$. We give the first deterministic fully polynomialtime approximation scheme (FPTAS) to estimate the probability up to a relative error of $1\pm\epsilon$. Our algorithm is based on the idea developed for approximately counting knapsack solutions in [Gopalan et al. FOCS11].


Gradientbased inference for higherorder probabilistic programming languages
Tianlin Shi, Alexey Radul and Vikash Mansinghka
NEML 2014: Workshop at New England Machine Learning Day
Documents: [Abstract] [Poster] .
Abstract: Higherorder probabilistic programming languages make it easy to represent models that combine discrete latent variables, stochastic control flow, and combinatorial stochastic processes with highdimensional continuous latent spaces. These kinds of probabilistic programs present new inference challenges that existing probabilistic programming systems have not addressed: not only can the energy function's curvature vary unpredictably, the number of continuous dimensions can grow or shrink dynamically, and energy functions and/or their gradients may be unavailable due to the presence of computationally intensive simulations.
We describe adaptive hybrid Monte Carlo transition operators and accelerated gradientbased MAP schemes for inference in higherorder probabilistic programs that overcome these limitations. We evaluate our transition operators on posterior simulation in models that are difficult for most generalpurpose inference engines, such as a highdimensional multivariate Gaussian, spikeandslab regression, and an uncollapsed Dirichlet process mixture of Gaussians. We demonstrate correctness by comparison with custom implementations. We show how our approach resolves limitations of previous hybrid Monte Carlo schemes for probabilistic programs and makes it straightforward to express hybrid machine learning models based on novel combinations of point estimation and approximate Bayesian inference.
