# PMOHR Summary

Machine learning is a field of computer science with the aim of discovering statistical patterns in large datasets and making predictions using these patterns. While there is a growing interest on methods that can provide accurate predictions on unseen data, comparatively less effort is invested on designing interpretable machine learning models. Interpretable models have the benefit to provide human-understandable patterns that can be interpreted by domain experts. This enables knowledge discovery in the scientific disciplines, and it also allows reasoning about causality and making counterfactual predictions.

In probabilistic machine learning, we first encode our assumptions about the data structure in the form of a model that has latent variables, which represent the hidden patterns. We then learn the latent variables using an inference algorithm. The results of the inference procedure can be used to make predictions and explore the data collection. Crucially, we can impose an interpretable structure in the model design phase. Finally, in probabilistic machine learning we can also apply model testing approaches to verify the properties of the posited model and whether if fails capture certain properties of the data.

PMOHR is an interdisciplinary project focused on the design of interpretable models through probabilistic machine learning, with the ultimate goal of modelling Electronic Health Records (EHRs). Applying machine learning tools to EHR data can help not only design clinical support systems, but it may also lead to uncover unknown patterns from the data and even form causal theories.

However, medical data, and EHRs in particular, present several challenges that prevent us from applying probabilistic modelling tools, because the datasets are large and heterogeneous. The objectives of PMOHR are to develop both probabilistic models and inference algorithms that are suitable for modelling EHR data. This new set of tools can then be applied to make predictions and to analyse medical datasets.

# News

## Jul 2019: Talk at EMS, Special Session on "Recent advances in simulation-based methods for numerical integration and inference" (Palermo, Italy)

A Contrastive Divergence for Combining Variational Inference and MCMC [slides]

## Jun 2019: Talk at MSCA Monitoring Meeting (Brussels, Belgium)

Probabilistic Modelling of Electronic Health Records [slides]

## May 2019: Invited Seminar at Linkoping University (Linkoping, Sweden)

Beyond the Mean-Field Family: Variational Inference with Implicit Distributions [slides]

## May 2019: Paper Accepted at ICML

**A Contrastive Divergence for Combining Variational Inference and MCMC** [arxiv]

We develop a method to combine Markov chain Monte Carlo (MCMC) and variational inference (VI), leveraging the advantages of both inference approaches. Specifically, we improve the variational distribution by running a few MCMC steps. To make inference tractable, we introduce the variational contrastive divergence (VCD), a new divergence that replaces the standard Kullback-Leibler (KL) divergence used in VI. The VCD captures a notion of discrepancy between the initial variational distribution and its improved version (obtained after running the MCMC steps), and it converges asymptotically to the symmetrized KL divergence between the variational distribution and the posterior of interest. The VCD objective can be optimized efficiently with respect to the variational parameters via stochastic optimization. We show experimentally that optimizing the VCD leads to better predictive performance on two latent variable models: logistic matrix factorization and variational autoencoders (VAEs).

## May 2019: Code Released

**A Contrastive Divergence for Combining Variational Inference and MCMC** [link]

The code for our *ICML* paper has been released.

## Apr 2019: Paper Submitted to TACL

**Topic Modeling in Embedding Spaces**

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (ETM), a generative model of documents that marries traditional topic models with word embeddings. In particular, it models each word with a categorical distribution whose natural parameter is the inner product between a word embedding and an embedding of its assigned topic. To fit the ETM, we develop an efficient amortized variational inference algorithm. The ETM discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.

## Mar 2019: Poster Presentation in NY Annual ML Symposium

Title: "A Contrastive Divergence for Combining Variational Inference and MCMC"

## Feb 2019: Code Released

**De novo Gene Signature Identification from Single-Cell RNA-Seq with Hierarchical Poisson Factorization** [link]

The code for our *Molecular Systems Biology* paper has been released.

## Feb 2019: Paper Accepted at Molecular Systems Biology

**De novo Gene Signature Identification from Single-Cell RNA-Seq with Hierarchical Poisson Factorization** [link]

Common approaches to gene signature discovery in single cell RNA-sequencing (scRNA-seq) depend upon predefined structures like clusters or pseudo-temporal order, require prior normalization, or do not account for the sparsity of single cell data. We present single cell Hierarchical Poisson Factorization (scHPF), a Bayesian factorization method that adapts Hierarchical Poisson Factorization for de novo discovery of both continuous and discrete expression patterns from scRNA-seq. scHPF does not require prior normalization and captures statistical properties of single cell data better than other methods in benchmark datasets. Applied to scRNA-seq of the core and margin of a high-grade glioma, scHPF uncovers marked differences in the abundance of glioma subpopulations across tumor regions and subtle, regionally-associated expression biases within glioma subpopulations. scHFP revealed an expression signature that was spatially biased towards the glioma-infiltrated margins and associated with inferior survival in glioblastoma.

## Jan 2019: Code Released

**Unbiased Implicit Variational Inference** [link]

The code for our AISTATS paper has been released.

## Dec 2018: Paper Accepted at AISTATS 2019

**Unbiased Implicit Variational Inference** [arxiv]

We develop unbiased implicit variational inference (UIVI), a method that expands the applicability of variational inference by defining an expressive variational family. UIVI considers an implicit variational distribution obtained in a hierarchical manner using a simple reparameterizable distribution whose variational parameters are defined by arbitrarily flexible deep neural networks. Unlike previous works, UIVI directly optimizes the evidence lower bound (ELBO) rather than an approximation to the ELBO. We demonstrate UIVI on several models, including Bayesian multinomial logistic regression and variational autoencoders, and show that UIVI achieves both tighter ELBO and better predictive performance than existing approaches at a similar computational cost.

## Dec 2018: Organized 1st Symposium on Advances in Approximate Bayesian Inference

**1st Symposium on Advances in Approximate Bayesian Inference** [link]

Probabilistic modeling is a useful tool to analyze and understand real-world data. Central to the success of Bayesian modeling is posterior inference, for which approximate inference algorithms are typically needed in most problems of interest. The two pillars of approximate Bayesian inference are variational and Monte Carlo methods. In the recent years, there have been numerous advances in both methods, which have enabled Bayesian inference in increasingly challenging scenarios involving complex probabilistic models and large datasets.

In this symposium, besides recent advances in approximate inference, we will discuss the impact of Bayesian inference, connecting approximate inference methods with other fields. In particular, we encourage submissions that relate Bayesian inference to the fields of reinforcement learning, causal inference, decision processes, Bayesian compression, or differential privacy, among others. We also encourage submissions that contribute to connecting different approximate inference methods, such as variational inference and Monte Carlo.

## Dec 2018: Invited Talk at University Carlos III in Madrid (Madrid, Spain)

Unbiased Implicit Variational Inference [slides]

## Nov 2018: Invited Talk at Amazon Research (Cambridge, UK)

Shopper: A Probabilistic Model of Consumer Choice with Substitutes and Complements [slides]

## Jul 2018: Paper Submitted to BioRxiv

**De novo Gene Signature Identification from Single-Cell RNA-Seq with Hierarchical Poisson Factorization** [link]

Common approaches to gene signature discovery in single cell RNA-sequencing (scRNA-seq) depend upon predefined structures like clusters or pseudo-temporal order, require prior normalization, or do not account for the sparsity of single cell data. We present single cell Hierarchical Poisson Factorization (scHPF), a Bayesian factorization method that adapts Hierarchical Poisson Factorization for de novo discovery of both continuous and discrete expression patterns from scRNA-seq. scHPF does not require prior normalization and captures statistical properties of single cell data better than other methods in benchmark datasets. Applied to scRNA-seq of the core and margin of a high-grade glioma, scHPF uncovers marked differences in the abundance of glioma subpopulations across tumor regions and subtle, regionally-associated expression biases within glioma subpopulations. scHFP revealed an expression signature that was spatially biased towards the glioma-infiltrated margins and associated with inferior survival in glioblastoma.

## Jul 2018: Paper Accepted at ICML 2018

**Augment and Reduce: Stochastic Inference for Large Categorical Distributions** [link]

Categorical distributions are ubiquitous in machine learning, e.g., in classification, language models, and recommendation systems. However, when the number of possible outcomes is very large, using categorical distributions becomes computationally expensive, as the complexity scales linearly with the number of outcomes. To address this problem, we propose augment and reduce (A&R), a method to alleviate the computational complexity. A&R uses two ideas: latent variable augmentation and stochastic variational inference. It maximizes a lower bound on the marginal likelihood of the data. Unlike existing methods which are specific to softmax, A&R is more general and is amenable to other categorical models, such as multinomial probit. On several large-scale classification problems, we show that A&R provides a tighter bound on the marginal likelihood and has better predictive performance than existing approaches.

## Jun 2018: Invited Talk at ISBA Workshop

Shopper: A Probabilistic Model of Consumer Choice with Substitutes and Complements [link] [slides]

## Jun 2018: Invited Talk at Universitat de Barcelona

Shopper: A Probabilistic Model of Consumer Choice with Substitutes and Complements [slides]

## Jun 2018: Invited Talk at Barcelona GSE Summer Forum

Shopper: A Probabilistic Model of Consumer Choice with Substitutes and Complements [link] [slides]

## May 2018: Code Released

**Augment and Reduce** [link]

The code for our ICML paper has been released.

## May 2018: Code Released

**Shopper** [link]

The code for our ArXiv paper has been released.

## May 2018: Paper Accepted at American Economics Association Papers and Proceedings

**Estimating Heterogeneous Consumer Preferences for Restaurants and Travel Time Using Mobile Location Data** [link]

We estimate a model of consumer choices over restaurants using data from several thousand anonymous mobile phone users. Restaurants have latent characteristics (whose distribution may depend on restaurant observables) that affect consumers' mean utility as well as willingness to travel to the restaurant, while each user has distinct preferences for these latent characteristics. We analyze how consumers reallocate their demand after a restaurant closes to nearby restaurants versus more distant restaurants, comparing our predictions to actual outcomes. We also address counterfactual questions such as what type of restaurant would attract the most consumers in a given location.

## Apr 2018: Workflow Chair at AISTATS 2018

**21st International Conference on Artificial Intelligence and Statistics** [link]

The 21st International Conference on Artificial Intelligence and Statistics (AISTATS 2018) will be held in Playa Blanca, Lanzarote, Canary Islands from Monday, 9 April 2018 to Wednesday, 11 April 2018 at the Hotel H10 Rubicon Palace.

Since its inception in 1985, AISTATS has been an interdisciplinary gathering of researchers at the intersection of artificial intelligence, machine learning, statistics, and related areas.

## Apr 2018: Invited Talk at Stony Brook University

Augment and Reduce: Stochastic Inference for Large Categorical Distributions [slides]

## Apr 2018: Poster Presentation at DALI

Augment and Reduce: Stochastic Inference for Large Categorical Distributions [link]

## Mar 2018: Poster Presentation in NY Annual ML Symposium

Title: "Augment and Reduce: Stochastic Inference for Large Categorical Distributions"

## Mar 2018: Interview at El País Retina

Interview for the Spanish newspaper *El País Retina* [link]

## Feb 2018: Code Released

**Structured Embeddings** [link]

The code for our NIPS paper has been released.

## Jan 2018: Paper Accepted at IEEE Transactions on Cognitive Communications and Networking

**Infinite Factorial Finite State Machine for Blind Multiuser Channel Estimation** [link]

New communication standards need to deal with machine-to-machine communications, in which users may start or stop transmitting at any time in an asynchronous manner. Thus, the number of users is an unknown and time-varying parameter that needs to be accurately estimated in order to properly recover the symbols transmitted by all users in the system. In this paper, we address the problem of joint channel parameter and data estimation in a multiuser communication channel in which the number of transmitters is not known. For that purpose, we develop the infinite factorial finite state machine model, a Bayesian nonparametric model based on the Markov Indian buffet that allows for an unbounded number of transmitters with arbitrary channel length. We propose an inference algorithm that makes use of slice sampling and particle Gibbs with ancestor sampling. Our approach is fully blind as it does not require a prior channel estimation step, prior knowledge of the number of transmitters, or any signaling information. Our experimental results, loosely based on the LTE random access channel, show that the proposed approach can effectively recover the data-generating process for a wide range of scenarios, with varying number of transmitters, number of receivers, constellation order, channel length, and signal-to-noise ratio.

## Dec 2017: Organized NIPS 2017 Workshop

**Advances in Approximate Bayesian Inference** [link]

Approximate inference is key to modern probabilistic modeling. Thanks to the availability of big data, significant computational power, and sophisticated models, machine learning has achieved many breakthroughs in multiple application domains. At the same time, approximate inference becomes critical since exact inference is intractable for most models of interest. Within the field of approximate Bayesian inference, variational and Monte Carlo methods are currently the mainstay techniques. For both methods, there has been considerable progress both on the efficiency and performance.

In this workshop, we encourage submissions advancing approximate inference methods. We are open to a broad scope of methods within the field of Bayesian inference. In addition, we also encourage applications of approximate inference in many domains, such as computational biology, recommender systems, differential privacy, and industry applications.

## Dec 2017: Paper Accepted at NIPS 2017

**Structured Embedding Models for Grouped Data** [link]

Word embeddings are a powerful approach for analyzing language, and exponential family embeddings (EFE) extend them to other types of data. Here we develop structured exponential family embeddings (S-EFE), a method for discovering embeddings that vary across related groups of data. We study how the word usage of U.S. Congressional speeches varies across states and party affiliation, how words are used differently across sections of the ArXiv, and how the co-purchase patterns of groceries can vary across seasons. Key to the success of our method is that the groups share statistical information. We develop two sharing strategies: hierarchical modeling and amortization. We demonstrate the benefits of this approach in empirical studies of speeches, abstracts, and shopping baskets. We show how S-EFE enables group-specific interpretation of word usage, and outperforms EFE in predicting held-out data.

## Dec 2017: Paper Accepted at NIPS 2017

**Context Selection for Embedding Models** [link]

Word embeddings are an effective tool to analyze language. They have been recently extended to model other types of data beyond text, such as items in recommendation systems. Embedding models consider the probability of a target observation (a word or an item) conditioned on the elements in the context (other words or items). In this paper, we show that conditioning on all the elements in the context is not optimal. Instead, we model the probability of the target conditioned on a learned subset of the elements in the context. We use amortized variational inference to automatically choose this subset. Compared to standard embedding models, this method improves predictions and the quality of the embeddings.

## Dec 2017: Paper in NIPS Workshop on Approximate Bayesian Inference

**Scalable Large-Scale Classification with Latent Variable Augmentation** [link]

Categorical distributions are ubiquitous in machine learning, e.g., in classification, language models, and recommendation systems. However, when the number of possible outcomes is very large, using categorical distributions becomes computationally expensive, as the complexity scales linearly with the number of outcomes. To address this problem, we propose a method to alleviate the computational complexity. We use two ideas: latent variable augmentation and stochastic variational inference, and we maximize a lower bound on the marginal likelihood of the data. Unlike existing methods which are specific to softmax, our method is more general and is amenable to other categorical models, such as multinomial probit.

## Dec 2017: Paper in NIPS Workshop on Bayesian Deep Learning

**Word2Net: Deep Representations of Language** [link]

Word embeddings extract semantic features of words from large datasets of text. Most embedding methods rely on a log-bilinear model to predict the occurrence of a word in a context of other words. Here we propose word2net, a method that replaces their linear parametrization with neural networks. For each term in the vocabulary, word2net posits a neural network that takes the context as input and outputs a probability of occurrence. Further, word2net can use the hierarchical organization of its word networks to incorporate additional meta-data, such as syntactic features, into the embedding model. For example, we show how to share parameters across word networks to develop an embedding model that includes part-of-speech information. We study word2net with two datasets, a collection of Wikipedia articles and a corpus of U.S. Senate speeches. Quantitatively, we found that word2net outperforms popular embedding methods on predicting held-out words and that sharing parameters based on part of speech further boosts performance. Qualitatively, word2net learns interpretable semantic representations and, compared to vector-based methods, better incorporates syntactic information.

## Nov 2017: Code Released

**Context Selection for Embedding Models** [link]

The code for our NIPS paper has been released.

## Nov 2017: Paper Submitted to ArXiv

**Shopper: A Probabilistic Model of Consumer Choice with Substitutes and Complements** [link]

We develop SHOPPER, a sequential probabilistic model of shopping data. SHOPPER uses interpretable components to model the forces that drive how a customer chooses products; in particular, we designed SHOPPER to capture how items interact with other items. We develop an efficient posterior inference algorithm to estimate these forces from large-scale data, and we analyze a large dataset from a major chain grocery store. We are interested in answering counterfactual queries about changes in prices. We found that SHOPPER provides accurate predictions even under price interventions, and that it helps identify complementary and substitutable pairs of products.

## Jun 2017: Paper Accepted at IEEE Transactions on Signal Processing

**Poisson Multi-Bernoulli Radar Mapping Using Gibbs Sampling** [link]

This paper addresses the mapping problem. Using a conjugate prior form, we derive the exact theoretical batch multiobject posterior density of the map given a set of measurements. The landmarks in the map are modeled as extended objects, and the measurements are described as a Poisson process, conditioned on the map. We use a Poisson process prior on the map and prove that the posterior distribution is a hybrid Poisson, multi-Bernoulli mixture distribution. We devise a Gibbs sampling algorithm to sample from the batch multiobject posterior. The proposed method can handle uncertainties in the data associations and the cardinality of the set of landmarks, and is parallelizable, making it suitable for large-scale problems. The performance of the proposed method is evaluated on synthetic data and is shown to outperform a state-of-the-art method.

## Apr 2017: Paper Accepted at AISTATS 2017

**Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms** [link]

*Obtained the best paper award. See also this blog post!*

Variational inference using the reparameterization trick has enabled large-scale approximate Bayesian inference in complex probabilistic models, leveraging stochastic optimization to sidestep intractable expectations. The reparameterization trick is applicable when we can simulate a random variable by applying a differentiable deterministic function on an auxiliary random variable whose distribution is fixed. For many distributions of interest (such as the gamma or Dirichlet), simulation of random variables relies on acceptance-rejection sampling. The discontinuity introduced by the accept-reject step means that standard reparameterization tricks are not applicable. We propose a new method that lets us leverage reparameterization gradients even when variables are outputs of a acceptance-rejection sampling algorithm. Our approach enables reparameterization on a larger class of variational distributions. In several studies of real and synthetic data, we show that the variance of the estimator of the gradient is significantly lower than other state-of-the-art methods. This leads to faster convergence of stochastic gradient variational inference.

## Mar 2017: Poster Presentation in NY Annual ML Symposium

*IBM best poster presenter award.*

Title: "Item Embeddings for Demand Estimation in Economics"

## Feb 2017: Code Released

**Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms** [link]

The code for our AISTATS paper has been released.

## Dec 2016: Invited Talk at Universidad Carlos III de Madrid

Exponential Family Embeddings: Application to Economics [slides]

## Dec 2016: Paper Accepted at NIPS 2016

**The Generalized Reparameterization Gradient** [link]

The reparameterization gradient has become a widely used method to obtain Monte Carlo gradients to optimize the variational objective. However, this technique does not easily apply to commonly used distributions such as beta or gamma without further approximations, and most practical applications of the reparameterization gradient fit Gaussian distributions. In this paper, we introduce the generalized reparameterization gradient, a method that extends the reparameterization gradient to a wider class of variational distributions. Generalized reparameterizations use invertible transformations of the latent variables which lead to transformed distributions that weakly depend on the variational parameters. This results in new Monte Carlo gradients that combine reparameterization gradients and score function gradients. We demonstrate our approach on variational inference for two complex probabilistic models. The generalized reparameterization is effective: even a single sample from the variational distribution is enough to obtain a low-variance gradient.

## Dec 2016: Paper Accepted at NIPS 2016

**Exponential Family Embeddings** [link]

Word embeddings are a powerful approach for capturing semantic similarity among terms in a vocabulary. In this paper, we develop exponential family embeddings, a class of methods that extends the idea of word embeddings to other types of high-dimensional data. As examples, we studied neural data with real-valued observations, count data from a market basket analysis, and ratings data from a movie recommendation system. The main idea is to model each observation conditioned on a set of other observations. This set is called the context, and the way the context is defined is a modeling choice that depends on the problem. In language the context is the surrounding words; in neuroscience the context is close-by neurons; in market basket data the context is other items in the shopping cart. Each type of embedding model defines the context, the exponential family of conditional distributions, and how the latent embedding vectors are shared across data. We infer the embeddings with a scalable algorithm based on stochastic gradient descent. On all three applications - neural activity of zebrafish, users' shopping behavior, and movie ratings - we found exponential family embedding models to be more effective than other types of dimension reduction. They better reconstruct held-out data and find interesting qualitative structure.

## Dec 2016: Obtained a Microsoft Azure 1-year Grant

1-year subscription for Microsoft Azure [link]

## Nov 2016: Code Released

**Exponential Family Embeddings** [link p-emb] [link b-emb]

The code for our NIPS paper has been released.

## Nov 2016: Invited Talk at University of Cambridge

Reparameterizing Challenging Distributions [slides]

## Oct 2016: Obtained a GPU Nvidia Grant

Obtained a Tesla K40 GPU, donated by Nvidia [link]

## Oct 2016: Paper Submitted to ArXiv

**Model Criticism for Bayesian Causal Inference** [link]

The goal of causal inference is to understand the outcome of alternative courses of action. However, all causal inference requires assumptions. Such assumptions can be more influential than in typical tasks for probabilistic modeling, and testing those assumptions is important to assess the validity of causal inference. We develop model criticism for Bayesian causal inference, building on the idea of posterior predictive checks to assess model fit. Our approach involves decomposing the problem, separately criticizing the model of treatment assignments and the model of outcomes. Conditioned on the assumption of unconfoundedness---that the treatments are assigned independently of the potential outcomes---we show how to check any additional modeling assumption. Our approach provides a foundation for diagnosing model-based causal inferences.