Dennis Prangle's homepage/blog2018-03-02T10:36:39+00:00http://dennisprangle.github.ioDennis Prangledennis.prangle@gmail.comBayesian inference by neural networks. Part 2: new paper2016-06-07T00:00:00+00:00http://dennisprangle.github.io/research/2016/06/07/bayesian-inference-by-neural-networks2
<p>In <a href="/research/2016/06/07/bayesian-inference-by-neural-networks">part 1</a> I reviewed likelihood-free inference and its limitations.
This post looks at a <a href="http://arxiv.org/pdf/1605.06376.pdf">new approach</a> by George Papamakarios and Iain Murray which avoids some of these issues and delivers an orders-of-magnitude improvement in performance.
Most of the post will describe their paper.
Then I’ll close with a few comments on the method’s advantages and possible limitations.</p>
<h2 id="regression-approaches-to-likelihood-free-inference">Regression approaches to likelihood-free inference</h2>
<p>For now I’ll consider the problem of approximating the posterior <script type="math/tex">\pi(\theta \vert y_0)</script> given simulations <script type="math/tex">(\theta_i, y_i)</script> from <script type="math/tex">\pi(\theta) \pi(y \vert \theta)</script> for <script type="math/tex">1 \leq i \leq N</script>.
The paper also simulates from other distributions, which I’ll discuss below later.
(n.b. Out of habit, I’m using slightly different notation to the paper. They use <script type="math/tex">x</script> for datasets, <script type="math/tex">x_o</script> for the observations, and <script type="math/tex">p</script> instead of <script type="math/tex">\pi</script>.)</p>
<p>As described in <a href="/research/2016/06/07/bayesian-inference-by-neural-networks">part 1</a>, ABC approaches operate by a nearest neighbours approach.
That is, it approximates the posterior using <script type="math/tex">\theta_i</script> values such that <script type="math/tex">y_i</script> is close to <script type="math/tex">y_0</script>.
But this performs poorly unless <script type="math/tex">y</script> is low dimensional.</p>
<p>An alternative approach is to use <em>regression</em>.
This uses the simulations to train a system predicting <script type="math/tex">\theta</script> from <script type="math/tex">y</script> and then returns the predictions for <script type="math/tex">y_0</script>.
Rather than just base its output on the nearest neighbours, regression uses <em>all</em> the data to learn relationships between <em>features</em> of <script type="math/tex">y</script>, <script type="math/tex">f_1(y), f_2(y), \ldots</script>, and <script type="math/tex">\theta</script>.
(The assumption that such relationships exist is, I think, what the machine learning community sometimes refers to as “compositionality”.)
Hopefully even <script type="math/tex">y_i</script> values which are not close to <script type="math/tex">y_0</script> are useful in learning about these relationships.
<a href="https://en.wikipedia.org/wiki/Linear_regression">Classical linear regression</a> requires the user to define the features and allows only linear relationships.
Modern methods such as neural networks (a) can learn the features and (b) use non-linear relationships, although allowing more complexity in either requires more data to learn well.</p>
<p>Most regression methods output a single “point” prediction.
This is problematic here as we want to learn more than this.
In the Bayesian setting we want to learn a whole distribution, summarising the uncertainty about <script type="math/tex">\theta</script> given <script type="math/tex">y_0</script>.
In the next section I’ll outline how Papamakarios and Murray deal with this.</p>
<p>There is a history of using regression in likelihood-free methods.
<a href="http://www.genetics.org/content/162/4/2025.short">Beaumont Zhang and Balding</a> find nearest neighbours in a traditional ABC step, then use these for regression.
A nice feature of this approach is that the set of nearest neighbours gives a distribution for <script type="math/tex">\theta</script>, which is then adjusted by the regression.
<a href="http://link.springer.com/article/10.1007/s11222-009-9116-0">Blum and François</a> generalise this to use neural network regression.
<a href="http://www.degruyter.com/view/j/sagmb.2011.10.issue-1/1544-6115.1684/1544-6115.1684.xml">Bonassi and West</a> propose a regression approach (with no initial ABC step) based on fitting a mixture of normals to the <script type="math/tex">(\theta_i,y_i)</script> simulations.</p>
<h2 id="mixture-density-networks">Mixture density networks</h2>
<p>Papamakarios and Murray use a <a href="https://en.wikipedia.org/wiki/Feedforward_neural_network">feed-forward neural network</a> approach.
These are composed of several layers of variables, the initial layer being the inputs, in this case <script type="math/tex">y</script>.
The next layer is created by applying a linear transformation and then element-wise non-linear transformations.
That is, an element of the second layer is a formed by taking a weighted sum of the <script type="math/tex">y</script>s and then applying a <a href="https://en.wikipedia.org/wiki/Logistic_function">logistic function</a> or something similar.
Subsequent layers are formed similarly.
The final layer is the output, which for example could be predictions of <script type="math/tex">\theta</script>.
The network is specified by all the weights it uses.
These are tuned by optimisation: ensuring the predictions are as close as possible to the true <script type="math/tex">\theta_i</script>s according to some mathematical function (e.g. minimising sum of squared errors, or maximising likelihood).
A lot of techniques have been created to allow this to be done well.
One technique used in this paper (Section 2.5) which seemed important, although I didn’t entirely follow it, was to use a stochastic variational inference (SVI) method to avoid overfitting due to small sample sizes.</p>
<p>As mentioned above, neural network regression provides the advantages of learning features of <script type="math/tex">y</script> - the variables in intermediate layers - and fitting non-linear relationships.
However it has the disadvantage of not outputting a distribution.
To deal with this the paper uses a <em>mixture density network</em> (MDN),
which outputs the parameters of a <em>mixture distribution</em>, in this paper a normal mixture.
This can be viewed as representing the posterior distribution as several clusters.
Each cluster has an associated weight, and is defined by its mean vector and variance matrix.
The final layer of neural network contains all of these details, represented in a neat form that ensures any output gives a valid distribution.
The network is fitted by maximising the density of the observed <script type="math/tex">\theta</script> values under the mixture density, which is performed iteratively on small batches of data using an efficient variant on stochastic gradient descent.</p>
<p>One reason to concentrate on mixtures of normals is that it is straightforward to calculate many posterior summaries without the need to resort to Monte Carlo.
They also facilitate the sequential methods described below.</p>
<p>A nice feature of the paper is that the neural networks are not deep.
They use only one or two hidden layers and are fitted to a relatively small amount of simulated data.
Only up to around <script type="math/tex">10^5</script> simulated datasets are used for training, and much fewer for the most efficient of the proposed methods.
(Although one example does also require a pilot of <script type="math/tex">10^5</script> simulations.)</p>
<h2 id="sequential-simulation">Sequential simulation</h2>
<p>The set-up above tries to learn the global relationship between the posterior <script type="math/tex">\pi(\theta \vert y)</script> and <script type="math/tex">y</script>, which is potentially very complicated.
The paper proposes focusing on a local part of this relationship which is relevant to learning <script type="math/tex">\pi(\theta \vert y_0)</script>, with the goal of reducing the amount of simulated data required.</p>
<p>Papamakarios and Murray do this by using a sequential approach.
First they learn a rough estimate of the posterior <script type="math/tex">\tilde{\pi}(\theta)</script> using the MDN approach outlined above.
As will be seen shortly, it turns out to be useful to just fit a normal distribution rather than a mixture.
Then they sample <script type="math/tex">(\theta_i, y_i)</script> pairs from the distribution <script type="math/tex">\tilde{\pi}(\theta) \pi(y|\theta)</script>.
Using a MDN the corresponding mixture estimate <script type="math/tex">q(\theta | y_0)</script> is learnt.
This is adjusted by importance sampling to take into account the sampling distribution, to produce <script type="math/tex">\frac{\pi(\theta)}{\tilde{\pi}(\theta)} q(\theta | y_0)</script>.
The paper focuses on the case where <script type="math/tex">\pi(\theta)</script> is normal or uniform.
In this case, since <script type="math/tex">\tilde{\pi}(\theta)</script> is also normal, the resulting posterior estimate can be shown to itself be a mixture of normals (see Appendix C of the paper).</p>
<p>The paper considers several schemes:</p>
<ol>
<li>A non-sequential approach fitting a mixture estimate.</li>
<li>Several iterations of sequential scheme above, always returing a normal estimate of the posterior. (This was where the SVI method mentioned above was needed.)</li>
<li>Using the result of 2 as a proposal density to fit a mixture estimate of the posterior.</li>
</ol>
<p>As noted in the paper, this general idea of a sequential approach is similar in spirit to existing sequential Monte Carlo algorithms for ABC.
Edit: In this literature, it’s considered to be a good idea to sample <script type="math/tex">\theta</script> from a wider distribution than the current approximation of the posterior:
see <a href="http://biomet.oxfordjournals.org/content/96/4/983">Beaumont et al</a> and a more theoretical argument in <a href="http://arxiv.org/abs/1506.03481">Li and Fearnhead</a>.
I wonder if the same applies here.</p>
<h2 id="simulation-studies">Simulation studies</h2>
<p>The paper contains several simulation studies, studying how well the method learns for observations simulated under known parameter values.
In all studies the MDN methods substantially outperform ABC methods: rejection, MCMC and SMC.
Typically fitting the <em>entire posterior</em> has a cost is comparable to producing a <em>single</em> effective sample under ABC.
And the quality of the posterior fits is comparable to or better than the best produced under ABC, without the need to worry about what value of <script type="math/tex">\epsilon</script> should be used.</p>
<p>Of the 3 MDN schemes mentioned above, the sequential methods require substantially fewer parameters.
In some cases the quality of the fit is similar for both of these methods, but in others the mixture model is superior to the normal model.</p>
<h2 id="comments">Comments</h2>
<p>This paper certainly seems like a big step forwards for likelihood-free inference, making substantial progress on most of the limitations listed in my <a href="/research/2016/06/07/bayesian-inference-by-neural-networks">first post</a>.
Benefits include:</p>
<ul>
<li>Beating the performance of ABC methods by orders of magnitude.</li>
<li>Avoiding the tuning parameter <script type="math/tex">\epsilon</script>, which is notoriously hard to select and interpret.</li>
<li>Reducing - but not eliminating, see below - the need to choose summary statistics.</li>
</ul>
<p>However as an academic I’m required to be cautious and say further research is needed to validate the performance of the algorithm in applied settings.
In particular one issue with existing ABC regression methods is that they can produce very poor estimates when the observations are substantially different to the simulated data.
This is because the regressions need to extrapolate rather than interpolate.
I wonder if the MDN approach is robust to this sort of problem.</p>
<p>Finally here are some other comments on the approach.</p>
<ul>
<li>
<p><strong>Tuning choices</strong> Is there a good way to decide tuning issues such as how many mixture components to use and the architecture the neural network?</p>
</li>
<li>
<p><strong>Scaling to high dimensional data</strong> The paper considers two examples with high dimensional data, a model of population dynamics and a queuing system. The population example has around 300 observations (large in the context of likelihood-free methods!) The queuing example has a variable number of observations. In both case a small number of summary statistics of the data are used as input to the neural network. This raises several questions (a) what’s a good way to choose summary statistics here - are methods proposed for ABC still of any use? (b) can we judge how much information is lost by using summary statistics (c) can dimension reduction be included as an initial stage of the neural network e.g. through some suitable sparse network structure to avoid needing an enormous amount of data.</p>
</li>
<li>
<p><strong>Parameterisation</strong> Some posteriors are closer to normal after a parameter transformation. Perhaps a suitable transformation - e.g. elementwise Box Cox transformations - could be learnt jointly with the rest of the model.</p>
</li>
<li>
<p><strong>Uncertainty</strong> It would be nice to take into account the uncertainty of the estimated posterior. Perhaps the Bayesian SVI method mentioned in the paper could produce this as a by-product.</p>
</li>
<li>
<p><strong>Validation</strong> Is it any easier to validate the quality of a fitted posterior than for ABC output?</p>
</li>
<li>
<p><strong>Scaling to high dimensional parameters</strong> The variance matrices in the mixture model requires <script type="math/tex">O(p^2)</script> parameters (where <script type="math/tex">p</script> is the number of parameters) which might make scaling to large <script type="math/tex">p</script> difficult.</p>
</li>
<li>
<p><strong>Latent variables</strong> Sometimes we’d like to include latent variables as parameters, and then integrate them out.
For example in model choice it’s generally efficient to learn about the parameters of each model.
Similarly in parameter inference perhaps it would be beneficial to learn about latent variables representing part of the stochastic simulation process.</p>
</li>
<li>
<p>Edit: <strong>Tails</strong> Normal mixtures won’t be able to match especially heavy or light tailed posteriors. Is there a flexible alternative?</p>
</li>
</ul>
Bayesian inference by neural networks. Part 1: background2016-06-07T00:00:00+00:00http://dennisprangle.github.io/research/2016/06/07/bayesian-inference-by-neural-networks
<p>Recently George Papamakarios and Iain Murray published <a href="http://arxiv.org/pdf/1605.06376.pdf">an arXiv preprint</a> on likelihood-free Bayesian inference which substantially beats the state-of-the-art performance by a neat application of neural networks.
I’m going to write a pair of blog posts about their paper.
First of all this post summarises background material on likelihood-free methods and their limitations.
The <a href="/research/2016/06/07/bayesian-inference-by-neural-networks2">second post</a> will review and comment on the new paper.
Over the next few months I’m hoping to spend some time experimenting with their method,
and I might post about my experiences in implementing it, if it’s amenable to short blog posts!</p>
<p><strong>Edit Nov 2016:</strong> I started trying to implement this method but got distracted by new research ideas!
Also I found George Papamakarious’s <a href="https://github.com/gpapamak/epsilon_free_inference">theano code</a> for this paper. Amongst other things, this illustrates how to implement Cholesky products neatly in theano, something I’d have struggled to figure out by myself.
More generally the <a href="http://edwardlib.org/tutorials/mixture-density-network">Edward library</a> can implement MDNs using TensorFlow, although I’m not sure if it automates Cholesky products.</p>
<h2 id="intractable-likelihoods">Intractable likelihoods</h2>
<p><em>Model-based statistics</em> assumes that the observed data (e.g. deaths from an infectious disease) has been produced from a random distribution or <em>probability model</em>.
The model usually involves some unknown <em>parameters</em> (e.g. controlling rates of infection and death for a disease).
<em>Statistical inference</em> aims to learn the parameters from the data.
This might be an end in itself - if the parameters have interesting real world implications we wish to learn - or as part of a larger workflow such as prediction or decision making.</p>
<p>Classical approaches to statistical inference are based on the probability (or probability density) of the observed data <script type="math/tex">y_0</script> given particular parameter values <script type="math/tex">\theta</script>.
This is known as the <em>likelihood function</em>, <script type="math/tex">\pi(y_0|\theta)</script>.
Since <script type="math/tex">y_0</script> is fixed this is a function of <script type="math/tex">\theta</script> and so can be written <script type="math/tex">L(\theta)</script>.
Approaches to inference involve optimising this, used in <a href="https://en.wikipedia.org/wiki/Maximum_likelihood">maximum likelihood</a> methods, or exploring it, for <a href="https://en.wikipedia.org/wiki/Bayesian_statistics">Bayesian</a> methods, which are described in more detail shortly.
A crucial implicit assumption of both approaches is that it’s possible and computationally inexpensive to numerically evaluate the likelihood function.</p>
<p>As computing power has increased over the last few decades, there are an increasing number of interesting situations for which this assumption doesn’t hold.
Instead models are available from which data can be simulated, but where the likelihood function is <em>intractable</em>, in that it cannot be numerically evaluated in a practical time.
Examples include models of:</p>
<ul>
<li>Climate</li>
<li>High energy physics reactions</li>
<li>Variation in genetic sequences over a population</li>
<li>Molecular level biological reactions</li>
<li>Infectious diseases</li>
</ul>
<p>One common reason for intractability is that there are a very large number of ways in which the observable data can be generated, and it would be necessary to sum the probability contributions of all of these.</p>
<h2 id="bayesian-inference">Bayesian inference</h2>
<p>The majority of work in this area, including the paper I’m discussing, is focused on the Bayesian approach to inference.
Here a probability distribution must be specified on the unknown parameters, usually through a density <script type="math/tex">\pi(\theta)</script>.
This represents <em>prior beliefs</em> about the parameters before any data is observed.
The aim is to learn the <em>posterior beliefs</em> resulting from updating the prior to incorporate the observations.
Mathematically this is an application of conditional probability using Bayes theorem:
the posterior is <script type="math/tex">\pi(\theta | y_0) = k \pi(\theta) L(\theta)</script>, where <script type="math/tex">k</script> is a constant of proportionality that is typically hard to calculate.
A central aim of Bayesian inference is to produce methods which approximate useful properties of the posterior in a reasonable time.</p>
<h2 id="simulator-based-inference-and-abc">Simulator based inference and ABC</h2>
<p><a href="/research/2016/01/03/LFtimeline">Several methods</a> have been proposed for inference using simulators rather than the likelihood function, sometimes called “likelihood-free inference”.
One of the most popular is <a href="https://en.wikipedia.org/wiki/Approximate_Bayesian_computation">approximate Bayesian computation</a>.
The simplest version of this is based on <a href="https://en.wikipedia.org/wiki/Rejection_sampling">rejection sampling</a>:</p>
<ol>
<li>Sample <script type="math/tex">\theta_i</script> values for <script type="math/tex">1 \leq i \leq n</script> from <script type="math/tex">\pi(\theta)</script>.</li>
<li>Simulate datasets <script type="math/tex">y_i</script> from the model given parameters <script type="math/tex">\theta_i</script> for <script type="math/tex">1 \leq i \leq n</script>.</li>
<li>Accept parameters for which <script type="math/tex">d(y_i, y_0) \leq \epsilon</script>, and reject the remainder.</li>
</ol>
<p>Here the user must specify the number of iterations <script type="math/tex">n</script>,
the acceptance threshold <script type="math/tex">\epsilon</script>,
and the distance function <script type="math/tex">d(\cdot,\cdot)</script>.</p>
<p>The accepted parameters are a sample from an approximation to the posterior with density proportional to
<script type="math/tex">\int \pi(\theta | y) I(d(y,y_0) \leq \epsilon) dy</script>
(where <script type="math/tex">I</script> represents an indicator function: 1 when <script type="math/tex">y</script> makes its argument is true and 0 otherwise.)</p>
<p>Obtaining a sample from the posterior is a standard approach to Bayesian inference known as the <em>Monte Carlo method</em>.
Such a sample can be used to approximate most interesting properties of the posterior.
Sampling from an approximation to the posterior, as in ABC, allows inference but adds an extra layer of approximation.</p>
<p>Rejection sampling is inefficient.
Typically the posterior is much more concentrated than the prior, especially if there are more than a handful of parameters to learn.
Therefore most simulations will be rejected.
Several more sophisticated ABC algorithms have been invented to avoid this difficulty.
These include versions of <a href="http://www.pnas.org/content/100/26/15324.short">Markov chain Monte Carlo</a> (MCMC) and <a href="http://rsif.royalsocietypublishing.org/content/6/31/187.short">sequential Monte Carlo</a> (SMC).
Both of these propose <script type="math/tex">\theta</script> values which attempt to focus on regions of higher posterior probability.</p>
<p>More recently several other algorithms have been proposed to improve the efficiency of ABC.
These include ideas such as <a href="http://arxiv.org/abs/1502.05503">Bayesian optimisation</a>,
<a href="http://papers.nips.cc/paper/5881-optimization-monte-carlo-efficient-and-embarrassingly-parallel-likelihood-free-inference">learning latent variables</a>
<a href="http://arxiv.org/abs/1512.00205">expectation propagation</a>
and <a href="http://arxiv.org/abs/1305.4283">classical testing theory</a>,
as well as various methods which, like that of Papamakarios and Murray, use <a href="http://arxiv.org/abs/1506.02169">classifiers</a> and <a href="http://arxiv.org/abs/1605.05537">regression</a> methods from machine learning.</p>
<h2 id="limitations-of-likelihood-free-methods">Limitations of likelihood-free methods</h2>
<p>ABC works well for sufficiently simple problems.
For example see this <a href="http://www.annualreviews.org/doi/abs/10.1146/annurev-ecolsys-102209-144621?journalCode=ecolsys">review article</a> on applications in genetics and ecology.
But it has several serious drawbacks in more difficult problems.</p>
<h3 id="1-tuning-requirements">1) Tuning requirements</h3>
<p>The ABC rejection algorithm outlined above requires selection of a threshold <script type="math/tex">\epsilon</script>.
There’s some theoretical work on its optimal choice in an asymptotic regime of a very large number of simulations, but it’s hard to apply this in practice.
One reason is that more complex ABC algorithms require many other tuning choices with interact with selection of <script type="math/tex">\epsilon</script>.
A particularly challenging issue is the choice of summary statistics (see item 4 in this list).</p>
<h3 id="2-validation">2) Validation</h3>
<p>ABC samples from an approximation to the posterior, and it’s hard to know how much to trust this approximation.</p>
<h3 id="3-expensive-simulators">3) Expensive simulators</h3>
<p>Even efficient ABC algorithms usually require at least tens of thousands of simulations.
This can be impractical for expensive simulators.</p>
<h3 id="4-high-dimensional-data-and-parameters">4) High dimensional data (and parameters)</h3>
<p>ABC suffers from a <em>curse of dimensionality</em> problem.
This is because it uses a nearest neighbours type approach.
Parameters are inferred based on the simulated data sets which are closest to the observations.
But as the dimension of the data increases, the simulations become increasingly sparse in the space of data sets, so the nearest neighbours become worse representations of the observed data.
This can be a problem for ABC even when the dimension of the data reaches as low as the tens!</p>
<p>To deal with this problem practitioners have used dimension reduction methods to replace high dimensional data with low dimensional summary statistics.
This tuning choice involves some loss of information about the parameters.
It’s hard to judge which choice of summaries makes the best trade-off between information and dimension, although many methods have been proposed.</p>
<p>ABC usually also performs poorly for high dimensional parameters.
This is because learning such parameters generally entails having data which is high dimensional enough to cause problems.</p>
<p>In the <a href="/research/2016/06/07/bayesian-inference-by-neural-networks2">second blog post</a> I’ll look at how the new paper tackles some of these problems.</p>
Jupyter and R basics2016-01-17T00:00:00+00:00http://dennisprangle.github.io/software/2016/01/17/jupyter-and-r-basics
<p>Below are some slides I wrote on the basics of using R with the <a href="http://jupyter.org/">Jupyter notebook</a>. Navigate by pressing space to avoid missing any slides! You can also see this in notebook form <a href="https://github.com/dennisprangle/R-North-East-Talk-2016">here</a>.</p>
<iframe width="840" height="630" src="http://nbviewer.ipython.org/format/slides/github/dennisprangle/R-North-East-Talk-2016/blob/master/Jupyter%20talk%20for%20R%20North-East%20meetup.ipynb" frameborder="0" allowfullscreen="true"> </iframe>
Likelihood-free timeline2016-01-03T00:00:00+00:00http://dennisprangle.github.io/research/2016/01/03/LFtimeline
<p>This is an incomplete of the appearance of various “likelihood-free” inference methods.
Please let me know if there are any mistakes or things I should add.</p>
<p>The methods listed perform statistical inference based on repeated model simulations rather than likelihood evaluations, which are expensive or impossible for some complex models.
There are some other ways to avoid likelihood evaluations - e.g. <a href="http://www.pnas.org/content/110/4/1321.short">empirical likelihood</a> - which could also be thought as a “likelihood-free”, so perhaps there’s a need for a more precise but equally catchy name in the future!</p>
<p>I’ve also avoided listing papers on selecting summary statistics for use in these methods.
See <a href="http://projecteuclid.org/euclid.ss/1369147911">Blum et al, 2013</a> and <a href="http://arxiv.org/abs/1512.05633">Prangle, 2015</a> for reviews.</p>
<h3 id="1980s">1980s</h3>
<ul>
<li>
<p>1984 <a href="http://www.jstor.org/stable/2345504">Diggle and Gratton</a> on inference for implicit models. This paper also discusses some precursors in the 1970s which use ad-hoc likelihood-free methods for particular applications.</p>
</li>
<li>
<p>1984 <a href="http://projecteuclid.org/euclid.aos/1176346785">Rubin</a> presents a likelihood-free rejection sampling algorithm as an intuitive explanation of Bayesian methods, but not as a practical method.</p>
</li>
<li>
<p>1989 <strong>Simulated method of moments</strong>, <a href="http://www.jstor.org/stable/1913621">McFadden</a> (econometrics).</p>
</li>
</ul>
<h3 id="1990s">1990s</h3>
<ul>
<li>
<p>1992 <strong>GLUE</strong>, <a href="http://onlinelibrary.wiley.com/doi/10.1002/hyp.3360060305/abstract">Beven and Binley</a> (hydrology).</p>
</li>
<li>
<p>1993 <strong>Indirect inference</strong>, <a href="http://onlinelibrary.wiley.com/doi/10.1002/jae.3950080507/abstract">Gourieroux et al</a> (econometrics).</p>
</li>
<li>
<p>1997 Approximate Bayesian computation (<strong>ABC</strong>) (population genetics). Early papers include <a href="http://www.genetics.org/content/145/2/505.short">Tavare et al</a> and <a href="http://mbe.oxfordjournals.org/content/14/2/195.short">Fu and Li</a>.</p>
</li>
</ul>
<h3 id="2000s">2000s</h3>
<ul>
<li>
<p>2003 <strong>ABC-MCMC</strong>, <a href="http://www.pnas.org/content/100/26/15324.full">Marjoram et al</a>.</p>
</li>
<li>
<p>2006 <strong>Convolution filter</strong>, <a href="http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4177291">Campillo and Rossi</a>.</p>
</li>
<li>
<p>2006 <strong>Iterated filtering</strong>, <a href="http://www.pnas.org/content/103/49/18438.short">Ionides et al</a>.</p>
</li>
<li>
<p>2007-2012 <strong>ABC-SMC/PMC</strong>, <a href="http://www.pnas.org/content/104/6/1760.full">Sisson et al</a>, <a href="http://biomet.oxfordjournals.org/content/96/4/983">Beaumont et al</a>, <a href="http://rsif.royalsocietypublishing.org/content/6/31/187">Toni et al</a>, <a href="http://link.springer.com/article/10.1007/s11222-011-9271-y">Del Moral, et al</a>.</p>
</li>
</ul>
<h3 id="2010s">2010s</h3>
<ul>
<li>
<p>2010 <strong>Synthetic likelihood</strong>, <a href="http://www.nature.com/nature/journal/v466/n7310/abs/nature09319.html">Wood</a>.</p>
</li>
<li>
<p>2010 <strong>Coupled ABC</strong>, <a href="http://link.springer.com/article/10.1007/s11222-010-9216-x">Neal</a> (epidemiology) based on utilising latent variables.</p>
</li>
<li>
<p>2015 <strong>Bayesian indirect likelihood</strong>, <a href="http://projecteuclid.org/euclid.ss/1425492441">Drovandi et al</a>.</p>
</li>
<li>
<p>2015 <strong>Classifier</strong>-based approaches: the random forest method of <a href="http://bioinformatics.oxfordjournals.org/content/early/2015/12/23/bioinformatics.btv684.abstract">Pudlo et al</a> and the likelihood ratio estimation method of <a href="http://arxiv.org/abs/1506.02169">Cranmer et al</a>. A related but more expensive approach from 2014 is by <a href="http://onlinelibrary.wiley.com/doi/10.1002/sta4.56/abstract">Pham et al</a>.</p>
</li>
<li>
<p>2015 <strong>Optimisation Monte Carlo</strong>, <a href="http://papers.nips.cc/paper/5881-optimization-monte-carlo-efficient-and-embarrassingly-parallel-likelihood-free-inference">Meeds and Welling</a> and the closely related <strong>reverse sampler</strong> of <a href="http://arxiv.org/abs/1506.04017">Forneron and Ng</a>.
Both exploit a latent variable formulation.</p>
</li>
<li>
<p>2016 <a href="http://arxiv.org/abs/1605.07826">Graham and Stokey</a> use <strong>constrained Hamiltonian Monte Carlo</strong> to perform joint updates on parameters and latent variables conditioned on observations.</p>
</li>
<li>
<p>2016 <strong>Automatic variational ABC</strong>, <a href="https://arxiv.org/abs/1606.08549">Moreno et al</a>, using latent
variable draws in the estimation of loss function gradients.</p>
</li>
<li>
<p>2016 <a href="http://arxiv.org/abs/1605.06376">Papamakarios and Murray</a> learn a <strong>mixture density network</strong> to predict the parameter posterior from observations.</p>
</li>
</ul>
My work software list2015-09-27T00:00:00+00:00http://dennisprangle.github.io/software/2015/09/27/my-work-software-list
<p>As I’ve just started a new job, I’ve been given a new desktop PC to work on.
I thought I’d keep track of what I install on it so that I’ve got a complete list for next time.
It’s running some (slightly customised?) flavour of ubuntu which includes quite a lot of stuff by default.</p>
<h2 id="already-installed">Already installed</h2>
<ul>
<li>firefox</li>
<li>thunderbird</li>
<li>git</li>
<li>R</li>
<li>latex</li>
</ul>
<h2 id="additions">Additions</h2>
<ul>
<li>emacs (not installed by default?!)</li>
<li>dropbox</li>
<li>julia</li>
<li>rstan</li>
<li>quicktile (python script giving keyboard shortcuts for tiling - there’s compiz options for this but I can never get them to work)</li>
<li>thunderbird extensions
<ul>
<li>exQuilla (to work with exchange server)</li>
<li>markdown here (update Jan 2016: I stopped using this as thunderbird now has an insert “Mathematical Formula” option that seems to work with a bigger range of email clients)</li>
<li>archive this (keyboard shortcuts for moving messages)</li>
</ul>
</li>
<li>firefox extensions
<ul>
<li>adblock plus</li>
<li>lastpass</li>
</ul>
</li>
</ul>
Newcastle staff email on thunderbird2015-09-20T00:00:00+00:00http://dennisprangle.github.io/software/2015/09/20/newcastle-staff-email-on-thunderbird
<p>I’ve recently started working at Newcastle University and found it a bit tricky to read my staff email with thunderbird.
In case it saves anyone else some time, here’s the settings I used.
This worked for me as of September 2015.</p>
<ul>
<li>I used the thunderbird add-on <a href="https://exquilla.zendesk.com/home">ExQuilla</a>.
This accesses Microsoft Exchange email accounts.
It requires a $10/year license, but has a 60 day free period where you can decide if you like it.</li>
<li>Create a new email account with ExQuilla</li>
<li>Put in your email address and password</li>
<li>Log in with userid (<code class="highlighter-rouge">youruserid@newcastle.ac.uk</code>) and leave domain blank.</li>
<li>Specify manual EWS URL: <code class="highlighter-rouge">https://outlook.office365.com/EWS/Exchange.asmx</code> and complete other details as you wish.</li>
</ul>