# Finding the Best Fit: Bayesian Statistics Meets Geometry

Guest Commentary by Dr. Martyn Rittman

A common problem faced in research is when a set of data has been collected on the one hand, and a theoretical model (a set of equations with input parameters) exists on the other. We want to know firstly, whether the model describes the data well and secondly, what variables of the model give the best possible fit.

A good example, discussed by Konukoglu [1] is the personalization of medical data. The data are about heart rates, and a number of models have been proposed as an explanation. In application to a single patient, however, the data are often sparse, the models are complex, and it is important to know how much confidence to place in the model predictions.

Many methods exist to solve this kind of problem, and an increasing number as the computational power required to crunch the data becomes cheaper. In most cases, the algorithm seeks a single value for each parameter, however by harnessing the power of Bayesian inference it is possible to seek a distribution of parameter values that best describe the data. Markov chain Monte Carlo (MCMC) is an example of such a method. It works by iteratively sampling from better and better estimates of the solution distribution to give a series of possible solution values (a Markov chain). The final answer is the distribution of these values after a certain number of iterations (the ‘burn-in’ period). MCMC exploits the fact that the final Markov Chain must be reversible: it should converge to a point where you cannot tell if the order of points in chain sequence are swapped around at random.

Figure: Over multiple MCMC runs, we can see how the log of the likelihood function (and each parameter) converge to a small range. Figure taken from [2].

Some advantages of the MCMC approach are that you can easily know how precise your values are from the width of the distribution, and thus how much confidence to place in the solution. Since a distribution is broader than a single point, proposed solutions are taken from a wider area than so-called deterministic methods (e.g. those based on steepest gradients), which helps to avoid the search being trapped in a small region of the parameter space that provides only the locally best value.

To solve the problem of medical data personalization, Konukoglu et al. [1] turned to a recent optimization of MCMC discussed by Samuel Livingstone and Mark Girolami in a paper in Entropy [3]. They present the Metropolis-Hastings MCMC algorithm [4,5] and offer some ways to improve it for particularly difficult cases. They build on their earlier work [6] in which they presented a method that makes use of the geometry of the parameter space to improve the chance of Markov chain conversion.

A major drawback of MCMC is that, unlike a deterministic algorithm, there is no sure-fire way to know when you should stop or how long convergence will take. The Markov chain may even never converge. Practitioners try to modify the properties of the input distribution (known as the proposal kernel) to give a series of outputs that will quickly converge. The optimum proposal kernel is dependent on the shape of the final distribution.

To deal with these issues, Livingstone and Girolami [3] show that the use of Langevin diffusion in the decision step can lead to an increase of the optimum acceptance rate from 0.234 to 0.574, saving valuable computational time. The resulting algorithm, is known as the Metropolis-adjusted Langevin algorithm (MALA). Taking this further, they incorporate ideas from Riemannian geometry, a branch of mathematics dealing with smooth manifolds including how to define a metric: a concept of distance. An intelligent choice of metric can make points that were previously far apart become closer. Applied to MCMC, this means that points from which the Markov chain might diverge can be moved closer to ones from which it will converge, increasing the chance of a positive and quick result.The new algorithm is termed ‘MALA on a manifold’.

A mapping from one metric to another can be given by a function G and clearly the choice of G is critical. Livingstone and Girolami show how the Fisher information and Hessian can be used.

Some interesting issues remain outstanding, such as how to avoid unwanted properties of G. For example, the best choice should be ‘positive definite’, essentially meaning that distances between two points shouldn’t become zero or negative, which is not always the case for the Hessian. Also, the further complexity involved means that the manifold method is also not necessary for many cases. How does one go about deciding when to use it?

The interest of the MALA on a manifold method lies not just in widening the applicability of MCMC, but also in the steps forward that can be made by combining two very different fields of mathematical science: Bayesian statistics and geometry. It is a fine example of the benefits of a multidisciplinary approach.

Samuel Livingstone is a PhD student in Statistical Science at University College London, under the supervision of Professor Mark Girolami and Dr. Alexandros Beskos. He is interested in the properties of estimators constructed using Markov chain Monte Carlo methods, and in understanding how ideas from Riemannian Geometry can alter these properties. Before his PhD he studied for an MSc in Statistics at Lancaster University, after a short previous career in consultancy.