Probabilistic numerics for deep learning

Published on 2017-07-276145 Views

DLSS & RLSS 2017 - Montreal

Presentation

Probabilistic numerics for deep learning00:00

Probabilistic numerics treats computation as a decision - 135:20:01

Probabilistic numerics treats computation as a decision - 239:38:35

Probabilistic numerics treats computation as a decision - 341:41:15

Probabilistic numerics is the study of numeric methods as learning algorithms.46:13:16

Global optimisation considers objective functions that are multi-modal and often expensive to evaluate. 58:14:53

The Rosenbrock is expressible in closed-form.72:56:49

Computational limits form the core of the optimisation problem.96:53:01

We are epistemically uncertain about f(x,y) due to being unable to afford its computation.108:55:33

Probabilistic modelling of functions134:01:15

Probability theory represents an extension of traditional logic, allowing us to reason in the face of uncertainty.139:03:57

A probability is a degree of belief. This might be held by any agent – a human, a robot, a pigeon, etc.185:02:49

‘I’ is the totality of an agent’s prior information. An agent is (partially) defined by I.194:35:53

The Gaussian distribution allows us to produce distributions for variables conditioned on any other observed variables224:49:12

A Gaussian process is the generalisation of a multivariate Gaussian distribution to a potentially infinite number of variables.280:06:05

A Gaussian process provides a non-parametric model for functions, defined by mean and covariance functions. 299:00:01

Gaussian processes are specified by a covariance function, which flexibly allow the expression of e.g304:54:35

Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 1317:50:12

Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 2335:47:43

Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 3339:26:42

Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 4339:34:10

Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 5344:22:28

Bayesian optimisation as decision theory344:46:19

Bayesian optimisation is the approach of probabilistically modelling f(x,y), and using decision theory to make optimal use of computation349:51:10

By defining the costs of observation and uncertainty, we can select evaluations optimally by minimising the expected loss with respect to a probability distribution.358:37:13

We define a loss function that is the lowest function value found after our algorithm ends.371:50:42

This loss function makes computing the expected loss simple: we’ll take a myopic approximation and consider only the next evaluation.408:15:53

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 1422:47:53

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 2453:36:16

Untitled470:58:51

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 4480:27:51

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 5484:29:14

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 6487:02:48

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 7488:55:58

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 8490:32:08

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 9553:00:52

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 10610:41:51

Tuning is used to cope with model parameters (such as periods).613:32:39

Bayesian optimisation gives a powerful method for such tuning.644:11:10

Snoek, Larochelle and Adams (2012) used Bayesian optimisation to tune convolutional neural networks. 650:20:31

Bayesian optimisation is useful in automating structured search over # hidden layers, learning rates, dropout rates, # hidden units per layer & L2 weight constraints.697:03:27

Bayesian stochastic optimisation709:10:23

Using only a subset of the data (a mini-batch) gives a noisy likelihood evaluation713:15:58

If we use Bayesian optimisation on these noisy evaluations, we can perform stochastic learning.743:21:05

Lower-variance evaluations (on smaller subsets) are higher cost: let’s also Bayesian optimise over the fidelity of our evaluations!756:21:00

Quiz: which of these sequences is random? - 1817:08:58

Quiz: which of these sequences is random? - 2838:47:28

A random number871:59:30

Integration beats optimisation938:35:49

The naïve fitting of models to data performed by optimisation can lead to overfitting.944:56:36

Bayesian averaging over ensembles of models reduces overfitting, and provides more honest estimates of uncertainty952:13:36

Our model962:12:20

Averaging requires integrating over the many possible states of the world consistent with data: this is often non-analytic.995:25:47

Numerical integration (quadrature) is ubiquitous. 1003:25:41

Optimisation is an unreasonable way of estimating a multi-modal or broad likelihood integrand.1031:20:12

If optimising, flat optima are often a better representation of the integral than narrow optima. 1057:26:13

Bayesian quadrature makes use of a Gaussian process surrogate for the integrand (the same as you might use for Bayesian optimisation).1074:49:51

Gaussian distributed variables are joint Gaussian with any affine transform of them. 1092:05:41

A function over which we have a Gaussian process is joint Gaussian with any integral or derivative of it, as integration and differentiation are linear.1116:24:39

We can use observations of an integrand ℓ in order to perform inference for its integral, Z: this is known as Bayesian Quadrature.1130:17:22

Bayesian quadrature generalises and improves upon traditional quadrature.1156:00:24

Quiz: what is the convergence rate of Monte Carlo? - 11312:55:04

Quiz: what is the convergence rate of Monte Carlo? - 21326:35:03

Monte Carlo1343:20:34

Probabilistic numerics views the selection of samples as a decision problem.1367:55:32

Our method (Warped Sequential Active Bayesian Integration) converges quickly in wall-clock time for a synthetic integrand.1495:57:55

WSABI-L converges quickly in integrating out hyperparameters in a Gaussian process classification problem (CiteSeerx data).1497:50:58

Probabilistic numerics offers the propagation of uncertainty through numerical pipelines.1498:03:33

Probabilistic numerics treats computation as a decision1498:28:56