
0.25
0.5
0.75
1.25
1.5
1.75
2
Probabilistic numerics for deep learning
Published on 2017-07-276145 Views
Presentation
Probabilistic numerics for deep learning00:00
Probabilistic numerics treats computation as a decision - 135:20:01
Probabilistic numerics treats computation as a decision - 239:38:35
Probabilistic numerics treats computation as a decision - 341:41:15
Probabilistic numerics is the study of numeric methods as learning algorithms.46:13:16
Global optimisation considers objective functions that are multi-modal and often expensive to evaluate. 58:14:53
The Rosenbrock is expressible in closed-form.72:56:49
Computational limits form the core of the optimisation problem.96:53:01
We are epistemically uncertain about f(x,y) due to being unable to afford its computation.108:55:33
Probabilistic modelling of functions134:01:15
Probability theory represents an extension of traditional logic, allowing us to reason in the face of uncertainty.139:03:57
A probability is a degree of belief. This might be held by any agent – a human, a robot, a pigeon, etc.185:02:49
‘I’ is the totality of an agent’s prior information. An agent is (partially) defined by I.194:35:53
The Gaussian distribution allows us to produce distributions for variables conditioned on any other observed variables224:49:12
A Gaussian process is the generalisation of a multivariate Gaussian distribution to a potentially infinite number of variables.280:06:05
A Gaussian process provides a non-parametric model for functions, defined by mean and covariance functions. 299:00:01
Gaussian processes are specified by a covariance function, which flexibly allow the expression of e.g304:54:35
Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 1317:50:12
Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 2335:47:43
Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 3339:26:42
Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 4339:34:10
Gaussian processes have a complexity that grows with the data; they provide flexible models, robust to overfitting. - 5344:22:28
Bayesian optimisation as decision theory344:46:19
Bayesian optimisation is the approach of probabilistically modelling f(x,y), and using decision theory to make optimal use of computation349:51:10
By defining the costs of observation and uncertainty, we can select evaluations optimally by minimising the expected loss with respect to a probability distribution.358:37:13
We define a loss function that is the lowest function value found after our algorithm ends.371:50:42
This loss function makes computing the expected loss simple: we’ll take a myopic approximation and consider only the next evaluation.408:15:53
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 1422:47:53
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 2453:36:16
Untitled470:58:51
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 4480:27:51
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 5484:29:14
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 6487:02:48
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 7488:55:58
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 8490:32:08
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 9553:00:52
We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss. - 10610:41:51
Tuning is used to cope with model parameters (such as periods).613:32:39
Bayesian optimisation gives a powerful method for such tuning.644:11:10
Snoek, Larochelle and Adams (2012) used Bayesian optimisation to tune convolutional neural networks. 650:20:31
Bayesian optimisation is useful in automating structured search over # hidden layers, learning rates, dropout rates, # hidden units per layer & L2 weight constraints.697:03:27
Bayesian stochastic optimisation709:10:23
Using only a subset of the data (a mini-batch) gives a noisy likelihood evaluation713:15:58
If we use Bayesian optimisation on these noisy evaluations, we can perform stochastic learning.743:21:05
Lower-variance evaluations (on smaller subsets) are higher cost: let’s also Bayesian optimise over the fidelity of our evaluations!756:21:00
Quiz: which of these sequences is random? - 1817:08:58
Quiz: which of these sequences is random? - 2838:47:28
A random number871:59:30
Integration beats optimisation938:35:49
The naïve fitting of models to data performed by optimisation can lead to overfitting.944:56:36
Bayesian averaging over ensembles of models reduces overfitting, and provides more honest estimates of uncertainty952:13:36
Our model962:12:20
Averaging requires integrating over the many possible states of the world consistent with data: this is often non-analytic.995:25:47
Numerical integration (quadrature) is ubiquitous. 1003:25:41
Optimisation is an unreasonable way of estimating a multi-modal or broad likelihood integrand.1031:20:12
If optimising, flat optima are often a better representation of the integral than narrow optima. 1057:26:13
Bayesian quadrature makes use of a Gaussian process surrogate for the integrand (the same as you might use for Bayesian optimisation).1074:49:51
Gaussian distributed variables are joint Gaussian with any affine transform of them. 1092:05:41
A function over which we have a Gaussian process is joint Gaussian with any integral or derivative of it, as integration and differentiation are linear.1116:24:39
We can use observations of an integrand ℓ in order to perform inference for its integral, Z: this is known as Bayesian Quadrature.1130:17:22
Bayesian quadrature generalises and improves upon traditional quadrature.1156:00:24
Quiz: what is the convergence rate of Monte Carlo? - 11312:55:04
Quiz: what is the convergence rate of Monte Carlo? - 21326:35:03
Monte Carlo1343:20:34
Probabilistic numerics views the selection of samples as a decision problem.1367:55:32
Our method (Warped Sequential Active Bayesian Integration) converges quickly in wall-clock time for a synthetic integrand.1495:57:55
WSABI-L converges quickly in integrating out hyperparameters in a Gaussian process classification problem (CiteSeerx data).1497:50:58
Probabilistic numerics offers the propagation of uncertainty through numerical pipelines.1498:03:33
Probabilistic numerics treats computation as a decision1498:28:56