Chapter 7: Data-driven modelling with neural networks

Throughout this course we have built mathematical models from physical or biological principles: the equation for a pendulum (which follows from Newton’s second law), the logistic model from considering birth-death rates, the Sedov–Taylor blast wave from dimensional analysis. In each case we derived a model and then solved it.

But what if the governing mechanism is too complex, poorly understood, or simply unknown? In many modern applications we are given data (measurements, sensor readings, images, time series) and asked to make predictions without a governing equation. Examples include:

In these settings we need a flexible, general-purpose way to approximate an unknown function from data. This is the role of what we call a neural network.

7.1.0.0.1 A brief history.

The idea of artificial neurons dates back to McCulloch and Pitts (1943), who proposed simple mathematical models of biological neurons. In the late 1950s, Rosenblatt introduced the Perceptron, a single-layer network that could learn to classify patterns from data. Initial excitement was dampened when Minsky and Papert (1969) proved that single-layer networks have severe limitations, most notably, they cannot learn the simple exclusive-or (XOR) function (we will see why shortly). The field was revived in the 1980s when Rumelhart, Hinton and Williams showed how to train multi-layer networks using the backpropagation algorithm, overcoming the limitations Minsky and Papert had identified. Modern AI builds on this foundation, enabled by vastly more data and faster hardware.

7.2 Function approximation: the core idea

The central idea is simple and familiar. Suppose we have \(n\) data points \((x_i, y_i)\) and we want to find a function \(\hat{y}(x)\) that fits the data well.

7.2.0.0.1 Polynomial fitting

One approach is to choose a polynomial \[\hat{y}(x; \boldsymbol{\theta}) = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_p x^p,\] where \(\boldsymbol{\theta} = (\theta_0, \theta_1, \ldots, \theta_p)\) are the parameters (or coefficients) to be determined. We choose \(\boldsymbol{\theta}\) to minimise the difference between the data, \(y\) and our polynomial approximation, \(\hat{y}\). A common way to achieve this is to minimise the sum of squared errors: \[L(\boldsymbol{\theta}) = \sum_{i=1}^{n} \bigl(y_i - \hat{y}(x_i; \boldsymbol{\theta})\bigr)^2.\] This is least-squares fitting, which you may have encountered before. Polynomials work well for smooth, low-dimensional data, but they struggle with high-dimensional inputs (for example, images typically have millions of pixels), sharp features, and multivariate relationships. We need a richer class of parametrised functions.

Key Idea

A neural network is simply a more flexible parametrised function \(\hat{y}(\mathbf{x}; \boldsymbol{\theta})\). Training the network means choosing \(\boldsymbol{\theta}\) to minimise a loss function, the same optimisation idea as least-squares fitting, but with a far more expressive function.

7.3 The single neuron

The basic building block of a neural network is a neuron. A single neuron with \(d\) inputs \(x_1, x_2, \ldots, x_d\) computes:

We call \(\mathbf{w} = (w_1, \ldots, w_d)^\top\) the weights and \(b\) the bias, and \(\sigma\) is a nonlinear function. The output of the neuron is \(a = \sigma(\mathbf{w}^\top \mathbf{x} + b)\). Without the nonlinearity of \(\sigma\) the neuron would just compute a linear function of its inputs, the nonlinearity is what makes neural networks powerful.

7.3.0.0.1 Common activation functions.

Example: A single neuron by hand

Consider a neuron with \(d = 2\) inputs, weights \(w_1 = 0.5\), \(w_2 = -1.0\), bias \(b = 0.3\), and ReLU activation. For input \(\mathbf{x} = (2, 1)^\top\): \[z = 0.5 \times 2 + (-1.0) \times 1 + 0.3 = 0.3, \qquad a = \max(0,\, 0.3) = 0.3.\] For input \(\mathbf{x} = (1, 3)^\top\): \[z = 0.5 \times 1 + (-1.0) \times 3 + 0.3 = -2.2, \qquad a = \max(0,\, -2.2) = 0.\] The ReLU “switches off” the neuron when \(z < 0\).

7.3.0.0.2 Geometric interpretation.

A single neuron divides its input space into two regions separated by a decision boundary, the set of points where \(z = \mathbf{w}^\top\mathbf{x} + b = 0\). For two inputs, this boundary is a straight line; for three inputs, a plane; in general, a hyperplane. The weight vector \(\mathbf{w}\) is perpendicular to this boundary, and the bias \(b\) controls how far the boundary is from the origin. On one side of the boundary the neuron is “on” (or has high output); on the other side it is “off” (or has low output). This means a single neuron can only separate inputs that are divisible by a straight cut through the space, a significant limitation.

7.3.0.0.3 The XOR problem: why one neuron is not enough.

\(x_1\)	\(x_2\)	XOR
\(0\)	\(0\)	\(0\)
\(0\)	\(1\)	\(1\)
\(1\)	\(0\)	\(1\)
\(1\)	\(1\)	\(0\)

If you plot these four points in the \((x_1, x_2)\) plane, the two “1” outputs (at \((0,1)\) and \((1,0)\)) cannot be separated from the two “0” outputs (at \((0,0)\) and \((1,1)\)) by any single straight line. Since a single neuron implements exactly one linear decision boundary, it cannot represent XOR.

The solution is to add a hidden layer, extra neurons whose outputs are not directly observed but feed into the output neuron. The hidden neurons create multiple decision boundaries, and the output neuron combines them. With even one hidden neuron, the four XOR points can be embedded into a higher-dimensional space where they are linearly separable. This is the key insight behind multi-layer networks.

7.4 Feedforward neural networks

A feedforward neural network (also called a multilayer perceptron, MLP) is built by arranging neurons into layers:

A network with \(L\) hidden layers is often called an “\(L\)-layer network” or a “deep” network (when \(L \ge 2\)). Let superscripts in square brackets denote layer number. For a network with one hidden layer of \(n_1\) neurons:

Composing these operations, the network computes \[\begin{equation} \hat{y}(\mathbf{x};\, \boldsymbol{\theta}) = \mathbf{w}^{[2]\top}\,\sigma\!\bigl(\mathbf{W}^{[1]}\,\mathbf{x} + \mathbf{b}^{[1]}\bigr) + b^{[2]}, \label{eq:one-hidden} \end{equation}\] where \(\boldsymbol{\theta} = \{\mathbf{W}^{[1]}, \mathbf{b}^{[1]}, \mathbf{w}^{[2]}, b^{[2]}\}\) collects all the parameters. Don’t worry if this doesn’t make complete sense yet or the matrix notation is new, this will all be covered in more detail next year. The key idea for now is to understand this is just a parametrised function to be compared with the polynomial \(\hat{y}(x; \boldsymbol{\theta})\) from the previous section.

Key Idea

A key mathematical result (the universal approximation theorem) states that a feedforward network with a single hidden layer containing sufficiently many neurons can approximate any continuous function on a compact set to arbitrary accuracy. In other words, the function class defined by [eq:one-hidden] is, in principle, rich enough to represent any relationship hiding in your data — provided \(n_1\) is large enough and the parameters \(\boldsymbol{\theta}\) are chosen well.

Example: Counting parameters

A network with \(d = 3\) inputs, one hidden layer of \(n_1 = 10\) neurons (ReLU), and a single output has:

\(\mathbf{W}^{[1]}\): \(10 \times 3 = 30\) weights,
\(\mathbf{b}^{[1]}\): \(10\) biases,
\(\mathbf{w}^{[2]}\): \(10\) weights,
\(b^{[2]}\): \(1\) bias.

Total: \(30 + 10 + 10 + 1 = 51\) parameters. A degree-3 polynomial in 3 variables has only 20 coefficients, so even a modest network has substantially more flexibility.

7.5 Deeper networks

Adding more hidden layers creates a deep network. For a network with \(L\) hidden layers, the computation proceeds layer by layer: \[\mathbf{a}^{[l]} = \sigma\!\bigl(\mathbf{W}^{[l]}\,\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\bigr), \qquad l = 1, 2, \ldots, L,\] with \(\mathbf{a}^{[0]} = \mathbf{x}\) (the input) and the output \(\hat{y} = \mathbf{W}^{[L+1]}\,\mathbf{a}^{[L]} + \mathbf{b}^{[L+1]}\).

In practice, deeper networks can learn hierarchical features, simple patterns in early layers (edges, trends) and complex patterns in later layers (shapes, interactions), which is why the term “deep learning” is used.

7.6 Training: optimisation of the parameters

Given training data \(\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}\) and a network \(\hat{y}(\mathbf{x}; \boldsymbol{\theta})\), we choose \(\boldsymbol{\theta}\) to minimise a loss function (think back to the least squared example), which measures how well the predictions match the data.

7.6.0.0.1 Learning paradigms.

The approach we describe here — learning from input–output pairs \((\mathbf{x}_i, y_i)\) — is called supervised learning. The “supervisor” provides the correct answer \(y_i\) for each input, and the network learns to reproduce it. There are other paradigms: unsupervised learning, where the network discovers structure (clusters, patterns) in data without being told the correct answer, and reinforcement learning, where the network learns by trial and error, receiving only a reward or penalty signal. In this course we focus on supervised learning, but it is worth knowing that the other paradigms exist and are widely used. These paradigms will also be the focus of later courses at Lancaster.

7.6.0.0.2 Loss functions.

For regression (predicting a continuous value), the standard choice is the mean squared error (MSE): \[\begin{equation} L(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^{n}\bigl(y_i - \hat{y}(\mathbf{x}_i;\, \boldsymbol{\theta})\bigr)^2. \label{eq:mse} \end{equation}\] This is precisely the least-squares objective, just as in polynomial fitting. Many other loss functions have been proposed and widely used, depending on the particular problem of interest.

7.6.0.0.3 Gradient descent.

For a neural network we typically cannot find the parameter value that minimises the loss function exactly. Instead, we use an iterative method called gradient descent.

To illustrate the idea, suppose the model depends on just a single parameter \(\theta\), and the loss is a function \(L(\theta)\). Starting from an initial guess \(\theta^{(0)}\), we repeatedly update according to \[\begin{equation} \theta^{(k+1)} = \theta^{(k)} - \eta \frac{dL}{d\theta}\bigg|_{\theta=\theta^{(k)}}, \end{equation}\] where \(\eta>0\) is called the learning rate.

Key Idea

The derivative \(\frac{dL}{d\theta}\) tells us how the loss changes as \(\theta\) changes:

if \(\frac{dL}{d\theta}>0\), the loss increases as \(\theta\) increases, so we move \(\theta\) to the left;
if \(\frac{dL}{d\theta}<0\), the loss decreases as \(\theta\) increases, so we move \(\theta\) to the right.

In both cases, subtracting the derivative moves us “downhill” toward smaller values of the loss.

The learning rate \(\eta\) controls the step size: too large and the iteration may overshoot the minimum and diverge; too small and convergence becomes very slow. In fact it is playing the same role as the time step \(\tau\) in Euler’s method, and so gradient descent is Euler’s method applied to a “gradient-flow” equation

Modern neural networks usually contain many parameters rather than just one. In that case, the derivative is replaced by a vector of partial derivatives called the gradient. This extension to multiple parameters will be studied in more detail in a second-year AI module.

7.6.0.0.4 Backpropagation.

Computing the gradient of \(L\) efficiently is done by an algorithm called backpropagation, which is essentially the chain rule applied systematically through the layers of the network. In this module (and in typically in-practice) open-source libraries (PyTorch, TensorFlow) compute these gradients automatically, you will specify the network and the loss, and the library handles the differentiation.

7.6.0.0.5 Interpolation vs extrapolation.

Neural networks are generally much better at interpolation (predicting within the range of the training data) than at extrapolation (predicting outside it). Unlike the physical or biologically motivated models discussed earlier, a neural network has no principled reason to produce sensible predictions in regions of input space it has never seen. This is a fundamental limitation: if you train a network on data from \(t \in [0, 10]\) and then evaluate it at \(t = 15\), the output may be completely unreliable.

If the training loss keeps decreasing but the validation loss starts increasing, the network is overfitting.

Common strategies to combat overfitting include using a smaller network (fewer parameters), stopping training early and adding a penalty term to the loss that discourages large weights (regularisation).

7.7 Worked example: learning the logistic curve from data

Let us apply these ideas to a problem we already understand analytically: the logistic growth model from week 5.

Recall the logistic solution \[N(t) = \frac{MN_0}{(M - N_0)\,e^{-\alpha t} + N_0}.\] We generate synthetic training data by evaluating this at \(n = 50\) random times \(t_i \in [0, 10]\) with parameters \(\alpha = 1\), \(M = 100\), \(N_0 = 5\), and adding a small amount of Gaussian noise: \[y_i = N(t_i) + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2),\] with \(\sigma = 3\). The task: given only the noisy data \(\{(t_i, y_i)\}\), train a neural network to predict \(N(t)\).

7.7.0.0.1 The network.

Total parameters: \((1 \times 20 + 20) + (20 \times 20 + 20) + (20 \times 1 + 1) = 20 + 20 + 400 + 20 + 20 + 1 = 481\).

7.7.0.0.2 Implementation in PyTorch.

After training, the network’s prediction closely follows the exact logistic curve, even though the network was given no knowledge of the logistic equation, it learned the relationship purely from the noisy data.

Key Idea

The neural network recovered the logistic curve without any information about birth rates, carrying capacities, or differential equations. It treated the problem purely as function approximation: given \(t\), predict \(N\). This is the power (and the limitation) of the data-driven approaches discussed in this chapter.

7.8 Mechanistic vs data-driven models

When should you use a neural network, and when should you use an ODE? The answer depends on what you know and what you need.

Mechanistic models (ODEs)	Data-driven models (NNs)
Built from physical/biological principles	Built from data alone
Few parameters with clear meaning (\(\alpha\), \(M\), \(\mu\), …)	Many parameters with no individual meaning
Extrapolate reliably (governed by physics)	Extrapolate poorly (no physics to guide them)
Require understanding of the mechanism	Work when the mechanism is unknown
May be analytically tractable	Always require numerical computation
Struggle with very complex systems	Scale to high-dimensional, complex data

7.9 Summary

Key Idea

The course has followed a single thread: How do we build mathematical models of real-world phenomena?

Chapter 2–3: Dimensional analysis and the \(\Pi\)-theorem, deduce the form of a relationship without solving any equations.
Chapter 4: Analytical solution of ODEs, solve the model exactly when possible.
Chapter 5: Population models, build, solve and analyse mechanistic models (exponential, logistic, harvesting).
Chapter 6: Numerical methods, approximate solutions when exact ones are unavailable.
Chapter 7: Neural networks, learn the model directly from data when the governing equations are unknown.

Each approach has its place, the real skill of mathematical modelling is knowing which tool to reach for and why.

Exercise 7.1

A single neuron has weights \(w_1 = 2\), \(w_2 = -1\), bias \(b = 0.5\), and sigmoid activation \(\sigma(z) = 1/(1 + e^{-z})\).

Compute the output for input \(\mathbf{x} = (1, 0)^\top\).
Compute the output for input \(\mathbf{x} = (0, 3)^\top\).
For what values of \(\mathbf{x}\) is the output approximately \(0.5\)? Interpret this geometrically.

Exercise 7.2

Consider a network with 1 input, one hidden layer of \(n_1\) neurons (ReLU activation), and 1 output (no activation).

How many parameters does the network have (as a function of \(n_1\))?
For \(n_1 = 1\): show that the network can only represent piecewise-linear functions with at most one “kink”. Sketch an example.
For general \(n_1\): how many kinks can the function have? Why does increasing \(n_1\) improve the network’s ability to approximate smooth curves?
How many parameters would a degree-\(p\) polynomial have? For what values of \(n_1\) does the network have more parameters than a degree-10 polynomial?

Exercise 7.3

Adapt the PyTorch code from the worked example to learn the exponential decay \(y(t) = e^{-t}\) on \([0, 5]\) from 30 noisy data points.

Generate training data with noise level \(\sigma = 0.05\) and train the network. Plot the result against the exact solution.
What happens if you evaluate the trained network at \(t = 7\) (outside the training range)? Why?
Now try predicting the solution of \(y' = -y\) using Euler’s method from week 6 on the same interval. Compare the two approaches: which is more accurate? Which generalises better to \(t > 5\)?
Reflect: if you know the ODE, is a neural network a sensible approach? When might it be?

Exercise 7.4

Consider the SIR model from Section 7.2 with \(\beta = 0.3\) and \(\gamma = 0.1\), starting from \(S(0) = 990\), \(I(0) = 10\), \(R(0) = 0\).

Use Heun’s method (from week 6) to generate the solution \(I(t)\) on \([0, 50]\) with \(\tau = 0.1\). This is your “ground truth” data.
Subsample 40 points from the solution, add noise with \(\sigma = 5\), and train a neural network to predict \(I(t)\) from \(t\).
Compare the network’s prediction with the Heun solution. Where does the network perform well? Where does it struggle?
What would happen if you tried to use the trained network to predict \(I(t)\) for a different value of \(\beta\)? Why is this a fundamental limitation of the purely data-driven approach?

Chapter 7: Data-driven modelling with neural networks

7 Data-driven modelling with neural networks

Learning outcomes

7.1 Motivation: when we don’t have an equation

7.1.0.0.1 A brief history.

7.2 Function approximation: the core idea

7.2.0.0.1 Polynomial fitting

7.3 The single neuron

7.3.0.0.1 Common activation functions.

7.3.0.0.2 Geometric interpretation.

7.3.0.0.3 The XOR problem: why one neuron is not enough.

7.4 Feedforward neural networks

7.5 Deeper networks

7.6 Training: optimisation of the parameters

7.6.0.0.1 Learning paradigms.

7.6.0.0.2 Loss functions.

7.6.0.0.3 Gradient descent.

7.6.0.0.4 Backpropagation.

7.6.0.0.5 Interpolation vs extrapolation.

7.7 Worked example: learning the logistic curve from data

7.7.0.0.1 The network.

7.7.0.0.2 Implementation in PyTorch.

7.8 Mechanistic vs data-driven models

7.9 Summary