By the end of this chapter you should be able to:
explain when and why a data-driven model might be preferred over a mechanistic (ODE-based) model;
define the building blocks of a feedforward neural network including weights, biases and activation functions;
explain geometrically what a single neuron computes and why hidden layers are needed;
implement a simple neural network in Python and train it on data from a known model;
critically compare data-driven and mechanistic approaches, and describe how they can be combined.
Throughout this course we have built mathematical models from physical or biological principles: the equation for a pendulum (which follows from Newton’s second law), the logistic model from considering birth-death rates, the Sedov–Taylor blast wave from dimensional analysis. In each case we derived a model and then solved it.
But what if the governing mechanism is too complex, poorly understood, or simply unknown? In many modern applications we are given data (measurements, sensor readings, images, time series) and asked to make predictions without a governing equation. Examples include:
predicting the shape of a protein from its amino-acid sequence,
forecasting energy demand from historical consumption data,
classifying medical images as healthy or diseased.
In these settings we need a flexible, general-purpose way to approximate an unknown function from data. This is the role of what we call a neural network.
The idea of artificial neurons dates back to McCulloch and Pitts (1943), who proposed simple mathematical models of biological neurons. In the late 1950s, Rosenblatt introduced the Perceptron, a single-layer network that could learn to classify patterns from data. Initial excitement was dampened when Minsky and Papert (1969) proved that single-layer networks have severe limitations, most notably, they cannot learn the simple exclusive-or (XOR) function (we will see why shortly). The field was revived in the 1980s when Rumelhart, Hinton and Williams showed how to train multi-layer networks using the backpropagation algorithm, overcoming the limitations Minsky and Papert had identified. Modern AI builds on this foundation, enabled by vastly more data and faster hardware.
The central idea is simple and familiar. Suppose we have \(n\) data points \((x_i, y_i)\) and we want to find a function \(\hat{y}(x)\) that fits the data well.
One approach is to choose a polynomial \[\hat{y}(x; \boldsymbol{\theta}) = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_p x^p,\] where \(\boldsymbol{\theta} = (\theta_0, \theta_1, \ldots, \theta_p)\) are the parameters (or coefficients) to be determined. We choose \(\boldsymbol{\theta}\) to minimise the difference between the data, \(y\) and our polynomial approximation, \(\hat{y}\). A common way to achieve this is to minimise the sum of squared errors: \[L(\boldsymbol{\theta}) = \sum_{i=1}^{n} \bigl(y_i - \hat{y}(x_i; \boldsymbol{\theta})\bigr)^2.\] This is least-squares fitting, which you may have encountered before. Polynomials work well for smooth, low-dimensional data, but they struggle with high-dimensional inputs (for example, images typically have millions of pixels), sharp features, and multivariate relationships. We need a richer class of parametrised functions.
Key Idea
A neural network is simply a more flexible parametrised function \(\hat{y}(\mathbf{x}; \boldsymbol{\theta})\). Training the network means choosing \(\boldsymbol{\theta}\) to minimise a loss function, the same optimisation idea as least-squares fitting, but with a far more expressive function.
The basic building block of a neural network is a neuron. A single neuron with \(d\) inputs \(x_1, x_2, \ldots, x_d\) computes:
A weighted sum: \(z = w_1 x_1 + w_2 x_2 + \cdots + w_d x_d + b = \mathbf{w}^\top \mathbf{x} + b\).
An activation function: \(a = \sigma(z)\).
We call \(\mathbf{w} = (w_1, \ldots, w_d)^\top\) the weights and \(b\) the bias, and \(\sigma\) is a nonlinear function. The output of the neuron is \(a = \sigma(\mathbf{w}^\top \mathbf{x} + b)\). Without the nonlinearity of \(\sigma\) the neuron would just compute a linear function of its inputs, the nonlinearity is what makes neural networks powerful.
Several choices are widely used:
ReLU (Rectified Linear Unit): \(\sigma(z) = \max(0, z)\). The most common choice in modern networks, it is simple and computationally efficient to compute.
Sigmoid: \(\sigma(z) = 1/(1 + e^{-z})\), range \((0, 1)\). You have already seen this shape, it is the logistic function from chapter 5!
Tanh: \(\sigma(z) = \tanh(z) = (e^z - e^{-z})/(e^z + e^{-z})\).
Example: A single neuron by hand
Consider a neuron with \(d = 2\) inputs, weights \(w_1 = 0.5\), \(w_2 = -1.0\), bias \(b = 0.3\), and ReLU activation. For input \(\mathbf{x} = (2, 1)^\top\): \[z = 0.5 \times 2 + (-1.0) \times 1 + 0.3 = 0.3, \qquad a = \max(0,\, 0.3) = 0.3.\] For input \(\mathbf{x} = (1, 3)^\top\): \[z = 0.5 \times 1 + (-1.0) \times 3 + 0.3 = -2.2, \qquad a = \max(0,\, -2.2) = 0.\] The ReLU “switches off” the neuron when \(z < 0\).
A single neuron divides its input space into two regions separated by a decision boundary, the set of points where \(z = \mathbf{w}^\top\mathbf{x} + b = 0\). For two inputs, this boundary is a straight line; for three inputs, a plane; in general, a hyperplane. The weight vector \(\mathbf{w}\) is perpendicular to this boundary, and the bias \(b\) controls how far the boundary is from the origin. On one side of the boundary the neuron is “on” (or has high output); on the other side it is “off” (or has low output). This means a single neuron can only separate inputs that are divisible by a straight cut through the space, a significant limitation.
Consider the exclusive-or (XOR) function on two binary inputs:
| \(x_1\) | \(x_2\) | XOR |
|---|---|---|
| \(0\) | \(0\) | \(0\) |
| \(0\) | \(1\) | \(1\) |
| \(1\) | \(0\) | \(1\) |
| \(1\) | \(1\) | \(0\) |
If you plot these four points in the \((x_1, x_2)\) plane, the two “1” outputs (at \((0,1)\) and \((1,0)\)) cannot be separated from the two “0” outputs (at \((0,0)\) and \((1,1)\)) by any single straight line. Since a single neuron implements exactly one linear decision boundary, it cannot represent XOR.
The solution is to add a hidden layer, extra neurons whose outputs are not directly observed but feed into the output neuron. The hidden neurons create multiple decision boundaries, and the output neuron combines them. With even one hidden neuron, the four XOR points can be embedded into a higher-dimensional space where they are linearly separable. This is the key insight behind multi-layer networks.
A feedforward neural network (also called a multilayer perceptron, MLP) is built by arranging neurons into layers:
Input layer: the raw inputs \(\mathbf{x} \in \R^d\) (no computation, just the data).
Hidden layers: one or more layers of neurons. Each neuron takes inputs from the previous layer, computes a weighted sum plus bias, and applies an activation function.
Output layer: produces the final prediction \(\hat{y}\).
A network with \(L\) hidden layers is often called an “\(L\)-layer network” or a “deep” network (when \(L \ge 2\)). Let superscripts in square brackets denote layer number. For a network with one hidden layer of \(n_1\) neurons:
Hidden layer (\(l = 1\)): each of the \(n_1\) neurons computes \[a_j^{[1]} = \sigma\!\left(\sum_{i=1}^{d} w_{ji}^{[1]}\,x_i + b_j^{[1]}\right), \qquad j = 1, \ldots, n_1.\] In matrix form: \(\mathbf{a}^{[1]} = \sigma\!\bigl(\mathbf{W}^{[1]}\,\mathbf{x} + \mathbf{b}^{[1]}\bigr)\), where \(\mathbf{W}^{[1]}\) is an \(n_1 \times d\) weight matrix, \(\mathbf{b}^{[1]} \in \R^{n_1}\) is a bias vector, and \(\sigma\) is applied element-wise.
Output layer (\(l = 2\)): for a single output (regression), \[\hat{y} = \mathbf{w}^{[2]\top}\,\mathbf{a}^{[1]} + b^{[2]},\] where \(\mathbf{w}^{[2]} \in \R^{n_1}\) and \(b^{[2]} \in \R\). (Often no activation is applied at the output for regression problems.)
Composing these operations, the network computes \[\begin{equation} \hat{y}(\mathbf{x};\, \boldsymbol{\theta}) = \mathbf{w}^{[2]\top}\,\sigma\!\bigl(\mathbf{W}^{[1]}\,\mathbf{x} + \mathbf{b}^{[1]}\bigr) + b^{[2]}, \label{eq:one-hidden} \end{equation}\] where \(\boldsymbol{\theta} = \{\mathbf{W}^{[1]}, \mathbf{b}^{[1]}, \mathbf{w}^{[2]}, b^{[2]}\}\) collects all the parameters. Don’t worry if this doesn’t make complete sense yet or the matrix notation is new, this will all be covered in more detail next year. The key idea for now is to understand this is just a parametrised function to be compared with the polynomial \(\hat{y}(x; \boldsymbol{\theta})\) from the previous section.
Key Idea
A key mathematical result (the universal approximation theorem) states that a feedforward network with a single hidden layer containing sufficiently many neurons can approximate any continuous function on a compact set to arbitrary accuracy. In other words, the function class defined by [eq:one-hidden] is, in principle, rich enough to represent any relationship hiding in your data — provided \(n_1\) is large enough and the parameters \(\boldsymbol{\theta}\) are chosen well.
Example: Counting parameters
A network with \(d = 3\) inputs, one hidden layer of \(n_1 = 10\) neurons (ReLU), and a single output has:
\(\mathbf{W}^{[1]}\): \(10 \times 3 = 30\) weights,
\(\mathbf{b}^{[1]}\): \(10\) biases,
\(\mathbf{w}^{[2]}\): \(10\) weights,
\(b^{[2]}\): \(1\) bias.
Total: \(30 + 10 + 10 + 1 = 51\) parameters. A degree-3 polynomial in 3 variables has only 20 coefficients, so even a modest network has substantially more flexibility.
Adding more hidden layers creates a deep network. For a network with \(L\) hidden layers, the computation proceeds layer by layer: \[\mathbf{a}^{[l]} = \sigma\!\bigl(\mathbf{W}^{[l]}\,\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\bigr), \qquad l = 1, 2, \ldots, L,\] with \(\mathbf{a}^{[0]} = \mathbf{x}\) (the input) and the output \(\hat{y} = \mathbf{W}^{[L+1]}\,\mathbf{a}^{[L]} + \mathbf{b}^{[L+1]}\).
In practice, deeper networks can learn hierarchical features, simple patterns in early layers (edges, trends) and complex patterns in later layers (shapes, interactions), which is why the term “deep learning” is used.
Given training data \(\{(\mathbf{x}_i, y_i)\}_{i=1}^{n}\) and a network \(\hat{y}(\mathbf{x}; \boldsymbol{\theta})\), we choose \(\boldsymbol{\theta}\) to minimise a loss function (think back to the least squared example), which measures how well the predictions match the data.
The approach we describe here — learning from input–output pairs \((\mathbf{x}_i, y_i)\) — is called supervised learning. The “supervisor” provides the correct answer \(y_i\) for each input, and the network learns to reproduce it. There are other paradigms: unsupervised learning, where the network discovers structure (clusters, patterns) in data without being told the correct answer, and reinforcement learning, where the network learns by trial and error, receiving only a reward or penalty signal. In this course we focus on supervised learning, but it is worth knowing that the other paradigms exist and are widely used. These paradigms will also be the focus of later courses at Lancaster.
For regression (predicting a continuous value), the standard choice is the mean squared error (MSE): \[\begin{equation} L(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^{n}\bigl(y_i - \hat{y}(\mathbf{x}_i;\, \boldsymbol{\theta})\bigr)^2. \label{eq:mse} \end{equation}\] This is precisely the least-squares objective, just as in polynomial fitting. Many other loss functions have been proposed and widely used, depending on the particular problem of interest.
For a neural network we typically cannot find the parameter value that minimises the loss function exactly. Instead, we use an iterative method called gradient descent.
To illustrate the idea, suppose the model depends on just a single parameter \(\theta\), and the loss is a function \(L(\theta)\). Starting from an initial guess \(\theta^{(0)}\), we repeatedly update according to \[\begin{equation} \theta^{(k+1)} = \theta^{(k)} - \eta \frac{dL}{d\theta}\bigg|_{\theta=\theta^{(k)}}, \end{equation}\] where \(\eta>0\) is called the learning rate.
Key Idea
The derivative \(\frac{dL}{d\theta}\) tells us how the loss changes as \(\theta\) changes:
if \(\frac{dL}{d\theta}>0\), the loss increases as \(\theta\) increases, so we move \(\theta\) to the left;
if \(\frac{dL}{d\theta}<0\), the loss decreases as \(\theta\) increases, so we move \(\theta\) to the right.
In both cases, subtracting the derivative moves us “downhill” toward smaller values of the loss.
The learning rate \(\eta\) controls the step size: too large and the iteration may overshoot the minimum and diverge; too small and convergence becomes very slow. In fact it is playing the same role as the time step \(\tau\) in Euler’s method, and so gradient descent is Euler’s method applied to a “gradient-flow” equation
Modern neural networks usually contain many parameters rather than just one. In that case, the derivative is replaced by a vector of partial derivatives called the gradient. This extension to multiple parameters will be studied in more detail in a second-year AI module.
Computing the gradient of \(L\) efficiently is done by an algorithm called backpropagation, which is essentially the chain rule applied systematically through the layers of the network. In this module (and in typically in-practice) open-source libraries (PyTorch, TensorFlow) compute these gradients automatically, you will specify the network and the loss, and the library handles the differentiation.
Neural networks are generally much better at interpolation (predicting within the range of the training data) than at extrapolation (predicting outside it). Unlike the physical or biologically motivated models discussed earlier, a neural network has no principled reason to produce sensible predictions in regions of input space it has never seen. This is a fundamental limitation: if you train a network on data from \(t \in [0, 10]\) and then evaluate it at \(t = 15\), the output may be completely unreliable.
To detect overfitting, we split the data into:
Training set: used to optimise \(\boldsymbol{\theta}\).
Validation set (or test set): used to monitor performance during training on data the network has not seen.
If the training loss keeps decreasing but the validation loss starts increasing, the network is overfitting.
Common strategies to combat overfitting include using a smaller network (fewer parameters), stopping training early and adding a penalty term to the loss that discourages large weights (regularisation).
Let us apply these ideas to a problem we already understand analytically: the logistic growth model from week 5.
Recall the logistic solution \[N(t) = \frac{MN_0}{(M - N_0)\,e^{-\alpha t} + N_0}.\] We generate synthetic training data by evaluating this at \(n = 50\) random times \(t_i \in [0, 10]\) with parameters \(\alpha = 1\), \(M = 100\), \(N_0 = 5\), and adding a small amount of Gaussian noise: \[y_i = N(t_i) + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2),\] with \(\sigma = 3\). The task: given only the noisy data \(\{(t_i, y_i)\}\), train a neural network to predict \(N(t)\).
We use a simple feedforward network with:
1 input (\(t\)),
2 hidden layers of 20 neurons each (ReLU activation),
1 output (\(\hat{N}\), no activation).
Total parameters: \((1 \times 20 + 20) + (20 \times 20 + 20) + (20 \times 1 + 1) = 20 + 20 + 400 + 20 + 20 + 1 = 481\).
import torch
import torch.nn as nn
import numpy as np
# --- Generate training data ---
alpha, M, N0, sigma = 1.0, 100.0, 5.0, 3.0
t_data = np.sort(np.random.uniform(0, 10, 50))
N_exact = M * N0 / ((M - N0) * np.exp(-alpha * t_data) + N0)
y_data = N_exact + sigma * np.random.randn(50)
t_train = torch.tensor(t_data, dtype=torch.float32).unsqueeze(1)
y_train = torch.tensor(y_data, dtype=torch.float32).unsqueeze(1)
# --- Define the network ---
model = nn.Sequential(
nn.Linear(1, 20),
nn.ReLU(),
nn.Linear(20, 20),
nn.ReLU(),
nn.Linear(20, 1)
)
# --- Training loop ---
optimiser = torch.optim.Adam(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()
for epoch in range(2000):
prediction = model(t_train)
loss = loss_fn(prediction, y_train)
optimiser.zero_grad()
loss.backward() # backpropagation
optimiser.step() # gradient descent step
if (epoch + 1) % 500 == 0:
print(f"Epoch {epoch+1}, Loss: {loss.item():.2f}")
# --- Plot results ---
import matplotlib.pyplot as plt
t_plot = torch.linspace(0, 10, 200).unsqueeze(1)
with torch.no_grad():
y_plot = model(t_plot).numpy()
plt.scatter(t_data, y_data, s=15, label="Noisy data")
plt.plot(t_plot.numpy(), y_plot, 'r-', label="Neural network")
t_fine = np.linspace(0, 10, 200)
N_fine = M * N0 / ((M - N0) * np.exp(-alpha * t_fine) + N0)
plt.plot(t_fine, N_fine, 'k--', label="Exact logistic")
plt.xlabel('$t$'); plt.ylabel('$N$')
plt.legend(); plt.show()
After training, the network’s prediction closely follows the exact logistic curve, even though the network was given no knowledge of the logistic equation, it learned the relationship purely from the noisy data.
Key Idea
The neural network recovered the logistic curve without any information about birth rates, carrying capacities, or differential equations. It treated the problem purely as function approximation: given \(t\), predict \(N\). This is the power (and the limitation) of the data-driven approaches discussed in this chapter.
When should you use a neural network, and when should you use an ODE? The answer depends on what you know and what you need.
| Mechanistic models (ODEs) | Data-driven models (NNs) |
|---|---|
| Built from physical/biological principles | Built from data alone |
| Few parameters with clear meaning (\(\alpha\), \(M\), \(\mu\), …) | Many parameters with no individual meaning |
| Extrapolate reliably (governed by physics) | Extrapolate poorly (no physics to guide them) |
| Require understanding of the mechanism | Work when the mechanism is unknown |
| May be analytically tractable | Always require numerical computation |
| Struggle with very complex systems | Scale to high-dimensional, complex data |
Key Idea
The course has followed a single thread: How do we build mathematical models of real-world phenomena?
Chapter 2–3: Dimensional analysis and the \(\Pi\)-theorem, deduce the form of a relationship without solving any equations.
Chapter 4: Analytical solution of ODEs, solve the model exactly when possible.
Chapter 5: Population models, build, solve and analyse mechanistic models (exponential, logistic, harvesting).
Chapter 6: Numerical methods, approximate solutions when exact ones are unavailable.
Chapter 7: Neural networks, learn the model directly from data when the governing equations are unknown.
Each approach has its place, the real skill of mathematical modelling is knowing which tool to reach for and why.
Exercise 7.1
A single neuron has weights \(w_1 = 2\), \(w_2 = -1\), bias \(b = 0.5\), and sigmoid activation \(\sigma(z) = 1/(1 + e^{-z})\).
Compute the output for input \(\mathbf{x} = (1, 0)^\top\).
Compute the output for input \(\mathbf{x} = (0, 3)^\top\).
For what values of \(\mathbf{x}\) is the output approximately \(0.5\)? Interpret this geometrically.
Exercise 7.2
Consider a network with 1 input, one hidden layer of \(n_1\) neurons (ReLU activation), and 1 output (no activation).
How many parameters does the network have (as a function of \(n_1\))?
For \(n_1 = 1\): show that the network can only represent piecewise-linear functions with at most one “kink”. Sketch an example.
For general \(n_1\): how many kinks can the function have? Why does increasing \(n_1\) improve the network’s ability to approximate smooth curves?
How many parameters would a degree-\(p\) polynomial have? For what values of \(n_1\) does the network have more parameters than a degree-10 polynomial?
Exercise 7.3
Adapt the PyTorch code from the worked example to learn the exponential decay \(y(t) = e^{-t}\) on \([0, 5]\) from 30 noisy data points.
Generate training data with noise level \(\sigma = 0.05\) and train the network. Plot the result against the exact solution.
What happens if you evaluate the trained network at \(t = 7\) (outside the training range)? Why?
Now try predicting the solution of \(y' = -y\) using Euler’s method from week 6 on the same interval. Compare the two approaches: which is more accurate? Which generalises better to \(t > 5\)?
Reflect: if you know the ODE, is a neural network a sensible approach? When might it be?
Exercise 7.4
Consider the SIR model from Section 7.2 with \(\beta = 0.3\) and \(\gamma = 0.1\), starting from \(S(0) = 990\), \(I(0) = 10\), \(R(0) = 0\).
Use Heun’s method (from week 6) to generate the solution \(I(t)\) on \([0, 50]\) with \(\tau = 0.1\). This is your “ground truth” data.
Subsample 40 points from the solution, add noise with \(\sigma = 5\), and train a neural network to predict \(I(t)\) from \(t\).
Compare the network’s prediction with the Heun solution. Where does the network perform well? Where does it struggle?
What would happen if you tried to use the trained network to predict \(I(t)\) for a different value of \(\beta\)? Why is this a fundamental limitation of the purely data-driven approach?