A trained neural network is, exactly, a Hamilton–Jacobi PDE — and training is a search for its initial data. One deformation parameter ε unifies network, tropical algebra, viscous PDE, and convex optimization in a single commutative diagram.
1Center for AI Research PH 2Asian Institute of Management
Training a neural network is identified, exactly, as a search through Hamilton–Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton–Jacobi equation whose Hopf–Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton–Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter ε unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include the minimax-optimal generalization rate \(O(n^{-1/(d+2)})\) for fixed t, adversarial robustness controlled by ε, backpropagation as the co-state equation of the Hamiltonian system for residual networks, scaling exponents consistent with data intrinsic dimension via PDE quadrature, and a closed-form O(N) influence function whose entropy landscape undergoes fold bifurcations as ε increases.
Conventionally the question runs one way: given a PDE, design a network to approximate its solution. Here a trained network already is a Hamilton–Jacobi equation, and the question is which one.
The key is a single deformation parameter ε and ultradiscretization — the exact passage ε → 0 between two algebraic worlds. At ε = 0, addition is max and multiplication is + (the tropical semiring); at ε > 0, ordinary arithmetic is restored and the same objects reappear as the solution operator of a viscous PDE. The passage is not an approximation — it is an exact semiring homomorphism, the Maslov dequantization.
The object realizing this is a log-sum-exp layer: \[ f_\varepsilon(x) \;=\; \varepsilon \log \sum_{j=1}^{N} \exp\!\Big( (W_j \cdot x + b_j)/\varepsilon \Big). \] The Hopf–Cole linearization identifies it exactly as the heat-equation propagator of a viscous Hamilton–Jacobi equation: the weights encode the initial data, the architecture encodes the Hamiltonian, and a forward pass evaluates that PDE solution at the query point.
Moving right is ultradiscretization (Maslov dequantization): the smooth ε > 0 object hardens into its tropical / Hopf–Lax limit as ε → 0. Moving down is the exact identification of the network layer with the PDE solution. The square commutes: the two limits (ε → 0 and width N → ∞) can be taken in either order and give the same answer.
| Neural network | Physics |
|---|---|
| LSE layer at ε > 0 | Quantum statistical mechanics (partition function) |
| Tropical limit ε → 0 | Classical mechanics (principle of least action) |
| ε | ℏ (Planck's constant) |
| Softmax weights πj | Boltzmann–Gibbs probabilities |
| Network forward pass | Imaginary-time Schrödinger propagator |
| Neurons j = 1,…,N | Discrete paths in a path integral |
The Feynman–Kac formula is the unifying object: each neuron is a discrete path weighted by a Boltzmann factor, and the forward pass computes the log-partition function of the ensemble.
A trained network is a physics equation in disguise. Normally you pick an equation and build a network to approximate its solution. This work flips the arrow: once a network is trained, it already is the exact solution of a specific equation — a Hamilton–Jacobi equation, the same kind that describes how a wavefront spreads or how a particle finds its least-effort path. Training is the search for which equation (its starting shape) best fits the data; a forward pass just reads off that solution at your input point.
Everything is controlled by one knob, ε. Turn it down and the network becomes sharp and decisive — it picks the single best match. Turn it up and it blends many options smoothly, the way heat diffuses and blurs detail. The surprise is that this one number is simultaneously the softmax “temperature,” the equation's blurriness (viscosity), and how strongly the model is regularized. They are not three analogies — they are the same number wearing three hats.
The same trick that links quantum to classical physics. Sharpening the knob all the way (ε → 0) turns a soft, probabilistic computation into a hard, deterministic one — mathematically the identical move that takes quantum mechanics to classical mechanics as Planck's constant vanishes. So a forward pass is, formally, a sum over “paths” just like Feynman's path integrals, but with positive weights, which is exactly what keeps it computable on an ordinary computer.
The log-sum-exp layer is a smooth deformation of the tropical max. Drag ε to see LSEε(a, 0) relax toward max(a, 0) (left) and its gradient become the Gibbs / softmax weight σ(a/ε) (right). As ε → 0 the smooth layer hardens into the Hopf–Lax lookup; as ε grows it becomes the viscous heat propagator.
The attribution weights \(\pi_j(x;\varepsilon) \propto \exp\!\big((W_j \cdot x + b_j)/\varepsilon\big)\) form a closed-form influence function. These weights are a Gibbs measure — a probability distribution that hands each neuron a share proportional to exp(score / ε), so higher-scoring neurons get exponentially more weight and ε sets how peaked the split is (it is exactly the softmax, and the same rule statistical physics uses to weight states by energy). As ε sweeps, this measure transitions from a concentrated particle regime (Hopf–Lax, ε → 0) to a spread-out wave regime (heat equation, ε → ∞). Drag ε and watch the weights and their entropy H(π) change. The critical scale is \(\varepsilon^* = N^{-1/d}\).
Animations of the LSE network as a Hopf–Cole solution as the viscosity ε sweeps across the critical scale ε* = N−1/d.
Along a 1-D slice through the data, the attribution entropy H(π(x;ε)) has a basin (local minimum) near each support point and a ridge (local maximum) between neighbours. As the viscosity ε increases, neighbouring basins merge: a minimum and an adjacent maximum collide and annihilate in a fold (saddle-node) bifurcation. Drag ε to watch critical points disappear; the right panel traces each critical point's entropy value across ε.
Generalization rate. ℓ∞ error vs. width N for Lipschitz initial data across \(d \in \{1,2,4\}\); dashed \(O(N^{-1/d})\) slopes confirm the minimax rate.
Scaling law. Test RMSE vs. width for Adam-trained LSE networks; the empirical exponent recovers the data intrinsic dimension via PDE quadrature.
Certified robustness (MNIST). Measured spectral norm \(\|\nabla^2_x f\|_2\) vs. the theoretical bound \(\|W\|_{2,\infty}^2/\varepsilon\); the bound is never violated.
Certified robustness (CIFAR-10). Same bound across \(\varepsilon \in \{0.1, 0.3, 1.0, 3.0, 10.0\}\); raising \(\varepsilon\) monotonically reduces input sensitivity.
The layer–PDE identity \(f_\varepsilon + u_\varepsilon = |x|^2/(4t)\) and the transformer attention = Gibbs-measure identity are exact algebraic statements, not approximations: they hold to floating-point roundoff across all ε, d, and random trials tested. What remains genuinely informative is which architectures and activations admit such an exact Hamilton–Jacobi identity, and which do not.
HJ correspondence status for common activations
| Activation | Tropical limit (ε→0) | Finite-ε structure | Exact HJ identity |
|---|---|---|---|
| LSE | \(\max_j(W_j \cdot x + b_j)\) | Hopf–Cole solution | Yes (solution-class) |
| ReLU | \(\max(x,0)\) | 2-neuron LSE special case | Yes (special case) |
| Softplus | \(\max(x,0)\) | \(\log(1+e^x) = \mathrm{LSE}_1(x,0)\) | Yes (N=1 case) |
| Sigmoid | Heaviside | \(\nabla_x \mathrm{LSE}_1(x,0)\) | Yes (\(\nabla\)-class) |
| tanh | \(\mathrm{sign}(x)\) | \(\pi_+ - \pi_-\) (signed Gibbs weight) | Yes (\(\nabla\)-class) |
| SiLU | ReLU | \(x \cdot \nabla_x \mathrm{LSE}_1(x,0)\) | Open (one derivative off) |
| GELU | ReLU | \(x \cdot\) (heat kernel integral) | Open (product form) |
@article{minoza2026hjdl,
title = {The Hamilton--Jacobi Theory of Deep Learning},
author = {Mi{\~n}oza, Jose Marie Antonio and Legara, Erika Fille T. and Monterola, Christopher P.},
journal = {arXiv preprint arXiv:2605.28983},
year = {2026},
eprint = {2605.28983},
archivePrefix = {arXiv},
}