Abstract

Training a neural network is identified, exactly, as a search through Hamilton–Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton–Jacobi equation whose Hopf–Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton–Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter ε unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include the minimax-optimal generalization rate \(O(n^{-1/(d+2)})\) for fixed t, adversarial robustness controlled by ε, backpropagation as the co-state equation of the Hamiltonian system for residual networks, scaling exponents consistent with data intrinsic dimension via PDE quadrature, and a closed-form O(N) influence function whose entropy landscape undergoes fold bifurcations as ε increases.

A single ε unifies four perspectives
Neural Network
ε = softmax temperature
Tropical Algebra
ε = Maslov deformation (ε → 0: max-plus)
Viscous PDE
ε = viscosity
Convex Optimization
ε = regularization strength
The Central Idea

Conventionally the question runs one way: given a PDE, design a network to approximate its solution. Here a trained network already is a Hamilton–Jacobi equation, and the question is which one.

The key is a single deformation parameter ε and ultradiscretization — the exact passage ε → 0 between two algebraic worlds. At ε = 0, addition is max and multiplication is + (the tropical semiring); at ε > 0, ordinary arithmetic is restored and the same objects reappear as the solution operator of a viscous PDE. The passage is not an approximation — it is an exact semiring homomorphism, the Maslov dequantization.

The object realizing this is a log-sum-exp layer: \[ f_\varepsilon(x) \;=\; \varepsilon \log \sum_{j=1}^{N} \exp\!\Big( (W_j \cdot x + b_j)/\varepsilon \Big). \] The Hopf–Cole linearization identifies it exactly as the heat-equation propagator of a viscous Hamilton–Jacobi equation: the weights encode the initial data, the architecture encodes the Hamiltonian, and a forward pass evaluates that PDE solution at the query point.

The correspondence assembles into a commutative diagram
Neural Network fNε,  ε > 0  (smooth layer) Tropical Network f0N  (max-plus) Viscous HJ PDE uε  (heat propagator) Inviscid HJ u0  (Hopf–Lax) ε → 0 Maslov ε → 0 Hopf–Lax exact exact

Moving right is ultradiscretization (Maslov dequantization): the smooth ε > 0 object hardens into its tropical / Hopf–Lax limit as ε → 0. Moving down is the exact identification of the network layer with the PDE solution. The square commutes: the two limits (ε → 0 and width N → ∞) can be taken in either order and give the same answer.

One parameter ε, four identities
Neural networkPhysics
LSE layer at ε > 0Quantum statistical mechanics (partition function)
Tropical limit ε → 0Classical mechanics (principle of least action)
ε (Planck's constant)
Softmax weights πjBoltzmann–Gibbs probabilities
Network forward passImaginary-time Schrödinger propagator
Neurons j = 1,…,NDiscrete paths in a path integral

The Feynman–Kac formula is the unifying object: each neuron is a discrete path weighted by a Boltzmann factor, and the forward pass computes the log-partition function of the ensemble.

In Plain Terms

A trained network is a physics equation in disguise. Normally you pick an equation and build a network to approximate its solution. This work flips the arrow: once a network is trained, it already is the exact solution of a specific equation — a Hamilton–Jacobi equation, the same kind that describes how a wavefront spreads or how a particle finds its least-effort path. Training is the search for which equation (its starting shape) best fits the data; a forward pass just reads off that solution at your input point.

Everything is controlled by one knob, ε. Turn it down and the network becomes sharp and decisive — it picks the single best match. Turn it up and it blends many options smoothly, the way heat diffuses and blurs detail. The surprise is that this one number is simultaneously the softmax “temperature,” the equation's blurriness (viscosity), and how strongly the model is regularized. They are not three analogies — they are the same number wearing three hats.

The same trick that links quantum to classical physics. Sharpening the knob all the way (ε → 0) turns a soft, probabilistic computation into a hard, deterministic one — mathematically the identical move that takes quantum mechanics to classical mechanics as Planck's constant vanishes. So a forward pass is, formally, a sum over “paths” just like Feynman's path integrals, but with positive weights, which is exactly what keeps it computable on an ordinary computer.

Interactive: The Deformation Parameter ε

The log-sum-exp layer is a smooth deformation of the tropical max. Drag ε to see LSEε(a, 0) relax toward max(a, 0) (left) and its gradient become the Gibbs / softmax weight σ(a/ε) (right). As ε → 0 the smooth layer hardens into the Hopf–Lax lookup; as ε grows it becomes the viscous heat propagator.

Layer output — \(\mathrm{LSE}_\varepsilon(a,0)\) vs \(\max(a,0)\)
ε = 1.00

LSE (viscous, ε>0)    ⎯⎯ Hopf–Lax max (ε→0)

Gradient — \(\partial_a \mathrm{LSE}_\varepsilon = \sigma(a/\varepsilon)\) (Gibbs weight)
same ε as left

The gradient is exactly the softmax / Boltzmann–Gibbs weight — a genuine probability for all ε > 0.

The same ε is simultaneously the softmax temperature, the PDE viscosity, and the convex-regularization strength. These roles are identified, not merely analogous. In plain terms: small ε = a sharp, confident decision; large ε = a soft, blurred average. The dashed red limit (ε → 0) is the Hopf–Lax rule — just take the single best-scoring option, a plain “pick the max” with no blending. (More generally, Hopf–Lax says the answer at any point is the best trade-off of starting value plus travel cost over all starting points — the cheapest-path principle.) The curve on the right is “how much credit each option gets.”
Interactive: Particle ↔ Wave Phase Transition

The attribution weights \(\pi_j(x;\varepsilon) \propto \exp\!\big((W_j \cdot x + b_j)/\varepsilon\big)\) form a closed-form influence function. These weights are a Gibbs measure — a probability distribution that hands each neuron a share proportional to exp(score / ε), so higher-scoring neurons get exponentially more weight and ε sets how peaked the split is (it is exactly the softmax, and the same rule statistical physics uses to weight states by energy). As ε sweeps, this measure transitions from a concentrated particle regime (Hopf–Lax, ε → 0) to a spread-out wave regime (heat equation, ε → ∞). Drag ε and watch the weights and their entropy H(π) change. The critical scale is \(\varepsilon^* = N^{-1/d}\).

Attribution weights \(\pi_j\) over neurons
ε = 0.50

Low ε → one neuron dominates (particle). High ε → uniform spread (wave).

Attribution entropy \(H(\pi)\) vs \(\varepsilon\)
marker tracks slider

Entropy rises monotonically with ε; fold bifurcations merge attribution basins along the way.

This is the same particle-to-wave transition shown empirically below on synthetic and MNIST data — the entropy landscape H(π) reorganizes through fold bifurcations as the viscosity ε increases. In plain terms: at low ε a prediction is explained by essentially one training example (a “particle”); at high ε it is a smeared average over many (a “wave”). The switch happens at a predictable scale.
Dynamic Results: The Phase Transition in Action

Animations of the LSE network as a Hopf–Cole solution as the viscosity ε sweeps across the critical scale ε* = N−1/d.

Particle-to-wave phase transition on synthetic data
Synthetic (two-cluster). Attribution basins sharpen in the particle regime and flatten to uniform entropy in the wave regime.
Phase transition on MNIST 3 vs 7
MNIST (3 vs. 7). The decision boundary follows inter-cluster geometry in PCA space; ε* ≈ 0.091.
Fold bifurcations of the attribution-entropy landscape
Bifurcation cascade. Critical points of the entropy landscape H(π) annihilate in fold bifurcations as ε increases — explore this live in the interactive below.
Interactive: Fold Bifurcations of the Attribution Landscape

Along a 1-D slice through the data, the attribution entropy H(π(x;ε)) has a basin (local minimum) near each support point and a ridge (local maximum) between neighbours. As the viscosity ε increases, neighbouring basins merge: a minimum and an adjacent maximum collide and annihilate in a fold (saddle-node) bifurcation. Drag ε to watch critical points disappear; the right panel traces each critical point's entropy value across ε.

Entropy landscape \(H(x;\varepsilon)\) — 11 critical points
ε = 0.50

minima (attribution basins)    maxima (attribution ridges)

Bifurcation diagram — critical \(H\)-value vs \(\varepsilon\)
vertical line tracks ε

Each branch annihilation is a fold bifurcation, merging one attribution basin.

This is the interactive counterpart of the bifurcation cascade animated above: as ε grows the entropy landscape simplifies, each fold event collapsing a minimum–maximum pair into a single smoother region. In plain terms: picture a landscape of valleys and ridges. As you blur it, neighbouring valleys suddenly merge into one — not gradually, but in a sharp “pop.” Each pop is the model deciding two groups of examples now tell the same story.
Quantitative Consequences
Quadrature convergence rate

Generalization rate. error vs. width N for Lipschitz initial data across \(d \in \{1,2,4\}\); dashed \(O(N^{-1/d})\) slopes confirm the minimax rate.

Scaling law

Scaling law. Test RMSE vs. width for Adam-trained LSE networks; the empirical exponent recovers the data intrinsic dimension via PDE quadrature.

Hessian bound on MNIST

Certified robustness (MNIST). Measured spectral norm \(\|\nabla^2_x f\|_2\) vs. the theoretical bound \(\|W\|_{2,\infty}^2/\varepsilon\); the bound is never violated.

Hessian bound on CIFAR-10

Certified robustness (CIFAR-10). Same bound across \(\varepsilon \in \{0.1, 0.3, 1.0, 3.0, 10.0\}\); raising \(\varepsilon\) monotonically reduces input sensitivity.

The Correspondence Is Exact

The layer–PDE identity \(f_\varepsilon + u_\varepsilon = |x|^2/(4t)\) and the transformer attention = Gibbs-measure identity are exact algebraic statements, not approximations: they hold to floating-point roundoff across all ε, d, and random trials tested. What remains genuinely informative is which architectures and activations admit such an exact Hamilton–Jacobi identity, and which do not.

HJ correspondence status for common activations

ActivationTropical limit (ε→0)Finite-ε structureExact HJ identity
LSE\(\max_j(W_j \cdot x + b_j)\)Hopf–Cole solutionYes (solution-class)
ReLU\(\max(x,0)\)2-neuron LSE special caseYes (special case)
Softplus\(\max(x,0)\)\(\log(1+e^x) = \mathrm{LSE}_1(x,0)\)Yes (N=1 case)
SigmoidHeaviside\(\nabla_x \mathrm{LSE}_1(x,0)\)Yes (\(\nabla\)-class)
tanh\(\mathrm{sign}(x)\)\(\pi_+ - \pi_-\) (signed Gibbs weight)Yes (\(\nabla\)-class)
SiLUReLU\(x \cdot \nabla_x \mathrm{LSE}_1(x,0)\)Open (one derivative off)
GELUReLU\(x \cdot\) (heat kernel integral)Open (product form)
Citation
@article{minoza2026hjdl,
  title         = {The Hamilton--Jacobi Theory of Deep Learning},
  author        = {Mi{\~n}oza, Jose Marie Antonio and Legara, Erika Fille T. and Monterola, Christopher P.},
  journal       = {arXiv preprint arXiv:2605.28983},
  year          = {2026},
  eprint        = {2605.28983},
  archivePrefix = {arXiv},
}