Initial Galerkin Neural ODEs in SciML

722 words

4 minutes

-- views

Initial Galerkin Neural ODEs in SciML

2026-03-18

Open Source

GSoC

/

SciML

/

ODE

/

Neural Network

Key Idea from the paper “Dissecting Neural ODEs”#

Vanilla Neural ODEs cannot be fully considered the deep limit of ResNets.
The first attempt to pursue the true deep limit of ResNets is the hypernetwork approach of (Zhang et al., 2019b) where another neural network parametrizes the dynamics of $\theta(s)$ .
However, this approach is not backed by any theoretical argument and it exhibits a considerable parameter inefficiency, as it generally scales polynomially in $n_{\theta}$
This paper approach to the problem by uncovering an optimization problem in functional space, solved by a direct application of the adjoint sensitivity method in infinite-dimensions.

Galerkin Neural ODEs#

Galerkin Neural ODEs is the spectral discretization verison. The idea is to expand $\theta (s)$ on complete orthogonal basis of a predetermined subspace $\mathbb{L}_{2}(\mathcal{S}\to \mathbb{R}^{n_{\theta}})$ and truncate the series to the $m$ -th term:

\theta(s)=\sum_{j=1}^{m}\alpha_{j}\odot\psi_{j}(s)

Where $\psi_j(s)$ are basis functions and the trainable objects are the coefficients $\alpha_j$ .
This turns an infinite-dimensional optimization over functions $\theta(s)$ $θ (s)$ into an ordinary finite-dimensional optimization over coefficient vectors $\alpha=(\alpha_{1}, \dots\alpha_{m})\in \mathbb{R}^{mn_{\theta}}$ $α = (α_{1}, \dots α_{m}) \in R^{m n_{θ}}$ , whose gradient can be computed as follows
- Corollary 1 (Spectral Gradients). Under the assumptions of Theorem 1 (Infinite-Dimensional Gradients), if $\theta(s)=\sum ^{m}_{j=1}\alpha_{j}\odot\psi_{j}(s)$ , then:

\frac{d \ell}{d\alpha}=\int_{\mathcal{S}}\vec{a}^{\top }(\tau) \frac{ \partial f_{\theta(s)} }{ \partial \theta(s) } \psi(\tau)d\tau, \quad \psi=(\psi_{1}, \dots \psi_{m})

At solver time $s$ , evaluate the basis functions $\psi_j(s)$ , reconstruct the current parameter set $\theta(s)$ , and then use that parameter set inside the vector field. So the system you solve is still an ODE, but the ODE’s neural-network parameters now evolve with depth according to the learned basis expansion.

Implementation Plan#

1. Typer and constructor.#

Current NeuralODE is a thin helper around solve. For Galerkin Neural ODE is chose to do custom Lux layer instead of a wrapper layer, because the parameter object is no longer the wrapped model’s ordinary weight tree, it becomes a Galerkin coefficient tree.

2. Shape the coefficient tree as the same Lux parameter tree.#

Based on the Galerkin idea of:

\theta(s)=\sum ^{m}_{j=1}\alpha_{j}\cdot \psi_{j}(s)

So the natural Lux-native encoding is: keep the same nested parameter tree, but replace each parameter leaf with its coefficient tensor. Lux recommends implementing initialparameters, initialstates, parameterlength, and statelength for custom layers, and it recommends a NamedTuple-style parameter tree.

3. Reconstruct $p(t)$ by contracting the basis dimension at each RHS evaluation#

The core operation is: evaluate the basis at solver time t, get psi(t), then contract each coefficient leaf with psi(t) to recover the ordinary parameter tree p_t. Since the paper’s Galerkin layer is exactly a basis expansion of $\theta(s)$ , this contraction is the central primitive.

4. The RHS should reconstruct `p(t)` inside `dudt`, and should not copy `basic_tgrad`#

Current DiffEqFlux NeuralODE uses an autonomous RHS and explicitly sets tgrad = basic_tgrad, where basic_tgrad(u,p,t) = zero(u). That is reasonable for the current helper because the source defines the RHS as dudt(u, p, t) = model(u, p). For a Galerkin layer, that line is the one thing that should not copy: once the parameters are $\theta(s)$ , the RHS depends on time/depth through the basis evaluation. The safe MVP is to omit an explicit tgrad and let the generic machinery handle it.

5. First basis Fourier#

Paper explicitly mentions Fourier and Chebyshev families and reports an experiment using a Fourier series with 5 harmonics for Galerkin NODEs.

Result#

Testing and graphs are generated using the example code from the doc Neural Ordinary Differential Equations Across the three figures, the results suggest that the initial Galerkin Neural ODE is implemented correctly and behaves as expected.
In the untrained-state plots, both the standard NeuralODE and the Galerkin NeuralODE start far from the ground-truth trajectories, confirming that neither model matches the data before optimization.

Nerual vs Galerkin ODE training animation

During training, however, the loss curves show two important patterns:
- First, the Galerkin model with $M=1$ closely follows the standard NeuralODE, which is a key sanity check because the constant-only Galerkin case should reduce to the vanilla NeuralODE;
- Second, the Galerkin model with $M=5$ converges substantially faster and reaches a noticeably lower training loss, indicating that the additional basis modes provide extra expressive power.

Training Loss graph

Finally, the trained trajectory plots show that all three models recover the target dynamics well, with the $M=1$ Galerkin model nearly overlapping the NeuralODE baseline and the $M=5$ model achieving the best overall fit.

Trajectories with Nerual and Galerkin ODE

Taken together, these results support both the correctness of the Galerkin Neural ODE implementation and the practical benefit of allowing depth-varying parameters through a richer basis expansion, although part of the improvement for $M=5$ may also come from its larger effective parameterization.