Compress-Add-Smooth

Compress-Add-Smooth (CAS) is a continual-learning framework for resource-constrained agents in which a fixed-capacity memory is updated through a "day–night" recursion: each new experience distribution is appended to a compressed past, and the result is rebinned onto a uniform schedule Chertkov (2026). Unlike parameter-based continual learning Kirkpatrick et al. (2017), CAS stores knowledge exclusively in a temporal stack of probability densities and samples from the stack through a bridge diffusion; no gradient step, no neural network, and no backpropagation are required. The recursion admits an analytically tractable forgetting curve: the half-life \(a_{1/2}\) of a past memory scales linearly with the segment budget \(L\), with a capacity constant \(c = a_{1/2}/L\) that lies in the range \(c \in [2.0, 3.6]\) and is reported near \(c \approx 2.4\) for the default geometry Chertkov (2026).

The continual-learning problem

Standard supervised learners trained sequentially on non-stationary data suffer from catastrophic interference: acquisition of a new task rapidly overwrites representations acquired for previous ones McCloskey & Cohen (1989). In connectionist architectures this was identified as a direct consequence of distributed representations being updated by gradient descent on the new task alone, without constraints anchoring them to prior tasks.

Parameter-space mitigations such as elastic weight consolidation (EWC) attach a quadratic penalty to the loss, centred at the parameter values obtained after each task and weighted by a diagonal approximation of the Fisher information; EWC was shown to enable sequential training on the Atari suite with limited forgetting Kirkpatrick et al. (2017). Rehearsal-based methods, such as deep generative replay, instead train a generative model alongside each task and replay synthetic samples during subsequent training Shin et al. (2017). Both families remain tied to a learnt parameter vector and to gradient-based optimisation.

Compress-Add-Smooth takes a different path: it replaces the parameter vector by a stack of probability densities indexed by age, and the optimiser by an exactly solvable bridge diffusion that samples from the stack Chertkov (2026).

The CAS recursion

Let \(p_a^{(n)}\) denote the memory density at age \(a\) on day \(n\), supported on the normalised temporal interval \([0, 1]\) divided into \(L\) uniform segments. A new experience density \(q^{(n)}\) is incorporated through a three-step, strictly local update Chertkov (2026):

  1. Compress. The existing stack, supported on \([0, 1]\), is losslessly rescaled onto the interval \([0, L/(L+1)]\) through an affine change of variable in age.
  2. Add. The freed segment \([L/(L+1), 1]\) is populated by the new density \(q^{(n)}\), which becomes the youngest memory of day \(n+1\).
  3. Smooth. The resulting \((L+1)\)-segment stack is rebinned onto a uniform \(L\)-segment grid by averaging adjacent segments; this is the only lossy step of the cycle.

The Compress and Add steps are information-preserving in continuous time; all forgetting is localised in the Smooth step, whose repeated application over many days plays the role of a discrete heat flow in the age variable. Because each \(p_a^{(n)}\) is represented as a Gaussian mixture, the per-cycle cost is \(O(L K d^2)\) in the number of segments \(L\), mixture components \(K\), and ambient dimension \(d\), and requires neither gradient descent nor neural inference Chertkov (2026).

Sampling from the stack on day \(n\) is performed by a bridge diffusion targeting a convex mixture of the \(L\) stored densities, enabling queries that jointly weight recent and remote memories. The bridge is of the Path Integral Diffusion family described in harmonic-path-integral-diffusion and inherits its closed-form drift.

Capacity law \(a_{1/2} \approx c \cdot L\)

The signature empirical law of CAS relates the half-life of a memory — the age at which the bridge-replay distribution of that memory differs from its original by half of its total-variation distance — to the segment budget \(L\) Chertkov (2026):

\[ a_{1/2} \;\approx\; c \cdot L. \]

The constant \(c\) is geometry-dependent and is reported in the range \(c \in [2.0, 3.6]\), with a default-geometry value near \(c \approx 2.4\) Chertkov (2026). The upper end of the range is attained under favourable concentration of the stored densities and the lower end under adversarially spread mixtures; the approximate constancy across dimensions \(d\), number of modes \(K\), and number of particles \(P\) used to estimate each density is the key empirical finding.

The interpretation favoured in Chertkov (2026) is information-theoretic: the Smooth step is a noisy discrete-time channel on the age variable, the segment count \(L\) plays the role of a channel-rate parameter, and the linear scaling of half-life with \(L\) is the CAS analogue of channel capacity in the sense of Shannon Shannon (1948). A memory surviving for \(a_{1/2}\) days corresponds to a code word transmitted at a rate below capacity: extending \(L\) by one segment buys roughly \(c \approx 2.4\) extra days of recall, independently of the task.

The two-regime structure of forgetting — near-perfect recall followed by a sharp sigmoidal transition into rapid decay — accompanies the linear capacity law and reflects the fact that the Smooth step acts as a heat-like operator with a characteristic mixing time of order \(L\) in age units.

MNIST experiments

On the MNIST digits 0, 3, and 8, projected to a \(d = 12\) latent space and modelled as \(K = 3\) Gaussian components per day, Chertkov (2026) reports an empirical half-life of \(a_{1/2} \approx 37\) days for a segment budget \(L = 16\), consistent with \(c \approx 2.3\) within the reported range. Separate control experiments vary \(L\), the number of modes, and the ambient dimension; in all cases the linear relation \(a_{1/2} \propto L\) persists, with the proportionality constant remaining inside \([2.0, 3.6]\) Chertkov (2026).

A decomposition of the forgetting signal into contributions from the mean, covariance, and mixture weights of each stored Gaussian mixture shows that, on MNIST, the dominant component of forgetting is covariance drift rather than mean displacement: modes slowly broaden under the iterated Smooth step long before their centroids have visibly migrated.

Relation to bridge diffusions

Compress-Add-Smooth sits inside the broader family of exactly solvable bridge diffusions that underpins the integrability hierarchy of path-integral-diffusion Chertkov (2026). Each day's sampler is a quadratic-potential bridge — an instance of harmonic-path-integral-diffusion — whose terminal law is the current mixture-of-mixtures stack. The day–night recursion is therefore a sequence of bridges: between two consecutive sampling cycles, the memory distribution is transported from the stack of day n to the stack of day n+1 by a deterministic compress–add–smooth map, and within each cycle samples are drawn by a single bridge diffusion with closed-form drift.

This places CAS in the same structural position with respect to temporal memory as mean-field-pid occupies with respect to ensembles of interacting samples: both take a PID bridge as primitive and compose it — MF-PID in the particle index, CAS in the age index — while preserving analytic tractability at the level of the Green functions Chertkov (2026).

The absence of any learnt parameter vector, combined with the linear capacity law \(a_{1/2} \approx c \cdot L\) and the Shannon-style information-theoretic reading of the Smooth step Shannon (1948), is what motivates the designation samples that remember: forgetting here is a property of temporal resolution, not of parameter overwrite Chertkov (2026).

See also

References