Homoiconic Continual Learning: bridging Lisp's self-referential core with neural plasticity — Pensaduras

The proposed Homoiconic Continual Learning (HCL) framework draws a structurally compelling but imperfect analogy between Lisp’s metacircular evaluator and neural continual learning. The mapping is tightest for LoRA-based frozen-core architectures and in-context learning (where transformers demonstrably implement gradient descent in their forward pass), moderately strong for hypernetworks as weight generators, and weakest for the reversibility claims. While the specific Lisp-to-neural vocabulary has not appeared in prior publications, the underlying mechanisms — self-referential weight matrices, frozen cores with compositional deltas, and meta-learned continual learning algorithms — are well-established in Schmidhuber’s 30-year research program. The framework’s genuine contribution lies in unifying these threads under a principled programming-language-theoretic lens, but it needs formal categorical grounding and concrete algorithms that outperform existing methods to avoid remaining a suggestive metaphor.

1. The metacircular evaluator as a computational fixed point

The theoretical anchor of HCL is the metacircular evaluator from Chapter 4 of Abelson & Sussman’s Structure and Interpretation of Computer Programs (MIT Press, 1996). The eval/apply cycle defines a semantic fixed point: a Lisp interpreter written in Lisp, where eval dispatches expressions to evaluation rules and apply executes procedures on arguments. As SICP states, “expressions to be evaluated in environments are reduced to procedures to be applied to arguments, which in turn are reduced to new expressions to be evaluated in new environments.” The entire evaluator fits on roughly one page of code yet defines the complete semantics of Scheme.

The concept traces to McCarthy’s 1960 Communications of the ACM paper, where the universal S-function apply played “the theoretical role of a universal Turing machine and the practical role of an interpreter.” Reynolds (1972) coined the term “metacircular” and systematically classified definitional interpreters. The key property for HCL is homoiconicity — code and data share the same representation (S-expressions), enabling programs to inspect and modify themselves. Koza’s genetic programming (1992) exploited this directly: programs represented as S-expression trees are subjected to crossover and mutation as data, making code-modifying-code literal.

The neural analog of this property was articulated explicitly by Irie, Schlag, Csordás, & Schmidhuber (ICML 2022, arXiv:2202.05780): “The weight matrix of a neural network is its program.” Their Self-Referential Weight Matrix (SRWM) uses outer products and the delta update rule to modify itself during runtime, including the parts responsible for modification — the closest existing neural architecture to a metacircular evaluator. Schmidhuber’s original 1993 “self-referential weight matrix” (ICANN 1993) established this concept, and Kirsch & Schmidhuber (2022) formalized self-referential architectures that control all their own variables, proving them strictly more expressive than memory architectures without meta-optimization.

2. Catastrophic forgetting and the landscape of continual learning

Catastrophic forgetting — the tendency of neural networks to overwrite previously learned knowledge when trained on new tasks — was identified by McCloskey & Cohen (1989) and remains the central challenge in continual learning. The field has developed six main approaches, each mapping differently onto HCL’s architecture.

Regularization-based methods constrain weight updates to preserve previous knowledge. Elastic Weight Consolidation (Kirkpatrick et al., PNAS 2017, arXiv:1612.00796) adds a quadratic penalty weighted by the Fisher Information Matrix: L(θ) = L_new(θ) + Σᵢ (λ/2) Fᵢ(θᵢ − θ*ᵢ)². EWC treats all weights as part of a single mutable program — the antithesis of HCL’s frozen-core principle. Its limitations are well-documented: the number of regularization terms grows linearly with tasks, the Laplace approximation underestimates parameter importance (Huszár 2018), and performance diverges after roughly 18 tasks on Permuted MNIST.

Architecture-based methods align more closely with HCL. Progressive Neural Networks (Rusu et al. 2016, arXiv:1606.04671) freeze previous columns and add new ones with lateral connections — immune to forgetting by construction but with O(k²) parameter growth. PackNet (Mallya & Lazebnik, CVPR 2018) prunes and freezes weight subsets per task, using binary masks as “deltas” over a shared network. Dynamically Expandable Networks (Yoon et al., ICLR 2018) selectively retrain, expand, and split neurons, achieving batch-model performance with 12–60% of parameters.

Replay-based methods store or regenerate past examples. GEM (Lopez-Paz & Ranzato, NeurIPS 2017) formalized the field’s key metrics: Average Accuracy (ACC), Backward Transfer (BWT, negative = forgetting), and Forward Transfer (FWT). iCaRL (Rebuffi et al., CVPR 2017) combines exemplar storage with knowledge distillation. Dark Experience Replay (Buzzega et al., NeurIPS 2020) stores and distills logits alongside examples.

The most HCL-aligned existing paradigm is LoRA-based continual learning. LoRA (Hu et al., ICLR 2022, arXiv:2106.09685) freezes pretrained weights W₀ and learns low-rank updates ΔW = BA where rank r ≪ min(d,k). This directly implements HCL’s frozen core + compositional deltas: each task gets its own (Aₜ, Bₜ), and task switching involves swapping small adapter modules. Biderman et al. (2024, arXiv:2405.09673) showed “LoRA learns less and forgets less” — the low-rank constraint acts as implicit regularization against forgetting.

3. The LoRA-based continual learning explosion (2023–2025)

The period 2023–2025 saw rapid development of LoRA variants for continual learning, all implementing variations of HCL’s frozen-core architecture:

O-LoRA (Wang et al., EMNLP 2023 Findings, arXiv:2310.14152) learns tasks in orthogonal low-rank subspaces, eliminating interference without replay
InfLoRA (Liang & Li, CVPR 2024) designs the B matrices to project into subspaces orthogonal to previous tasks’ gradient directions, with capacity bounded by T ≤ ⌊d/r⌋
BiLoRA (Zhu et al., CVPR 2025) achieves quadratically lower collision rates via bilinear frequency-based task separation, reaching 87.46% on CIFAR-100 versus 91.92% for joint training
TreeLoRA (ICML 2025) uses hierarchical gradient-similarity trees for layer-wise LoRA allocation
LiLoRA (arXiv:2508.06202, 2025) shares matrix A across tasks and applies additional low-rank decomposition to B — a hierarchical composition of deltas
KeepLoRA (arXiv:2601.19659, 2026) projects gradient updates into residual subspaces orthogonal to both the pretrained principal subspace and previous task directions

These methods validate HCL’s core architectural claim: a frozen pretrained model with compositional, storable, swappable, and reversible low-rank weight deltas per task constitutes an effective continual learning architecture. The orthogonal variants (O-LoRA, InfLoRA, BiLoRA) add the critical property that task-specific updates don’t interfere with each other, approaching the ideal of non-destructive, reconstructible task knowledge.

Task arithmetic (Ilharco et al., ICLR 2023, arXiv:2212.04089) provides the algebraic complement: task vectors τₜ = θₜ − θ₀ can be added (multi-task), negated (unlearning), and composed by analogy. Chitale et al. (NeurIPS 2023 Workshop, arXiv:2311.02428) applied task arithmetic in LoRA space for continual learning, directly implementing HCL’s “reconstruction from core + stored deltas.” MagMax (Marczak et al., ECCV 2024) showed that simple maximum-magnitude weight selection during sequential fine-tuning outperforms many dedicated CL methods.

4. Hypernetworks as neural interpreters that generate weight-programs

Hypernetworks (Ha, Dai, & Le, ICLR 2017, arXiv:1609.09106) instantiate a direct structural parallel to eval/apply: a small network (the hypernetwork/eval) takes a task description and generates weights (programs) for a target network (apply). The target network then executes these weights on inputs to produce outputs.

Von Oswald, Henning, Grewe, & Sacramento (ICLR 2020, arXiv:1906.00695) applied this to continual learning with striking results. Their task-conditioned hypernetwork generates full target weights from task embeddings: θₜ = h(eₜ; φ). Instead of rehearsing data, the system rehearses weight configurations — a regularizer constrains h(eᵢ; φ) to remain close to previously computed weight realizations. This achieves a compressive regime where the hypernetwork parameters can be smaller than the target network yet retain memories for many tasks. The hypnettorch library (github.com/chrhenning/hypnettorch) and hypercl repository (github.com/chrhenning/hypercl) provide PyTorch implementations.

Recent extensions include partial hypernetworks for CL (Hemati et al., PMLR 2023), HyperPEFT for ViT-based continual learning (Information Sciences, 2024), and the provocative reframing of attention itself as a hypernetwork (ICLR 2025): key-query interactions specify a low-dimensional latent code that parameterizes value-network operations, supporting compositional generalization in abstract reasoning.

The analogy to metacircular evaluation holds in several ways. Both systems separate the “interpreter” (hypernetwork/eval) from the “programs” being interpreted (generated weights/expressions). Both achieve generalization through a shared computational core. And both exhibit a form of homoiconicity: generated weights are simultaneously data (output of the hypernetwork) and programs (executable parameters of the target network).

Where it breaks: hypernetworks don’t truly interpret themselves. The metacircular evaluator is special because the interpreter and the interpreted are the same language. Schmidhuber’s SRWM (1993, modernized in Irie et al. 2022) comes closest by allowing a weight matrix to modify itself, including the parts responsible for modification. But even here, initial training relies on external gradient descent.

5. In-context learning provides the strongest evidence for the “frozen interpreter” thesis

The most compelling support for HCL’s core metaphor comes from the in-context learning literature, which demonstrates that frozen transformer weights implement adaptive learning algorithms in their forward pass.

Von Oswald, Niklasson, Randazzo, et al. (ICML 2023, arXiv:2212.07677) proved by construction that a single linear self-attention layer implements one step of gradient descent on a regression loss. Empirically, trained self-attention transformers converge to this theoretical construction — they become mesa-optimizers that learn models by gradient descent within their forward pass. The frozen weights are the outer loop (meta-learning); the forward-pass computation is the inner loop (task-specific adaptation). This is the precise structure of HCL: a frozen metacircular core executing variable programs.

Akyürek, Schuurmans, Andreas, Ma, & Zhou (ICLR 2023, arXiv:2211.15661) showed transformers implement and transition between different algorithms — gradient descent, ridge regression, and exact least-squares — depending on depth and noise, converging to Bayesian estimators at large width. Garg, Tsipras, Liang, & Valiant (NeurIPS 2022, arXiv:2208.01066) demonstrated in-context learning of linear functions, sparse linear functions, neural networks, and decision trees. Dai et al. (ACL 2023 Findings, arXiv:2212.10559) showed transformer attention has a dual form of gradient descent: the pretrained model serves as a meta-optimizer producing meta-gradients from demonstrations.

The follow-up paper on mesa-optimization (von Oswald et al., ICLR 2024, arXiv:2309.05858) deepened the picture: standard next-token prediction training gives rise to a subsidiary learning algorithm within the forward pass. Multi-layer analysis revealed that first layers perform “token binding” (constructing a mesa-dataset of input-output associations) while subsequent layers perform mesa-optimization. Crucially, the learned forward-pass optimization algorithm can be repurposed for supervised few-shot tasks — the same “interpreter” runs different “programs.”

Li, Ildiz, Papailiopoulos, & Oymak (ICML 2023, arXiv:2301.07067) formalized this as algorithm learning: the transformer constructs hypothesis functions at inference time, with generalization bounds through algorithmic stability. The inductive bias depends on task complexity and number of training tasks, not transformer complexity — the transformer effectively selects a task-appropriate algorithm.

6. Meta-learning bridges the gap between adaptation and continual learning

MAML (Finn, Abbeel, & Levine, ICML 2017, arXiv:1703.03400) provides a natural bridge. Its meta-learned initialization θ encodes general-purpose learning capability — a “frozen starting point” from which task-specific gradient steps produce fast adaptation. The inner-outer loop structure mirrors HCL’s frozen core + task-specific deltas, and the theoretical equivalence between MAML and in-context learning (von Oswald et al. 2023) makes this connection rigorous for linear models.

The meta-continual learning field, surveyed by Son, Lee, & Kim (IEEE TPAMI 2024, arXiv:2311.05241), defines five combinatorial frameworks bridging meta-learning and continual learning. Most relevant is Meta-Continual Learning (MCL), where MAML-style bi-level optimization trains an initialization that remains good for all tasks while the inner loop adapts to each. Javed & White (NeurIPS 2019) used MAML to learn representations robust to catastrophic forgetting. MAML-en-LLM (KDD 2024, arXiv:2405.11446) explicitly applies MAML’s bi-level optimization to improve in-context learning, achieving 2–4% improvements.

The most directly HCL-relevant work is Automating Continual Learning (ACL) by Kirsch, Harrison, Sohl-Dickstein, & Schmidhuber (TMLR), which uses self-referential neural networks to meta-learn their own in-context continual learning algorithms. ACL encodes CL desiderata into meta-learning objectives and resolves “in-context catastrophic forgetting” — a self-referential system that discovers its own strategy for avoiding catastrophic forgetting, implemented and benchmarked.

7. Reversibility: miniKanren’s elegant inversion versus neural approximations

HCL’s claim that weight updates should be “structured and reversible, analogous to relational/backwards execution in miniKanren” is the framework’s weakest link. miniKanren (Friedman, Byrd, & Kiselyov, The Reasoned Schemer, MIT Press 2005; Byrd’s 2009 Indiana University dissertation) treats programs as mathematical relations, eliminating the distinction between inputs and outputs. Byrd, Holk, & Friedman (2012) demonstrated quine generation via relational interpreters, and Byrd, Ballantyne, Rosenblatt, & Might (ICFP 2017) showed a single relational interpreter solving seven programming challenges including program synthesis and theorem proving. The metaKanren work (ICFP 2021 miniKanren Workshop) achieved a metacircular relational interpreter — miniKanren interpreting miniKanren, runnable backwards for program synthesis.

Neural reversibility operates on a fundamentally different level. RevNets (Gomez, Ren, Urtasun, & Grosse, NeurIPS 2017, arXiv:1707.04585) achieve activation reconstruction via coupling layers: y₁ = x₁ + F(x₂), y₂ = x₂ + G(y₁), with exact inverse x₂ = y₂ − G(y₁), x₁ = y₁ − F(x₂). i-RevNet (Jacobsen, Smeulders, & Oyallon, ICLR 2018) extends this to fully invertible networks, proving “no information is discarded.” Normalizing flows (NICE by Dinh et al. 2014; RealNVP by Dinh et al. 2016; Glow by Kingma & Dhariwal, NeurIPS 2018) provide invertible transformations with tractable Jacobian determinants. Invertible Residual Networks (Behrmann et al., ICML 2019, arXiv:1811.00995) proved that Lipschitz-constrained residual functions yield invertible networks.

The connection to reversible computation theory is deep. Landauer (1961) established that irreversible operations must dissipate energy; Bennett (1973) proved any computation can be made reversible at the cost of additional memory. But neural reversibility is numerical invertibility of activations, not logical reversibility of reasoning. RevNets reconstruct activations for memory-efficient training; they cannot “reason backwards” about what inputs would produce desired outputs. miniKanren’s relational execution is about running interpreters as synthesizers — a qualitatively different capability that has no real neural analog today.

The closest neural approach to logical reversibility comes from orthogonal LoRA methods (O-LoRA, InfLoRA), where task-specific updates occupy separable subspaces and can be individually added or removed. LoRA’s merge/unmerge mechanism — model.eval() merges W₀ + BA, model.train() unmerges — provides simple additive reversibility. But this is compositional reversibility of deltas, not logical reversibility of computation.

8. Formal frameworks and the category-theoretic bridge

The most promising path to formalizing HCL lies in category theory. Fong, Spivak, & Tuyéras (arXiv:1711.10455, 2019) defined a category NNet of neural networks (objects = dimensions, morphisms = architectures) and showed that implementing a neural network as a supervised learner is functorial: backpropagation emerges as a consequence of the chain rule’s functoriality. This provides the mathematical language for composing learning systems.

Gavranović et al. (ICML 2024, arXiv:2402.15332) proposed that “categorical deep learning is an algebraic theory of all architectures,” using monads in 2-categories of parametric maps. Gavranović (2020, arXiv:2009.06837) showed that functors (not just functions) can be learned via gradient descent, extending the categorical framework to meta-learning.

To formalize HCL categorically, one would need:

A category Lisp whose objects are types and morphisms are Lisp programs, with eval as an endofunctor
A category Neural whose objects are weight spaces and morphisms are parameterized maps
A functor F: Lisp → Neural mapping the metacircular evaluator to the frozen core, programs to weight deltas, and relational inversion to reversible networks
Natural transformations expressing the coherence conditions between symbolic and neural self-reference

No such formalization exists today. The Shiebler, Gavranović, & Wilson survey (ACT 2021, arXiv:2106.07032) covers category theory in machine learning broadly but does not address continual learning. This represents a genuine open problem and potential contribution.

Google’s Nested Learning (Behrouz & Mirrokni, NeurIPS 2025) provides a complementary framework: models as nested multi-level optimization problems where architecture and optimizer are fundamentally the same concept at different levels. Their HOPE architecture is a self-modifying variant of Titans with a continuum memory system, updating at different frequencies across memory levels. This is the closest independent development to HCL’s multi-level self-referential vision.

9. Where the analogy holds, breaks, and what is genuinely novel

Tight mappings. The LoRA frozen-core + compositional-deltas pattern maps cleanly onto eval/apply + programs. In-context learning as “running programs on a fixed interpreter” is now empirically validated by multiple groups (von Oswald et al. 2023; Akyürek et al. 2022; Dai et al. 2023). Task arithmetic provides genuine (if approximate) compositionality over weight-space “programs.” The hypernetwork-as-eval analogy captures the structural relationship between program generators and program executors.

Moderate mappings. Hypernetworks generate weights for separate target networks but lack true self-reference. MAML’s initialization serves as a “frozen core” but was not designed as an interpreter. Self-Referential Weight Matrices (Irie et al. 2022) achieve self-modification but through outer products, not recursive symbolic evaluation.

Weak mappings. The reversibility claim conflates numerical invertibility (RevNets) with logical reversibility (miniKanren). Weight deltas lack syntactic structure, control flow, variables, or compositional semantics — they are opaque numerical objects, not programs. The continuous-discrete gap is fundamental: Lisp’s power comes from exact symbolic manipulation, while neural networks operate in approximate continuous spaces. Task arithmetic works only in a small neighborhood of pretrained weights (α < 1), degrading with larger modifications.

Prior work assessment. The specific vocabulary — homoiconic, metacircular evaluator, eval/apply — applied to continual learning appears unpublished. However, the underlying structural ideas are thoroughly explored in Schmidhuber’s research program (1987–2024), including self-referential weight matrices, meta-meta-learning, and networks that modify their own learning rules. The ACL paper (Kirsch et al.) already implements self-referential networks that meta-learn continual learning algorithms. Google’s Nested Learning independently develops the multi-level optimization interpretation.

Genuinely novel elements. HCL’s contribution would be strongest as: (1) a unifying vocabulary bridging PL theory and continual learning communities, (2) the specific insight that frozen cores should be designed as interpreters rather than inherited from pretraining, (3) formal categorical grounding connecting Lisp’s semantic fixed point to neural learning dynamics, and (4) concrete algorithms derived from the analogy that outperform existing methods — none of which yet exist.

10. PyTorch experimental infrastructure

The experimental ecosystem for validating HCL is mature. Avalanche (github.com/ContinualAI/avalanche, JMLR 2023) provides benchmarks (Split-MNIST, Split-CIFAR-10/100, Permuted-MNIST), training strategies (EWC, GEM, PackNet, replay methods), and evaluation metrics (forgetting, backward/forward transfer). The van de Ven codebase (github.com/GMvandeVen/continual-learning) supports systematic comparison across Task-IL, Domain-IL, and Class-IL scenarios.

For the frozen-core + deltas component, InfLoRA (github.com/liangyanshuo/InfLoRA, CVPR 2024) and O-LoRA (github.com/cmnfriend/O-LoRA, EMNLP 2023) implement orthogonal LoRA for continual learning with ViT and LLM backbones. Online-LoRA (github.com/christina200/online-lora-official, WACV 2025) adds automatic distribution shift detection. The task_vectors repository (github.com/mlfoundations/task_vectors) implements task arithmetic over CLIP models.

For hypernetwork experiments, hypnettorch (github.com/chrhenning/hypnettorch) provides a general hypernetwork framework and hypercl (github.com/chrhenning/hypercl) implements von Oswald et al.’s continual learning hypernetwork with Split-MNIST/CIFAR benchmarks. For reversible components, FrEIA (github.com/vislearn/FrEIA) and nflows (github.com/bayesiains/nflows) provide invertible architecture building blocks.

A prototype HCL experiment would combine: (1) a frozen ViT backbone as the metacircular core, (2) per-task LoRA adapters as compositional weight deltas with orthogonality constraints, (3) task arithmetic for reconstruction verification (θ_task = θ_core + Δ_task), and (4) comparison against EWC, Progressive Nets, and vanilla LoRA on Split-CIFAR-100 using standard ACC/BWT/FWT metrics. The key HCL-specific test: whether designing the frozen core explicitly for “interpretive capacity” (via meta-training on diverse task distributions before freezing) yields better continual learning than a standard pretrained backbone.

11. Benchmarks and evaluation standards

Standard continual learning evaluation follows the taxonomy of Van de Ven & Tolias (2019): Task-Incremental (task identity available at test time), Domain-Incremental (input distribution changes, same structure), and Class-Incremental (most challenging — discriminate among all seen classes without task identity). The Lopez-Paz & Ranzato (2017) metrics remain standard: ACC = (1/T)Σᵢ R_{T,i}; BWT = (1/(T-1))Σᵢ (R_{T,i} − R_{i,i}); FWT = (1/(T-1))Σᵢ (R_{i-1,i} − bᵢ).

Recent benchmarks address limitations of synthetic task splits. CLEAR (Lin et al., NeurIPS 2022, arXiv:2201.06289) provides natural temporal evolution from YFCC100M images (2004–2014). CORe50 (Lomonaco & Maltoni 2017) offers 50 objects across 11 sessions. CoIN (Chen et al. 2024) benchmarks continual instruction tuning for multimodal LLMs. For LLM-specific evaluation, the comprehensive survey by Shi et al. (ACM Computing Surveys 2025, arXiv:2404.16789) covers continual pretraining, domain-adaptive pretraining, and continual instruction tuning, with curated paper lists at github.com/Wang-ML-Lab/llm-continual-learning-survey.

State-of-the-art results on Split-CIFAR-100 with pretrained ViTs: BiLoRA achieves 87.46% final accuracy (2025), InfLoRA achieves strong results with capacity limits, versus joint training at 91.92%. On Permuted-MNIST, EWC maintains ~90%+ on 3 tasks but diverges after ~18. GEM shows minimal forgetting with episodic memory. The gap between the best continual learning methods and joint training has narrowed substantially with foundation model backbones.

Conclusion: a conceptual bridge that needs engineering and formalization

HCL identifies a genuine structural correspondence between Lisp’s self-referential computational model and the emerging architecture of frozen-core neural continual learning. The strongest evidence comes from three converging lines: (1) LoRA-based continual learning already implements the frozen-core + compositional-deltas pattern with strong empirical results, (2) in-context learning research proves that frozen transformer weights implement adaptive learning algorithms — mesa-optimizers — in their forward pass, and (3) self-referential weight matrices demonstrate that neural networks can modify their own computational substrate, approaching homoiconicity.

The framework’s main risk is remaining a productive metaphor rather than becoming a productive theory. Three developments would transform HCL from analogy to architecture. First, a categorical formalization mapping a Lisp-like free monad to a neural parametric-maps category, with natural transformations expressing coherence between symbolic and neural self-reference. Second, concrete algorithms where the frozen core is explicitly meta-trained for interpretive capacity rather than inherited from standard pretraining — testing whether “interpreter-designed” cores outperform “representation-designed” cores. Third, structured weight deltas with formal compositionality guarantees going beyond approximate task arithmetic. The tools exist (Avalanche, hypnettorch, InfLoRA, FrEIA); what’s missing is the synthesis. The key open question is whether the Lisp framing generates predictions that existing frameworks — stability-plasticity theory, Fisher information regularization, tangent-space task arithmetic — do not.