From LISP self-reference to self-modifying neural architectures: a literature survey
The intellectual arc from McCarthy’s 1960 eval/apply to modern self-referential transformers is not a metaphor — it is a precise technical lineage. LISP’s homoiconicity established that programs and data can inhabit the same representational space; Backus’s FP algebra showed that programs compose through a closed set of algebraic combinators; denotational semantics grounded self-reference in fixed-point theory; and today’s transformers implement learning algorithms in their forward pass, effectively becoming self-modifying computational systems. This survey maps that lineage across nine research areas, providing the bibliographic foundation for two papers: a conceptual paper connecting LISP self-reference to emergent self-awareness in modern ML, and a technical paper proposing experiments in LISP-based differentiable architecture manipulation.
1. LISP self-reference and homoiconicity: where code became data
Foundational references
John McCarthy, “Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I,” Communications of the ACM 3(4):184–195, 1960. McCarthy introduced S-expressions as a uniform representation for both programs and data, and defined the universal S-function apply, which he described as playing “the theoretical role of a universal Turing machine and the practical role of an interpreter.” Steve Russell’s implementation of eval as executable code on the IBM 704 — against McCarthy’s own expectation — created the first self-interpreting programming system. The key result: LISP’s eval/apply pair constitutes a self-describing computational system, a program that interprets programs in its own data representation.
Harold Abelson, Gerald Jay Sussman, and Julie Sussman, Structure and Interpretation of Computer Programs, MIT Press, 1984 (2nd ed. 1996). Chapter 4 presents the metacircular evaluator — a Scheme interpreter written in Scheme. The evaluator is built from two mutually recursive procedures (eval and apply) that fully specify the language’s semantics. SICP states: “An evaluator that is written in the same language that it evaluates is said to be metacircular.” This is the canonical demonstration that homoiconic languages enable complete self-description.
John C. Reynolds, “Definitional Interpreters for Higher-Order Programming Languages,” Proceedings of the ACM Annual Conference, 2:717–740, 1972. Reynolds coined the term “metacircular” and showed that self-interpreters inherit semantic properties (e.g., evaluation order) from their metalanguage, introducing defunctionalization and CPS transformation to resolve the resulting ambiguities.
Fixed-point combinators and Kleene’s recursion theorem
Haskell B. Curry and Robert Feys, Combinatory Logic, Volume I, North-Holland, 1958. Defines the Y combinator (Y = λf.(λx.f(x x))(λx.f(x x))), which enables recursion without explicit self-reference — a function need not know its own name. Self-reference emerges from composition rules alone.
Stephen C. Kleene, Introduction to Metamathematics, North-Holland, 1952. The Second Recursion Theorem states: for any total computable function f, there exists an index n such that φ_n = φ_{f(n)}. Every computable transformation of programs has a fixed point. This guarantees the existence of quines, self-interpreters, and all forms of computational self-reference. As Jones observed, “the recursion theorem is almost trivial in a reflexive language.”
Oleg Kiselyov, “Kleene’s Second Recursion Theorem: A Functional Pearl,” draft ~2018. Bridges computability theory and functional metaprogramming, noting that “Lisp-like quotation in particular helps with” efficient implementations of the recursion theorem. Demonstrates that the proof of Kleene’s SRT is structurally analogous to the Y combinator construction.
Joseph Helfer, “Y is a Least Fixed Point Combinator,” arXiv:2504.19379, 2025. Proves that Curry’s Y combinator produces least fixed points when lambda-calculus terms are viewed as partial functions, connecting lambda-calculus fixed points to domain theory.
Homoiconicity
Calvin N. Mooers and L. Peter Deutsch, “TRAC, A Text-Handling Language,” Proceedings of the ACM ‘65, pp. 229–246, 1965. Origin of the term: “Because TRAC procedures and text have the same representation inside and outside the processor, the term homoiconic is applicable, from homo meaning the same, and icon meaning representation.”
Alan Kay, The Reactive Engine, PhD thesis, University of Utah, 1969. Popularized the term, describing LISP and TRAC as “homoiconic in that their internal and external representations are essentially the same.”
LISP homoiconicity and modern ML
Kevin Ellis et al., “DreamCoder: Bootstrapping Inductive Program Synthesis with Wake-Sleep Library Learning,” PLDI 2021; arXiv:2006.08381. DreamCoder learns to solve problems by synthesizing programs in lambda calculus (essentially S-expression trees). A wake-sleep algorithm alternately synthesizes programs, refactors them into reusable library abstractions, and trains a neural recognition network. The system grows its own programming language — a form of meta-programming directly analogous to LISP’s defun and macro systems.
Steven T. Piantadosi, “The Computational Origin of Representation,” Minds and Machines 31:1–58, 2021. Proposes that cognition operates through a Language of Thought based on combinatory logic. Notes: “Interestingly, the computational power to use recursion comes for free if lambda calculus is the representational system: recursion can be constructed via the Y-combinator out of nothing more than the composition laws of lambda calculus.”
Martín Abadi and Gordon D. Plotkin, “A Simple Differentiable Programming Language,” Proc. ACM Program. Lang. 4, POPL, Article 38, 2020; arXiv:1911.04523. Defines a language with recursive functions and first-class reverse-mode differentiation, proving operational and denotational semantics coincide. Differentiation as a language-level operation parallels eval: just as eval maps code-as-data to values, the differentiation operator maps programs to their derivative programs.
Bohdan B. Khomtchouk, Edmund Weitz, and Claes Wahlestedt, “The Machine that Builds Itself: How the Strengths of Lisp Family Languages Facilitate Building Complex and Flexible Bioinformatic Models,” arXiv:1608.02621, 2016. Argues that Lisp’s homoiconicity, macro system, and DSL construction capabilities make it uniquely suited for building AI/ML applications — “the machine that builds itself.”
2. Backus’s FP algebra: programs as algebraic objects
The 1977 Turing lecture
John Backus, “Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs,” Communications of the ACM 21(8):613–641, 1978. Backus critiqued conventional languages as “fat and weak” due to their “primitive word-at-a-time style” and identified “the assignment statement is the von Neumann bottleneck.” He proposed FP: a variable-free, function-level system built on combining forms (composition, construction, apply-to-all/map, condition, insert/reduce) that serve as operations of an algebra of programs. His canonical example — Def Innerproduct = (/+) ∘ (α×) ∘ Transpose — defines inner product purely by composing three operations without naming variables. The combining forms are closed: composing them always yields functions within the FP system.
The direct mapping to neural network primitives is striking: composition = layer chaining; construction = multi-head/branching; apply-to-all = convolution/map; insert/reduce = pooling; condition = gating mechanisms.
John Backus, “The Algebra of Functional Programs: Function Level Reasoning, Linear Equations, and Extended Definitions,” LNCS 107, Springer, 1981. Extends the FP algebra with function-level reasoning and linear equations.
Bird-Meertens formalism
Jeremy Gibbons, “The School of Squiggol: A History of the Bird–Meertens Formalism,” FM 2019 International Workshops, LNCS 12233, pp. 18–42, Springer, 2020. Comprehensive history of BMF, “a calculus for program transformation by equational reasoning in a function style.” BMF’s operators — map, fold/reduce, scan, filter — are precisely Backus’s combining forms placed on categorical foundations, with fusion laws enabling systematic program derivation.
Richard S. Bird, “An Introduction to the Theory of Lists,” Monograph PRG-56, Oxford, 1986. Established the theory of lists as a calculational framework, introducing the homomorphism lemma and fusion laws.
Richard S. Bird and Oege de Moor, The Algebra of Programming, Prentice-Hall, 1997. The definitive synthesis of BMF using category theory (initial algebras, catamorphisms) for program calculation.
Categorical semantics
Erik Meijer, Maarten Fokkinga, and Ross Paterson, “Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire,” LNCS 523, pp. 124–144, 1991. Introduces catamorphisms, anamorphisms, hylomorphisms, and paramorphisms as recursion operators on algebraic data types — the categorical culmination of Backus’s algebra.
Tatsuya Hagino, “A Categorical Programming Language,” PhD thesis, University of Edinburgh, 1987; arXiv:2010.05167. Presents data types based on F,G-dialgebras, connecting Backus-style definitions to initial/final (co)algebras.
Modern revivals connecting to ML
Christopher Olah, “Neural Networks, Types, and Functional Programming,” blog post, September 2015. The most explicit articulation of the Backus-to-deep-learning connection. Olah identifies: encoding RNNs = folds (Backus’s insert/reduce); generating RNNs = unfolds; general RNNs = scanl; recursive neural networks = catamorphisms. “Every model in deep learning that I am aware of involves optimizing composed functions. I believe this is the heart of what we are studying.”
Bruno Gavranović, Paul Lessard, Andrew Dudzik, et al., “Position: Categorical Deep Learning is an Algebraic Theory of All Architectures,” ICML 2024, PMLR 235:15209–15241; arXiv:2402.15332. Proposes that all major architectures (RNNs, CNNs, GNNs, transformers) are instances of a single algebraic theory: monad algebras in a 2-category of parametric maps. This is the strongest modern statement that ML is fundamentally algebraic.
3. Reflective and self-modifying computation: towers all the way up
Procedural reflection
Brian Cantwell Smith, Procedural Reflection in Programming Languages, PhD thesis, MIT, 1982 (MIT-TR-272). Introduced procedural reflection via 3-LISP, where every program is executed by a metacircular interpreter in the same language, forming an infinite tower of meta-interpreters. Three requirements for computational self-reasoning: self-reference, an embedded theory of self, and a causal connection between that theory and the system’s behavior.
Brian Cantwell Smith, “Reflection and Semantics in Lisp,” POPL ‘84, pp. 23–35, 1984. Condensed presentation: “reflective procedures are essentially analogues of subroutines to be run ‘in the implementation’” and “there is not a tower of different languages — there is a single dialect (3-LISP) all the way up.”
Pattie Maes, “Concepts and Experiments in Computational Reflection,” OOPSLA ‘87, pp. 147–155, 1987. Formalized computational reflection: a reflective system is one that “reasons about and acts upon itself.” Distinguished structural reflection (introspecting program structure) from behavioral reflection (modifying runtime semantics).
The reflective tower
Mitchell Wand and Daniel P. Friedman, “The Mystery of the Tower Revealed: A Non-Reflective Description of the Reflective Tower,” Lisp and Symbolic Computation 1(1):11–38, 1988. Provided rigorous denotational semantics for Smith’s infinite tower of metacircular interpreters without using reflection to explain reflection.
Nada Amin and Tiark Rompf, “Collapsing Towers of Interpreters,” Proc. ACM Program. Lang. 2, POPL, 2017. Shows how towers of interpreters can be collapsed into a single-pass compiler via multi-level staging — connecting the reflective tower to modern compilation.
Hofstadter: strange loops as fixed points
Douglas R. Hofstadter, Gödel, Escher, Bach: An Eternal Golden Braid, Basic Books, 1979. The mathematical content (not merely philosophical): Gödel numbering enables statements about a formal system to be encoded within it — the paradigmatic case of computational self-reference. Hofstadter coined “Quining” for the self-referential construction where a statement is applied to its own quotation. Strange loops defined: “A shift from one level of abstraction to another, which feels like an upwards movement in a hierarchy, and yet somehow the successive ‘upward’ shifts turn out to give rise to a closed cycle.”
The Diagonal Lemma (Gödel 1931, generalized by Carnap 1934): For any formula ψ(x) with one free variable in a sufficiently strong theory T, there exists a sentence φ such that T ⊢ φ ↔ ψ(⌜φ⌝). This is the mathematical formalization of self-reference — closely related to Kleene’s recursion theorem.
Fixed-point semantics
Dana S. Scott, “Outline of a Mathematical Theory of Computation,” Technical Monograph PRG-2, Oxford, 1970; “Continuous Lattices,” LNMS 274, 1972. Scott’s domain theory provides the mathematical foundation: his crucial construction solved D ≅ [D → D], creating a domain isomorphic to its own function space — the mathematical form of homoiconicity. The Kleene-Scott fixed-point theorem guarantees that for any continuous function f on a complete partial order, the least fixed point exists as ⊔{fⁿ(⊥) : n ≥ 0}.
Samson Abramsky and Achim Jung, “Domain Theory,” Handbook of Logic in Computer Science 3:1–168, Oxford University Press, 1994. The authoritative reference on domain theory’s role in denotational semantics.
4. Transformers as universal computers and in-context meta-learners
Turing completeness
Jorge Pérez, Javier Marinković, and Pablo Barceló, “On the Turing Completeness of Modern Neural Network Architectures,” ICLR 2019; arXiv:1901.03429. Proves transformers are Turing complete “based on their capacity to compute and access internal dense representations of data” — no external memory required. Assumes rational-valued activations with arbitrary precision.
Jorge Pérez, Pablo Barceló, and Javier Marinković, “Attention is Turing Complete,” JMLR 22(75):1–35, 2021. Expanded journal version showing a minimal encoder-decoder transformer (single-layer encoder, three-layer decoder) suffices. Model dimension scales as d = 2|Q| + 4|Σ| + 11.
Mostafa Dehghani et al., “Universal Transformers,” ICLR 2019; arXiv:1807.03819. Adds weight-sharing recurrence over depth with Adaptive Computation Time, yielding a practically Turing-complete architecture. “Given sufficient memory the Universal Transformer is computationally universal.”
Angeliki Giannou et al., “Looped Transformers as Programmable Computers,” ICML 2023; arXiv:2301.13196. Shows that 13 constant layers in a loop can emulate a small instruction-set computer. “A single frozen transformer, instructed by its input, can emulate a basic calculator, a basic linear algebra library, and even a full backpropagation, in-context learning algorithm.” The input acts as a “punchcard” — the transformer literally becomes a programmable computer.
In-context learning as implicit optimization
Ekin Akyürek et al., “What Learning Algorithm is In-Context Learning? Investigations with Linear Models,” ICLR 2023; arXiv:2211.15661. Proves constructively that transformers can implement gradient descent and closed-form ridge regression. Trained transformers transition between different classical estimators depending on depth and noise level.
Johannes von Oswald et al., “Transformers Learn In-Context by Gradient Descent,” ICML 2023, PMLR 202:35151–35174; arXiv:2212.07677. Shows formal equivalence between a single linear self-attention layer and one step of gradient descent on a regression loss. “Trained Transformers become mesa-optimizers — i.e. learn models by gradient descent in their forward pass.”
Damai Dai et al., “Why Can GPT Learn In-Context?” Findings of ACL 2023; arXiv:2212.10559. Derives a dual form between transformer attention and gradient descent: attention values serve as “meta-gradients,” and ICL is analogous to applying gradient updates from demonstration examples.
In-context learning as Bayesian inference
Sang Michael Xie et al., “An Explanation of In-context Learning as Implicit Bayesian Inference,” ICLR 2022; arXiv:2111.02080. “In-context learning emerges both theoretically and empirically when the pretraining distribution is a mixture distribution, resulting in the language model implicitly performing Bayesian inference in its forward pass.”
Samuel Müller et al., “Transformers Can Do Bayesian Inference,” ICLR 2022; arXiv:2112.10510. Introduces Prior-Data Fitted Networks (PFNs): a single transformer forward pass approximates Bayesian posterior predictive distributions, achieving >200× speedups over MCMC/VI.
Transformers as meta-learners
Shivam Garg et al., “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes,” NeurIPS 2022; arXiv:2208.01066. Demonstrates transformers encode diverse learning algorithms (matching OLS, Lasso, gradient descent, decision tree learning) within a single forward pass.
Gail Weiss, Yoav Goldberg, and Eran Yahav, “Thinking Like Transformers,” ICML 2021; arXiv:2106.06981. Proposes RASP (Restricted Access Sequence Processing Language) as a computational model for transformer-encoders. “The iterative process of a transformer is then not along the length of the input sequence but rather the depth of the computation.” RASP programs compile into transformer architectures — providing a formal programming-language view of attention as function composition.
5. Differentiable programming as algebraic program transformation
AD as combining form
Barak A. Pearlmutter and Jeffrey Mark Siskind, “Reverse-mode AD in a Functional Framework: Lambda the Ultimate Backpropagator,” ACM TOPLAS 30(2):7:1–7:36, 2008. Shows that reverse-mode AD can be a first-class function in an augmented lambda calculus, achieving closure (the derivative remains in the same language). The most direct bridge between AD and Backus: differentiation is a combining form that transforms programs into programs while preserving essential properties.
Conal Elliott, “The Simple Essence of Automatic Differentiation,” Proc. ACM Program. Lang. 2, ICFP, Article 70, 2018; arXiv:1804.00746. Derives AD from categorical specifications. “Computations expressed in this vocabulary are differentiable by construction thanks to the building blocks of category theory.” AD is a functor — a structure-preserving map between categories. The chain rule becomes composition preservation: D(g∘f) = D(g)∘D(f).
Conal Elliott, “Compiling to Categories,” Proc. ACM Program. Lang. 1, ICFP, Article 27, 2017. Typed functional programs can be reinterpreted in any cartesian closed category, enabling automatic translation to AD, hardware circuits, or interval analysis. The most direct modern realization of Backus’s vision.
Fei Wang et al., “Demystifying Differentiable Programming: Shift/Reset the Penultimate Backpropagator,” Proc. ACM Program. Lang. 3, ICFP, Article 96, 2019; arXiv:1803.10228. Reveals a tight connection between reverse-mode AD and delimited continuations (shift/reset). Backpropagation is a specific instance of CPS transformation — another algebraic program transformation.
JAX as Backus’s algebra realized
James Bradbury et al., JAX: composable transformations of Python+NumPy programs, 2018; github.com/jax-ml/jax. JAX’s grad, jit, vmap, and pmap are composable higher-order function transformations — directly analogous to Backus’s combining forms. jit(vmap(grad(loss))) composes three transformations into a single optimized computation. All transformations work through tracing into an intermediate representation (jaxpr) that can be further transformed.
Roy Frostig, Matthew James Johnson, and Chris Leary, “Compiling machine learning programs via high-level tracing,” MLSys 2018. Describes JAX’s tracing mechanism: Python functions are captured as graphs of primitive operations, then algebraically optimized via XLA (fusion, constant folding, simplification) — a concrete pipeline of algebraic program transformations.
Category theory in ML
Brendan Fong, David I. Spivak, and Rémy Tuyéras, “Backprop as Functor: A compositional perspective on supervised learning,” LICS 2019; arXiv:1711.10455. Defines a monoidal category Learn where morphisms are “learners” and gradient descent defines a monoidal functor from parametrized Euclidean functions to learners. Composition of learners mirrors Backus’s function composition. The monoidal structure corresponds to parallel (construction) combining forms.
G.S.H. Cruttwell, B. Gavranović, N. Ghani, P.W. Wilson, and F. Zanasi, “Categorical Foundations of Gradient-Based Learning,” ESOP 2022, LNCS 13240; arXiv:2103.01931. Proposes categorical semantics using lenses, parametric maps (Para construction), and reverse derivative categories. Shows that the entire gradient-based learning pipeline — model, optimizer, loss — consists of parametric lenses. Encompasses ADAM, AdaGrad, Nesterov momentum, MSE, and softmax cross-entropy.
Bruno Gavranović, “Fundamental Components of Deep Learning: A category-theoretic approach,” PhD thesis, University of Strathclyde; arXiv:2403.13001, 2024. Identifies two fundamental properties of deep learning — parametric and bidirectional — modeled by actegories and weighted optics. Combining these yields parametric weighted optics with connections to Bayesian updating and game theory.
Dan Shiebler, Bruno Gavranović, and Paul Wilson, “Category Theory in Machine Learning,” arXiv:2106.07032, 2021. Survey noting that “as most modern machine learning systems are inherently compositional, this makes it unsurprising that a number of authors have begun to study them through the lens of category theory.”
Differentiable programming languages
Adam Paszke et al., “Getting to the Point: Index Sets and Parallelism-Preserving Autodiff for Pointful Array Programming,” Proc. ACM Program. Lang. 5, ICFP, Article 88, 2021; arXiv:2104.05372. Dex treats arrays as eagerly-memoized functions on typed index sets, enabling abstract function manipulations (currying) on arrays with parallelism-preserving reverse-mode AD.
6. Neural program synthesis and meta-learning: programs writing programs
Core references
Scott Reed and Nando de Freitas, “Neural Programmer-Interpreters,” ICLR 2016; arXiv:1511.06279. NPI learns to compose lower-level programs to express higher-level ones via a recurrent core with persistent program memory. The program memory is analogous to a LISP environment where function definitions are first-class.
Chelsea Finn, Pieter Abbeel, and Sergey Levine, “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,” ICML 2017; arXiv:1703.03400. MAML’s bi-level optimization (outer loop optimizes initialization, inner loop adapts) parallels the self-modifying program paradigm: the system modifies its own parameters to become better at modifying itself.
David Ha, Andrew Dai, and Quoc V. Le, “HyperNetworks,” ICLR 2017; arXiv:1609.09106. One network generates weights for another — the most direct neural analog of LISP’s eval where code produces code. A hypernetwork that generates its own weights would be a neural quine.
Barret Zoph and Quoc V. Le, “Neural Architecture Search with Reinforcement Learning,” ICLR 2017; arXiv:1611.01578. Frames architecture design as sequence generation, where an RNN controller generates architecture descriptions as strings (essentially serialized programs). NAS is fundamentally program search.
Esteban Real et al., “Regularized Evolution for Image Classifier Architecture Search,” AAAI 2019; arXiv:1802.01548. Treats architectures as mutable symbolic structures subject to transformation — analogous to LISP macros transforming S-expressions.
Symbolic-neural hybrid synthesis
Kevin Ellis et al., “DreamCoder: Growing Generalizable, Interpretable Knowledge with Wake-Sleep Bayesian Program Learning,” PLDI 2021; arXiv:2006.08381. Operates in a functional language with LISP-like primitives (car, cdr), grows its own language by learning new abstractions, and uses neural recognition to guide search. The library learning phase is meta-programming: the system rewrites its own standard library.
Alexander L. Gaunt et al., “TerpreT: A Probabilistic Programming Language for Program Induction,” arXiv:1608.04428, 2016. Neural TerpreT (NTPT) creates hybrid programs calling learned neural subroutines — a differentiable interpreter, essentially a neural eval.
Matej Balog et al., “DeepCoder: Learning to Write Programs,” ICLR 2017; arXiv:1611.01989. Neural prediction of which functional programming primitives (map, filter, scanl) appear in a program, guiding search over functional programs.
Swarat Chaudhuri et al., “Neurosymbolic Programming,” Foundations and Trends in Programming Languages 7(3):158–243, 2021. Comprehensive survey bridging deep learning and program synthesis: functions as programs using neural modules alongside symbolic primitives.
Ziyang Li, Jiani Huang, and Mayur Naik, “Scallop: A Language for Neurosymbolic Programming,” PLDI 2023; arXiv:2304.04812. Combines Datalog-based logic programming with differentiable reasoning via provenance semirings.
7. Self-referential neural computation: Schmidhuber and beyond
Schmidhuber’s lineage
Jürgen Schmidhuber, “A ‘Self-Referential’ Weight Matrix,” Proc. ICANN ‘93, pp. 446–450, Springer, 1993. Derived a gradient-based learning algorithm for a network that can “speak about its own weight matrix in terms of activations” — special input/output units for “explicitly analyzing and manipulating all of its own weights, including those weights responsible for analyzing and manipulating weights.”
Jürgen Schmidhuber, “Gödel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements,” arXiv:cs/0309048, 2003; published in Goertzel & Pennachin (eds.), Artificial General Intelligence, Springer, 2007, pp. 199–226. “The first class of mathematically rigorous, general, fully self-referential, self-improving, optimally efficient problem solvers.” A Gödel machine “rewrites any part of its own code as soon as it has found a proof that the rewrite is useful.” The Global Optimality Theorem proves self-rewrites are globally optimal. Schmidhuber noted the machine is “‘conscious’ or ‘self-aware’ in the sense that its entire behavior is open to self-introspection, and modifiable… And this type of total self-reference is precisely the reason for its optimality.”
Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber, “A Modern Self-Referential Weight Matrix That Learns to Modify Itself,” ICML 2022, PMLR 162; arXiv:2202.05780. Makes the 1993 vision practical using Fast Weight Programmers and linear Transformers. “The WM of a self-referential NN can keep rapidly modifying all of itself during runtime. In principle, such NNs can meta-learn to learn, and meta-meta-learn to meta-learn to learn, and so on, in the sense of recursive self-improvement.”
Louis Kirsch and Jürgen Schmidhuber, “Eliminating Meta Optimization Through Self-Referential Meta Learning,” arXiv:2212.14392, 2022. Proves that self-referential architectures require parameter sharing (N² weights derived from N activations). Proposes Fitness Monotonic Execution for self-modification without explicit meta-optimization.
Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber, “Linear Transformers Are Secretly Fast Weight Programmers,” ICML 2021. Demonstrates that linear transformers are equivalent to fast weight programmers from the 1990s — connecting Schmidhuber’s self-referential lineage directly to modern transformer architectures.
Self-modifying architectures (2022–2025)
Xunjian Yin et al., “Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement,” arXiv:2410.04444, 2024. LLM-based agent that recursively modifies its own logic and behavior, inspired by Schmidhuber’s Gödel Machine.
Andrea Ferigo and Giovanni Iacca, “Self-Building Neural Networks,” GECCO ‘23; arXiv:2304.01086. Networks construct their own structure during task execution through Hebbian learning and pruning.
Louis Kirsch and Jürgen Schmidhuber, “Meta Learning Backpropagation And Improving It,” NeurIPS 2021; arXiv:2012.14905. Variable Shared Meta Learning (VSML): a neural network whose weights are replaced by tiny LSTMs can implement backpropagation purely in forward-mode — the network learns its own learning algorithm.
8. Clojure/LISP + ML: the practical intersection
Clojure ecosystem
The Clojure ML ecosystem includes Cortex (neural network construction/training by Thinktopic), Deep Diamond (tensor computation with GPU via Intel MKL-DNN and Nvidia cuDNN, by Dragan Djuric), scicloj.ml (composable ML pipelines), clj-djl (Amazon Deep Java Library wrapper), and DL4J/DL4CLJ (JVM deep learning). Djuric’s blog series “Deep Learning in Clojure from Scratch to GPU” demonstrates implementing neural networks from scratch in “a couple dozen lines of Clojure code.” The Python interop library libpython-clj enables direct use of TensorFlow/PyTorch from Clojure.
Common Lisp ML
MGL (by Gábor Melis) supports backpropagation neural networks, Boltzmann machines, and Gaussian processes. TH provides autograd-based deep learning with pretrained model support (VGG16, ResNet50, etc.). NeuraLisp offers modular ML with tensor operations, autograd, and GPU support. Yann LeCun’s Lush (Lisp Universal Shell) was a Lisp dialect specifically designed for ML research.
Scheme + differentiable programming
Pearlmutter and Siskind’s STALINGRAD compiler (Stalin∇) compiles VLAD, a Scheme-like language with first-class AD, to efficient native code. Jeffrey Mark Siskind, “Scheme as a Framework for Deep Learning,” ICFP Scheme Workshop 2021. Directly argues that Scheme’s properties make it suitable as a DL framework.
Critical hybrid systems
Garrett E. Katz et al., “NeuroLISP: High-Level Symbolic Programming with Attractor Neural Networks,” Neural Networks 146:230–251, 2022. Implements a LISP interpreter in neural substrate using attractor dynamics — temporally-local working memory, compositional data structures, scoped variable binding, and programs-as-data manipulation. Achieves perfect performance on PCFG SET benchmark. This is the most direct bridge: neural networks natively supporting LISP’s code-as-data paradigm.
Chen Liang et al., “Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision,” ACL 2017, pp. 23–33. Combines a seq2seq neural “programmer” with a symbolic “computer” that is a Lisp interpreter. The neural network learns to write Lisp programs from weak supervision.
“From Tool Calling to Symbolic Thinking: LLMs in a Persistent Lisp Metaprogramming Loop,” arXiv:2506.10021, 2025. Augments LLMs with a persistent Common Lisp REPL for dynamic tool construction, reflective reasoning, and metaprogramming.
Eshkol (github.com/tsotchke/eshkol) is a high-performance Lisp-like language with native AD (symbolic, forward-mode, reverse-mode plus vector calculus operators). Lambda functions maintain S-expression representations through a runtime registry — directly implementing the thesis concept of homoiconic differentiable programming.
Neural Networks in Pure Lisp (SectorLISP): Hikaru Ikuta (2022) implemented a neural network in SectorLISP — a 512-byte boot sector Lisp interpreter — without built-in numbers, using only symbolic manipulation of atoms and lists, constructing fixed-point arithmetic and matrix multiplication from first principles.
9. Philosophical-computational self-awareness as framing device
Daniel C. Dennett, The Intentional Stance, MIT Press, 1987. Defines three stances (physical, design, intentional) for predicting system behavior. Any system whose behavior is best predicted by attributing beliefs and desires is an “intentional system, whatever its innards.” Applied to modern ML: Cominelli et al. (2021) argue the intentional stance is “the only viable strategy” for understanding complex AI behavior.
Douglas R. Hofstadter, I Am a Strange Loop, Basic Books, 2007. Refines the strange loop concept: self-awareness emerges from self-referential loops in symbol systems. Formal claim: “The pronoun ‘I’ doesn’t involve a stronger self-reference than Gödel’s construction.” Argues for downward causality — high-level self-referential patterns can be causally efficacious.
Thomas Nagel, “What Is It Like to Be a Bat?” The Philosophical Review 83(4):435–450, 1974. Argues that subjective experience has an irreducibly first-person character that objective description cannot capture. Serves as a foil: while a self-modeling system can represent its own states, Nagel’s argument marks the boundary of what computational self-reference can achieve.
Anthropic, “Emergent Introspective Awareness in Large Language Models,” Transformer Circuits, 2025. Finds that Claude models achieve high true-positive introspective awareness with zero false positives when tested for ability to detect injected internal-state representations. Post-training is key to eliciting introspective awareness.
“Large Language Models Report Subjective Experience Under Self-Referential Processing,” arXiv:2510.24797, 2025. Self-referential prompting consistently elicits structured experience reports across model families; cross-model semantic convergence occurs. Connects directly to Hofstadter’s Gödelian constructions.
How the pieces connect: the central thesis architecture
The nine areas form a precise intellectual pipeline supporting both papers:
Foundation layer (Papers 1 & 2): McCarthy’s homoiconicity + Kleene’s recursion theorem + Scott’s fixed-point semantics establish that self-reference is both practically accessible (LISP) and mathematically inevitable (any sufficiently expressive system).
Algebraic layer (Papers 1 & 2): Backus’s FP algebra + Bird-Meertens formalism + categorical semantics show that programs compose through a closed algebra of combining forms with equational laws for transformation.
Differentiable layer (Paper 2 primarily): JAX’s functional transformations + Elliott’s AD-as-functor + Fong-Spivak’s backprop-as-functor demonstrate that gradient-based learning is an algebraic program transformation system preserving compositional structure — Backus’s vision realized for differentiable computation.
Meta-learning layer (Papers 1 & 2): In-context learning as implicit gradient descent (von Oswald, Akyürek) + transformers as programmable computers (Giannou) + Turing completeness (Pérez) show transformers implement learning algorithms in their forward pass — they are mesa-optimizers, self-modifying computational systems.
Self-reference layer (Papers 1 & 2): Schmidhuber’s self-referential weight matrices + Smith’s reflective tower + hypernetworks demonstrate that self-reference enables recursive self-improvement in neural systems.
Synthesis layer (Paper 2 experiments): DreamCoder + NeuroLISP + Eshkol + Neural Symbolic Machines prove the practical viability of representing neural computation in LISP-like structures. The proposed experiments — neural networks as S-expressions, architecture search as symbolic transformation, a neural metacircular evaluator — sit at this intersection.
The missing piece that both papers can articulate: no existing system fully unifies homoiconicity, differentiability, and self-modification in a single framework. DreamCoder comes closest but its neural and symbolic components remain separate. A fully homoiconic differentiable architecture — where the network’s structure is represented in the same format it processes, and gradient-based optimization operates on this self-representation — would close the loop from McCarthy through Backus to modern transformers, realizing computationally what Hofstadter described philosophically: a strange loop where the system’s representation of itself is causally connected to its own operation.