Will We Ever Illuminate the Black Box?

February 13, 2026

I recently read through The Bitter Lesson, an essay by Richard Sutton, widely considered to be the “father” of reinforcement learning, about why the incorporation of human knowledge into AI has been unhelpful or possibly even detrimental to its long-term development. A bitter lesson indeed—nobody wants to believe themselves obsolete, least of all the very people working to bring about Artificial Superintelligence (ASI). I certainly didn’t want to accept it when I first read Sutton’s essay, but the more I looked back over my past experiences with machine learning, the more I realized he might have a point.

Back in my day…

It was my final year of high school, and we were all required to choose a “senior research project” that would be the capstone of our youth. Like any other teenager, I was overconfident in my abilities and hopelessly confused about what I wanted to do with my life. So I joined the biotechnology lab wanting to do a machine learning project. What project? I had no idea.

Thankfully, high school teachers have a pretty good sense of how to deal with with confused and overconfident teenagers. I was assigned to a pre-existing partnership with the National Park Service in which students would collect surface samples from various national monuments in D.C. We used next-generation sequencing to identify and taxonomically classify the microbes that were living on the marble.

I took great issue with that last part. My (very flawed) understanding of biology had led me to believe that microbial life is not and ought not to be differentiated into discrete taxonomic bins. Nor should individual microbes be considered in isolation from the communities in which they naturally exist. When I visualized in my mind what a microbial community looked like at the genetic level, I did not see a taxonomic table.

What did I see? I made it my mission to communicate that to the world visually, by leveraging machine learning. Specifically, I wanted to turn the “ugly” textual representations of nucleotide sequences into beautiful mathematical representations with inherent interpretability. The idea I came up with was to train a neural network to predict a numerical vector for each DNA sequence, such that the distance between the vectors would mirror the real-world edit distance between the sequences.

The problems began when I actually tried implementing this. Firstly, I had never done any machine learning before, and was very misguided on what actually was and wasn’t possible for neural networks. I knew nothing about convolutions, transformers, or even the mathematical basis of the perceptron. I had no idea what gradient descent was or what I was doing when I defined an activation function. I had never taken any machine learning course before. Not to mention, ChatGPT didn’t exist yet.

The very first thing I tried was a simple stack of dense layers. Predictably, this did not work very well for sequential data. I read somewhere that convolutions were an option, so I added some convolution layers (and was very confused when the output size was different than the input size). Then I discovered the LSTM and bidirectional LSTM, which led to further improvements. And finally attention (which I could not for the life of me wrap my head around). At this point, I was out of easily accessible architectures. So I took the dumbest possible next step: threw them all together and made everything bigger. Turns out, that worked better than anything else I’d tried, and at the time I essentially chalked it up to “ML magic!”

You can see the final result of this exploration on my projects page. But I discuss it again here to highlight that, despite having effectively 0 domain knowledge about DNA or even the inner workings of neural networks, I was still able to design an effective learning system by throwing together fundamental architectural units and letting it train on a large dataset. And now, watching the latest developments in LLMs and foundation models, it seems that much smarter and better-resourced people than myself are doing exactly the same thing, and seeing similarly effective results. Moreover, even the process of designing neural network architectures is being automated, and I suspect it won’t be long before these frameworks begin to outperform the average human machine learning engineer. Hell, you don’t even have to look that far; Claude Code could probably do a half-decent job of designing, training, and deploying machine learning models entirely on its own.

So where does the Bitter Lesson leave us humans? If scaling up simple algorithms over massive compute is all that’s required for developing ASI, what are we to do with all the human knowledge and expertise we’ve built up? Moreover, if AI systems become even less dependent on human knowledge, does that not also imply a further decrease in our ability to understand them? Which brings up questions of safety, and if we can really trust these systems to provide medical advice, monitor critical infrastructure, or in the extreme case, not cause human extinction (unintentionally or otherwise).

Interpretability is key

Nobody can predict what the world will look like once ASI is achieved, least of all a still-very-confused undergraduate. But I do believe that understanding how neural networks learn, represent information, and make decisions, is an essential first step toward safely and harmoniously co-existing with ASI. If nothing else, mechanistic interpretability (MI) is a problem that can only be solved through the accumulation of human knowledge, as we would otherwise be unable to validate the claims an ASI makes about its own behavior, or would need to trust conclusions drawn by other black-box systems (leaving us exactly where we started: unsure about failure modes and vulnerable to hallucinations and misalignment).

What I can tell you—and perhaps this is the biologist in me slipping out—is that we need to stop treating LLMs like mathematical optimization algorithms and start treating them like emergent natural systems. After all, the world already has highly complex emergent black-box systems that humans are only beginning to understand, capable of both great utility and great harm: cellular life. We’ve spent hundreds of years trying to untangle its mysteries, and especially in the last few decades with DNA sequencing, we’ve begun to make serious headway.

The very textual representations of DNA that I declaimed as “ugly” in high school, are indeed one of the great marvels of modernity. The fact that we can directly read out on a computer screen the genetic code of an organism too small to see, hear, smell, touch, or taste, is nothing short of astounding. And we have all manner of algorithms out there for further improving the presentation of this information, to the point that geneticists can diagnose hereditary diseases at a glance and prescribe life-saving treatments.

It seems to me that MI today is in a pre-sequencing dark age. Despite being able to see the activation values and gradients at every single neuron of an LLM under all manner of data inputs, we have yet to develop a comparable system for synthesizing that information into a format that our limited brains can comprehend and reason about. We do not understand why neural networks generalize well in some cases but hallucinate in others. We attempt to achieve “safety” through external frameworks of censorship rather than identifying and addressing the root causes of undesirable behavior, at the circuits level.

The science of interpretability (and lack thereof)

LLMs are really not that complex when compared to biological systems. Anybody can make an exact copy of the hardware and software environment required to run an LLM, whereas determining proper culturing conditions remains one of the biggest challenges in studying microbes. No microscope is needed to examine a digital neuron, and experiments can be conducted with massive parallelization in a fraction of the time needed by biology labs. The challenge that remains is explaining the emergence of complex behaviors from simple primitives and massive scale, a problem also shared by systems biologists.

How do biologists attempt to solve this? At the most basic level, hypothesis-driven experimentation. The philosophical principle underlying all scientific discovery is the generation of falsifiable hypotheses, the testing of those hypotheses via experimentation, and the updating of our theories of how a system works as the empirical evidence refutes our preconceptions. Of critical importance are statistical significance measures, replication studies, and meta-analyses which aggregate the results from hundreds of studies to broadly evaluate scientific ideas. In my (admittedly limited) experience reading the literature across AI subdomains, I have observed a general lack of scientific rigor in our attempts to understand neural networks, as compared the standards applied to biology.

Because of the massive costs involved in running cutting-edge language models, much of our best MI research comes from private AI labs (particularly OpenAI and Anthropic) who refuse to open-source their models; without replicability, our understanding of LLMs is no better than blind faith in the words broadcasted from a silicon tower. Moreover, the market incentive for AI companies is fundamentally misaligned with the public’s goals of ensuring AI is safe, unbiased, and open for all. If there’s one broader lesson I’ve learned from my experience in machine learning, it’s that Charlie Munger was 100% correct: “Show me the incentive, and I’ll show you the outcome.” The free market demands that companies do everything they can to grow and continue growing, lest they be replaced by more ruthless competitors. Without the ability for independent researchers to replicate their findings, private labs could easily manipulate their research publications to downplay AI safety concerns (or play them up to generate hype).

And even for peer-reviewed research on open-source models, there is a troubling lack of focus on statistical validity. Not along the data dimension (as there is usually so much that significance can be assumed), but along instantiations of the same model. If the models themselves are the objects of study (as is usually the case in both MI research and new architecture proposals), why is a sample size of 1 so often considered acceptable? Compute limitations are certainly an understandable constraint, but even in this case, it would be more informative to train 3 models 1/3 of the size than one large model achieving a cosmetic improvement on a benchmark.

Interest in MI also tends to follow the hottest trend on any given day. For now, everyone wants to know how LLMs work. But even much older, less performant neural networks still evade our understanding. ResNet (an early image classification model) was developed a decade ago and was all anyone was talking about at the time. Despite the MI community having made some real progress since then toward understanding its function, it receives only a fraction of the attention that’s paid to LLMs today. The LLMs are of course the more interesting and higher-stakes object of study, but there are many general principles and analytical tools that could arise from continued analysis of ResNet, were more effort directed toward it.

Part of the problem is a lack of patience which I believe stems from the Silicon Valley culture that dominates modern computing. Researchers seek big results fast that can rapidly be converted into marketable products and business impacts. Yet almost no other scientific discipline operates this way. Microbiologists will spend their entire careers studying the behavior of a single organism, seeking to understand it at the genomic, transcriptomic, proteomic, and metabolomic levels. Well-studied model organisms serve as comparison points for more interesting, non-standard behavior in understudied organisms. It is only because of this foundational work that we have begun to understand microbial communities as complete systems. If we had instead dismissed the study of less interesting or less successful organisms, countless antibiotics and cancer therapies hiding in Earth’s microbial dark matter may never have been discovered, and we may never have learned the general principles which aided us in our study of more interesting microbes.

How can we do better?

I have personally always looked up to the “move fast and break things” attitude of Silicon Valley, and the “just build it” mindset is what drew me toward computing in the first place. I am certainly not suggesting that these things need to be abandoned or even toned down; they are some of the most admirable qualities our field possesses. Scientists are slow, scientists are inefficient, and scientists are often wrong (sometimes catastrophically). But what scientists do have is philosophical unity under the framework of hypothesis-driven testing, and centuries of experience in its application to understanding natural systems. From that, there is much we can learn.

The very first priority in making MI more scientific is to define some standardized model organisms. Not toy one-layer transformers that are conveniently scoped for a single paper; truly complex, well-established neural architectures exhibiting some well-profiled emergent capabilities, ideally with many trained variants to offer a large sample size. Existing model zoo projects, which provide thousands of trained image classifiers under different hyperparameters, are a step in the right direction, but need to be expanded to additional problem domains (object detection, natural language processing, etc.). Most importantly, these projects need to actually be used by MI researchers in a standardized way.

Secondly, AI research more broadly needs to drastically improve in statistical rigor. Demonstrating that a single trained system with finely-tuned hyperparameters can beat state-of-the-art performance on a benchmark should not be sufficient for publication in a peer-reviewed journal or conference. Proposals should quantify hyperparameter sensitivity and demonstrate training stability over multiple attempts on different data subsets. Statistical significance must be a requirement for new architectures that claim to beat state-of-the-art performance.

Thirdly, we need a public funding arrangement for MI research in the same way that we have for basic science research. We can not allow ourselves to slip into complacency over the gradual takeover of MI research by corporate groups with clear perverse incentives. Publicly funded academic research remains the best option for ensuring scientific rigor, as it has done for hundreds of years. We will not feel the pain of deferring this cost until it’s far too late.

Finally, it is critical that we bring in more people from outside the field who have a different set of perspectives and experiences. This, perhaps, is an even more bitter lesson. In the words of Bill Gates, when you have a hammer, everything starts looking like a nail. Our hammer is computing, and it can be difficult for us to recognize that there are other ways of solving problems. I know personally how challenging it can be to work between science and computing. You speak different languages, have different priorities, and possess different skills. But bridging this divide is non-negotiable if MI is to become a serious scientific discipline. What we most need to bring over is the most difficult thing of all: the scientists’ philosophical basis, the very lens through which they perceive the world. We will need many, many more of these hard conversations and difficult collaborations to achieve that.

Will we ever illuminate the black box?

Personally, I think it’s absolutely possible. I see MI’s current struggles as caused by a lack of time (large-scale neural networks have only been around to study for 1-2 decades; DNA sequencing was only invented hundreds of years after the discovery of cells), a lack of manpower (the intersection of people drawn to the Silicon Valley culture, and those with enough of an interest in theory and safety to care about interpretability, is much too small), and a lack of imagination (there simply hasn’t been enough crossover from the sciences into AI research). All of these will improve with time, as people reallocate themselves to fill the gap.

The better question is whether we’ll solve this problem before ASI is developed. On that, I am really not sure. If we accept the Bitter Lesson, the primary limiting factor of progress in AI capabilities will be the scaling up of compute. To keep up, MI research either needs to grow equally exponentially (which is fairly unlikely), or needs a major breakthrough that lets us interpret neural networks accurately regardless of their scale. Such breakthroughs have absolutely happened before (like the invention of microscopes), but it’s hard to put a number on how long that could take.

This presents yet another argument for people to get involved in MI. Statistically, we have a greater chance of happening upon that crucial breakthrough the more researchers we have working toward it. We just have to hope that AI does not cause major problems until such time as scale-invariant interpretation of neural networks is a solved problem. If trying to explain your years of work in the natural sciences to a vibe-coder doesn’t sound too appealing, perhaps the non-negligible probability of human extinction if we fail to understand AI might tip the scales.

Comments

All comments have a 300 character minimum to promote meaningful contributions and civil discussion. If this criteria is not met, your comment will be permanently deleted, without notice. You are responsible for ensuring your contribution meets the character minimum before submitting. I look forward to reading your thoughts, feedback, and criticism!