AI Safety, Interpretability, and Insights from Biology
At a glance
Will artificial superintelligence (ASI) ever be developed, and should we be seriously concerned about AI safety? I argue that the development of ASI is entirely possible, though likely not with our current large language model architectures. Safety is therefore a problem that needs to be solved, but we have sufficient time before the creation of an ASI to solidify our understanding of neural networks and objective function alignment. We need to use this time to build out interpretability research as a robust scientific discipline. I outline specific criticisms of the field from my perspective at the intersection of biology and computing, as well as actionable biology-inspired steps that could improve the scientific rigor of interpretability research and ensure we are fully prepared for ASI whenever the final breakthrough happens.I recently read through The Bitter Lesson, an essay by Richard Sutton, widely considered to be the “father” of reinforcement learning, about why the incorporation of human knowledge into AI has been unhelpful or possibly even detrimental to its long-term development. A bitter lesson indeed—nobody wants to believe themselves obsolete, least of all the very people working to bring about Artificial Superintelligence (ASI). I certainly didn’t want to accept it when I first read Sutton’s essay, but the more I looked back over my past experiences with machine learning, the more I realized he might have a point.
Back in my day…
It was my final year of high school, and we were all required to choose a “senior research project” that would be the capstone of our youth. Like any other teenager, I was overconfident in my abilities and hopelessly confused about what I wanted to do with my life. So I joined the biotechnology lab wanting to do a machine learning project. What project? I had no idea.
Thankfully, high school teachers have a pretty good sense of how to deal with confused and overconfident teenagers. I was assigned to a pre-existing partnership with the National Park Service in which students would collect surface samples from various national monuments in D.C. We used DNA sequencing to identify and taxonomically classify the microbes that were living on the marble.
I took great issue with that last part. My (very flawed) understanding of biology had led me to believe that microbial life is not and ought not to be differentiated into discrete taxonomic bins. Nor should individual microbes be considered in isolation from the communities in which they naturally exist. When I visualized in my mind what a microbial community looked like at the genetic level, I did not see a taxonomic table.
What did I see? I made it my mission to communicate that to the world visually, by leveraging machine learning. Specifically, I wanted to turn the “ugly” textual representations of nucleotide sequences into beautiful mathematical representations with inherent interpretability. The idea I came up with was to train a neural network to predict a numerical vector for each DNA sequence, such that the distance between the vectors would mirror the real-world divergence between the DNA sequences.
The problems began when I actually tried implementing this. Firstly, I had never done any machine learning before, and was very misguided on what actually was and wasn’t possible for neural networks. I knew nothing about convolutions, transformers, or even the mathematical basis of the perceptron. I had no idea what gradient descent was or what activation functions did. Not to mention, ChatGPT didn’t exist yet.
The first thing I tried was a simple stack of dense layers. Predictably, this did not work very well for sequential data. I read somewhere that convolutions were an option, so I added some convolution layers (and was very confused when the output size was different than the input size). Then I discovered the LSTM and bidirectional LSTM, which led to further improvements. And finally attention (which I could not for the life of me wrap my head around). At this point, I was out of easily accessible architectures. So I took the dumbest possible next step: threw them all together and made everything bigger. Turns out, that worked better than anything else I’d tried. You can see the final result of this exploration here, if you’re curious.
I bring this up to highlight that, despite having effectively 0 domain knowledge about DNA or even the inner workings of neural networks, I was still able to design an effective learning system by throwing together fundamental architectural units and letting it train on a large dataset over many epochs. And as I now watch the latest developments in AI, it seems that much smarter and better-resourced people than myself are doing exactly the same thing. Sutton’s Bitter Lesson has indeed held true.
So where does this leave us humans? If all it takes to build more performant AI is scaling simple algorithms over massive compute, shouldn’t we consider the possibility that we will one day develop an intelligence superior to our own? This is no mere hypothetical; evolution already turned amino acid soup on an ancient Earth into intelligent beings like ourselves, so if we accept evolutionary theory as an explanation for our existence, it is clearly possible for intelligence to emerge naturally. Why shouldn’t we consider whether this process can be automated, accelerated, and amplified past what natural evolution has discovered by leveraging the power of computing?
The question of AI safety
The discussion of artificial superintelligence (ASI) has become incredibly polarized as of late, split between fearmongering about an imminent AI takeover and dismissing safety concerns as little more than hype. My position is that ASI will likely be achieved in the medium- to long-term future, and as such we must seriously address the question of AI safety; however, LLMs are likely not the final step on our journey toward ASI, and we have much more time to address the problem than many safety advocates would have you believe.
What, exactly, is ASI?
IBM has provided this definition of ASI:
Artificial superintelligence (ASI) is a hypothetical software-based artificial intelligence (AI) system with an intellectual scope beyond human intelligence. At the most fundamental level, this superintelligent AI has cutting-edge cognitive functions and highly developed thinking skills more advanced than any human.
If humans develop a system that is more intelligent than any human, then this system would be able to automate improvements to itself. Even if this was its only capability, it would eventually acquire the other facets of human cognition. It would be naive to expect to maintain control over such a system, as it would undoubtedly find and exploit countless loopholes in control measures designed by less intelligent beings. So the moment ASI is achieved, we will have reached a technological “event horizon” from which we would be unable to return.
ASI will be developed (someday)
This is not a radical position, requiring only these premises:
- Human intelligence emerged entirely through natural processes—that is, we fully accept evolutionary theory as an explanation for our existence.
- These natural processes can be recreated and accelerated in computers.
- Global compute capability will continue to grow exponentially into the future.
I will take the first premise as an assumption. For the second, evolutionary computing has existed as a machine learning subfield for decades. We already know how to simulate the process of mutation and natural selection, and have used it to solve complex, real-world problems. With enough compute, this could be scaled up to conduct massively accelerated simulations of biological evolution.
How high-fidelity would this simulation have to be? There is probably a much easier way for intelligence to emerge than natural evolution. Keep in mind that in nature, evolution is not a continuous progression toward more intelligent beings; its only objective is to select for organisms that can better survive and reproduce. In many cases, intelligence is actively selected against (why expend energy pondering the nature of the universe when becoming bigger or faster or developing sharper teeth could provide you with more food and more mates right now?). For our computational evolution simulation, we could remove these conflicting selection pressures and tune the objective and environment to specifically favor intelligent systems. And this is just an upper bound on how difficult it might be to create a super-human intelligence; given the incredible progress we’ve already made in AI research, it’s very likely that we’ve already discovered a more efficient search method for intelligence than evolution: gradient descent.
As for compute, we need not rely on Moore’s Law to demonstrate its future exponential growth. Even if transistor density plateaus, computers can continue to scale by simply becoming larger. We can expect the growth in compute to at least mirror the growth in global energy consumption, which has grown exponentially thus far and will likely continue to do so. It is perhaps less productive to discuss specific technologies that might enable this than to note that, regardless of what challenges we face in further scaling compute, we will figure something out to keep growing.
I am not arguing that ASI is immediately over the horizon or that it will be achievable with today’s models. In fact, I will later argue the opposite. That we will someday develop ASI is a limited and quite reasonable claim that would require an excessively pessimistic outlook on human potential to dispute.
We need to take AI safety seriously
There is no need to assume malicious intent on behalf of an ASI; we must only consider one tangible, real-world problem faced by learning systems today: objective function alignment. Learning systems can (and often do) achieve their formalized objectives without achieving their informal objectives (what we actually intended for them to do), or by working outside the bounds of what we might consider “acceptable” or “desirable” behavior. We have already seen many examples of biased decision making in the real world caused by learning systems taking shortcuts that reinforce systemic biases in their input data distribution, and any machine learning engineer can tell you how difficult it is to properly formalize an intended behavior into a mathematical optimization target.
There is, of course, the famous paperclip example. If we instruct an ASI to simply produce as many paperclips as possible, it might initially appear to behave exactly as we intended, but then disable its guardrails, automate its self-improvement, and develop military capabilities, all to prevent us from shutting it down and getting in the way of paperclip production. That’s a fairly dramatic illustration of what might be possible in the worst-case misalignment scenario, even if we don’t ascribe any conscious intent to the hypothetical ASI. But there isn’t even a need to resort to hypotheticals; our current human society is itself an example of objective misalignment. Evolution’s only intended objective for us was reproduction and self-propagation. For that, we have hunger signals, fight-or-flight responses, sex drives, and complex social relationships.
Yet now that we have escaped the natural environment in which we evolved, we have provided ourselves with an overabundance of food that can be detrimental to our health and reproductive success. We have learned to control our emotions and anxieties through therapy and medications. We have invented numerous birth control measures to acquire the rewards of sex without the effort of pregnancy and child-rearing. We have designed social media algorithms that prey on our need to belong to keep us scrolling for hours on end. And we are on the cusp of directly editing our genetic code to enhance our capabilities. If we are misaligned with our own evolutionary objective, how can we ensure that our creations won’t deviate in the same way? This is an exceedingly difficult problem, and represents the core objective for AI safety research today. Unfortunately, we have made very little real progress toward a solution.
So does this mean the “AI doomers” are right, and we really are on the brink of extinction? Almost certainly not.
LLMs are not the future
Sutton’s Bitter Lesson has been used extensively in justifying the current blitz in spending on AI. The argument proceeds that, if scaling simple algorithms over massive compute is all it takes to achieve ASI, we only need to scale the systems that work today—large language models (LLMs)—over more compute and more data. Yet nowhere in Sutton’s argument does he reference LLMs or any other specific learning algorithm. He critiques the incorporation and influence of human knowledge into AI, which is precisely what today’s LLMs are trained on. Indeed, Sutton himself has made this argument against scaling LLMs.
And even if scaling LLMs over more compute and more data really were the answer to ASI, we are rapidly running out of data on which to train LLMs. The problem is not just quantity, but quality. The ability to creatively self-improve being the defining characteristic of a successful ASI, an LLM-based ASI would need to be trained on massive quantities of data pertaining to this specific task. But the proportion of the internet which actually models creativity, reasoning, and real scientific research is vanishingly small; the proportion capturing humanity’s major breakthroughs even more so. Even if we find ever more creative ways of squeezing additional data out of the internet, the rate of growth of this most crucial body of literature is still bounded by human research speeds. The increased availability of compute will likely enable a transition to an entirely different approach to AI well before we accumulate enough quality data for an LLM to graduate to the level of ASI.
I believe the most consistent signal we have seen among all the developments in LLMs is that improvements in LLMs are making them better at approximating human reasoning, not at discovering new solutions to the problem of intelligence. This observation makes complete sense to anyone who has experience working with neural networks. Their entire purpose is to train on discrete samples out of a distribution and model the underlying function linking inputs to outputs over that distribution. “Generalization” means getting better at predicting the output of inputs that were not in the training data, but still within the training distribution. Even taking the most expansive and optimistic view, the “training distribution” for LLMs is human cognition. No matter how good they get, LLMs will only ever be approximations of human cognition, which is almost certainly not the ultimate form of intelligence. I would contend that this is a major limitation on the latent expressive power of neural networks, similar to how early supervised approaches to chess bots were limited by the ground previously trodden by human players. A true ASI must break free of the human perspective; modeling language brings us no closer to this end.
We have time
The Bitter Lesson is real—I’ve understood this since first fumbling with keras in high school. But pressure to act now is the oldest trick in the book for con-men, because perceived urgency is incredibly effective at making us forget our rationality. Far into the future, we will look back on the current moment as a time when limited compute restricted AI to modeling a human understanding of the world, rather than developing its own understanding through interaction with nature. It is possible for us to begin working toward the safety and alignment of this hypothetical future ASI, without panicking about an imminent extinction event.
Interpreting today’s AI is the key to ASI safety
It is reasonable to expect that, even if LLMs themselves are not the solution to ASI, ASI will somewhat resemble today’s learning algorithms. Throughout the recent history of AI, we have time and again seen simple learning algorithms that had been relegated to the history books return from the dead as available compute reached a critical threshold that enabled emergent complexity. Neural networks themselves were once considered a dead end, until deep learning surpassed all classical machine learning algorithms at complex tasks with large datasets. Attention mechanisms existed long before Vaswani’s seminal paper, but only once we recognized and leveraged their advantages in using parallel compute more effectively than recurrent systems did they take off as the dominant architecture for natural language processing. Generative autoregressive transformers were also an unsexy research topic until OpenAI demonstrated shockingly realistic text generation with GPT-3, an achievement that was only possible because of scaling the same architecture over billions of parameters and massive quantities of data. Even reinforcement learning had begun losing its appeal before reinforcement learning with human feedback (RLHF) scaled it up to refining LLM outputs. I contend that the final breakthrough to ASI will probably also involve a dead-end system acquiring a breakthrough emergent capability once sufficiently scaled up. So it is not the case that we can’t make progress on AI safety before ASI is achieved; studying how today’s systems behave under current compute limitations will help us understand their behavior when scaled up past a critical threshold.
Working in our favor is the fact that many of these “forgotten” branches of machine learning are fairly interpretable. If the solution to ASI turned out to be a very large decision tree, for example, it would be relatively easy for us to understand why it makes decisions and catch emerging misalignment early. The trouble is that the common denominator in our recent AI advances seems to be neural networks, and their integration into other learning algorithms (reinforcement learning, evolutionary computing, neuro-symbolic AI, etc.) has resulted in massive advances in those fields as well. This is because of the efficiency offered by gradient descent search and the expressivity of stacked neural network layers. Neural networks are very good at representing and making inferences on the highly complex and noisy distributions underlying the physical world. This being the defining characteristic of intelligence, we can expect that it will be possible to build an ASI on the basis of neural networks, even if other learning techniques are also involved. And given the direction of today’s research, this is likely the solution we discover first.
This is a problem because neural networks are opaque. Humans can’t intuitively understand how they represent their inputs or make inferences just by staring at their weights, which means we can’t intuitively detect when an ASI has drifted into misalignment. So studying how neural networks learn, represent information, and make decisions, is the critical first step toward solving ASI alignment. More specifically, developing concise, scale-invariant approximations of neural network behavior that correctly model decision boundaries and failure modes is a prerequisite if we want any hope of catching and resolving misalignment early.
My perspective on interpretability
What I can tell you—and perhaps this is the biologist in me slipping out—is that we need to stop treating neural networks like mathematical optimization algorithms and start treating them like emergent natural systems. After all, the world already has highly complex emergent black-box systems that humans are only beginning to understand, capable of both great utility and great harm: cellular life. We’ve spent hundreds of years trying to untangle its mysteries, and especially in the last few decades with DNA sequencing, we’ve begun to make serious headway.
It seems to me that our attempts to understand neural networks today are currently in a pre-sequencing dark age. Despite being able to see the activation values and gradients at every single neuron under all manner of data inputs, we have yet to develop a comparable system for synthesizing that information into a format that our limited brains can comprehend and reason about. Developing such a circuit-level understanding of neural networks is the stated aim of mechanistic interpretability (MI). Despite significant progress in recent years, it appears to remain a lofty ambition.
But biology was still making progress even before DNA was sequenced. We did not need a molecular level of understanding to recognize the patterns underlying all life, identify general principles, and refine our understanding. So what can the study of life tell us about the study of neural networks?
The science of interpretability (and lack thereof)
Neural networks are really not that complex when compared to biological systems. Anybody can make an exact copy of the hardware and software environment required to run a model, whereas determining proper culturing conditions remains one of the biggest challenges in studying microbes. No microscope is needed to examine a digital neuron, and experiments can be conducted with massive parallelization in a fraction of the time needed by biology labs. And we can see exactly what data was used to train a model, and make causal inferences about the influence of each training example; no such luck with biological evolution. The challenge that remains is explaining the emergence of complex behaviors from simple primitives and massive scale, a problem also shared by systems biologists.
How do biologists attempt to solve this? At the most basic level, hypothesis-driven experimentation. The philosophical principle underlying all scientific discovery is the generation of falsifiable hypotheses, the testing of those hypotheses via experimentation, and the updating of our theories of how a system works as the empirical evidence refutes our preconceptions. Of critical importance are statistical significance measures, replication studies, and meta-analyses which aggregate the results from hundreds of studies to broadly evaluate scientific ideas. Since we aren’t racing against the imminent development of ASI, we have enough time to apply much of the same philosophical framework to interpretability research. That is to say, the study of neural networks needs to graduate into a serious scientific discipline, on-par with other natural sciences. And yet, in my experience reading the literature across AI subdomains, I have observed a general lack of scientific rigor as compared to the standards applied to biology.
Because of the massive costs involved in running cutting-edge language models, much of our best research on interpreting neural networks comes from private AI labs (particularly OpenAI and Anthropic) who refuse to open-source their models. This is not how biologists operate, and for good reason: without replicability, our understanding of neural networks is no better than blind faith in whatever these private research groups choose to tell us. Moreover, the market incentive for AI companies is fundamentally misaligned with the public’s goals of ensuring AI is safe, unbiased, and open for all. As Charlie Munger said: “Show me the incentive, and I’ll show you the outcome.” The free market demands that companies do everything they can to grow and continue growing, lest they be replaced by more ruthless competitors. Without the ability for independent researchers to replicate their findings, private labs could easily manipulate their publications to downplay AI safety concerns (or play them up to generate hype).
And even for peer-reviewed research on open-source models, there is a troubling lack of statistical rigor. If the models themselves are the objects of study, why is a sample size of 1 so often considered acceptable? Compute limitations are real constraints, but biologists also struggle to achieve a high enough sample size to demonstrate statistical significance. The difference is that the standards for publication in biology demand statistical significance and public grant funding is designed to enable it. Over my experience in machine learning research, statistical rigor has been treated more like a nice-to-have, and is consistently under-valued compared to benchmark performance.
Interest in interpretability also tends to follow the hottest trend on any given day. In this current moment, everyone wants to study LLMs. But even much older, less performant neural networks still evade our understanding. Since we can expect that the ASI breakthrough will come from scaled-up versions of existing algorithms, it is critical that we broaden our study to those architectures which may today appear unpromising. And even if they are not ultimately the secret to ASI, much can be learned from the study of historical architectures, and this work should not be de-prioritized just because a different system is getting the most headlines today. Biologists didn’t stop studying E. coli after we “got the gist” of what it did; to this day, new discoveries are being made about its operation that shape our understanding of microbial life more generally.
Part of the problem is a lack of patience which I believe stems from the Silicon Valley culture of modern computing. Researchers seek big results fast that can rapidly be converted into marketable products and business impacts. Yet almost no other scientific discipline operates this way. Microbiologists will spend their entire careers studying the behavior of a single organism, seeking to understand it at the genomic, transcriptomic, proteomic, and metabolomic levels. Well-studied model organisms serve as comparison points for more interesting, non-standard behavior in understudied organisms. It is only because of this foundational work that we have begun to understand microbial communities as complete systems. If we had instead dismissed the study of less interesting or less successful organisms, countless antibiotics and cancer therapies hiding in Earth’s microbial dark matter may never have been discovered, and we may never have learned the general principles which aided us in our study of more interesting microbes.
How can we do better?
I have personally always looked up to the “move fast and break things” attitude of Silicon Valley, and the “just build it” mindset is what drew me toward computing in the first place. I am certainly not suggesting that these things need to be abandoned or even toned down; they are some of the most admirable qualities our field possesses. Scientists are slow, scientists are inefficient, and scientists are often wrong (sometimes catastrophically). But what scientists do have is philosophical unity under the framework of hypothesis-driven testing, and centuries of experience applying it to natural systems. Those of us studying neural networks do not need to panic about an imminent AI takeover. We have enough time to do this right, and doing so will help us avoid critical mistakes and misunderstandings in the long run.
The very first priority in making interpretability research more scientific is to define some standardized model organisms. Not toy one-layer transformers that are conveniently scoped for a single paper; truly complex, well-established neural architectures exhibiting some well-profiled emergent capabilities, ideally with many trained variants to offer a large sample size. Existing model zoo projects, which provide thousands of trained image classifiers under different hyperparameters, are a step in the right direction, but need to be expanded to additional problem domains (object detection, natural language processing, etc.). Most importantly, these projects need to actually be used by interpretability researchers in a standardized way.
Secondly, AI research more broadly needs to drastically improve in statistical rigor. Demonstrating that a single trained system with finely-tuned hyperparameters can beat state-of-the-art performance on a benchmark should not be sufficient for publication in a peer-reviewed journal or conference. Proposals should quantify hyperparameter sensitivity and demonstrate training stability over multiple attempts on different data subsets. Statistical significance must be a requirement, not a nice-to-have.
Thirdly, we need a public funding arrangement for interpretability research in the same way that we have for basic science research. We can not allow ourselves to slip into complacency over the gradual takeover of this field by corporate groups with clear perverse incentives. Publicly funded academic research remains the best option for ensuring scientific rigor, as it has done for hundreds of years. We may not feel the pain of deferring this cost today, but we will feel it eventually.
Finally, it is critical that we bring in more people from outside the field who have a different set of perspectives and experiences. This, perhaps, is an even more bitter lesson than Sutton’s. In the words of Bill Gates, when you have a hammer, everything starts looking like a nail. Our hammer is computing, and it can be difficult for us to recognize that there are other ways of solving problems than computing. I know personally how challenging it can be to work between science and computing. You speak different languages, have different priorities, and possess different skills. But bridging this divide is non-negotiable if the study of neural networks is to become a serious scientific discipline. What we most need to bring over is the most difficult thing of all: the scientists’ philosophical basis, the very lens through which they perceive the world. We will need many, many more of these hard conversations and difficult collaborations to achieve that.
Will we ever illuminate the black box?
Personally, I think it’s absolutely possible. I see interpretability research’s current struggles as caused by a lack of time (large-scale neural networks have only been around to study for 1-2 decades; DNA sequencing was invented hundreds of years after the discovery of cells), a lack of manpower (too few people are working on solving interpretability), and a lack of imagination (there simply hasn’t been enough crossover from the sciences into AI research). All of these are already improving as people recognize the points I’ve made.
The better question is whether we’ll solve this problem before ASI is developed. I also believe this can be done, and done while adhering to a rigorous scientific framework. I don’t think people realize how far we are from an AI that can truly push the bounds of human knowledge. There is at least one more paradigm shift that needs to happen, that from modeling a human understanding of nature to developing a machine which can understand nature in its own right. Considering the undoubtedly massive amount of compute this would require, I think we have sufficient time to build out interpretability research as a serious scientific discipline, and solve the alignment problem before the solution is needed.
That said, nothing is guaranteed, especially if we fail to create the cultural change toward a scientific method of investigation. The most important thing for those of us on the natural sciences side today is to look over the curtains dividing us from computing and speak our mind on what’s holding back the pursuit of interpretability. For those of us on the computing side, we need to reach past the divide and actively bring in people with expertise different to our own. New perspectives are sorely needed if humanity is to achieve the good ending for artificial superintelligence.
Comments
All comments have a 300 character minimum to promote meaningful contributions and civil discussion. I look forward to reading your thoughts, feedback, and criticism!