6
Jan
2025

AI Needs Natural Language to Give Structure to Biology

Sam Rodriques, co-founder and CEO, FutureHouse

The word of the day, at least in the AI for Biology community, is foundation models. Everyone wants bigger data on more things to throw into bigger models.

Virtual cell models will enable us to predict how cell states will change in response to chemical perturbations. Protein language models will enable us to identify better enzymes for degrading plastics or protein binders that have more drug-like properties. These layers are on top of increasingly accessible genomic data. The future is bright.

Real biology discoveries look somewhat different, though, and I think it is telling that there are not many actual biologists at AI biology meetings like NeurIPS, a conference on Neural Information Processing Systems. which I attended last month in Vancouver BC.

Contrast these dreams of foundation models driving biological discovery with the latest table of contents from Science or Nature:

I struggle to imagine how any of these discoveries could fall out of a multimodal biology foundation model.

This is not intended to be a straw man argument. Surely, a foundation model could potentially identify the lncRNA from the first paper, but I am not sure how such a foundation model would associate it with chromatin remodeling.

A multimodal foundation model with enough data could also potentially identify metabolic changes associated with melanoma cells subjected to certain kinds of treatments, but I don’t see how that foundation model could identify the effect of those metabolites in preventing CD8+ T cell activation. Indeed, I do not think that any of the foundation models that are being developed today would be capable of generating rich new biological insights of the kind described in these papers. And yet, these are the kinds of insights that new therapies are made from.

The issue, I think, is that machine learning models work extremely well on structured data, and so all the foundation models that are being built are highly structured. Take a protein sequence as input and produce a protein sequence as output. Take a cell state and a chemical perturbation as input and produce a new cell state as output.

Biology, however, is poorly structured. The lncRNA insight is a good example: what structured representation can we use for the action of the lncRNA in modulating chromatin architecture? Protein models cannot represent it; DNA models cannot represent it; virtual cell models cannot represent it. Perhaps a model that incorporates RNA expression and 3D genome state could represent it, but then how would that model represent the lipid modulation of the monocytes?

I worry that every discovery may need its own representation space. Indeed, the nature of biology is such that there likely is no representation, short of an atomic-resolution real-space model of the entire organism, that is sufficient to represent the diversity of biological phenomena that are relevant for disease. Such a whole-organism model is far off – we still don’t have a computer model that fully represents the complexity of a single living cell.

Except, of course, for natural language, which has evolved to represent all concepts that humans are capable of contemplating. Indeed, I think natural language is ultimately unavoidable for discovery in biology, insofar as it is the only medium we know of that is sufficiently structured for machine learning and sufficiently flexible to represent the full diversity of biological concepts.

One way to combine language and biology is to use agents, like the ones we build at FutureHouse, a non-profit AI lab that I run in San Francisco. Language agents are language models – like ChatGPT – that can use literature search tools (e.g. PubMed), protein structure prediction tools (e.g. AlphaFold), DNA analysis tools (e.g. BLAST), and so on to analyze biological data in the same way humans do, but much faster and at much larger scale. We recently deployed an agent we built, PaperQA2, to search the literature and write an accurate and cited Wikipedia-style article for nearly every protein-coding gene in the human genome. In the future, language agents will be able to automatically analyze experimental data and clinical reports to provide detailed biological hypotheses similar to those in the Nature and Science papers above.

There are other ways to combine language and biology as well. Training models that combine natural language with protein, DNA, transcriptomics, and so on will also be extremely productive, provided the addition of the structured datatypes does not restrict their ability to represent unstructured concepts.

The history of biology is built on tools that we have found in nature to study biological phenomena. CRISPR is one powerful recent example. As all biologists know, trying to engineer things from scratch (almost) never works; what works is finding things in nature and repurposing them. It will be aesthetically pleasing if it turns out that our engineered representations are yet again insufficient for studying biology, and that good old natural language is simply another such tool that we have found in nature that must be applied to unravel the mysteries of biology.