DeepSeek Shocked Silicon Valley, but It’s Not Earth Shaking for Biotech

Simon Barnett, partner and head of research, Dimension
DeepSeek, the artificial intelligence (AI) research group owned by Chinese hedge fund High-Flyer, dominated last week’s news cycle—at least for 24-48 hours. The group launched R1, the latest in a series of cutting-edge large language models (LLMs). Investors panicked, erasing over $1 trillion of U.S. equity market cap in a single day.
Nvidia (NVDA), the maker of high-powered AI chips, shed $500 billion alone. Speculators feared the demand for Nvidia’s chip might dry up since DeepSeek found an ostensible workaround delivering stellar LLM performance on vastly less compute. The news sent shockwaves throughout Silicon Valley.
Could there be a similar DeepSeek moment for the burgeoning intersection of machine learning (ML) and the life sciences?
No—I don’t think so. The conditions leading to the market frenzy in the natural language processing (NLP) space are quite different from those in biotech—setting these domains on divergent courses.
THE STATE OF THE ML UNION
The frontier of NLP research is governed by so-called scaling laws. Chief amongst these is that training large ML algorithms using vast datasets and enormous gobs of compute will result in monotonically improving model performance.
Big tech is betting these models will steadily breach performance thresholds wherein they can extract inexorably more economic value—first by displacing low-level tasks and eventually complex knowledge work.
In the NLP realm, complex ML model architectures and Internet data are easy to come by. Compute is the scarce resource gating the progress of LLMs. Training large algorithms requires expensive, specialized hardware—principally Nvidia (NVDA) GPUs.
Frontier NLP model development is thus a pay-to-play endeavor. Companies like OpenAI and Anthropic have contributed enormously to ML research. They’ve also leveraged the compute scaling narrative to corral the capital necessary to build veritable armamentariums of GPUs. The size and scope of these compute investments – in sheer dollar terms – functions like a competitive moat.
These conditions have shaped the industry into an oligopoly of well-capitalized, closed-source groups exchanging dollars for compute with the hope that scaled ML models will power a growing set of applications—forking the history of life on Earth.
DEEPSEEK R1 CONSTITUTES A NARRATIVE VIOLATION
DeepSeek R1 seemingly violated the closed-source scaling narrative, casting uncertainty over big tech’s ML primacy.
R1 isn’t the most performant model, but it’s good enough to power most downstream tasks. Far more relevant are the facts that—(a) DeepSeek made R1 open-source under the MIT license and (b) DeepSeek claims R1 cost ~10x less to train and ~90% less to use than other contemporary LLMs.*
The emergence of an inexpensive, open-source (ish) LLM has driven the value capture conversation marginally away from the close-source cabinet and more towards the application layer—to groups solving UI/UX issues and product/market fit questions. Understandably, big LLM providers aren’t static. They will rebut with their own competitive salvos. Perhaps they already have.
BIOLOGY IS A SEPARATE BEAST
Biotech companies and academics alike are releasing scaled biomolecular ML models at a fever pitch, whether to predict protein-ligand complex structures, engineer therapeutic proteins, or surface developable small molecule hits.
However, the dynamics governing the evolution of ML in the life sciences is unique. Market participants should be cautious about reapplying the parables of the current moment in natural language processing into the biotechnology field.
While GPUs cost the same across ML domains, they aren’t the rate limiter (yet) in the life sciences. Throwing disproportionately large compute at sparse data results in overfitting, a dangerous phenomenon where models overstate their performance and struggle to generalize.
Data is the scarce asset in the life sciences. Biological data doesn’t expand ambiently like the Internet, with several billion users contributing to the information commons every day. Open-source life sciences datasets are relatively small and contain experimental artifacts, challenging ML model training. Even so, this public data has been enough to power innovation to date.
ML model innovation is perennially useful in biology. For example, AlphaFold2’s architectural novelties burst open the field of computational protein structure prediction in 2021 despite the underlying dataset—the Protein Data Bank—being available for years.
The reason there’s unlikely to be a singular DeepSeek moment at the intersection of ML and biology is because we’re still inundated in DeepSeek moments. Acts of algorithmic clairvoyance have spiked the field in new, exciting directions.
As the open-source data wellspring dries up, however, ML in the life sciences may move in the opposite direction to NLP. The field may instead shift towards closed-source walled gardens that house high-throughput, experimental data foundries—the scarce asset that will imbue scaled biological models with economically valuable capabilities.
*DeepSeek likely distilled R1 from other LLMs. They also did not include all the R&D costs they drafted off of to build R1, casting significant doubt on the stated training costs. The inference costs have replicated in users’ hands, making them much more viable.