Seeking Pockets of Reducibility in Personalized Medicine: Lessons from Google’s AI Health Coach Study

David Shaywitz
Technologists often imagine a future of health in which AI delivers highly personalized, preemptive guidance, powered by dense, dynamic streams of data. Continuous sensors track physiology and metabolism; lab panels and -omics assays capture molecular signatures; imaging contributes structural and functional context; and genome sequencing rounds out the picture. Collected longitudinally and at population scale, these data are linked to outcomes and interpreted by advanced computational models.
The expectation is that such a system will surface patterns invisible to clinicians today, drawing on “digital twins” — others who share your profile — to forecast risk and recommend precisely tuned preventive actions. I’ve called this the “Magic Vat” vision: pour in massive multimodal data, swirl in AI, and wait for actionable, personalized wisdom to bubble out.
The appeal is obvious, and Lee Hood and Nathan Price have articulated it eloquently in their concept of “scientific wellness.” Their vision for dense, longitudinal, personalized health data clouds is compelling, yet the leap from amassing data to producing reliable, timely guidance has proved largely elusive. That tension — between the promise of precision and the stubborn complexity of real health data — has both inspired and confounded champions of precision health for decades.
If every required component truly were in place – the near-complete measurement of relevant physiology and molecular state for everyone, cleanly linked to outcomes — this aspiration might be achievable (though still imperfect, as I noted in my recent WSJ review of Sam Arbesman’s delightful The Magic of Code). At present, it lives in the “assume a can opener” realm. The urgent question is what to do now, before that data-saturated future arrives — if it ever does.
Borrowing Stephen Wolfram’s language, I’m interested in pockets of reducibility: places where complex systems yield just enough to become tractable, and where limited data can deliver disproportionate leverage.
Meanwhile, the marketing allure of “scientifically” personalized advice has led to a spate of startups promising genetically informed guidance on what to eat, drink, or even whom to date — generally with little validation. (Remember Vinome? And Pheramor, ScientificMatch, SingldOut — all since shuttered.) Amusing as these are, they’re symptoms of a broader pattern I’ve seen for years: reach outpacing grasp (and often common sense).
The Continuum of Approaches to Health Personalization
Stepping back from the hype, it helps to map the current approaches to early diagnosis and targeted intervention, from conservative and validated methods to those more exploratory and speculative.
- Established clinical biomarkers: parameters like cholesterol, blood pressure, HbA1c, with well-validated assays typically (or at least ideally) associated with evidence-based interventions.
- Expanded panels outside traditional context: companies offering broad CLIA-certified tests (e.g., Function Health, which partners with Quest Diagnostics). While the assays themselves are analytically robust, their value when applied non-selectively, outside the targeted context in which they’re usually ordered, is questionable at best, a point Dr. Eric Topol has recently underscored.
- Digital biomarker proxies: platforms like WHOOP derive and model metrics from wearable sensors (e.g., hours of sleep, daily steps, resting heart rate, VO₂ max estimates), and may aggregate them into composite indices (e.g., “WHOOP Age”) built from parameters that, when measured with clinical rigor, have been linked to healthspan. Their key appeal is the immediacy and continuity of measurement — delivering dynamic, longitudinal streams that can engage users and support real-time course corrections. But the measurements often lack the robustness of clinical assays, and their prospective linkage to health outcomes remains unproven. Indeed, even — and perhaps especially — when the individual parameters carry well-established health associations under validated conditions, digital readouts that have not been subjected to comparable scrutiny can be contested, as illustrated by WHOOP’s recent dispute with the FDA over blood pressure measurement.
- Exploratory dense-data clouds: Hood and Price’s vision of longitudinal, multimodal “scientific wellness” profiling, seeking novel markers that flag early wellness-to-disease transitions. Enormously ambitious, but as yet mostly aspirational.
Each step along this continuum reflects a trade-off: at one end, the narrow set of rigorously validated tests physicians order in traditional practice, aligned with Eric Topol’s view that, outside these boundaries, new assays should be pursued within the rigor and scientific discipline of formal clinical trials (as he emphasizes in Super Agers, my WSJ review here).
At the other end are more exploratory, often speculative analyses, justified by a readiness to act on incomplete evidence if the perceived benefits seem to outweigh the risks. As Peter Attia emphasizes in Outlive, and as I’ve argued in the context of “personalized regulation,” this approach creates space for individual preferences and tolerances to guide such choices.
Motivation and Measurement
An often underappreciated dimension of precision medicine is the remarkable psychological impact — including the ability to change behavior — that even scientifically suspect personalized health recommendations can have if they reinforce an individual’s health narrative.
A recent essay by advocate Jordan Glenn argued that dietary supplements can function as a “gateway drug to health”; he cites a 2014 publication reporting “dietary supplement users are more likely than nonusers to adopt a number of positive health-related habits. These include better dietary patterns, exercising regularly, maintaining a healthy body weight, and avoidance of tobacco products.”
The point, Glenn suggests, is that much of the benefit of supplements is as a quick and easy daily habit that may both catalyze and reinforce your commitment to healthier living. (To be sure, the causal contribution of the supplement itself remains unproven.)
Similarly, I previously described how modestly informative genetic tests can spur genuine lifestyle changes, particularly when accompanied by the provision of generally sensible advice that is followed because it aligns well with what Dr. Arthur Kleinman has described as one’s “explanatory model.”
I can even imagine a similar dynamic playing out in the apparently trendy domain of “engineering” better babies through genetics. The science here is tenuous at best — our ability to meaningfully enhance complex traits like intelligence through genetic tinkering remains highly uncertain (not to mention ethically suspect). Yet expectancy effects suggest such claims could still shape outcomes: parents who believe their child has been genetically enhanced might interact with them differently — echoing the Rosenthal (“Pygmalion”) effect observed in classrooms — and children might internalize these expectations in ways that alter behavior and performance.
More broadly, the point is that even scientifically shaky “precision” interventions can exert real-world influence, not through biology, but through belief.
More Precision Doesn’t Always Translate To Better Health
Not only can questionable precision science motivate health-promoting behaviors, but conversely, even well-grounded precision science can reveal credentialed insights that offer unexpectedly little value.
For example, genetics can distinguish fast from slow metabolizers of the commonly used blood thinner, warfarin. Yet clinical trials to date show limited incremental benefit over careful titration in usual care, with context‑specific exceptions; in many settings the traditional “go low and go slow” approach performs well. I suspect one reason for the disappointingly slow adoption of pharmacogenomics in the clinic relate to similar concerns about practical value.
Of course, there are many compelling examples of the exceptional value of genetic and other measurements in enabling more personalized medicine, particularly in oncology: for instance, HER2 amplification guiding trastuzumab in breast cancer or BRAF V600E mutations predicting response to BRAF inhibitors in melanoma. Genetic testing also plays a critical role in determining the use of abacavir in HIV patients, and of fluoropyrimidines in oncology.
Even so, on balance there’s a tremendous gap between medicine’s ambition to offer more personalized care and our ability to credibly do so.
Moreover, as clinical visits become ever more rushed, and the delivery of care ever more industrialized, a devastating consequence has been the loss of what has long been one of our most effective tools for personalizing care: the therapeutic relationship between patients seeking personalized care and doctors who know and understand their patients well enough to provide this.
Lessons from a Thoughtful Google Experiment
Yet even the most skilled and empathetic doctor — or the best health coach – can only take care of a relatively limited population of patients. In contrast, AI — particularly as a health coach — can theoretically provide a way to scale personalized guidance and impact to reach far larger populations. Consequently, I was intrigued by a recent effort by the team at Google to develop just such a personalized health coach in a thoughtful and rigorous fashion; the results were just published in Nature Medicine.
The researchers wanted to explore whether an AI model could be trained to integrate a range of data associated with lifestyle parameters like sleep and exercise and offer expert-level insight and advice.
This was a particularly attractive area of study for three reasons, as the authors indicate:
- Lifestyle factors such as sleep and activity have profound health impacts — as this column has frequently emphasized.
- Sleep and activity parameters can be measured passively and continuously by widely available wearables.
- Practical advice can be offered without veering into regulated medical claims, thus providing a bit more space for less encumbered exploration.
Training the Model
The team started with the Gemini 1.0 Ultra large language model (LLM) and fine-tuned it on expert-written case studies for sleep and fitness. These were built from anonymized Fitbit data, with experts crafting the “gold-standard” answers; a separate set of expert-only cases was held back for grading.
They also wanted the model to connect what wearables record with how people say they slept — their subjective experience. To do this, the Google team trained an “adapter” on a large Fitbit research cohort in which participants wore devices for several weeks and completed validated sleep questionnaires. The adapter’s job: turn streams of sensor numbers into a form the language model can reason about, so it could relate a participant’s recent data to their own reported sleep experience.
For fitness, the inputs mixed real training metrics (e.g., load and recent workouts) with short diary-style notes to mimic user logs about soreness or readiness. Experts then wrote the coaching replies.
The result was PH-LLM — a personal health large language model that knew the domain and could “speak sensor.”
Evaluating the Model
The researchers then asked three basic questions of the model:
- Does it know the material? On certification-style tests, PH-LLM scored 79% in sleep (experts: 76%) and 88% in fitness (experts: 71%).
- Can it link wearables to how people felt they slept? With the adapter, PH-LLM did better than prompt-only LLMs, but about the same as a simple logistic regression. Bottom line: wearable features only modestly predict subjective sleep quality.
- Is its coaching any good? On sleep cases, fine-tuning improved PH-LLM over the base model. On fitness, its advice was judged statistically indistinguishable from human experts — and the base model landed in the same range.
Taken together, these results show technical feasibility with clear limits. You can adapt a general model to a lifestyle domain, teach it to “speak sensor,” and produce advice experts often find reasonable. But the signal–outcome link is modest, and nothing here demonstrates behavior change or better health.
In other words, the bottleneck looks less like “insufficient model cleverness” and more like where the data carry usable signal about outcomes we care about — and whether there’s a lever that turns prediction into improvement.
Where to Go from Here
In my last several decades of engaging with precision health champions in academia, biopharma, and health tech, a recurrent theme has been the hope (and assumption) that if you amass enough multimodal data and add ever-smarter analytics (now AI), actionable insight will emerge – the idea of the “Magic Vat.”
PH-LLM is a well-executed reminder that volume plus AI isn’t, by itself, a shortcut. The practical question isn’t “Can an LLM coach?” so much as “Where does data plus AI buy real leverage?” — i.e., in which domains do measurement, outcomes, and actions line up tightly enough to make a difference?
The most promising areas are likely to share three features:
- Reliable, relevant signals that can be collected at scale.
- Meaningful outcomes captured consistently and in ways that matter to individuals.
- Credible, evidence-guided levers that can shift those outcomes on practical timescales.
You can see this logic in action with platforms like Tonal, which precisely captures inputs (sets, reps, loads, even form) and outcomes (strength, function), and applies well-established levers like progressive overload and recovery. With user consent, it could even support A/B testing of different approaches, extending into rehab or fall-prevention with the appropriate outcome data — a near-ideal loop of signal, outcome, and intervention.
Crucially, context matters. Whether advice is realistic often depends on factors like shift work, caregiving, travel, or acute illness. Even a single, low-burden context flag can materially improve both predictions and recommendations.
Unfortunately, in many health domains we care most about, it remains surprisingly difficult to find — let alone gain access to — datasets that contain all three ingredients: reliable signals, meaningful outcomes, and credible levers. These enduring gaps underscore the importance of prioritizing relevant data over vague hopes that AI alone will supply the miracle — and point to the need for more agile, iterative processes to make progress.
How to Look: Refining the Process
Too often our default logic echoes the South Park Underpants Gnomes:
Step 1: Collect data.
Step 2: ? (AI?).
Step 3: Wondrous health‑altering insight.
A more productive alternative emphasizes accelerating knowledge turns (to borrow Andy Grove’s phrase): shortening the cycle from data to hypothesis to tested result, then back again. Instead of amassing years of data before broaching the analysis, we should be probing while we collect — scanning for early signals, forming provisional hypotheses, and pressure-testing them quickly.
Nathan Price’s wellness studies (as he discusses here) show why dense longitudinal data clouds matter. By gathering deep molecular and physiological data over time, his team could later look back at individuals who eventually developed cancer and see subtle protein shifts years before diagnosis. They didn’t know which signals would matter in advance, but the ongoing collection created the opportunity to spot them in hindsight, turning retrospective observations into new hypotheses for prospective testing.
The path forward, then, looks less like passively waiting for the “Magic Vat” to yield wisdom, and more like building iterative funnels for discovery — environments designed to collect richly, analyze continuously, and refine actively. This also requires exactly the sort of mindset I’ve suggested will propel medicine’s data-driven future: inquisitive physicians and scientists willing to explore, discard, and build again, accelerating knowledge turns while maintaining rigor and empathy.
So instead of stockpiling data and praying for magic, we might be better off embracing a process that keeps us learning along the way.
To me, it looks something like this:
- Look for signals while collecting. Dense cohorts can surface candidate markers — molecular, behavioral, or physiological — that hint at risk or resilience.
- Test quickly, discard freely. Expect most to vanish; keep the probes cheap, reversible, and skeptical.
- Deliberately follow what survives. When a candidate persists, shift gears: downselect, sharpen the measurement, and scale up evaluation with larger, more focused cohorts. The point is to convert serendipitous suggestion into deliberate study design.
- Close the loop. Feed validated findings back into both practice and data collection (dropping low-value measures, adding contextual ones).
As I’ve discussed (see here, here), this is often how medicine advances — by gradual refinements, not sudden leaps. The opportunity with AI isn’t to conjure insight from a vat of undifferentiated data, but to accelerate and discipline these turns of the wheel: spotting possible signals sooner, testing them faster, and deepening them in the right places with more deliberate evidence.
Beyond Metrics: Remembering What Health Is For
Finally, a caution. It’s tempting to equate precision health with metric optimization, chasing personalized nudges to lower blood pressure, trim cholesterol, or log more steps. These markers matter, but they are not health itself. Human flourishing — purpose, connection, agency — cannot be captured in tidy dashboard metrics.
If we’re fortunate, the future of personalized medicine will be a system that earns its worth one validated improvement at a time, powered not only by data and algorithms but by people: physicians, scientists, and patients alike — relentlessly curious about new signals, disciplined in testing them, willing to discard what fails, and ready to scale what endures. It will be shaped not by the arrogance of an omniscient AI guide, but by the humility of knowing that the deepest drivers of health may lie beyond what any dataset can hold.



