The Promise and Challenge of Deriving Meaningful Clinical Insights From Wearables

David Shaywitz

Wearable devices have an ability to capture lots of data, in real-time and over long periods of time, that may reflect aspects of an individual person’s health.

But (and this is a common theme in the application of data science to healthcare), gathering volumes of data is one thing – deriving meaning from these data in a way that significantly improves a person’s health is another.

A recent paper in Nature Medicine paper highlights the delicate balance digital health researchers must maintain as they demonstrate the potential of emerging wearable device technology while taking care not to get ahead of the current state of the science, in terms of what the devices actually can tell us.

The research began in Mike Snyder’s lab at Stanford University, and was co-led by Jessilyn Dunn (a rising star in biomedical engineering now on faculty at Duke University) and Lukasz Kidzinski (now an AI researcher at Stanford and director of AI at Princeton, NJ-based Bioclinica).

Beyond Narciss-ome?

For a decade, Snyder has led the charge on wearables. He has famously used himself as a guinea pig.  So exhaustively has he monitored his own parameters, including genomics, proteomics, and every other -omic, that Baylor College of Medicine researcher Richard Gibbs, tongue-in-cheek, proposed a new term, the “narciss-ome”, to describe this comprehensive assessment.

Mike Snyder, chair, department of genetics, Stanford Medicine

As Snyder’s Stanford colleague Euan Ashley writes in Genome Odyssey (my recent WSJ review here),

“Mike Snyder was on a mission to measure everything about himself, all the time, using every technology possible. And I mean everything. Starting in 2010, shortly after he started at Stanford, Mike would show up to meetings sporting multiple different wearable devices. You would meet him, and there might be one smartwatch on one wrist and a different one on the other. Sometimes, he would wear an armband device the size of a pack of cards that detected airborne toxins in his environment. At one meeting, he showed up with a front-facing camera that took time-lapse pictures of everyone in the room. It freaked everyone out, so he stopped that soon after. Lloyd Minor, the dean of Stanford’s School of Medicine, refers to him as ‘the most studied organism in history.’”

Whether these exhaustive measurement efforts are truly useful has been less than clear; in some ways, like the dancing bear, they seem most remarkable not for quality of the clinical insight generated, but rather because they were conducted at all. Phrased differently, it’s not clear that the burden of such comprehensive data collection is (yet) justified, as I’ve recently discussed (here).

Nevertheless, the promise of rich data collection, particularly using wearables, remains as compelling as it was when Denny Ausiello and I articulated the ambition of digital health nearly a decade ago: we live our lives continuously, yet our medical needs tend to be evaluated episodically, and (hopefully) infrequently. 

Surely, there must be meaningful insight to be obtained from relatively dense, continuous, longitudinal measurements that can’t be gleaned from the occasional clinic visit. 

The challenge has been surfacing this hidden insight, and capturing the implicit value.

From Wearables To Insight?

Which is where the latest paper comes in. Utilizing data from 54 participants in the Stanford iPOP (integrative personal omics profiling) study, researchers examined data extracted from the smart watches the participants wore. The scientists first compared these values to two vital signs (temperature and resting heart rate) obtained in clinic visits using a validated instrument, and then utilized machine learning to see whether they could use either the wearable data or the clinical data to predict the values of routine clinical laboratory tests. 

The study utilized an Intel Basis watch, subsequently withdrawn from the market for safety concerns (the device could overheat, causing burns or blisters). The paper was originally submitted for publication in September 2018, but not published until May 2021, perhaps explaining why the Basis was used in this just-reported study.

The Basis could detect four parameters:

  • Heart rate using PPG signals (the approach associated with the shiny green lights on the back of your Apple Watch);
  • Skin temperature;
  • Steps;
  • Electrodermal activity (EDA), a measure of the electrical properties of the skin.

First, the researchers wanted to get a sense of how the measurement of resting heartrate obtained on the smart watch (using PPG) compared to clinic observations. Many devices use PPG to measure heart rate, including the Apple Watch, the Whoop strap, the Oura ring, and the Fitbit tracker, among others. The approach measures absorbance of the shined light, which is proportional to blood volume variation (each pulse transiently increases the volume). 

Challenges of Using PPG Technology To Assess Clinical Endpoints

According to a 2018 review article in the International Journal of Biosensors and Bioelectronics:

“The popularity of the PPG technology as an alternative heart rate monitoring technique has recently increased, mainly due to the simplicity of its operation, the wearing comfort ability for its users, and its cost effectiveness.  However, one of the major difficulties in using PPG-based monitoring techniques is their inaccuracy in tracking the PPG signals during daily routine activities and light physical exercises. This limitation is due to the fact that the PPG signals are very susceptible to Motion Artifacts (MA) caused by hand movements.”

These concerns were further examined in a recent NPJ Digital Medicine paper from Dunn’s current lab at Duke, examining potential sources of PPG wearable variability, compared to an ECG gold standard. While skin tone turned out not to represent a significant source of variability, motion was; moreover, the wearables exhibited different degrees of accuracy, with the Apple Watch generally outperforming competitors.

Jessilynn Dunn, assistant professor of biomedical engineering, Duke University

Dunn’s data accord with my own experience using consumer wearables during exercise; I’ve found the Apple Watch works better than other wearables I’ve tested, but not nearly as well as measurement techniques detecting electrical activity directly, like the Polar chest strap I’ve now adopted. Consumer-facing ECG measurements, like Kardia, and like the Apple Watch measurement obtained when holding the crown for 30 seconds, also utilize electrical detection.

Notably, in Dunn’s recent paper, consumer wearables significantly outperformed several “research-grade” wearables that were also evaluated; research wearables allow investigators access to the underlying waveforms, while consumer wearables function like black boxes from a research perspective, dramatically limiting their use in clinical research and making it prohibitively difficult to utilize more than one wearable in a given clinical trial — a critical interoperability obstacle that Jordan Brayanov, Jen Goldsack, and Bill Byrom elegantly discussed last year in STAT

As the three authors explained,

“You’d think that monitoring heart rate remotely would be easy. But wearables from technology giants like Apple and Samsung measure it in different and proprietary ways. One device may record the number of beats over 10 seconds and multiply by six; another may communicate an ‘instant’ heart rate reported after every single heartbeat. This means the two platforms’ data aren’t consistent and so can’t easily be used simultaneously in clinical trials.”

Resting Heart Rate, Temperature: Wearables Data vs Clinical Data

Back to the original paper: Snyder’s team found that when they considered two weeks’ worth of resting heart rate measurements at the same time of day as the clinic visits, the values were similar, but the variability was significantly less in the wearables groups, compared to the clinical measurement group. 

In other words, you get more consistency measuring resting heart rate over weeks on a wearable than assessing it once in a while in the clinic.

Score one for the wearable!

However, temperature measurement was a different story; here, as the researchers report, “clinically measured oral temperature was a more consistent and stable physiological temperature metric than wearable-measured skin temperature….” 

Translation: compared to clinical measurement, assessment of temperature on wearables was somewhat scattered.

Wearable Data + Feature Engineering + ML = Clinical Lab Predictions?

With these foundational parameters of performance established, here’s where the paper gets interesting. The researchers examined the four basic categories of output from the watch – measurements of heart rate, temperature, electrodermal activity, and steps – and began the alchemy of data science known as “feature engineering.” 

Feature engineering involves selecting attributes from the raw data to use as an input for a machine learning model. It could include a statistical property of the data – average heartrate, say, or a property of the distribution of the heart rate, or it could be the implied activity state of the individual, based on number of steps. 

According to the authors, there were 5,736 possible features they could have considered, from which they selected 153 that seemed the most likely to be altered in a fashion that could conceivably be reflected in a clinical laboratory test.

These 153 features were then fed into several different types of models intended to predict the value of one of 44 different clinical labs that were also obtained from the study participants. The initial work suggested one modeling approach, called random forest, generated predictions that explained up to a fifth of the variability seen in measures of hematocrit, red blood cell count, hemoglobin levels, and platelet count. 

To be clear, the contention isn’t that wearable data predicted these clinical labs with exceptional accuracy, but rather that wearable-derived data seemed to very roughly correlate with these clinical labs, and others. 

When the researchers examined which features were driving the predictions, it turned out that various permutations of electrodermal activity played a critical role in predicting hematocrit, red blood cell count, and hemoglobin levels, while features driving platelet count predictions were all based on heart rate.

The authors then conducted what felt like a bit of a pedantic demonstration exercise, comparing predictions derived from the 153 wearable features with those derived from the two measurements from the clinical visit (heart rate, vital signs), and found that generally, more is better. The authors typically got better predictions when they had more data to consider, even if the source data was only consumer-grade, vs regulatory grade like the clinical measurements.

Warped Perceptions

To read some of the coverage describing this paper, you’d think we could forget about the need for future blood draws, and just rely on data extracted from smart watches. “Your smartwatch can predict blood study results,” one headlined declared. Another: “More than just a step-tracker, smartwatches can predict blood test results and infections, study finds.”

Some of this hyperbole likely stems from the actual title of the paper itself: “Wearable sensors enable personalized predictions of clinical laboratory measurements,” which seems, in the context of the reported data, a bit aspirational.

Co-author Dunn may have expressed the contribution of the paper best in a dialog on LinkedIn, commenting:

“I want to emphasize that this is more about directionality than about exact predictions of clinical labs. The current status of this work is certainly not to the point of replacing clinical labs with wearables, but rather it may indicate which labs are more likely to have changes, which can then be directly and specifically measured (think of it as a pre-screening tool for labs when you have limited time and resources).

This, in my opinion, falls under basic research. We need to establish the principle that these relationships exist before we can iterate over them to improve predictions toward more clinical utility. Agreed that there is much more to do, and I hope in this paper we succeed in making the case that this is a path worth following.”

In a larger sense, this assessment captures the current status of many digital and data technologies that are being brought to bear in biopharma and healthcare these days: neither ready for prime time nor quite living up to the hype, but nevertheless making real progress, which resolute cynics can choose to ignore only at their peril.