Getting the COVID-19 Numbers Wrong

Ruth Etzioni, Full Member, Division of Public Health Sciences, Fred Hutch Cancer Center

When I was in college, everyone wanted to major in psychology. I signed up, but switched out after only a few weeks.

Why? Well, the more I read, the less I seemed to know. Psychology, after all, is an inexact science.

I sought refuge in the exact worlds of computer science and mathematics. Those courses led me to build a career in statistics, the science of uncertainty.

For the past 20 years, I have partnered with colleagues in the Surveillance Program at the National Cancer Institute to explain why cancer rates go up and down from one year to the next. We can do this work because we know how many cancers cases – and deaths – there are every year. Our analysis is grounded in hard data that is reliably and consistently collected.

For years, an information infrastructure has existed to funnel data from pathology laboratories to local cancer registry offices, who interface with hospitals and doctors’ offices to catalog each cancer case.

I always appreciated the NCI’s Surveillance Program for its tireless work to bring us the numbers and double-check to make sure they are right. Reliable data is the bedrock of our analysis.

I appreciate the fundamental, unglamorous data-checking and cleaning work even more as I look at the state of COVID-19 surveillance. I wonder about the data we are relying on to track cases and deaths, to prepare for the future, and to make critical policy decisions.

It seems to be a house of cards.

We know that the reported daily tally of cases is hopelessly wrong. Cases can only be confirmed if they are tested by one of the reliable RT-PCR diagnostic tests that uses samples from nasal swabs. From the beginning, the US has struggled with a shortage of these tests. So the number of cases reported on any given day is determined not only by the spread of the virus itself, but also by the availability of tests. Many people don’t know that they have the virus, and of those that exhibit symptoms, many who want a test can’t get one. Access to reliable diagnostic testing varies widely over time and across locations. So the meaning of “number of confirmed cases” differs depending on the date and where you are.

The number of cases simply can’t be interpreted without understanding the state of testing. But, even if we knew how many tests were being done each day, this would not be enough. Just figuring out how many symptomatic cases there are would require how many of these people actually present themselves to a healthcare provider, asking for a test. We don’t have the data to understand this, but it surely varies depending on the date and where you are. And this doesn’t even get at how to account for asymptomatic cases.

New antibody studies can provide us valuable snapshots in time, if conducted with a reliable blood test, and with a rigorous random sample of the population – not just people who volunteer because they think they might have COVID-19. These studies may help us to reconstruct how many surviving individuals had the virus, whether they know it or not, and this could potentially help correct our daily tally of cases, retroactively. But, results of these studies are subject to debate and cannot be taken at face value.

As an example, an antibody study conducted in Santa Clara county was roundly criticized, because the underlying test produced false positive results that could have accounted for many of those told they had antibodies. Still, as these studies accumulate, we will likely gain a better idea of the foothold the virus has established in the population.

Given that we are in a pickle when it comes to counting cases, we might turn to harder data, like the numbers of hospitalizations or deaths. Hospitalizations are limited by system factors that have nothing to do with disease burden. But deaths? Surely deaths are more reliable. You are either alive or not, there’s no squishy subjective judgment at work, and no faulty test to wonder about. Right?

It’s not quite that simple. Until recently, it seemed that most people trusted the counts of COVID-19 deaths reported around the world. Even as China kept changing its definition of what constituted a coronavirus case, its death tally was not questioned. On Mar. 19, breathless headlines declared a “grim milestone” – Italy’s COVID-19 death toll of 3,405 had exceeded China’s reported death toll of 3,249. Reported deaths from China and Europe became the basis for models that predicted the scale of the epidemic in the US and informed social distancing policies across the country. I have written in these pages about the models and warned about how they are being miscommunicated, but I never mentioned my concerns about the death data which drives them.

I can’t remember when I first became skeptical about the deaths. But it was definitely before Mar. 19 because I recall reading the grim milestone headlines on that day and wondering why everybody believed the official number from China. Maybe it was after I personally asked a Chinese statistician colleague now living the US who said that in her opinion the numbers from China were made up so as to imply a fatality rate that was between one and two percent.

Still, I told myself, our surveillance systems in the US would never allow this to happen here.

Recently, important data has emerged that strongly suggests that we are undercounting COVID-19 deaths in this county – by tens of thousands. The basis for this is in the numbers; the official count of deaths due to the virus is far below the excess in the overall number of deaths compared to what would be expected at this time of the years based on data from the last few years. On May 13, Nicholas Kristof of The New York Times reviewed the numbers and concluded the COVID-19 death toll had already exceeded 100,000.

The counting of excess deaths to shed light on cause-specific mortality is not new. We do it to assess our progress against cancer all the time. Sometimes it is hard to know whether a person with a history of cancer died of, or with, their disease.

Suppose a breast cancer patient dies of a heart attack. Was that death due to cardiovascular disease or to the chemotherapy drug she was given and that is known to weaken the heart? Death certificates can only do so much to tell us, and generally focus on what we call the proximal cause of death, which in this case would be cardiac arrest and not cancer. To avoid undercounting deaths among patients with a specific cancer, we can tally the excess deaths among individuals with a history of that cancer and stack the result against the deaths among comparable individuals (e.g. same age and race) in the population. This technique, developed in the 1950s, is called relative survival. It has been shown to work really quite well for many cancers.

In the case of COVID-19, which can cause death via a diverse array of awful disease symptoms, this makes a lot of sense. Death certificates list first the proximal cause of death – what actually caused you to die.

We know COVID-19 can kill you by suffocating you, giving you a stroke, causing your heart to stop beating or your kidneys to fail. This is what gets listed first. If COVID-19 is known or suspected, this is recorded as a second, third or even fourth cause.

My cousin, an all-knowing MD and research scientist, who directs a critical care unit in Denver, tells me that even when patients die of respiratory failure, the best known COVID-19 cause of death, the proximal cause is listed as Acute Hypoxemic Respiratory Failure (AHRF) and the next cause is the condition, Acute Respiratory Distress Syndrome (ARDS). COVID-19 infection only appears third – so long as the patient is known to have the virus.

Patients not tested, who don’t make it to the hospital, or whose deaths masquerade as being from strokes or heart attacks may not even have COVID-19 listed as an underlying cause in the death certificate.

So it makes sense to look beyond the cause-specific tally to the overall mortality data.

I think we can trust the overall number of deaths. It is possible that heart attack or stroke deaths may have gone up a bit this year even in the absence of COVID-19. And we certainly don’t want to count deaths that were truly due to these causes as COVID-19 deaths just because a person had a confirmed diagnosis and then had a fatal heart attack or stroke.

But, on balance, if we look at overall all-cause mortality rather than cause-specific deaths, we are left with the inescapable conclusion that the official tally of COVID-19 deaths in the US is way low. When you look at all-cause mortality, and make a direct comparison of March 2020 to March 2019, or April 2020 to April 2019, you will see an unmistakable increase in death rates. This is despite various conspiracy theories to the contrary to which I won’t devote any airtime. Let their proponents debate our National Center for Health Statistics.

At this point I hope we are all on the same page about the need to look beyond the official COVID-19 death toll to the cases who don’t get to have the virus listed on their death certificate. But what is happening in some states now is worse; they are changing their definition of a COVID-19 death to only count those for whom the virus is coded as a proximal cause.

I know that this is happening in Colorado because my cousin shared his frustrations about his patients’ causes of death being wrongly recorded by the state. Colorado coroners are objecting, according to a piece run by CBS4 in Denver. A new Scientific American article out this morning cites the Colorado issue.

In Florida, the state forced medical examiners to stop releasing their counts of COVID-19 deaths which were at odds with official state figures, and fired the creator of the state’s COVID-19 data portal, who claims she was terminated for refusing to manipulate the numbers. And in states like Georgia, where COVID-19 can only be listed in as a cause of death in confirmed cases, reducing the number of tests performed will automatically trigger a decline in deaths. Of note: Georgia is no longer releasing data on testing numbers.

We tend to think of numbers as facts and data as absolute; it feels safer that way. Unfortunately, when it comes to COVID-19, it seems that there is no safety in the numbers. When we wake up each morning and check the dashboards of cases and deaths in our local papers or cable TV news, we need to bring a healthy dose of curiosity – and perhaps skepticism – to the table.

If we don’t, we may jeopardize our understanding of the true burden of COVID-19 and compromise our ability to navigate our way out of it.  

You may also like

Do We Need Models Anymore?
Q&A with Regeneron SVP David Weinreich on Therapeutic Neutralizing Antibodies
Leadership, Strategy and Capabilities: How We Are Losing The Fight Against the Virus
The Exponential Curves Re-Emerge