11 min read

The Yu Group in the fight against COVID-19: Curating data, predicting deaths, and building partnerships

The Yu Research Group at UC Berkeley Statistics and EECS, led by PI Bin Yu, has worked to curate a comprehensive COVID-19 data respository and predict the spread of deaths caused by the virus. You can read our paper here and visit our project website at covidseverity.com.

University of California, Berkeley
Lawrence Berkeley National Laboratory
Chan Zuckerberg Biohub

in partnership with

Center for Spatial Data Science, University of Chicago

In late March as the COVID-19 epidemic began to take a firm hold on our lives, the Yu Group began working in collaboration with Response4Life, a non-profit organization striving to deliver critical emergency medical supplies to the hospitals that need them most.

To do so, we sought to forecast the severity of the epidemic for counties and hospitals across the U.S., with three primary goals:

  • Provide open access to a large repository of data related to the spread of COVID-19 in the U.S. that other groups may use to better understand the past and future trajectory of the virus.
  • Produce short-term predictions of the number of deaths at the county-level to identify counties whose health-care systems are likely to face significant stress over the coming week to ten days.
  • Develop a hospital-level COVID pandemic severity index (CPSI) to identify hospitals most likely to be facing emergency medical supply shortages.

Our Data Processing Pipeline

Amazon EC2
Amazon S3

One of our main contributions is the curation of a large COVID-19 data repository, which we have made publicly available on GitHub. This repository is updated daily with new data and information. Currently, it includes data on COVID-19-related cases, deaths, demographics, health resource availability, health risk factors, social vulnerability, and other COVID-19-related information.

Let’s begin first with what our data processing pipeline looks like at a very high level. At the beginning of this pandemic, there was obviously no existing database labeled “Everything you need to understand the COVID-19 pandemic”. Our job, then, was:

  • To begin curating this data repository.
  • To determine what are relevant information and sources.
  • To clean and validate the data continuously.

We’ve collected over a million records from 15+ sources and counting, and we’re using AWS S3 to store some of the larger data sets as well as an EC2 instance running JupyterLab to easily share results, EDA, and collaborate together on the code.

Other Data Sources

In addition to the USAFacts and NYT data, our repository includes:

Our data pipeline is much more of an iterative and continuous cycle than a static pipeline with a well-defined start and end. We are constantly looking for new data sources and incorporating them into our pipeline and repository. Our data collection and cleaning efforts are on-going as new really cool data sources keep popping up on our radar every day.

Key Takeaways

When searching for, cleaning, and combining data from many different sources it can be easy to get lost in the details, but there are three big takeaways we’d like to highlight on the data curation end.

  • Know the audience or end user of your data and give them documentation.

We saw our audience as not just being our team, but also other researchers in the broader community, which meant we needed to create very clear and organized documentation so that the data is easily understandable and user-friendly. With that aim, we’ve spent significant time on documentation for each data source. We also created both abridged and unabridged versions of the data to allow newcomers to get started more quickly.

  • Don’t ignore naming/coding conventions and organization structure for storing and processing your data.
New York Times

This goes hand-in-hand with the first takeaway and improves the accessibility of the data for all end users. For example, if other researchers want to see how we cleaned the data, they can easily find the clean.py script in the folder each dataset. If they want to see the raw data, they can find it in the raw folder. Having a good organizational structure is crucial to quickly integrate new members and volunteers onto the team, and it’s best to set standard at the beginning rather than spend a lot of effort reorganizing when the project is already well-established.

  • Good data curation takes a lot of effort; and it’s worth it.

Early on, everyone on the team worked on data cleaning in some capacity, but we quickly found that we needed to separate into smaller teams to be more efficient. Two of our team members worked on the data team essentially full-time for almost a month and the results speak for themselves. We’re incredibly happy to see that the broader community is already using our repository, such as for the final project for STAT 542, a graduate-level machine learning course at University of Illinois, and DATA 100 at our very own UC Berkeley, an undergrad class with over 1000 students!

Forecasting County-Level Death Counts

Data curation is an ongoing effort, but once we had the backbone of our data pipeline in place, we turned to the prediction problem.

Five Statistical Methods

Our predictive approach primarily uses the county-level case and death reports provided by USA Facts, along with some county-level demographics and health data. We use five different statistical methods, each of which captures slightly different data trends.

  • Separate-county exponential predictors
Separate-county exponential predictors

Separate-county exponential predictors

The separate predictors aim to model each county independently via a best-fit exponential curve using the most recent 5 days of data.


  • Separate-county linear predictors
Separate-county linear predictors

The linear predictor is similar to the separate one, but uses a simple linear model rather than the exponential format.


  • Shared-county exponential predictor
Shared-county exponential predictors

The shared predictor fits the data from all of the counties simultaneously and predicts death counts for individual counties.


  • Demographics shared-county exponential predictor
Demographics shared-county exponential predictors

The demographics shared predictor is similar to the shared-county exponential predictor, but also includes various county demographic and health-related predictive features.

  • Expanded shared-county exponential predictor
Expanded shared-county exponential predictor

The expanded shared predictor is similar to the “shared” predictor, but also includes COVID-19 case numbers and neighboring county cases and deaths as predictive features.

  • Combined Linear and Exponential Predictors (CLEPs)
Combined linear and exponential predictors

Ultimately, we found that the approach that worked best was to use an ensemble of the five models listed above to flexibly fit the COVID-19 trend. We use a weighting scheme previously developed by Dr. Yu and collaborators for lossless audio compression. This exponential weighting term wtm for predictor m applied on day t is given by


where μ(0,1) and c>0 are tuning parameters, t0 represents some past time point, and t represents the day on which the prediction is calculated. Since μ<1, the muti term represents the greater influence given to more recent predictive performance.

Note that the loss terms (y^im,yi) used in the weights are calculated based on the three-day predictions from seven predictors built over the course of a week. We chose the past week’s 3-day performance in the weights since it yielded good performance for our ensemble predictor for predicting death counts several days in the future. In practice, our loss function is


where the log is taken to help prevent vanishing weights due to the heavy-tailed nature or our error distribution.

It’s important to note that our weights are calculated at the county-level, so that each county uses a distinct ensemble of the five methods above. In practice, we find that the best model is a combination of the expanded shared predictor and the linear predictor.

Prediction Results and Intervals

Our five day predictions are tracking well with current trends.

Predicted versus actual number of deaths May 02

However, we weren’t satisfied to stop at point estimates. To get a sense of the variability of our predictions, we use maximum (absolute) error prediction intervals (MEPI):

5 day forecasts and prediction intervals May 02

These intervals are formed in three steps:

  1. Find the normalized error of our predictors in the past:


  1. Find maximum error of past 5 days:


  1. Form the interval for predictions for k days in the future:


Using this method, we see good empirical coverage for our historical prediction intervals:

Histogram of Prediction Interval Coverage

From County-Level Predictions to the Hospital-Level

At this point we have predicted deaths at the county level, but our main deliverable as part of our partnership with Response4Life was to predict demands for emergency medical supplies at hospitals across the country. This is no small task, made all the more difficult by the fact that hospital-level case and death counts are not available to the best of our knowledge.

Covid Pandemic Severity Index (cPSI)

Covid Pandemic Severity Index

For this reason, our approach is to disaggregate our county-level predictions to the hospital level based on the proportion of hospital employees of the total employees in a given county, and we assign a three-level covid Pandemic Severity Index (cPSI) to each hospital using the number of predicted cumulative deaths over the next five days.

Covid Pandemic Surge Index (cPSUI)

Covid Pandemic Surge Index

The cPSI gives more weight to hospitals in counties with the highest number of cumulative deaths, so we also developed an index to reflect those counties with the highest numbers of new deaths. We know that roughly half of the people on ventilators die and a hospital’s number of ventilators is roughly equal to the number of ICU beds.

Informing the Distribution of PPE

U.S. Map of Severity Index
LA County GIF
Santa Clara County GFI
Predicted new deaths for LA and Santa Clara counties in California 05/10-05/16 (predicted on 05/10/2020)

Using our indices, our partners at Response4Life can make informed decisions on how to distribute emergency medical supplies manufactured by a network of small makers to hospitals around the country. At the time of writing, Response4Life has distributed 65,000 items of personal protective equipment to 25 recipients in 15 states (and another 500,000 items outside the US). We’re incredibly proud to be partnered to an organization making such a positive impact.

What’s next?

This is a question that people across the world are asking every day, and the Yu Group is no different. We intend to stay engaged in the fight to protect our healthcare workers and respond to the spread of COVID-19.

More (and More!) Collaboration

CSDS U.S. COVID-19 Atlas

We’re very excited about our collaboration with the Center for Spatial Data Science at the University of Chicago to add our predictions and indices to their excellent U.S. COVID-19 Atlas. We see collaborations like this one that unites expertise from groups across the country as crucial to the fight against COVID-19.

For updates and further information on this ongoing project, visit our website at covidseverity.com.