# The Yu Group in the fight against COVID-19: Curating data, predicting deaths, and building partnerships

The Yu Research Group at UC Berkeley Statistics and EECS, led by PI Bin Yu, has worked to curate a comprehensive COVID-19 data respository and predict the spread of deaths caused by the virus. You can read our paper here and visit our project website at covidseverity.com.

in partnership with

In late March as the COVID-19 epidemic began to take a firm hold on our lives, the Yu Group began working in collaboration with Response4Life, a non-profit organization striving to deliver critical emergency medical supplies to the hospitals that need them most.

To do so, we sought to forecast the severity of the epidemic for counties and hospitals across the U.S., with three primary goals:

• Provide open access to a large repository of data related to the spread of COVID-19 in the U.S. that other groups may use to better understand the past and future trajectory of the virus.
• Produce short-term predictions of the number of deaths at the county-level to identify counties whose health-care systems are likely to face significant stress over the coming week to ten days.
• Develop a hospital-level COVID pandemic severity index (CPSI) to identify hospitals most likely to be facing emergency medical supply shortages.

# Our Data Processing Pipeline

One of our main contributions is the curation of a large COVID-19 data repository, which we have made publicly available on GitHub. This repository is updated daily with new data and information. Currently, it includes data on COVID-19-related cases, deaths, demographics, health resource availability, health risk factors, social vulnerability, and other COVID-19-related information.

Let’s begin first with what our data processing pipeline looks like at a very high level. At the beginning of this pandemic, there was obviously no existing database labeled “Everything you need to understand the COVID-19 pandemic”. Our job, then, was:

• To begin curating this data repository.
• To determine what are relevant information and sources.
• To clean and validate the data continuously.

We’ve collected over a million records from 15+ sources and counting, and we’re using AWS S3 to store some of the larger data sets as well as an EC2 instance running JupyterLab to easily share results, EDA, and collaborate together on the code.

## Other Data Sources

In addition to the USAFacts and NYT data, our repository includes:

Our data pipeline is much more of an iterative and continuous cycle than a static pipeline with a well-defined start and end. We are constantly looking for new data sources and incorporating them into our pipeline and repository. Our data collection and cleaning efforts are on-going as new really cool data sources keep popping up on our radar every day.

## Key Takeaways

When searching for, cleaning, and combining data from many different sources it can be easy to get lost in the details, but there are three big takeaways we’d like to highlight on the data curation end.

• Know the audience or end user of your data and give them documentation.

We saw our audience as not just being our team, but also other researchers in the broader community, which meant we needed to create very clear and organized documentation so that the data is easily understandable and user-friendly. With that aim, we’ve spent significant time on documentation for each data source. We also created both abridged and unabridged versions of the data to allow newcomers to get started more quickly.

• Don’t ignore naming/coding conventions and organization structure for storing and processing your data.

This goes hand-in-hand with the first takeaway and improves the accessibility of the data for all end users. For example, if other researchers want to see how we cleaned the data, they can easily find the clean.py script in the folder each dataset. If they want to see the raw data, they can find it in the raw folder. Having a good organizational structure is crucial to quickly integrate new members and volunteers onto the team, and it’s best to set standard at the beginning rather than spend a lot of effort reorganizing when the project is already well-established.

• Good data curation takes a lot of effort; and it’s worth it.

Early on, everyone on the team worked on data cleaning in some capacity, but we quickly found that we needed to separate into smaller teams to be more efficient. Two of our team members worked on the data team essentially full-time for almost a month and the results speak for themselves. We’re incredibly happy to see that the broader community is already using our repository, such as for the final project for STAT 542, a graduate-level machine learning course at University of Illinois, and DATA 100 at our very own UC Berkeley, an undergrad class with over 1000 students!

# Forecasting County-Level Death Counts

Data curation is an ongoing effort, but once we had the backbone of our data pipeline in place, we turned to the prediction problem.

## Five Statistical Methods

Our predictive approach primarily uses the county-level case and death reports provided by USA Facts, along with some county-level demographics and health data. We use five different statistical methods, each of which captures slightly different data trends.

• Separate-county exponential predictors

The separate predictors aim to model each county independently via a best-fit exponential curve using the most recent 5 days of data.

$E\left({\text{deaths}}_{t}|t\right)={e}^{{\beta }_{0}+{\beta }_{1}t}$

• Separate-county linear predictors

The linear predictor is similar to the separate one, but uses a simple linear model rather than the exponential format.

$E\left({\text{deaths}}_{t}|t\right)={\beta }_{0}+{\beta }_{1}t$

• Shared-county exponential predictor

The shared predictor fits the data from all of the counties simultaneously and predicts death counts for individual counties.

$E\left({\text{deaths}}_{t}|t\right)={e}^{{\beta }_{0}+{\beta }_{1}\mathrm{log}\left({\text{deaths}}_{t-1}+1\right)}$

• Demographics shared-county exponential predictor

The demographics shared predictor is similar to the shared-county exponential predictor, but also includes various county demographic and health-related predictive features.

$E\left({\text{deaths}}_{t}|t\right)={e}^{{\beta }_{0}+{\beta }_{1}\mathrm{log}\left({\text{deaths}}_{t-1}+1\right)+{\beta }_{{d}_{1}}{d}_{1}^{c}+\dots +{\beta }_{{d}_{m}}{d}_{m}^{c}}$
• Expanded shared-county exponential predictor

The expanded shared predictor is similar to the “shared” predictor, but also includes COVID-19 case numbers and neighboring county cases and deaths as predictive features.

$E\left({\text{deaths}}_{t}|t\right)={e}^{{\beta }_{0}+{\beta }_{1}\mathrm{log}\left({\text{deaths}}_{t-1}+1\right)+{\beta }_{2}\mathrm{log}\left({\text{cases}}_{t-k}+1\right)+{\beta }_{3}\mathrm{log}\left({\text{neigh_deaths}}_{t-k}+1\right)+\dots +{\beta }_{4}\mathrm{log}\left({\text{neigh_cases}}_{t-k}+1\right)}$
• Combined Linear and Exponential Predictors (CLEPs)

Ultimately, we found that the approach that worked best was to use an ensemble of the five models listed above to flexibly fit the COVID-19 trend. We use a weighting scheme previously developed by Dr. Yu and collaborators for lossless audio compression. This exponential weighting term ${w}_{t}^{m}$$w_t^m$ for predictor $m$$m$ applied on day $t$$t$ is given by

${w}_{t}^{m}\propto \mathrm{exp}\left(-c\left(1-\mu \right)\sum _{i={t}_{0}}^{t-1}{\mu }^{t-i}\ell \left({\stackrel{^}{y}}_{i}^{m},{y}_{i}\right)\right)$

where $\mu \in \left(0,1\right)$$\mu \in (0,1)$ and $c>0$$c > 0$ are tuning parameters, ${t}_{0}$$t_0$ represents some past time point, and $t$$t$ represents the day on which the prediction is calculated. Since $\mu <1$$\mu < 1$, the $m{u}^{t-i}$$mu^{t−i}$ term represents the greater influence given to more recent predictive performance.

Note that the loss terms $\ell \left({\stackrel{^}{y}}_{i}^{m},{y}_{i}\right)$$\ell(\hat{y}_i^m, y_i)$ used in the weights are calculated based on the three-day predictions from seven predictors built over the course of a week. We chose the past week’s 3-day performance in the weights since it yielded good performance for our ensemble predictor for predicting death counts several days in the future. In practice, our loss function is

$\ell \left({\stackrel{^}{y}}_{i}^{m},{y}_{i}\right)=|\mathrm{log}\left(1+{\stackrel{^}{y}}_{i}^{m}\right)-\mathrm{log}\left(1+{y}_{i}\right)|$

where the log is taken to help prevent vanishing weights due to the heavy-tailed nature or our error distribution.

It’s important to note that our weights are calculated at the county-level, so that each county uses a distinct ensemble of the five methods above. In practice, we find that the best model is a combination of the expanded shared predictor and the linear predictor.

## Prediction Results and Intervals

Our five day predictions are tracking well with current trends.

However, we weren’t satisfied to stop at point estimates. To get a sense of the variability of our predictions, we use maximum (absolute) error prediction intervals (MEPI):

These intervals are formed in three steps:

1. Find the normalized error of our predictors in the past:

${\mathrm{\Delta }}_{t}:=|{y}_{t}-{\stackrel{^}{y}}_{t}|/|{\stackrel{^}{y}}_{t}|$

1. Find maximum error of past 5 days:

${\mathrm{\Delta }}_{max}:=\underset{0\le j\le 4}{max}{\mathrm{\Delta }}_{t-j}$

1. Form the interval for predictions for $k$$k$ days in the future:

$\stackrel{^}{\text{PI}}:=\left[max\left\{{\stackrel{^}{y}}_{t+k}\left(1-{\mathrm{\Delta }}_{max}\right),{y}_{t}\right\},{\stackrel{^}{y}}_{t+k}\left(1+{\mathrm{\Delta }}_{max}\right)\right]$

Using this method, we see good empirical coverage for our historical prediction intervals:

# From County-Level Predictions to the Hospital-Level

At this point we have predicted deaths at the county level, but our main deliverable as part of our partnership with Response4Life was to predict demands for emergency medical supplies at hospitals across the country. This is no small task, made all the more difficult by the fact that hospital-level case and death counts are not available to the best of our knowledge.

## Covid Pandemic Severity Index (cPSI)

For this reason, our approach is to disaggregate our county-level predictions to the hospital level based on the proportion of hospital employees of the total employees in a given county, and we assign a three-level covid Pandemic Severity Index (cPSI) to each hospital using the number of predicted cumulative deaths over the next five days.

## Covid Pandemic Surge Index (cPSUI)

The cPSI gives more weight to hospitals in counties with the highest number of cumulative deaths, so we also developed an index to reflect those counties with the highest numbers of new deaths. We know that roughly half of the people on ventilators die and a hospital’s number of ventilators is roughly equal to the number of ICU beds.

## Informing the Distribution of PPE

Using our indices, our partners at Response4Life can make informed decisions on how to distribute emergency medical supplies manufactured by a network of small makers to hospitals around the country. At the time of writing, Response4Life has distributed 65,000 items of personal protective equipment to 25 recipients in 15 states (and another 500,000 items outside the US). We’re incredibly proud to be partnered to an organization making such a positive impact.

# What’s next?

This is a question that people across the world are asking every day, and the Yu Group is no different. We intend to stay engaged in the fight to protect our healthcare workers and respond to the spread of COVID-19.

## More (and More!) Collaboration

We’re very excited about our collaboration with the Center for Spatial Data Science at the University of Chicago to add our predictions and indices to their excellent U.S. COVID-19 Atlas. We see collaborations like this one that unites expertise from groups across the country as crucial to the fight against COVID-19.