CookUp Analytics - Covid-19 Forecasting

A CRITICAL EVALUATION OF DIVERSE REGRESSION PARADIGMS IN FORECASTING COVID-19 OUTCOMES

I. Introduction

A. Executive Summary

Viruses are microscopic infectious agents that rely on living cells to reproduce and multiply. They can infect a myriad of organisms, ranging from animals and plants to microorganisms like bacteria and archaea (Lodish et al., 2000). The nature of viruses is unique; they exist on the boundary between living and non-living. They lack cellular structures and cannot carry out metabolic processes by themselves, but when they infect a susceptible host cell, they can direct the cell machinery to produce more viruses. One such virus, SARS-CoV-2, led to the global pandemic known as COVID-19. First identified in Wuhan, China in late 2019, escalated rapidly into a pandemic, enveloping the entire globe (Hui et al., 2020). While the primary symptoms were respiratory, severe cases often exhibited multi-organ complications leading to a higher mortality rate (Huang et al., 2020). The repercussions of the COVID-19 pandemic, however, extended beyond the medical realm. Economically, it ignited a global recession (Nicola et al., 2020). Socially, the pandemic compelled us to redefine norms, with lockdowns and social distancing becoming part and parcel of our lives (Brooks et al., 2020). It unmasked the stark disparities within healthcare and socio-economic systems and propelled mental health issues to an all-time high (Pfefferbaum et al., 2020). In essence, the COVID-19 pandemic precipitated a paradigm shift across various facets of human existence.

The healthcare system has been stressed beyond precedent, with hospitals and healthcare professionals straining to accommodate the influx of patients, leading to shortage of resources and the need for crisis management (Ranney et al., 2020). Due to the airborne nature of the virus, millions of people have been infected, leading to significant morbidity and mortality (Johns Hopkins University, 2023). Education has been significantly disrupted, as schools and universities shifted to online learning to safeguard public health, introducing both opportunities and challenges. The digital divide was highlighted, affecting the quality of education received by students without reliable internet access or technology (Burgess et al., 2020). The pandemic drastically transformed the global economy. Countries have faced a recession due to lockdown measures, disrupting global supply chains, and leading to a rise in unemployment rates (International Monetary Fund, 2020). Certain sectors such as tourism and hospitality experienced significant declines, while industries such as e-commerce and remote working technologies experienced growth (Gössling et al., 2021). Societal shifts were also evident. In many communities, the pandemic has exacerbated existing inequalities, hitting vulnerable populations the hardest (Yancy, 2020). Meanwhile, various forms of racial and ethnic discrimination related to the virus have been reported across the world (Devakumar et al., 2020). The psychological impact of the pandemic cannot be understated. Prolonged periods of isolation, fear, and anxiety have contributed to a surge in mental health issues globally (Pfefferbaum et al., 2020). Furthermore, the pandemic has generated an infodemic, with a rapid spread of misinformation causing public confusion and mistrust (Cuan-Baltazar et al., 2020). COVID-19 has also highlighted the importance of global cooperation and coordination. The scientific community globally embarked on an unprecedented race to develop vaccines and treatments, with several vaccines developed in record time, demonstrating the power of collaboration, and sharing of scientific knowledge (Lurie, 2020).

In conclusion, the impacts of COVID-19 have been far-reaching and transformative, leading to changes in health, education, economy, society, and global cooperation. Future research should focus on understanding the long-term impacts and how to mitigate such future pandemics. The primary aim of this study is to utilize various regression models to predict COVID-19 cases and deaths. Being able to predict how the virus is behaving on a macro level is crucial due to its inherent ability to inform public health responses and policy decisions in real-time. It enables resource allocation, healthcare preparation, and implementation of containment measures. Furthermore, the insights derived from these models can enhance our understanding of the virus’s spread dynamics. As such, they hold significant potential for future pandemics. These predictive models could be promptly applied to any new infectious disease outbreak, allowing us to anticipate disease spread and intervene more effectively, thus potentially reducing the societal and health impacts of future pandemics. By learning from the COVID-19 experience, we are better equipped to harness data and predictive modelling techniques to navigate through future public health crises.

II. Process

A. Data Gathering

Data features were retrieved from Johns Hopkins University open-source GitHub repository. Data characteristics and sources can be found in the Data Sources section for each data table retrieved. The data was retrieved from 01/22/2020 – 03/09/2023. The emphasis of the data gathering process revolved around the following categories.

Global Daily Counts

Number of Cases: This refers to the total number of confirmed COVID-19 infections, in a specific region (e.g., city, state, country, globally) during a specific time frame. It includes both people who have symptoms and those who are asymptomatic or pre-symptomatic but have tested positive for the virus. However, it's important to note that the reported number of cases is often an underestimate of the true number of infections due to limitations in testing capacity, different testing strategies across regions, and the presence of asymptomatic or mildly symptomatic individuals who do not get tested.

Number of Deaths: This refers to the total number of individuals who have died after being diagnosed with COVID-19 in a specific region during a specific time frame. These are often individuals who had severe symptoms and complications from the virus, although some deaths can also occur in individuals with mild symptoms. The reported number of deaths due to COVID-19 can also be an underestimate due to discrepancies in reporting, especially in the early stages of the pandemic or in regions with weaker health infrastructure. Additionally, some deaths may be indirectly caused by COVID-19, for example, due to healthcare systems being overwhelmed and unable to provide adequate care for other conditions, and these may not always be included in the official count.

B. Data Cleaning

In this study, we performed a rigorous data cleaning process to ensure the high quality and reliability of the data. The cleaning procedures were based on custom-made business rules, specifically tailored to meet the research objectives. This crucial step allowed us to address potential inconsistencies, errors, and missing values present in the dataset. By rectifying these issues, our primary aim was to significantly enhance the accuracy of the subsequent analysis and predictive modeling. It is important to emphasize that the data cleaning was primarily focused on preparing the dataset for visualization and modeling purposes. The following description outlines the specific steps taken during the cleaning process dedicated to the modeling section.

Assumptions/Business Rules

Diamond Princess (Cruise Ship)
MS Zaandam (Cruise Ship)
Summer Olympics 2020 (Event)
Winter Olympics 2022 (Event)
North Korea (Lack of Valid Entries)

Negative Values: During the data analysis, we encountered a peculiar issue where there were instances of negative cases and deaths per day for various countries. This was unexpected since the running sum of cases and deaths should always be increasing over time. To address this concern, we conducted a thorough investigation and identified a total of 372 rows with negative values. Out of these 372 rows, 14 of them were associated with countries that reported negative cases and deaths on the same date. To ensure the integrity of our modeling process, we decided to remove these 372 rows with negative values from the dataset before proceeding with any further analysis. By taking this precautionary step, we ensured that the data used for modeling was reliable and free from inconsistencies that could potentially impact the accuracy of our results.

Continent

Europe – 53 (30.8%)
Africa – 33 (19.2%)
Asia – 32 (18.6%)
South America – 21 (12.2%)
North America – 17 (9.9%)
Australia/Ocean – 16 (9.3%)

Country (Top 5)

France – 12 (7%)
Peru – 10 (5.8%)
Czechia – 5 (2.9%)
Spain – 5 (2.9%)
Mexico – 5 (2.9%)

Continent

Europe – 115 (57.8%)
Africa – 20 (10.1%)
Asia – 24 (12.1%)
South America – 9 (4.5%)
North America – 20 (10.1%)
Australia/Ocean – 11 (5.5%)

Country (Top 5)

Switzerland – 18 (9.0%)
Lithuania – 16 (8.0%)
Czechia – 13 (6.5%)
Israel – 9 (4.5%)
Spain – 8 (4.0%)

Aggregating Values: The data collected comprised daily global counts for each country. To streamline this information, it was aggregated based on date, summing the number of cases and deaths for each country to obtain a consolidated count per date. Given the continuous nature of the data - that is, a running sum - it was anticipated that all values would be positive. However, it was observed that there were instances of negative counts, necessitating the removal of those specific dates. This phenomenon occurs when the cases for a particular date are fewer than the previous day, contradicting the expected nature of a running sum which should see case counts increase daily. Out of all the irregularities, there were 250 dates with negative counts. Interestingly, 108 of these dates exhibited both negative cases and deaths. This irregularity in the data emphasizes the need for careful handling and meticulous verification of data during the process of analysis.

Covid-19 Global Tracking: 01/22/2020 - 03/09/2023

C. Data Modeling

The culmination of our data gathering, and thorough cleaning processes marked the transition to a vital stage in our research – the selection of the regression model. This critical stage required a detailed assessment and comparison of various algorithms to ascertain the most apt one in sync with the nature of our problem and dataset. The goal of model selection was to discern a model adept at capturing the inherent trends and associations within the data. Each algorithm brings its unique advantages and limitations, thereby making it imperative to opt for a model that could deliver precise and significant results for our investigation. By thoughtfully considering the attributes of the dataset and the study’s goals, we confirmed that the selected model was apt for our research, enabling us to infer valuable conclusions and offer crucial recommendations. Two use cases were developed; one to predict number of cases and one to predict number of deaths. The dependent (target) variable was the outcome of these predictions for each use case (number of cases and number of deaths) and the independent (predicator) variable was the date.

Model Selection

Regression: Regression analysis serves as a fundamental technique in predictive modeling. It aims to model the relationships between a target variable and one or more predictor variables (James et al., 2013). In this case study, we present an overview of three regression methods: Linear Regression, Polynomial Regression, and Extreme Gradient Boosting (XGBoost) Regression. We conduct a comparative analysis of their advantages and disadvantages in predicting COVID-19 cases and deaths. To evaluate the performance of these models, we employ two commonly used metrics: R-squared (R²) and Mean Absolute Error (MAE). These metrics are justified as suitable measures for this specific use case. While there are other evaluation techniques available, R² and MAE prove to be well-suited for our analysis. To ensure robust evaluation, the data is split into training and testing sets independently for both cases and deaths. The split ratio is 80% for training and 20% for testing. Additionally, a validation set is created to find the optimal polynomial for the Polynomial Regression, accounting for 10% of the total data. We extract and shuffle the data randomly to maintain the validity of our analysis. Crucially, the same dataset is used for each use case (number of cases and number of deaths) to ensure fair and accurate comparisons among the different regression methods. This approach guarantees that any observed differences in results can be attributed to the performance of the models rather than the variability of the data.

Linear Regression is a statistical model that assumes a linear relationship between the predictor variable (X) and the single target variable (Y) (Freedman, 2009). The aim is to find a linear function that, when the values for the predicator variables are given, will predict the target variable with minimal error. Mathematically, this relationship is represented as Y = mX + b, where Y is the target variable, m is the slope, X is the predicator variable, and b is the intercept.

Advantages

Linear Regression is simple, interpretable, and computationally efficient, and does not require tuning of hyperparameters (Freedman, 2009).

Disadvantages

Linear Regression assumes linearity, independence of errors, homoscedasticity, and normality, which may not hold true for all real-world data. It may underperform with non-linear data and is sensitive to outliers (James et al., 2013).

Polynomial Regression, an extension of Linear Regression, fits a non-linear relationship between X and Y using a polynomial (Kutner et al., 2004). This approach can model relationships that are not merely straight lines, allowing for curves and other complex patterns. This relation is represented as Y = a + b1X + b2X² + b3X³ + … + bnXⁿ + ε, where Y is the target variable, a is the Y-intercept, b1…bn are the coefficients of the polynomial terms, X is the predicator variable, n is the degree of the polynomial, and ε is the error term. The power of n determines the degree of the polynomial. n=1 is Linear Regression; n=2 is Quadratic Regression, and so forth. The higher the degree, the more complex the curve that can be fit to the data.

Advantages

Polynomial Regression can model non-linear relationships and provide better fits than Linear Regression for certain data.

Disadvantages

This approach can easily lead to overfitting if the degree of the polynomial is high, and deciding the correct degree can be challenging. It is also less interpretable and more computationally expensive than Linear Regression (Kutner et al., 2004).

XGBoost is a powerful ensemble machine learning algorithm that builds and combines weak prediction models, typically decision trees, in a stage-wise way to construct a strong predictive model (Chen et al., 2016). While it’s not possible to capture the entire algorithm in a simple formula, the basic idea is to construct the model in the form of an additive sequence, f(x) = b + h1(x) + h2(x) + … + hn(x). Where f(x) is the predicted outcome, b is a constant, h1(x)…hn(x) are the sequence of weak learner functions (usually decision trees) that are added to the model. At each stage, XGBoost adds the function hi(x) to the model that minimizes the overall loss function, which includes both a measure of prediction error and regularization terms to prevent overfitting.

Advantages

XGBoost is versatile, efficient, and typically yields high performance, dealing well with a variety of structured data. It can handle missing values and provides a feature importance score.

Disadvantages

XGBoost can be prone to overfitting, particularly with noisy data. It requires careful tuning of hyperparameters and is computationally intensive. Furthermore, its predictions are less interpretable compared to simpler models like Linear Regression (Chen et al., 2016).

R² represents a statistical measure that depicts how close the data are to the fitted regression line, essentially quantifying the goodness of fit (Singer et al., 2003). It ranges from 0 to 1, with a higher value indicating a better fit. is calculated based on the sum of squared errors from the regression model (SSR) and the total sum of squares (SST). It measures the proportion of variance in the target variable that can be explained by the predicator variable(s). A high R² doesn’t imply that the model is good at prediction, it just indicates that the features can explain the variance in the target variable.

The general formula is R² = 1 – (SSR/SST)

SSR = ∑ (Yi – Ŷi) ²; where Yi is the actual value and Ŷi is the predicted value.
SST = ∑ (Yi – Yavg) ²; where Yi is the actual value and Yavg is the average value of Y.

Example: Suppose a restaurant owner has collected data on the number of pizzas sold (Y) and the temperature outside (X, in degrees Fahrenheit) for five days.

The following data for X and Y was used:

Suppose a simple Linear Regression model has been fit to this data and the predicted number of pizzas sold (Ŷ) for each temperature is as follows:

We first calculate the SSR – the sum of the squared differences between the actual and predicted values:
SSR = ∑ (Yi – Ŷi) ² = (200-210) ² + (220-230) ² + (250-250) ² + (275-270) ² + (300-290) ² = 100 + 100 + 0 + 25 + 100 = 325

Next, we calculate the average value of Y (Yavg), which is:
(200+220+250+275+300)/5 = 245

We first calculate the SSR – the sum of the squared differences between the actual and predicted values:
SSR = ∑ (Yi – Ŷi) ² = (200-210) ² + (220-230) ² + (250-250) ² + (275-270) ² + (300-290) ² = 100 + 100 + 0 + 25 + 100 = 325

Finally, R² is calculated as:
R² = 1 – (SSR/SST) = 1 – (325/6600) = 0.9507
In this example, R² is 0.95, indicating that the model explains approximately 95% of the variance in the number of pizzas sold.

MAE quantifies the average of the absolute differences between the predicted and actual values. It provides a straightforward and interpretable measure of prediction error (Willmott et al., 2005). MAE is particularly useful because it is easy to understand and calculate. It’s a linear score, which means that all the individual differences are weighted equally in the average. The absolute value is taken to avoid cancellations between positive and negative errors. MAE is a more robust metric than, for example, Mean Squared Error, which squares the residuals, thus giving higher weight to larger errors.

The general formula is MAE = (1/n) * Σ|Yi – Ŷi|

n = Number of data points
Yi = Actual output
Ŷi = Predicted output

Example: Let’s say you have built a model to predict the price of houses, and you want to test the model on 3 new houses. The actual prices of the houses are $300,000, $400,000, and $500,000. Your model predicts the prices as $310,000, $420,000, and $470,000 respectively. Here’s how you would calculate the MAE.

Calculate the absolute error.
|$300,000 – $310,000| = $10,000
|$400,000 – $420,000| = $20,000
|$500,000 – $470,000| = $30,000

Calculate the mean of these absolute errors.
MAE = (1/3) * ($10,000 + $20,000 + $30,000) = $20,000

So, the MAE of your model on these three houses is $20,000. This means that, on average, your model’s predictions are off by $20,000.

Model Hyperparameter Tuning

Polynomial Regression:Polynomial Regression extends simple Linear Regression to accommodate non-linear relationships between the target and predicator variables by introducing higher-degree terms. However, one challenge with Polynomial Regression is choosing the right degree of the polynomial. Higher degrees can model more complex shapes, but they can also overfit the data, i.e., the model becomes excessively complex and starts capturing the noise along with the underlying pattern, making it perform poorly on unseen data.

Root Mean Square Error (RMSE) is a common metric used to evaluate the performance of a Regression model. It measures the average magnitude of the error. In the context of Polynomial Regression, you can use RMSE to find the optimal polynomial degree. The process involves training multiple Polynomial Regression models, each with a different degree, calculating the RMSE for each, and then selecting the model with the lowest RMSE as the optimal one.

For each degree of polynomial (starting from 1 and increasing), do the following: Fit a Polynomial Regression model on the training set, use the trained model to predict the output in the validation set, calculate the RMSE of these predictions against the actual output in the validation set. Compare the RMSE values for the models. The model with the smallest RMSE is typically chosen as the optimal one. However, you also need to be cautious of overfitting. If the RMSE of the validation set starts to increase as you add more degrees to your model (even though the RMSE of the training set continues to decrease), this is a sign that your model may be overfitting.

Figure 1 and Figure 2 present the results of the residual analysis for cases and deaths, respectively. The x-axis represents the degree of the polynomial, while the y-axis shows the corresponding RMSE values. The optimal number of polynomials for the cases model was found to be 8, with a lowest RMSE value of approximately 16,037,820. On the other hand, for the deaths model, the optimal number of polynomials was 5, and the lowest RMSE value was approximately 89,322. These findings indicate that the 8th-degree polynomial provided the best fit for predicting the number of cases, while the 5th-degree polynomial was the most suitable for predicting the number of deaths. Lower RMSE values signify that the models with these polynomial degrees demonstrated better accuracy in capturing the patterns and trends in the data.

Figure 1: Optimal Polynomial for Predicting Cases

Figure 2: Optimal Polynomial for Predicting Deaths

XG Boost Regression:To ensure optimal performance, a grid search technique was utilized to explore and determine the best combination of hyperparameters for the XGBoost Regression model. By systematically evaluating different parameter values, we aimed to identify the optimal configuration that maximized the model's predictive capabilities. The objective function chosen was "reg:squarederror". This is a commonly used objective function for regression problems. It's designed to minimize the squared difference between the predicted and actual values, which is often called SSE. By squaring the difference, it penalizes large errors more than small ones, leading to a robust model. This objective function is used to train the model, and the model learns the parameters which minimize the SSE.

N Estimators: This specifies the number of gradient boosted trees to use, i.e., the number of boosting rounds or iterations. A higher value makes the model more complex and more likely to overfit.

Number of Cases: The optimal number selected was 466 trees.
Number of Deaths: The optimal number selected was 500 trees.

Max Depth: This controls the depth of the tree. A larger value makes the model more complex and more likely to overfit. The depth of a tree is the length of the longest path from the root to a leaf.

Number of Cases: The optimal number selected was 2.
Number of Deaths: The optimal number selected was 2.

Learning Rate: This determines the impact of each tree on the outcome. It shrinks the contribution from each tree by the set value and helps in preventing overfitting. Lower values generally require more trees to model all the relations and will thus be more computationally expensive

Number of Cases: The optimal number selected was 0.03.
Number of Deaths: The optimal number selected was 0.03.

D. Model Results

Following the model selection process, the subsequent crucial step entailed interpreting the results obtained from the chosen regression algorithms. This phase involved analyzing the results of the regression models for predicting the number of cases and the number of deaths, and identifying which regression model was most accurate in relation to the use case (number of cases and number of deaths). Ideally, the optimal model can be defined as having a high R² and a low MAE. Additionally, residuals were calculated for each model. Analyzing residuals provided insights into the appropriateness of the model. If the residuals appeared to be randomly dispersed around zero, it indicated that a linear model might be appropriate. Systematic patterns in the residuals could suggest that the model is not capturing some aspect of the data. To ensure a fair comparison across all data points and models, we took the absolute value of these residuals. By doing this, we treated underpredictions and overpredictions equally, focusing purely on the magnitude of the error and not its direction.

Number of Cases Results

Number of Deaths Results

Analysis

Number of Cases: The XGBoost Regression model had the highest R² value and the lowest MAE, making it the most accurate and precise model among the three for predicting the number of COVID-19 cases on the test data. Therefore, it is safe to assume that using the XGBoost Regression model provides the best predictions, as it seems to capture the data's underlying patterns most effectively and has the smallest average error in its predictions. The dashboard visualization titled "Number of Cases Results" illustrates the model results visually, we see in the top graph that the XGBoost Regression line (red) is almost identical to the actual output (yellow). Additionally, the residuals graph on the bottom shows large variations for the Linear Regression and Polynomial Regression models while the XGBoost Regression model is the closest to 0 and is consistent throughout the predictions.

Number of Deaths: The XGBoost Regression model displayed both the highest R² value and the lowest MAE among the three models for predicting the number of COVID-19 deaths on the test data. Therefore, it is safe to assume that using the XGBoost Regression model provides the optimal performance in terms of capturing the intricacies of the data and providing the closest predictions to the actual values. The dashboard visualization titled "Number of Deaths Results" illustrates the model results visually, we see in the top graph that the XGBoost Regression line (red) is almost identical to the actual output (yellow). Additionally, the residuals graph on the bottom shows large variations for the Linear Regression model while the Polynomial Regression and the XGBoost Regression models provide smaller variations and are closer to 0.

III. Conclusion

A. Recommendations

In predicting both the number of COVID-19 cases and deaths, the XGBoost Regression model consistently outperformed the Linear Regression and Polynomial Regression models. XGBoost demonstrated the highest R² values, indicating a better fit to the data, and the lowest MAE, signifying the most accurate predictions. This showcases the robustness and superiority of gradient-boosted tree algorithms like XGBoost in capturing intricate patterns in data. It is important to acknowledge that as time progresses and additional data is collected, there might be a point where the models could reach a plateau in their predictive capabilities. In such cases, it becomes necessary to update the models to adapt to the evolving dynamics of the pandemic. While this case study primarily focused on regression models, it would be interesting to explore additional time series models and compare their accuracy to the ones discussed here. Here are some examples of other time series models that could be considered:

Long Short-Term Memory (LSTM) with Attention Mechanism: An extension of LSTM that incorporates an attention mechanism to give more weight to certain time steps in the sequence, potentially improving its ability to capture long-term dependencies.

Seasonal Autoregressive Integrated Moving Average with Exogenous Variables (SARIMAX): Like SARIMA but includes exogenous variables that may influence the time series, such as external factors or interventions.

Facebook’s Prophet with Additional Regressors: Enhancing the Prophet model by including additional regressors that might impact the time series, allowing for more comprehensive forecasting.

Holt-Winters Exponential Smoothing: A popular method for time series forecasting that handles trends and seasonality, particularly suitable for short-term predictions.

Gaussian Process Regression with Temporal Kernels: Incorporating temporal kernels in Gaussian Process Regression to capture complex temporal patterns in the data.

Predicting diseases such as COVID-19 is of paramount importance from multiple perspectives. Accurate predictions enable governments and public health organizations to make informed decisions about interventions, such as lockdowns or social distancing measures. This aids in managing the spread of the disease, thus protecting the well-being and lives of countless individuals. Additionally, timely and accurate predictions can educate the public about the impending risks, motivating them to adopt preventive measures. The economic fallout from pandemics can be staggering, with businesses closing, unemployment rates rising, and stock markets plummeting. Accurate forecasting allows governments and businesses to plan more effectively, allocate resources judiciously, and minimize economic disruptions.

Early predictions can guide policymakers in implementing strategic economic buffers and aids, helping economies weather the storm more effectively. Predicting the surge in cases or deaths aids healthcare systems in better resource allocation – from ICU beds to ventilators, and from medical staff deployment to the procurement of essential drugs. It ensures that hospitals are not overwhelmed and can provide care to those who need it most. Moreover, predictions can guide research in terms of where efforts might be focused, be it in vaccine development or therapeutic interventions. In summary, the accurate prediction of diseases like COVID-19 is a linchpin in mounting an effective response. It ensures that societies are equipped not only to deal with the immediate health implications but also the cascading economic and societal ramifications. Given the demonstrated effectiveness of the XGBoost Regression model in our analysis, leveraging such advanced predictive tools is indispensable in our global fight against pandemics.

IV. Data Sources

A. Github

Covid 19 Data

Contact Organization Unit: https://github.com/CSSEGISandData/COVID-19
Contact Name: Johns Hopkins University
Contact Person Function: Contributors
Contact Mail Address: jhusystems@gmail.com
Contact Email Address: None
Contact Phone Number: None

FIPS: US only. Federal Information Processing Standards code that uniquely identifies counties within the USA.
Admin2: County name. US only.
Province_State: Province, state or dependency name.
Country_Region: Country, region or sovereignty name. The names of locations included on the Website correspond with the official designations used by the U.S. Department of State.
Last Update: MM/DD/YYYY HH:mm:ss (24 hour format, in UTC).
Lat and Long: Dot locations on the dashboard. All points (except for Australia) shown on the map are based on geographic centroids, and are not representative of a specific address, building or any location at a spatial scale finer than a province/state. Australian dots are located at the centroid of the largest city in each state.
Confirmed: Counts include confirmed and probable (where reported).
Deaths: Counts include confirmed and probable (where reported).
Active: Active cases = total cases – total recovered – total deaths. This value is for reference only after we stopped to report the recovered cases (see Issue #4465)
Incident_Rate: Incidence Rate = cases per 100,000 persons.
Case_Fatality_Ratio (%): Case-Fatality Ratio (%) = Number recorded deaths / Number cases.

196 Countries
1,143 Days

Counts

B. Citation

Works Cited

Brooks, S. K; Webster, R. K; Smith, L. E; Woodland, L; Wessely, S; Greenberg, N; Rubin, G. J., (2020).
The Psychological Impact of Quarantine and How to Reduce it: Rapid Review of the Evidence.
The Lancet (London, England), 395(10227), 912–920. https://doi.org/10.1016/S0140-6736(20)30460-8.

Burgess, S; Sievertsen, H. H., (2020). Schools, Skills, and Learning: The Impact of COVID-19 on Education.
VoxEU.org. https://voxeu.org/article/impact-covid-19-education.

Chen, T; Guestrin, C., (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22^ndACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794).
https://doi.org/10.1145/2939672.2939785.

Cuan-Baltazar, J. Y; Muñoz-Perez, M. J; Robledo-Vega, C; Pérez-Zepeda, M. F; Soto-Vega, E., (2020).
Misinformation of COVID-19 on the Internet: Infodemiology Study. JMIR Public Health and Surveillance, 6(2),
e18444. https://doi.org/10.2196/18444.

Devakumar, D; Shannon, G; Bhopal, S. S; Abubakar, I., (2020). Racism and Discrimination in COVID-19 Responses.
The Lancet (London, England), 395(10231), 1194. https://doi.org/10.1016/S0140-6736(20)30792-3.

Freedman, D. A. (2009). Statistical models: Theory and Practice. Cambridge University Press.

Huang, C; Wang, Y; Li, X; Ren, L; Zhao, J; Hu, Y; Zhang, L; Fan, G; Xu, J; Gu, X; Cheng, Z; Yu, T;
Xia, J; Wei, Y; Wu, W; Xie, X; Yin, W; Li, H; Liu, M; Xiao, Y; Cao, B., (2020).
Clinical Features Of Patients Infected with 2019 Novel Coronavirus in Wuhan, China.
The Lancet (London, England), 395(10223), 497–506. https://doi.org/10.1016/S0140-6736(20)30183-5.

Hui, D. S; I Azhar, E; Madani, TA; Ntoumi, F; Kock, R; Dar, O; Ippolito, G; Mchugh, TD; Memish,
ZA; Drosten, C; Zumla, A; Petersen, E., (2020). The Continuing 2019-nCoV Epidemic Threat of
Novel Coronaviruses to Global Health—The Latest 2019 Novel Coronavirus Outbreak in Wuhan,
China. International Journal of Infectious Diseases, 91, 264-266.
https://www.ijidonline.com/article/S1201-9712(20)30011-4/fulltext.

International Monetary Fund. (2020). World Economic Outlook, April 2020: The Great Lockdown.
IMF. https://www.imf.org/en/Publications/WEO/Issues/2020/04/14/weo-april-2020.

James, G; Witten, D; Hastie, T; Tibshirani, R., (2013). An introduction to Statistical Learning (Vol. 112). Springer.

Johns Hopkins University. (2023). COVID-19 Dashboard by the Center for Systems Science and
Engineering (CSSE) at Johns Hopkins University. https://coronavirus.jhu.edu/map.html.

Kutner, M. H; Nachtsheim, C. J; Neter, J; Li, W., (2004). Applied Linear Statistical Models. McGraw Hill/Irwin.

Lodish, H; Berk, A; Zipursky, S.L; Matsudaira, P; Baltimore, D; Darnell, J., (2000) Photosynthetic
Stages and Light-Absorbing Pigments. Molecular Cell Biology. 4th Edition, W. H. Freeman, New York.

Lurie, N; Saville, M; Hatchett, R; Halton, J., (2020). Developing Covid-19 Vaccines at Pandemic Speed.
The New England Journal of Medicine, 382(21), 1969–1973. https://doi.org/10.1056/NEJMp2005630.

Nicola, M; Alsafi, Z; Sohrabi, C; Kerwan, A; Al-Jabir, A; Iosifidis, C; Agha, M; Agha, R., (2020).
The Socio-Economic Implications of the Coronavirus Pandemic (COVID-19): A Review. International
Journal of Surgery (London, England), 78, 185–193. https://doi.org/10.1016/j.ijsu.2020.04.018.

Pfefferbaum, B; North, C.S., (2020). Mental Health and the COVID-19 Pandemic. New England
Journal of Medicine, 383, 510-512. https://doi.org/10.1056/NEJMp2008017.

Ranney, M. L; Griffeth, V; Jha, A. K., (2020). Critical Supply Shortages – The Need for Ventilators
and Personal Protective Equipment during the Covid-19 Pandemic. The New England Journal of
Medicine, 382(18), e41. https://doi.org/10.1056/NEJMp2006141.

Singer, J. D; Willett, J. B., (2003). Applied Longitudinal Data Analysis: Modeling Change and Event
Occurrence. Oxford University Press. https://doi.org/10.1093/acprof:oso/9780195152968.001.0001.

Stefan, Gössling; Daniel, Scott. C; Michael, Hall., (2021). Pandemics, Tourism and Global Change: A
Rapid Assessment of COVID-19. Journal of Sustainable Tourism, 29(1):1-20.
https://doi.org/10.1080/09669582.2020.1758708.

Willmott, C. J; Matsuura, K., (2005). Advantages of the Mean Absolute Error (MAE) Over the Root
Mean Square Error (RMSE) in Assessing Average Model Performance. Climate Research, 30(1), 79–82.
http://www.jstor.org/stable/24869236

World Health Organization. (2020). Coronavirus Disease 2019 (COVID-19): Situation Report, 72.
https://apps.who.int/iris/handle/10665/331685

Yancy C. W. (2020). COVID-19 and African Americans. JAMA, 323(19), 1891–1892.
https://doi.org/10.1001/jama.2020.6548