I. Introduction
A. Executive Summary
Viruses are microscopic infectious agents that rely on living cells to reproduce and multiply. They can infect a myriad of organisms, ranging from animals and plants to microorganisms like bacteria and archaea (Lodish et al., 2000). The nature of viruses is unique; they exist on the boundary between living and non-living. They lack cellular structures and cannot carry out metabolic processes by themselves, but when they infect a susceptible host cell, they can direct the cell machinery to produce more viruses. One such virus, SARS-CoV-2, led to the global pandemic known as COVID-19. First identified in Wuhan, China in late 2019, escalated rapidly into a pandemic, enveloping the entire globe (Hui et al., 2020). While the primary symptoms were respiratory, severe cases often exhibited multi-organ complications leading to a higher mortality rate (Huang et al., 2020). The repercussions of the COVID-19 pandemic, however, extended beyond the medical realm. Economically, it ignited a global recession (Nicola et al., 2020). Socially, the pandemic compelled us to redefine norms, with lockdowns and social distancing becoming part and parcel of our lives (Brooks et al., 2020). It unmasked the stark disparities within healthcare and socio-economic systems and propelled mental health issues to an all-time high (Pfefferbaum et al., 2020). In essence, the COVID-19 pandemic precipitated a paradigm shift across various facets of human existence.
The healthcare system has been stressed beyond precedent, with hospitals and healthcare professionals straining to accommodate the influx of patients, leading to shortage of resources and the need for crisis management (Ranney et al., 2020). Due to the airborne nature of the virus, millions of people have been infected, leading to significant morbidity and mortality (Johns Hopkins University, 2023). Education has been significantly disrupted, as schools and universities shifted to online learning to safeguard public health, introducing both opportunities and challenges. The digital divide was highlighted, affecting the quality of education received by students without reliable internet access or technology (Burgess et al., 2020). The pandemic drastically transformed the global economy. Countries have faced a recession due to lockdown measures, disrupting global supply chains, and leading to a rise in unemployment rates (International Monetary Fund, 2020). Certain sectors such as tourism and hospitality experienced significant declines, while industries such as e-commerce and remote working technologies experienced growth (Gössling et al., 2021). Societal shifts were also evident. In many communities, the pandemic has exacerbated existing inequalities, hitting vulnerable populations the hardest (Yancy, 2020). Meanwhile, various forms of racial and ethnic discrimination related to the virus have been reported across the world (Devakumar et al., 2020). The psychological impact of the pandemic cannot be understated. Prolonged periods of isolation, fear, and anxiety have contributed to a surge in mental health issues globally (Pfefferbaum et al., 2020). Furthermore, the pandemic has generated an infodemic, with a rapid spread of misinformation causing public confusion and mistrust (Cuan-Baltazar et al., 2020). COVID-19 has also highlighted the importance of global cooperation and coordination. The scientific community globally embarked on an unprecedented race to develop vaccines and treatments, with several vaccines developed in record time, demonstrating the power of collaboration, and sharing of scientific knowledge (Lurie, 2020).
In conclusion, the impacts of COVID-19 have been far-reaching and transformative, leading to changes in health, education, economy, society, and global cooperation. Future research should focus on understanding the long-term impacts and how to mitigate such future pandemics. The primary aim of this study is to utilize various regression models to predict COVID-19 cases and deaths. Being able to predict how the virus is behaving on a macro level is crucial due to its inherent ability to inform public health responses and policy decisions in real-time. It enables resource allocation, healthcare preparation, and implementation of containment measures. Furthermore, the insights derived from these models can enhance our understanding of the virus’s spread dynamics. As such, they hold significant potential for future pandemics. These predictive models could be promptly applied to any new infectious disease outbreak, allowing us to anticipate disease spread and intervene more effectively, thus potentially reducing the societal and health impacts of future pandemics. By learning from the COVID-19 experience, we are better equipped to harness data and predictive modelling techniques to navigate through future public health crises.
II. Process
A. Data Gathering
Data features were retrieved from Johns Hopkins University open-source GitHub repository. Data characteristics and sources can be found in the Data Sources section for each data table retrieved. The data was retrieved from 01/22/2020 – 03/09/2023. The emphasis of the data gathering process revolved around the following categories.
Global Daily Counts
B. Data Cleaning
In this study, we performed a rigorous data cleaning process to ensure the high quality and reliability of the data. The cleaning procedures were based on custom-made business rules, specifically tailored to meet the research objectives. This crucial step allowed us to address potential inconsistencies, errors, and missing values present in the dataset. By rectifying these issues, our primary aim was to significantly enhance the accuracy of the subsequent analysis and predictive modeling. It is important to emphasize that the data cleaning was primarily focused on preparing the dataset for visualization and modeling purposes. The following description outlines the specific steps taken during the cleaning process dedicated to the modeling section.
Assumptions/Business Rules
C. Data Modeling
The culmination of our data gathering, and thorough cleaning processes marked the transition to a vital stage in our research – the selection of the regression model. This critical stage required a detailed assessment and comparison of various algorithms to ascertain the most apt one in sync with the nature of our problem and dataset. The goal of model selection was to discern a model adept at capturing the inherent trends and associations within the data. Each algorithm brings its unique advantages and limitations, thereby making it imperative to opt for a model that could deliver precise and significant results for our investigation. By thoughtfully considering the attributes of the dataset and the study’s goals, we confirmed that the selected model was apt for our research, enabling us to infer valuable conclusions and offer crucial recommendations. Two use cases were developed; one to predict number of cases and one to predict number of deaths. The dependent (target) variable was the outcome of these predictions for each use case (number of cases and number of deaths) and the independent (predicator) variable was the date.
Model Selection
Model Hyperparameter Tuning
Figure 1: Optimal Polynomial for Predicting Cases
Figure 2: Optimal Polynomial for Predicting Deaths
D. Model Results
Following the model selection process, the subsequent crucial step entailed interpreting the results obtained from the chosen regression algorithms. This phase involved analyzing the results of the regression models for predicting the number of cases and the number of deaths, and identifying which regression model was most accurate in relation to the use case (number of cases and number of deaths). Ideally, the optimal model can be defined as having a high R² and a low MAE. Additionally, residuals were calculated for each model. Analyzing residuals provided insights into the appropriateness of the model. If the residuals appeared to be randomly dispersed around zero, it indicated that a linear model might be appropriate. Systematic patterns in the residuals could suggest that the model is not capturing some aspect of the data. To ensure a fair comparison across all data points and models, we took the absolute value of these residuals. By doing this, we treated underpredictions and overpredictions equally, focusing purely on the magnitude of the error and not its direction.
Number of Cases Results
Number of Deaths Results
Analysis
III. Conclusion
A. Recommendations
In predicting both the number of COVID-19 cases and deaths, the XGBoost Regression model consistently outperformed the Linear Regression and Polynomial Regression models. XGBoost demonstrated the highest R² values, indicating a better fit to the data, and the lowest MAE, signifying the most accurate predictions. This showcases the robustness and superiority of gradient-boosted tree algorithms like XGBoost in capturing intricate patterns in data. It is important to acknowledge that as time progresses and additional data is collected, there might be a point where the models could reach a plateau in their predictive capabilities. In such cases, it becomes necessary to update the models to adapt to the evolving dynamics of the pandemic. While this case study primarily focused on regression models, it would be interesting to explore additional time series models and compare their accuracy to the ones discussed here. Here are some examples of other time series models that could be considered:
Long Short-Term Memory (LSTM) with Attention Mechanism: An extension of LSTM that incorporates an attention mechanism to give more weight to certain time steps in the sequence, potentially improving its ability to capture long-term dependencies.
Seasonal Autoregressive Integrated Moving Average with Exogenous Variables (SARIMAX): Like SARIMA but includes exogenous variables that may influence the time series, such as external factors or interventions.
Facebook’s Prophet with Additional Regressors: Enhancing the Prophet model by including additional regressors that might impact the time series, allowing for more comprehensive forecasting.
Holt-Winters Exponential Smoothing: A popular method for time series forecasting that handles trends and seasonality, particularly suitable for short-term predictions.
Gaussian Process Regression with Temporal Kernels: Incorporating temporal kernels in Gaussian Process Regression to capture complex temporal patterns in the data.
Predicting diseases such as COVID-19 is of paramount importance from multiple perspectives. Accurate predictions enable governments and public health organizations to make informed decisions about interventions, such as lockdowns or social distancing measures. This aids in managing the spread of the disease, thus protecting the well-being and lives of countless individuals. Additionally, timely and accurate predictions can educate the public about the impending risks, motivating them to adopt preventive measures. The economic fallout from pandemics can be staggering, with businesses closing, unemployment rates rising, and stock markets plummeting. Accurate forecasting allows governments and businesses to plan more effectively, allocate resources judiciously, and minimize economic disruptions.
Early predictions can guide policymakers in implementing strategic economic buffers and aids, helping economies weather the storm more effectively. Predicting the surge in cases or deaths aids healthcare systems in better resource allocation – from ICU beds to ventilators, and from medical staff deployment to the procurement of essential drugs. It ensures that hospitals are not overwhelmed and can provide care to those who need it most. Moreover, predictions can guide research in terms of where efforts might be focused, be it in vaccine development or therapeutic interventions. In summary, the accurate prediction of diseases like COVID-19 is a linchpin in mounting an effective response. It ensures that societies are equipped not only to deal with the immediate health implications but also the cascading economic and societal ramifications. Given the demonstrated effectiveness of the XGBoost Regression model in our analysis, leveraging such advanced predictive tools is indispensable in our global fight against pandemics.
IV. Data Sources
A. Github
Covid 19 Data
B. Citation
Works Cited
Brooks, S. K; Webster, R. K; Smith, L. E; Woodland, L; Wessely, S; Greenberg, N; Rubin, G. J., (2020).
The Psychological Impact of Quarantine and How to Reduce it: Rapid Review of the Evidence.
The Lancet (London, England), 395(10227), 912–920. https://doi.org/10.1016/S0140-6736(20)30460-8.
Burgess, S; Sievertsen, H. H., (2020). Schools, Skills, and Learning: The Impact of COVID-19 on Education.
VoxEU.org. https://voxeu.org/article/impact-covid-19-education.
Chen, T; Guestrin, C., (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794).
https://doi.org/10.1145/2939672.2939785.
Cuan-Baltazar, J. Y; Muñoz-Perez, M. J; Robledo-Vega, C; Pérez-Zepeda, M. F; Soto-Vega, E., (2020).
Misinformation of COVID-19 on the Internet: Infodemiology Study. JMIR Public Health and Surveillance, 6(2),
e18444. https://doi.org/10.2196/18444.
Devakumar, D; Shannon, G; Bhopal, S. S; Abubakar, I., (2020). Racism and Discrimination in COVID-19 Responses.
The Lancet (London, England), 395(10231), 1194. https://doi.org/10.1016/S0140-6736(20)30792-3.
Freedman, D. A. (2009). Statistical models: Theory and Practice. Cambridge University Press.
Huang, C; Wang, Y; Li, X; Ren, L; Zhao, J; Hu, Y; Zhang, L; Fan, G; Xu, J; Gu, X; Cheng, Z; Yu, T;
Xia, J; Wei, Y; Wu, W; Xie, X; Yin, W; Li, H; Liu, M; Xiao, Y; Cao, B., (2020).
Clinical Features Of Patients Infected with 2019 Novel Coronavirus in Wuhan, China.
The Lancet (London, England), 395(10223), 497–506. https://doi.org/10.1016/S0140-6736(20)30183-5.
Hui, D. S; I Azhar, E; Madani, TA; Ntoumi, F; Kock, R; Dar, O; Ippolito, G; Mchugh, TD; Memish,
ZA; Drosten, C; Zumla, A; Petersen, E., (2020). The Continuing 2019-nCoV Epidemic Threat of
Novel Coronaviruses to Global Health—The Latest 2019 Novel Coronavirus Outbreak in Wuhan,
China. International Journal of Infectious Diseases, 91, 264-266.
https://www.ijidonline.com/article/S1201-9712(20)30011-4/fulltext.
International Monetary Fund. (2020). World Economic Outlook, April 2020: The Great Lockdown.
IMF. https://www.imf.org/en/Publications/WEO/Issues/2020/04/14/weo-april-2020.
James, G; Witten, D; Hastie, T; Tibshirani, R., (2013). An introduction to Statistical Learning (Vol. 112). Springer.
Johns Hopkins University. (2023). COVID-19 Dashboard by the Center for Systems Science and
Engineering (CSSE) at Johns Hopkins University. https://coronavirus.jhu.edu/map.html.
Kutner, M. H; Nachtsheim, C. J; Neter, J; Li, W., (2004). Applied Linear Statistical Models. McGraw Hill/Irwin.
Lodish, H; Berk, A; Zipursky, S.L; Matsudaira, P; Baltimore, D; Darnell, J., (2000) Photosynthetic
Stages and Light-Absorbing Pigments. Molecular Cell Biology. 4th Edition, W. H. Freeman, New York.
Lurie, N; Saville, M; Hatchett, R; Halton, J., (2020). Developing Covid-19 Vaccines at Pandemic Speed.
The New England Journal of Medicine, 382(21), 1969–1973. https://doi.org/10.1056/NEJMp2005630.
Nicola, M; Alsafi, Z; Sohrabi, C; Kerwan, A; Al-Jabir, A; Iosifidis, C; Agha, M; Agha, R., (2020).
The Socio-Economic Implications of the Coronavirus Pandemic (COVID-19): A Review. International
Journal of Surgery (London, England), 78, 185–193. https://doi.org/10.1016/j.ijsu.2020.04.018.
Pfefferbaum, B; North, C.S., (2020). Mental Health and the COVID-19 Pandemic. New England
Journal of Medicine, 383, 510-512. https://doi.org/10.1056/NEJMp2008017.
Ranney, M. L; Griffeth, V; Jha, A. K., (2020). Critical Supply Shortages – The Need for Ventilators
and Personal Protective Equipment during the Covid-19 Pandemic. The New England Journal of
Medicine, 382(18), e41. https://doi.org/10.1056/NEJMp2006141.
Singer, J. D; Willett, J. B., (2003). Applied Longitudinal Data Analysis: Modeling Change and Event
Occurrence. Oxford University Press. https://doi.org/10.1093/acprof:oso/9780195152968.001.0001.
Stefan, Gössling; Daniel, Scott. C; Michael, Hall., (2021). Pandemics, Tourism and Global Change: A
Rapid Assessment of COVID-19. Journal of Sustainable Tourism, 29(1):1-20.
https://doi.org/10.1080/09669582.2020.1758708.
Willmott, C. J; Matsuura, K., (2005). Advantages of the Mean Absolute Error (MAE) Over the Root
Mean Square Error (RMSE) in Assessing Average Model Performance. Climate Research, 30(1), 79–82.
http://www.jstor.org/stable/24869236
World Health Organization. (2020). Coronavirus Disease 2019 (COVID-19): Situation Report, 72.
https://apps.who.int/iris/handle/10665/331685
Yancy C. W. (2020). COVID-19 and African Americans. JAMA, 323(19), 1891–1892.
https://doi.org/10.1001/jama.2020.6548