Predicting COVID-19 Cases Using Some Statistical Models: An Application to the Cases Reported in China Italy and USA

Today, the new coronavirus disease (COVID-19) is a global epidemic that spreads rapidly among individuals in most countries around the world and, therefore, becomes the greatest worldwide threat. The aim of this study is to find the best predictive models for the confirmation of daily situations in countries with a large number of confirmed cases. The study was conducted on the countries that recorded the highest infection rate, namely China, Italy and the United States of America. The second goal is using predictive models to get more prepared in terms of health care systems. In this study, predictions were made through statistical prediction models using the ARIMA and exponential growth model. The results indicate that the exponential growth model is better than ARIMA models for forecasting the COVID-19 cases.


Introduction
The World Health Organization has declared COVID-19 as universal epidemic. All countries should take the necessary measures to limit the spread of the virus, on the basis of the analysis and modeling of data issued by the World Health Organization. This research recommends that all countries of the world should impose quarantine including restrictions on travel and public gatherings, leading to the closure of schools, universities and workplaces in order to achieve ("social divergence") in the short term. With limited intensive care units and increasingly growing deaths, the failure of intensive care system is highly probable. This will affect a much larger proportion of the world's population including youth in the workforce who may be asymptomatic but are forced to work from home and rationalize expenditures. The number of affected people in the world will increase dramatically in the next four weeks. This study aims at finding the best predictive models for confirmed daily cases in China, Italy and the United States of America. The second aim of the study is to reach clear expectation of cases with these models in order to obtain more preparedness in health care systems in each of these countries and in countries where the disease begins to spread. This virus is considered a modified Corona virus, or what is called Covid-19, one that infects the respiratory system of humans and animals, and causes infection. Pneumonia is idiopathic pneumonia associated with individuals working in the Huanan seafood market where live animals are sold. At the end of January 2020, nearly 75,755 people were confirmed in China and among the first 41 patients were found to be linked to the Huanan seafood market in the "Huan market" in Wuhan, China, confirmed the first death of the virus on January 9, 2020. It is estimated that there is a large number of people infected with this virus, but it has not been revealed until it is spread beyond the borders of China. The first case recorded in one family was in Vietnam through the transmission of the virus from father to son, but the first outbreak of the virus was on January 22 in Germany where infection was made by infected Chinese people in the state of Bavaria .
To control the spread of this virus, China banned the movement of 57 million people in Wuhan and 15 surrounding cities The tourist sites were closed, celebrations of the Chinese New Year were cancelled to avoid the transmission of the virus. Here, we must mention that infection can be transmitted through mucus and saliva droplets that come out of the mouth or nose scattered from the infected person and spread on the surfaces. Moreover, direct infection can occur by inhaling the breath that comes out of an infected person. There are some studies that work on discovering the causes that lead to the transmission of the infection among individuals including the incubation period for the disease which is five or more days as per the World Health Organization (WHO) reports so far. The WHO has decided that antibiotics do not eliminate viruses, but they eliminate graphical infections, and, therefore, antibiotics should not be used in the prevention or treatment of this virus. Antibiotics shall be used according to the doctor's direct instructions to treat bacterial infections. Until now there is no specific vaccine or drug. Therefore, people receive care to relieve symptoms, and although there are some medications and antibiotics that reduce the severity of the symptoms of this virus, there is no current treatment to prevent the virus. Trials are now underway to test And some vaccines and possible drugs to treat this particular disease. In this regard, WHO is coordinating efforts to face this severe worldwide epidemic.

Methodology
This study was conducted on the basis of confirmed daily cases of COVID-19 that were collected from the official WHO website from January 22, 2020 to March 25, 2020. Analyzing time series can offer a very precise short-run forecasting on a relatively significant amount of data on the variables concerned, see Granger and Newbold [1]. The ARIMA models are common and popularly used in analyzing the univariate time series data. It is a mixture of three procedures; the first is autoregressive (AR) procedure; the second is the differencing; and the third is the moving-average (MA).
Such procedures are described as the default models in univariate time series analysis in the literature of statistical modeling and are widely utilized in many applications. The autoregressive form of order p, AR (p), may be described as: (1) where is the error term in the equation, indicating is a series of random variables with an independent and equal distribution (iid) also called a white noise, with ( ) and ( ) i.e ~ (0, ). This model assumes that all past values will have cumulative impact on this point of and so on. Therefore, it's called a longterm memory model.

Moving-Average (MA) Model
The MA procedure in time series analysis is said to have order q, MA (q), if: (2) This model is expressed in terms of past errors as explanatory variables. Therefore, only q errors will have effect, however, higher order errors don't affect ; this means that it's a short memory model.

Autoregressive Moving-Average (ARMA) Model
The ARMA procedure in time series analysis is said to have order p and q, ARMA(p,q), if: (3) This model can be a mixture of both autoregressive and moving average models above.

ARIMA Models
In fact, ARMA models may be generalized to non-stationary time series by enabling the data series arising from ARIMA models to be differencing. The general non-seasonality model is known as ARIMA (p, d, q) where p, d, q are the order of autoregressive, the degree of differencing, and the order of moving average respectively. For instance, if is non-stationary time series, the first-difference of is taken so that ∆ gets stationary, then the ARIMA (p, 1, q) model can be written as: (4) where ∆ = − . A special case of equation 4, is the random walk model; where the orders p and q equal zero. i.e. ARIMA (0, 1, 0).

Box-Jenkins Approach
In the analysis of time series, the ARIMA model is used to determine the optimal fit of a time series model to previous values of a time series. The Box, et al. [2] method is introduced by the statisticians George Box and Gwilym Jenkins.
For more information about Box &Jenkins method for analyzing time series data; see Young [3], Frain [4], Kirchgässner, et al. [5], and Chatfield and Xing [6]. The subsequent stages of this methodology are summarized in figure 1.

Figure-1. Stages in the Box-Jenkins iterative approach
Source: Abonazel and Abd-Elftah [7] These subsequent stages are shown in details below:  Model identification : Ensure that the variables are stationary, define the seasonality of the series and use the Auto-Correlation Function (ACF) and Partial Auto-Correlation Function (PACF) plots of the series to determine the autoregressive or moving-average portion to be included in the model.  Model estimation : Using statistical techniques to derive at the coefficients that better fit the chosen ARIMA model. The most popular approach is the Maximum Likelihood (ML) method or alternatively, the nonlinear least square approach.  Model checking: Checking if the fitted model complies with the requirements of the stationary univariate procedure. In particular, the independence of residuals of each other and to be constant in mean and variation across time; the ACF and PACF plots are helpful in detecting misspecification. If model estimation is insufficient, we need to go back to phase one and seek to create a better one. In fact, the estimated model will be contrasted with other ARIMA model in order to select the best one. To choose between models; one may use model selection criteria such as Akaike's Information Parameters (AIC) and Bayesian Information Criteria (BIC) which are described as:

Holt's Linear Trend Method
The exponentially weighted moving average is the averages of smoothing random variability with the following attractive properties: (1) older data have a declining weight; (2) it is very simple to calculate; and (3) minimal data is needed. The current value of the average is obtained only by computing the weighted average of two variables, the present value of the variable, and the mean value from the last period.
This paper uses these attractive properties to stabilize existing random variations and to continually update seasonal and pattern changes. These can be extrapolated to predicting in the future. The simplicity of the process, coupled with its simplicity computational and data specifications, makes it particularly ideal for industrial contexts in which a large number of predictions are required for the selling of individual goods. A simple example of an exponentially moving average would be to the stochastic process given in the next context. Consider the question of having the expected value of a random variable, the mean of which varies between successive drawings. The following concept can be proposed: take a weighted mean of all previous data and use it as the prediction of the latest average of the distribution.
Holt [8], generalized basic exponential smoothing to enable pattern data forecasting. This approach includes a prediction equation and two smoothing equations (one equation for the level and the other for the trend): The ( ) ( ) Where indicates the estimation of the series level at time t , is the estimate of the seies trend (slope) at time t , is the smoothing parameter for level and is the smoothing parameter for the trend In the simple exponential smoothing method, the level equation shows that is a weighted mean of observation and the forecasted one-step-ahead training for time there is given by ( ), the trend equation indicates that is a weighted mean of the trend estimates at time t based on ( ) and , the previously estimated trend. The forecast function becomes no longer flat, and the h-step-ahead projection is equivalent to the previous estimation plus h times the last estimated value of the trend, so the projection is a linear function of h.

Data Analysis
The purpose of this paper is to predict the confirmed, recovered, and death cases of the pandemic coronavirus (COVID-19) for three selected countries:China, Italy, and USA. The analysis is conducted by comparing the performance of the selected models; i.e. the exponential growth and ARIMA models, using R language.

Descriptive Analysis
The data used in this paper is a daily time series for a time ranges from the 22 nd of January 2020 to the 25 th of March 2020 1 . The resulting sample size is 64 which satisfy the rule of having more than 50 observations in Box-Jenkins approach of time series forecasting [6]. Based on this data, we will propose the appropriate ARIMA model and then compare it to the exponential growth model in forecasting the number of confirmed, recovered, and death cases for the next 30 days (26 th of March to 24 th of April). Table (1) represents the summary statistics for the numbers of confirmed, recovered, and death cases for the selected countries. It can be observed that as the coronavirus pandemic firstly begins in chine so, as a result the descriptive numbers of china are higher compared to other countries. It is also worth noting that the distributions of the data deviate away from normality and, as a result, the situation remains out of control. The preliminary analysis of the data was done by using time plots of the series as shown in figures 2-4. A visual inspection of the line chart indicates that numbers of confirmed, recovered, and death cases are not stationary. So, to reach stationary, we will take the differencing, as practiced, in developing ARIMA model, while analyzing the observed data for the exponential growth model. For each country, the best ARIMA model is selected. For more details, see Benvenuto, et al. [9], Bianconi, et al. [10], Dehesh, et al. [11], Elmousalami and Hassanien [12], Gupta and Pal [13], Utsunomiya, et al. [14], among others. Table (2) shows the parameter estimates for the ARIMA models for the selected countries associated with the standard error values in parenthesis. The ARIMA models are very important in time series analysis that are used in auto-correlated data analysis. These models include autoregressive (AR) model and moving average (MA) model.

Exponential Growth VS ARIMA Models
As obvious from the table, the parameter in all models equals 2 "i.e the 2 nd difference". The suitable models for China are ARIMA(0,2,1), ARIMA(0,2,2), and ARIMA(0,2,2) for the confirmed, recovered, and death cases respectively, indicating that the parameters are the same for the three types of data except for the parameter q for confirmed cases. For Italy, the parameters are the same for the three types of data.  Results of exponential growth model using Holt's method are reported in table (3). The optimal initial values are obtained and were optimized along with the smoothing parameters: see Hyndman, et al. [15] for details. After minimizing the sum of squared errors over the entire time series, the optimal values for were calculated and reported in table (3), where alpha and beta are the values of smoothing parameter for the level i.e "the confidence level for prediction intervals". Lambda is the Box-Cox transformation parameter, and the transformation is automatically selected. The very small values of indicates that the slope hardly changes over time, so the most susceptible numbers over time is the trend of death numbers in Italy.

Forecasted Results
Figures 6 and 7 presents the trend of the actual and the forecasted numbers of confirmed, recovered, and death values with their 95% confidence intervals. The forecasted results indicate the following:  For China, the numbers of confirmed and death cases will continue to be slightly increasing while the number of recovered cases is going to have noticed increase.  Regarding the results of Italy, the exponential growth model insures a substantial increase in the predicted values compared to that of the ARIMA models. Forecasted results concerning the USA predict the highest values compared to the other countries. The proposed two methods give a closer predicted values for both recovered and death values while the exponential growth predict a substantial increase in confirmed cases compared to that of the ARIMA model. Selection and accuracy measures for the forecasted models are reported in table (4) and presented in a form of a stacked bar chart in figure (5). The selection and accuracy measures are the Akaike information criterion (AIC), Bayesian information criterion (BIC), mean error (ME), root mean squared error (RMSE), mean absolute error (MAE), mean percentage error (MPE), mean absolute percentage error (MAPE), mean absolute scaled error (MASE) and autocorrelation of errors at lag 1 (ACF1). It is obvious from the table and the graph that the exponential growth method is better than ARIMA models for these data since it is associated with the lower values for the most criteria.

Summary and Conclusion
From the analysis of data issued by the World Health Organization, it is noted that there is an increase in the number of patients and deaths in the countries under this study: China, Italy and the United States of America. For China, the number of confirmed cases and deaths will continue to slightly increase while the number of recovered cases will increase noticeably. The situation in China has a relative improvement compared to other countries: an increased number of recovered cases and slightly increased number of deaths. The Chinese health system per 100,000 citizens is 3.6 beds in intensive care units, but with the slight increase of confirmed cases, China can control this crisis by expanding the construction of mobile intensive care units. However, China remains the most affected country when compared to Italy and the United States of America in terms of the number of confirmed cases, deaths and recoveries.
Regarding Italy, the exponential growth model guarantees a significant increase in the expected values compared to the ARIMA models. Consequently, more injuries and fatalities will also have a significant impact in the coming period. As the health system of Italy for every hundred thousand citizens is 12.5 beds in the intensive care unit and with the increase in the number of confirmed cases, disruption in the health system will occur due to the inability to absorb the increased number of confirmed cases; therefore, Italy has to create more units of intensive care.
The results of the United States expect the highest values compared to other countries. Therefore, the numbers of confirmed cases, deaths, and recovered cases are expected to continue to increase. Therefore, a crisis is expected because the health system in the United States of America per 100,000 citizens is 34.7 beds in the intensive care unit. The United States of America is a country with a good health sector, but the increased number of confirmed cases will lead to imbalance. Therefore, it is necessary to expand the establishment of mobile intensive care units to reduce the aggravation of this crisis. Therefore, this virus has effects on all sectors and activities, not only on health sector, but also on the economic, commercial, technological, educational, and other domains. Such harmful effects must be faced with the proper procedures in order to leave the other sectors unharmed.