Forecasting of COVID-19 Cases in Kurdistan Region Using Some Statistical Models

Nowadays the new universal disease of the coronavirus that is called the epidemic COVID-19 is spread as geometric progression among the people around the world, so, such pathogen considered the most dangerous threat facing humanity. This study aimed to derive the best forecasting models for the close future cases of infected, recovered, and deaths in the four provinces of Kurdistan Region-Iraq to avoid more loss of human lives by applying more health care in certain province. Two forecasting methods were used including Exponential Smoothing and ARIMA models. The results indicate that both ARIMA and Exponential Smoothing models were close to each other for predicting the infected cases of COVID-19 in Kurdistan Region provinces, and the predicting models show that the pandemic might not be under control unless the people apply the government instructions for health care and keep social distances.


Introduction
The term COVID-19 is coming from COrona VIrus Disease that has been identified in China, 2019. COVID-19 is an infectious animal and human disease caused by coronavirus group two [1], which infects the respiratory system of humans and causes infection like Pneumonia, and has become as a pandemic disease by World Health Organization (WHO) due to its fast spread (like geometric progression) around the world. Moreover, this disease has no vaccine and its genetic material (RNA) mutate periodically. The first infected case has been reported in China on November 17, 2019 [2], and after about eight months (on July 21, 2020) the total cases reached 14.7 million around the world with about 8.29 million recovered cases and 610,000 deaths [3]. In Iraq the infection, recovered and death cases on July 21, 2020, reported as 97003, 62836, and 3969 individuals, respectively. While in Kurdistan region, North of Iraq, the cases at the same date reported as 11367, 6044, and 419 individuals, respectively [4,5]. However, the highest cases reported in some countries (such as Spain, Italy, and China) in April 2020; while the USA had a continuous peak after the last date [4]. This disease imposes a quarantine on the people around the world, which resulted in restrictions on travel and gatherings, in addition to the closure of schools, institutes, universities, companies, and factories. As a result, this pandemic disease has caused huge problems for governments and people. So, it is necessary to study the prediction of its growth or decline to know the possibility of controlling it soon due to its continuous and dramatic increase.
For forecasting such cases, often two traditional models are used; Exponential Smoothing and ARIMA models within the time series modeler procedure. In a recent study conducted by Abotaleb [6] on prediction cases of COVID-19, that reported in the USA, Italy, and China, he used both ARIMA and exponential growth models; and concluded that the exponential model is better than ARIMA for predicting the COVID-19 cases. This author added that the number of confirmed and death cases in China will continue to increase insignificantly while the number of recovered cases will increase significantly; and the best ARIMA models for China will be ARIMA (0,2,1), ARIMA (0,2,2), and ARIMA (0,2,2) for the confirmed, recovered and death cases, respectively. Regarding the comparison between ARIMA and exponential model, the author stated that the exponential model resulted in expected values higher than the ARIMA model values significantly in Italy cases. Finally, he expected that the USA will record the highest number of confirmed, recovered and death cases compared to the other two countries. In another study carried out by Benvenuto, et al. [7] on COVID-19 prediction, they performed the Auto Regressive Integrated Moving Average model (ARIMA) to predict the trend of the incidence and prevalence of COVID-19. They concluded that both the prevalence and incidence of COVID-19 are not influenced by the seasonality. They added that more data were needed to obtain a trusted trend, and they expected that the spread of COVID-19 will slightly decrease as well as the cases will reach a plateau if the virus does not mutate. In another similar study conducted by Perc, et al. [8] on predicting COVID-19 cases for four countries (USA, Slovenia, Germany, and Iran) for two successive weeks. Therefore, they used an exponential growth method and concluded that daily case rates should be kept below 5%, and this may be achieved by restricting people's behavior. In another study by Chakraborty and Ghosh [9] on forecasting the daily infected cases of COVID-19 in Canada, France, India, South Korea, and the UK in which they used ARIMA, Wavelet-based Forecasting (WBF) and hybrid ARIMA-WBF models. They proposed that the last model can explain the non-linear and non-stationary behavior present in the studied datasets of COVID-19 cases. They added that the fitted ARIMA models for studied datasets were; ARIMA(1,2,1), ARIMA(1,1,2), ARIMA(0,1,1), ARIMA(2,1,0) and ARIMA (2,2,2) for India, Canada, France, South Korea and UK, respectively. Also, several previous investigations predicted the future COVID-19 cases for China using traditional time series forecasting models [10][11][12].
The present study aims to derive the best forecasting models for infected cases in the four provinces of Kurdistan region (Erbil, Sulaymaniyah, Duhok, and Halabja), to avoid more loss of human lives by applying more health care in certain governorates.

Methodology
The present study was carried out on the daily and accumulative data of infected, recovered and death cases of COVID-19 in the four provinces (Erbil, Sulaymaniyah, Duhok, and Halabja) of Kurdistan region (KR-Iraq) during the period from March 1, 2020, up to July 31, 2020 (153 days). The raw data were obtained from the official website of Kurdistan region government [5]. The objective was to predict the future of COVID-19 cases for upcoming days (half of the previous period, which is about 76 days). The study was performed using the two forecasting methods (ARIMA and Exponential Smoothing models) of time series analysis within SPSS software [13]. Such procedures resulted often in precise forecasting especially for short-run predictions on a relatively significant amount of raw data [14].

ARIMA Model
It is a classical time series analysis method, used for predicting linear tendencies in stationary data for time series (ARIMA is a non-stationary series enabling the differencing of the data series arising from the models by changing first-difference of y ). ARIMA is followed by (p, d, q); the p and q parameters represent the order of the Autoregressive (AR) and Moving-Average (MA) model, respectively, and d represents the level of differencing [9]. Mathematically, the ARIMA model can be expressed as the main model (Model 1) as follow: Δ y t = c + a1 Δ y t−1 + a2 Δ y t−2 + ··· + ap Δ y t−p + ε t − θ1 ε t−1 − θ2 ε t−2 − ··· −θq ε t−q (Model 1) Where Δ y = yy -1 = the change in actual observations at time t; εt is the random error at the same time t; c and a are ARIMA model's coefficients. ARIMA model assumes that the error tends to be zero and the variance is constant (( ) = 0 and ( ) = 2); and satisfies the sequence of independently and identically distributed (iid) condition. In the case of p = q = 0, the ARIMA called the random walk model and described as ARIMA (0, 1, 0). Deriving an ARIMA model for given time series data could be summarized in three frequent steps: Identification of the Model (achieving stationary); estimation of the parameter (Autocorrelation function -ACFand the partial autocorrelation function -PACF-), also plots are utilized to choose the values of both parameters (p and q); and diagnostics the model by checking the candidate model (s) for adequacy (finding the 'optimum' fitted predicting model by using Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC)) [15].
While the Moving-Average (MA) model is time series (yt) and called the moving-average process of order q, MA (q), as: This model is a short memory model because q errors only affect xt. Moreover, Autoregressive Moving-Average (ARMA) Model is a procedure that has both p and q order (ARIMA (p, q)), it is a combination of both previous models, as follow: yt = + 1y −1+⋯+ y − + − 1 −1−⋯− − (Model 4) Anyway, there is also the Box-Jenkins method [17] for analyzing time series. This method of analysis permit ARIMA models to derive the optimum fit model depending on the iteration of the previous values of a time series, until obtaining the satisfied model. So, this method has four stages instead of three, as follow: Model identification: identifying series seasonality, insure the stationary of the variables as well as using the Auto-Correlation Function (ACF) and Partial Auto-Correlation Function (PACF) plots for determination the best autoregressive (AR) or moving-average (MA) component in the series model.
Model estimation: using Maximum Likelihood Estimation (MLE) or non-linear least-squares estimation for best fit ARIMA model coefficients.
Model checking: checking whether the estimated model complies with the stationary univariate characteristics, where the residuals should be characterized as independent and constant in average and variance; and the residual ACF and PACF plots are indicating misspecification. So, if the resulted estimation is unsuitable, it should be returned to the first stage to drive a sufficient model. The last model will compare with other ARIMA models to select the best fit. However, two criteria used in model choosing: Akaike's Information Criterion (AIC) and Bayesian Information Criteria (BIC).
Forecasting: when the chosen ARIMA model is satisfied for a stationary univariate procedure in which it can be used for prediction.

Exponential Smoothing Model
This is the average of smoothing variability as a random process, and also its called a moving average of exponentially weighted, using Holt's Linear Trend Method [18]. This model has the following properties: (i) the weight is declined by older data; (ii) it is the simplest way for calculation and (iii) small size data is required. Also, it deals with the average current value by calculating the weighted average from the present value of a certain variable and the average value from the past. However, this model takes a weighted average of all past data and re-use it as the forecast of the final mean of the distribution; and create an exponential smoothing average to enable forecasting data. This way involves a prediction model and also two smoothing equations (level and trend equations): The prediction model is as follow: Ў t+h|t = lt + hbt (Model 5) While the level equation is as: lt = α yt + (1-α) (lt-1 + bt-1) And the trend equation is as: bt = γ (ltlt-1) + (1-γ)bt-1 Where lt denotes the series level estimation at the time, t; bt is the estimate of the series slope at the same time, t; α is the factor of smoothing in level equation (0≤ α ≤1) and γ is the smoothing parameter in trend equation (0≤ γ ≤1). For simple exponential smoothing, the level equation appears that the lt is a weighted average of yt observation and the simple predicted time will be as (lt-1 + bt-1); while in the case of trend equation, the bt is the weighted average based on (ltlt-1) and bt-1 at time t. So, the prediction function changes into linear shape and equivalent to the previous value plus h (times) of the trend.

Running the Analysis
Forecasting for time series within SPSS program Ver. 26 [13] was used for running the analysis. Two forecasting models (ARIMA and Exponential smoothing) were applied on the data of accumulative daily infected cases of the four provinces (Sulaymaniyah, Erbil, Halabja, and Duhok) of Kurdistan region. The analysis was conducted from March 1, 2020, to July 31, 2020 (153 days); and the forecasting was carried out for the next interval that equal half of the previous period (76 days), which started from August 1, 2020, up to October 15, 2020. Descriptive statistics were computed for the daily cases, and Chi-square test was applied for recovered, deaths and gender proportions, using the same software.

Descriptive Statistics
The descriptive statistics of the daily data for the infected cases of the four studied provinces within the studied period are presented in Table (1). It could be noticed from Table (1) that Sulaymaniyah province had the highest means of infected cases compared to other provinces. This is because the COVID-19 pandemic firstly begins in Sulaymaniyah. The values of both skewness and kurtosis and also standard deviation estimates indicating that the distribution of all provinces cases are non-normally distributed, therefore, the median values were taken to express the general view of such cases which stated that Sulaymaniyah province had the highest value of cases (6), followed by Erbil province (4); while botth Duhok and halabja provinces were in the safe mode compared to the other two previous provinces. Consequently, such finding represents dangerous bell to attend more health care for both Sulaymaniyah and Erbil provinces. Regarding to the percentages of recovered and death cases, the analysis of Chi-square showed that proportions of both were highly significant (p<0.01), where the chi-square values of recovered and death percentages were (39.34 and 58.24, respectively). Halabja province was recorded the highest recovered percentage (85.3 %) of infected cases; while Duhok province was recorded the lowest death percentage (0 %) of infected cases (Figure 1). However, the overall expected percentages of recovered and deaths cases were (65.55 and 3.9 %, respectiverl). Moreover, the rest percent represent the undertreatment cases. Here, the people of both Sulaymaniyah and Erbil, should apply the instructions that followed by people of both Duhok and Halabja provinces to minimize the deaths cases and maximize the recovered ones.
In respect to the infected gender percentage, the Chi-square test showed that there is no significant difference (p>0.05) between the ratios of both sexes for the four provinces (Figure 2), where the Chi-square value was (6.7). But in general the male percentage was higher than that of female one, this may due to the more active movement for males than females in Kurdistan during the quarantine period.
The accumulated observed cases for the infected of the four provinces are illustrated in Figures (3), in which could be noticed that Sulaymaniyah province had the highest cases (8522) followed by Erbil province (4328); while both Halabja and Duhok provinces had the lowest numbers of cases (475 and 532, respectively). The trend of curves had fluctuated in similarity among the four provinces.

ARIMA model Results
The results of ARIMA models for the accumulated infected cases of the four provinces are presented in Table  (2). Such models involve also sub-models like autoregressive model (AR) and moving average model (MA), and their significance levels, ARIMA models were used non-transformed data for all provinces except for Sulayamniah province where the square root way was used for transformation the data to be as normal distribution data. As shown from the Table (2), the best forecasted model of Erbil province is ARIMA (1,2,2) for the infected cases; while the parallel model of Sulaymaniyah is ARIMA (0,2,1), and this last model is also suitable for Duhok infected cases; but in respect to Halabja's infected cases, the model ARIMA (0,2,4) is reported as the best forecasted model.  Table (2), it could be noticed that the estimate of all difference parameters (d) are equal two (d=2); that is mean, the second difference is active.. However, autoregressive model (AR) which represent (p) in the ARIMA model, is active just for Erbil model (Lag 1); while moving average (MA) models which represent (q) in the ARIMA model are active and significant for all foretasted models and have been represented in (Lag 2, Lag 1, Lag1 and Lag 4, for Erbil, Sulaymaniyah, Duhok and Halabja province, respectively). Similar results and trends were found by Abotaleb [6] for forecasting ARIMA models for USA and China, but his finding for Italy was didn't agree with the present results.  As curves, the forecasted ARIMA models for the accumulative number of infected cases for the four provinces are illustrated in Figure (4). As shown in Figure (4), the number of infected cases for Erbil province will be increased linearly for a considerable time, and the observed numbers may be multiplied (3) times. While for Sulaymaniyah province, the expected number of infected cases will be increased slightly and less than Erbil cases. In respect to Duhok cases, the predicted number of infected cases will be increased linearly and more than Erbil cases. Regarding Halabja expected cases, the expected number will increase gradually and linearly for infected cases and will be similar to Erbil cases. Contrary findings were found by Benvenuto, et al. [7], who stated that the the spread of COVID-19 will slightly decrease, and the cases will reach a plateau if the virus does not mutate.

Exponential Smoothing Model Results
The forecasting results of infected cases for the four provinces of Kurdistan region by exponential smoothing models using Holt's method are presented in Table (3). It could be noticed from the mentioned table that estimates of all model's parameters (using both Alpha-level and Gamma-trend equations) of accumulative infected cases for the four provinces are highly significant (p<0.01), and the Exponential Smoothing Model was used square root way as correction factor for data transformation to be normally distributed data. The significant parameters of Exponential Smoothing Model indicate that the prediction of future cases for the studied areas and cases are active, and the government should take a specific health care instruction to control the spread of this pandemic disease (COVID-19). However, a similar study by Perc, et al. [8] on forecasting COVID-19 cases of USA, Slovenia, Germany, and Iran countries for 14 successive days using exponential smoothing method reported that the rate of cases should not exceed 5 %. As curves, the forecasted exponential smoothing models (using Holt's method) for the accumulative infected cases with confidence intervals (95 %) for the studied provinces are illustrated in Figure (5). As it's obvious from the mentioned Figure, the expected number of infected cases for Erbil province will be non-linearity increased with narrow fluctuation (depending on the government instructions and restrictions), and the observed numbers may be multiplied (5.5) times; while for Sulaymaniyah province infected cases, the expected number of cases will slightly increase with wide logical ranges, this trend is similar to ARIMA model which may be related with data transformation way; for Duhok province, the expected number of accumulative infected cases will increase tightly and non-linearity (here the government should apply more health care and restrictions), where the observed numbers may be multiplied (8) times; regarding Halabja province, the expected trend will be similar to Erbil one, but less intensity, and the observed cases may be multiplied (4) times. However, the hard changes over time for the slope of curve lines might be from the small estimates of Gamma models. These results ensure that the government should apply the restrictions with health care instructions for more time to control the COVID-19 pandemic. Moreover, several studies have been done by Chakraborty and Ghosh [9]; Li, et al. [10]; Roosa, et al. [11]; Wu, et al. [12] in different countries on forecasting COVID-19 cases, and they reported findings differed in the shape of curves due to the structure of data distribution.

ARIMA vs. Exponential Smoothing Models
To compare both forecasting models (ARIMA vs. Exponential smoothing) for selecting the best prediction models, it should be depended on the fit statistic accuracy measures which include: Auto-Correlation Errors at first lag (ACF1), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), Maximum Absolute Percentage Error (MaxAPE), Maximum Absolute Error (MaxAE), Bayesian Information Criterion (BIC) and Ljung-Box Significance. Table (4) shows the comparison between both main studied forecasting models for the fit statistics of COVID-19 infected cases of studied provinces of Kurdistan region-Iraq. It could be observed from the table that both Erbil and Sulaymaniyah forecasting within ARIMA models are significant at the level of (p <0.01) and (p <0.05), respectively according to Ljung-Box fit parameter estimate, while Duhok and Halabja models were insignificant. Regarding Exponential Smoothing Models, just Duhok province model was highly significant (p <0.01) according to Ljung-Box fit parameter estimate. In general, both studied methods are close to each other for forecasting time series of the studied COVID-19 cases in the studied provinces, since almost half of fit statistics (accuracy measures) are lower in ARIMA and the rest half are lower in Exponential Smoothing model ( Table 4). The present results disagree with the findings of Abotaleb [6] who found that the Exponential growth models were better than that of ARIMA ones for forecasting COVID-19 confirmed cases of the USA, Italy, and China, which may be attributed to the data structure.   Finally, the accumulative daily cases of all provinces could be summarized in Figure (7) as overall infected cases maight be expected in Kurdistan region. However, the Expert Modeler within SPSS software was used for Kurdistan cases, after the previous results indicate that both ARIMA and Exponential Smoothing are almost resulted in close findings. The expert modeler selected ARIMA(0,2,1) model (R 2 =1); with MA lag1 = 0.684 (p<0.01), and ACF value = -0.100, with square root transformation. As it is shown from the Figure (7), the forecasted number of infected cases will be increased as exponential growth; so the people should take health care and keep social distances.

Conclusions
It could be concluded from the present study that the ARIMA and Exponential Smoothing models which used for studied data are close to each other for the forecasted infected cases of COVID-19. The trends of predicted cases for the four provinces indicate that the COVID-19 epidemic could not be controlled in Kurdistan region provinces unless the people apply all government instructions, because the present results indicate the dangerous situation.

Suggestion
This study suggests that the negative effects of COVID-19 (the bad situation for health, commercial relationship, economic, education, technology, traveling…etc.) must be challenged by applying the instructions of the government, take more health care and applying social distances to avoid more live loss and to reserve the human activities normally.