Intro
When facing events with increasing tendencies and global capacities, events disruptive to our health, to our society and the economy, managing them, analyzing them and helping in the process of finding solutions that will decrease these events and their impact, is crucial.
The worldwide outbreak of COVID-19 at the moment is reaching more than 81,100 people in China alone, where the outbreak originated, so the full number of confirmed cases globally is more than 185,041.
The number of people confirmed to have died as a result of the virus at the moment is more than 8000.
The virus has been declared a pandemic by the World Health Organization, meaning that it is spreading rapidly all around the world. So far, 156 countries have confirmed cases.
At the moment the epicenter of the virus is Europe, with the largest number of confirmed cases in Italy, approximately 31500 cases.
The number of deaths in Europe are growing even more rapidly than they did in China at the same stage of the outbreak. Our main objective, by showing this process of Compared Analysis and Forecasts, is to aggregate the existing research, bring together all of the relevant data, and allow everyone who reads this to understand how important is the early research of the coronavirus outbreak.
Data for this analysis is obtained from
Timeframe of the collected data
- from 22 January 2020 to 17 March 2020
Software used in this analysis
- Python 3.7.1
- pandas 1.0.1
- numpy 1.18.1
- fbprophet 0.6
SETUP
Import needed libraries
First we need to import every package which are necessary for the analysis and forecast
import pandas as pd import glob from fbprophet import Prophet
READ FILES
Reading files
Read all the comma separated (csv) files from each day and concatenate them into one big DataFrame in order to analyse the whole data(table)
df_all=pd.DataFrame() for name in glob.glob("*.csv"): df_temp=pd.read_csv(name) df_temp['Date']=name[-14:-4] frames=[df_all,df_temp] df_all=pd.concat(frames,sort=False) del(df_temp) df_all['Date']=pd.to_datetime(df_all['Date']) df_all.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6438 entries, 0 to 6437 Data columns (total 9 columns): Province/State 3826 non-null object Country/Region 6438 non-null object Last Update 6438 non-null object Confirmed 6419 non-null float64 Deaths 5997 non-null float64 Recovered 6050 non-null float64 Date 6438 non-null datetime64[ns] Latitude 3620 non-null float64 Longitude 3620 non-null float64 dtypes: datetime64[ns](1), float64(5), object(3) memory usage: 503.0+ KB
DATA NORMALIZATION
Mend Discrepancies in Country Names
There are a lot of naming misconfigurations in 'Country/Region' column, so that issue should be fixed in order to have precise & clean data afterwards
df_all['Country/Region']=df_all.apply(lambda x: 'China' if x['Country/Region'] =='Mainland China' else x['Country/Region'],axis=1) df_all['Country/Region']=df_all.apply(lambda x: 'Iran' if x['Country/Region'] =='Iran (Islamic Republic of)' else x['Country/Region'],axis=1) df_all['Country/Region']=df_all.apply(lambda x: 'South Korea' if x['Country/Region'] =='Republic of Korea' else x['Country/Region'],axis=1) df_all['Country/Region']=df_all.apply(lambda x: 'South Korea' if x['Country/Region'] =='Korea, South' else x['Country/Region'],axis=1) df_all['Country/Region']=df_all.apply(lambda x: 'Taiwan' if x['Country/Region'] =='Taiwan*' else x['Country/Region'],axis=1) df_all['Country/Region']=df_all.apply(lambda x: 'Russia' if x['Country/Region'] =='Russian Federation' else x['Country/Region'],axis=1)
Make Distinction between China and other Countries
Mark China and Others separatelly since China Cases are almost at stopping point
df_all['Country']=df_all.apply(lambda x: 'China' if x['Country/Region'] =='China' else 'Other',axis=1)
Analysis on China vs Others
- * Group by Country (China/Others) and Date
- * Calculate Percentages
df_all_grouped=df_all.groupby(by=['Date','Country'])[['Confirmed','Recovered','Deaths']].sum().reset_index() df_all_grouped ['Percentage_Recovered']= df_all_grouped['Recovered']/ df_all_grouped ['Confirmed']*100 df_all_grouped ['Percentage_Dead']= df_all_grouped'Deaths']/ df_all_grouped ['Confirmed'] *100 df_all_grouped.head(10)
|
Date |
Country |
Confirmed |
Recovered |
Deaths |
Percentage_Recovered |
Percentage_Dead |
0 |
01-22-2020 |
China |
547.0 |
28.0 |
17.0 |
5.118830 |
3.107861 |
1 |
01-22-2020 |
Other |
8.0 |
0.0 |
0.0 |
0.000000 |
0.000000 |
2 |
01-23-2020 |
China |
639.0 |
30.0 |
18.0 |
4.694836 |
2.816901 |
3 |
01-23-2020 |
Other |
14.0 |
0.0 |
0.0 |
0.000000 |
0.000000 |
4 |
01-24-2020 |
China |
916.0 |
36.0 |
26.0 |
3.930131 |
2.838428 |
5 |
01-24-2020 |
Other |
25.0 |
0.0 |
0.0 |
0.000000 |
0.000000 |
6 |
01-25-2020 |
China |
1399.0 |
39.0 |
42.0 |
2.787706 |
3.002144 |
7 |
01-25-2020 |
Other |
39.0 |
0.0 |
0.0 |
0.000000 |
0.000000 |
8 |
01-26-2020 |
China |
2062.0 |
49.0 |
56.0 |
2.376334 |
2.715810 |
9 |
01-26-2020 |
Other |
56.0 |
3.0 |
0.0 |
5.357143 |
0.000000 |
LAST DATE ANALYSIS
Last Date Analysis
- * Analyze only last/max date
- * Grouped by Country/Region
df_all_last=df_all[df_all.Date==df_all.Date.max()] df_all_last=df_all_last.groupby(by=['Country/Region'])[['Confirmed','Deaths']].sum().reset_index() df_all_last['Percentage_Dead']=df_all_last['Deaths']/df_all_last['Confirmed']*100 df_all_last.sort_values(by=['Percentage_Dead'],ascending=False,inplace=True) df_all_last.head(10)
|
Country/Region |
Confirmed |
Deaths |
Percentage_Dead |
135 |
Sudan |
1.0 |
1.0 |
100.000000 |
58 |
Guatemala |
2.0 |
1.0 |
50.000000 |
61 |
Guyana |
4.0 |
1.0 |
25.000000 |
148 |
Ukraine |
7.0 |
1.0 |
14.285714 |
110 |
Philippines |
142.0 |
12.0 |
8.450704 |
69 |
Iraq |
124.0 |
10.0 |
8.064516 |
72 |
Italy |
27980.0 |
2158.0 |
7.712652 |
2 |
Algeria |
54.0 |
4.0 |
7.407407 |
10 |
Azerbaijan |
15.0 |
1.0 |
6.666667 |
90 |
Martinique |
15.0 |
1.0 |
6.666667 |
FORECASTING
Forecasting
- 1. We use linear model without daily seasonality since there are only dates and no yearly seasonality
- 2. This version of FB prophet raises an error if columns are not named like ['ds', 'y']
- 3. We predict next 10 days
df_forecast=df_all[['Confirmed', 'Date']] df_forecast.columns=['ds', 'y'] model = Prophet(daily.seasonality=False, yearly.seasonality=False) # model.fit(df_forecast) future = model.make_future_dataframe(periods=10) forecast = model.predict(future) fig1 = model.plot(forecast)
Covid-19 Forecast
CONFIRMED CASES
Confirmed Cases Trend Model
The Trend Model for the Confirmed Cases shows quite a signifigance (p-value is <0.0001) and the trend is >10000 new cases per day
Trend model
VISUALIZATIONS
Visualizations
Visualizations are made with Tableau Desktop Public 2019.2.3
Confirmed cases on 15.03 in Other Countries have surpassed China (86444 vs 81003)
Confirmed Cases by Country 01.03.2020
Confirmed Cases by Country 17.03.2020
Confirmed Cases by Day show weird anomalies on 13.02.2020 (15133) and 13.03.2020 (16837) and also there is an evident trend >10000 cases per day in the last 5 days
Confirmed Cases by day
Difference from Previous Day on Reported Confirmed Cases shows quite an uprising in Other Countries
Confirmed Cases difference from previous day
Forecasts on Confirmed Cases show halt in China and (almost) exponential growth in other countries Top countries by cases besides China are Italy,Iran,South Korea, Spain, Germany, France and US They all show serious growth rate (>1) both on linear and logarithmic scale
30 Days Linear forecast on Confirmed Cases
10 Days Linear forecast on Confirmed Cases
10 Days Logarithmic Forecast
10 Days Linear Forecast on Top Countries
10 Days Logarithmic Forecast on Confirmed Cases in Top Countries
RECOVERED CASES
Recovered cases are quite big/good in China, but that is probably due to the fact that in China the outbreak was >1 month earlier.
In other countries the Recovered Percentage of Confirmed Cases oscillates ~10%
Recovered Cases
Recovered Cases by Day
Recovered Cases % from Confirmed Cases
Recovered Cases by Country 17.03.2020
The forecast on Recovered Cases predicts that China will be “clean” by the end of March 2020
10 Days Linear Forecast on Recovered Cases
DEATHS
Death cases are almost at halt in China, while in Other countries there is quite a growth with 3100+ deaths in the last 5 days, especially in Italy.
Deaths
Deaths by Day
Deaths by Country 17.03.2020
Death Percentage is close to 4%, but in some Regions is quite higher (Italy 7.94 %, Iran 6.11 %)
Deaths % from Confirmed Cases
Deaths % from Confirmed Cases in Top Countries
Deaths forecast is a sad thing to do...
10 Days Linear Forecast on Deaths
COMPARED ANALYSIS BENEFITS
What kind of benefit does compared analysis and forecast provide?
- * One of the biggest advantages of comparing events over time is discovering trends, finding patterns, marking key points/influences and getting “in touch” with the actual situation trough the numbers or data.
- * Accurate forecasting helps us reduce negative outcomes, schedule meaningful actions & avoid unnecessary actions, and finally managing any situation/case/problem better overall.
Why are the compared analysis and forecast important in situations like this?
In this dire times for the humanity as a whole, every (qu)bit of intelligence and wisdom we can find and share between each other, by any means and methods, will help in saving lives. What is more important than that?