Clustering COVID-19 around the world with the DBSCAN algorithm

1. Introduction

What data was analyzed?
I analyzed a time series of confirmed COVID-19 cases in different countries around the world. For (almost) every country in the world, I had a list of dates and count of coronavirus cases confirmed by the test outcome. I used the data from Kaggle Novel Corona Virus 2019 Dataset with the addition of more detailed data for my home country - Poland (from KWIRUS.pl) and some statistical data (GDP, population, population density, average yearly temperature) from Wikipedia.
What is the purpose of this analysis?
I wanted to be able to compare how the coronavirus pandemic develops in various countries. I accounted for various factors: different populations of the country, different dates the coronavirus reached the given country, etc. I wanted to have a simple tool with which I could look at any two countries and compare them.
What methods did I use to analyze the data?
I was inspired by the analysis of Mark Handley from UCL. My methodology was similar: analyzing confirmed cases count on a logarithmic scale, computing the daily increase rate (%) as the angle on the plot of confirmed cases in time with a logarithmic scale. I went one step further and used machine learning: I clustered time series for countries with the DBSCAN algorithm. In this way, I extracted several groups of patterns followed by different countries.

I was inspired by the analysis of Mark Handley from UCL. My methodology was similar: analyzing confirmed cases count on a logarithmic scale, computing the daily increase rate (%) as the angle on the plot of confirmed cases in time with a logarithmic scale. I went one step further and used machine learning: I clustered time series for countries with the DBSCAN algorithm. In this way, I extracted several groups of patterns followed by different countries.

2. The analysis

2.1 Italy – analyzing one country

Here are 4 plots that are basic tools of analysis of confirmed cases for a given country. In these cases it is Italy. Italy was the first country in Europe to have a major coronavirus outbreak. The first 2 plots in the upper row are just standard time series of the cumulative number of cases (left) and daily new cases (right). The scale is linear. Exponential growth makes it difficult to show more details. However, the daily new cases (right) has started to decline a few days ago – the peak is clearly visible.

The biggest plot (second row) shows the same data as the plot on the left but has a logarithmic scale. In order not to analyze noise I choose as the beginning of pandemic (day 0) the day the cumulative number of cases was 100 or above. This is quite a standard approach across many analysis which can be found online.

There are a few straight lines fitted to the curve. Each line represents an exponential function with different parameters. It seems that:

up to day 10, the daily increase was roughly 43%. It means doubling the number of cases every 1.94 days
after day 10, the daily increase was 22%. It means doubling the number of cases every 3.46 days
The daily increase kept slowing down to 13% (day 24), 8% (day 31). Currently, it is 3% which means doubling every 26.63 days.

The last plot shows a daily increase (in %) as it slowed down from above 40% in the beginning to 3%.

For me, the plot indicates that the major factor that contributed to the tragic situation in Italy was not stopping the spread in the first 2 weeks where the daily increase rates were extremely high.

2.2 Comparing selected countries

The next step of the analysis was comparing multiple countries together on one plot. I selected a few countries (China, Italy, Spain, the US, etc.) which have been making headlines recently and my home country – Poland. For China, I restricted data to Hubei province, for the US – to New York state.

Here are the major countries plotted together. I made a few changes to the plots which make the visualization nicer and easier to look at:

normalizing the confirmed cases by the population of a given country/region – the y-axis is now the count per million inhabitants
now the day 0 of the epidemic is the day when there were at least 10 confirmed cases per million inhabitants
smoothing daily increase over one week and computing it for week 0, week 1, week 2, etc. The plot now looks like stairs :)

Here are the key observations:

Spain and New York, the US have an order of magnitude more cases than other countries had (Hubei, China, Italy, etc.) at this stage of the epidemic
New York, the US had an extremely high daily increase % in week 0 and week 1 before joining the other countries in week 2
Spain had an extremely high daily increase % in week 0 before joining the other countries in week 1
South Korea had quite high daily increase % in week 0 but managed to slow it down much better than other countries
Germany follows the same path as Italy. However, reading the news, I know that Germany is performing many more tests and the situation in Germany is much better than in Italy. It is the limitation of analyzing only confirmed case data.
Poland (my home country) daily increase was lower from the beginning. Hopefully, it is an effect of the early lockdown, not the limited testing capability.

2.3 Clustering all the countries

Finally having established good methods to plot countries together I switched to machine learning. I wanted to cluster countries so that countries with similar behaviour could be detected automatically. Clustering is a machine learning technique of automatically detecting patterns in the data. For more technical details please read section 5. Clustering - technical details

Clustering results

Here are the clustering results. The first plots are the same as the previous plots comparing different countries. This time the lines with the same color correspond to the countries in the same cluster. Below the plots, there is a table with basic information about every cluster.

cluster	countries	cluster description	average GDP per capita nominal ($)	average yearly temperature (°C)	average daily increase (%) week 0	average daily increase (%) week 1	average daily increase (%) week 2	average infected count per million week 0	average infected count per million week 1	average infected count per million week 2
N/A	averages for all 73 analyzed countries	N/A	25 525$	14°C	28%	14%	11%	53	157	442
outside cluster	Bahrain, Hubei, China, Denmark, Djibouti, Estonia, Israel, Lebanon, Oman, Qatar, Slovenia, Trinidad and Tobago, New York, US, Kosovo	noise points not belonging to any cluster	30 331$	17°C	38%	15%	10%	79	257	741
cluster 0	Ireland, Norway, Turkey	extremely high increase in week 0 (43%) and quite efficient slowing down in week 1 (but still high - 19%). Slowing down in week 2 is substantial - to 11%.	54 901$	7°C	43%	19%	11%	92	309	658
cluster 1	Armenia, Australia, Austria, Belgium, Bosnia and Herzegovina, Chile, Croatia, Czechia, Finland, Germany, Lithuania, Moldova, Portugal, Romania, Serbia, Spain, Switzerland	High increases in week 0 (31%) followed by a high increase in week 1 (18%) and lower but still above average in week 2 (13%). This is quite a diverse cluster and cluster 2 and cluster 3 are subparts of it with more specific behaviour. However, it contains a lot of Western European countries with moderate climate and quite a high GDP	28 674$	9°C	31%	18%	13%	59	203	518
cluster 2	Canada, France, Italy, Netherlands, Panama, United Kingdom	High increases in week 0 (29%) not slowed down much in week 1 (22%). Week 2 is also high - 15%. High GDP (Western Europe mostly) and moderate climate	38 426$	10°C	29%	22%	15%	46	182	482
cluster 3	Dominican Republic, Ecuador, Iran, Latvia, Mauritius, New Zealand, North Macedonia, Sweden	Extremely high increase in week 0 (36%) slowed down more efficiently than cluster 2 (to 12% and then to 8%). Much lower GDP than cluster 2 and 1	18 611$	14°C	36%	12%	8%	68	152	266
cluster 4	Greece, Korea, South, Uruguay	High increase in week 0 (28%) followed by slowing down to 12%	22 811$	15°C	28%	12%	5%	48	105	140
cluster 5	Azerbaijan, Bulgaria, Costa Rica, Hungary, Malaysia, Poland, Saudi Arabia, Slovakia, Tunisia	A lot of Central Eastern European countries and countries with warm climate. Countries started slower from the beginning (21%) and decreased to 10% in week 1. Moderate average GDP.	12 824$	16°C	21%	10%	6%	33	65	94
cluster 6	Albania, Brazil, Peru, United Arab Emirates	Countries which kept the increase at around 14% for 3 weeks. Warm climate.	14 741$	21°C	14%	15%	13%	22	60	167
cluster 7	Argentina, Jordan, South Africa, Thailand	Similar start to cluster 6 (15%) with slowing down in week 1. Warm climate.	7 041$	19°C	15%	5%	N/A	22	32	N/A
cluster 8	Georgia, Japan, West Bank and Gaza	Countries which kept the increase at around 9% for 3 weeks. Very diverse economically	16 111$	13°C	9%	9%	9%	17	32	52
cluster 9	Kuwait, Singapore, Taiwan	High GDP countries with warm climate. Managed to keep extremely small increases for 3 weeks (3-6%)	39 360$	25°C	6%	3%	6%	13	17	26

Conclusions from the cluster table and the plots

The clusters are ordered by the count of confirmed cases per million inhabitants after 2 weeks (descending). It also means that the average daily increase % is lower for the last few clusters. It is worth noting that the cluster number seems to correlate with GDP per capita and average temperature. It may indicate that fewer reported cases are the result of smaller testing capacity in the poorer countries. Or that transmission rate depends on the climate. Further investigation is needed which effect is dominant and truly casual - GDP or climate. GDP is correlated with the climate which must be taken into account.

However, putting aside the GDP and climate correlations, this clustering method gives a few interesting insights such as the behaviour of Kuwait-Taiwan-Singapore cluster 9 or the differentiation of Western European countries into a few different clusters of behaviour (clusters 0-2).

It is also worth noting that the distribution of daily increase % in week 0 seems to be the most sparse. It tends to converge for later weeks (week 1, week 2, etc.). The typical value of daily grow % for week 1 is roughly 14% and the typical value for week 2 is 10-11%. In other words, the factor that makes the difference between the countries is the daily increase % in the early weeks.

3. Conclusions

different countries experience different daily increase % (growth rates) in various stages of the pandemic. The daily increase usually slows down after 2-3 weeks.
many countries follow similar trajectories of the pandemic development. However different clusters of behaviour seem to correlate with GDP per capita and/or climate
if the country didn’t slow down the exponential growth in the first week, the results are bad
the behaviour of different countries during the pandemic can be clustered into a few patterns
it is very hard to account for underreporting of coronavirus cases by analyzing only the confirmed cases data (see the correlation with GDP)

4. Limitations

What are the limitations of this analysis?

4.1 Using only counts of confirmed cases

It would be nice to include other currently available data: the daily count of death cases, the daily count of performed tests, the daily count of recovered cases. I only use confirmed cases counts, I do not compute CFR (case fatality rate) which is a crucial factor. In the future (sometime after pandemic) it would be nice to include the statistics of all deaths in a country. Unreported coronavirus cases could be accounted for with an increase in deaths during pandemic relative to the previous year. The availability of such data may strongly depend on the country.

Links

Here are the sources hinting the large numbers of unreported coronavirus deaths:

4.2 Different testing capacity in different countries

Around the world, countries test and detect the different percentages of the coronavirus cases. Most probably some countries detect only a small sample of real cases, in other countries only the most obvious cases are tested and asymptotic coronavirus carriers are undetected. Some high-tech societies (for example South Korea) are known to track and test every single contact of a coronavirus case.

It would be good if for a given country the ratio detected cases and all cases would be constant in time. In such a situation, we would be able to evaluate the daily increase % factor. Sadly, it may not be the cases when the pandemic overwhelms the country’s health system or when the country increases its testing capability gradually as a response to the pandemic.

For example, the official number of coronavirus cases is similar in Ecuador and in South Korea (about 190 deaths as of April 7th, 2020) but the situation in these two countries is quite the opposite: in South Korea daily life is relatively unaffected whereas in Ecuador government cannot manage to collect dead bodies from the streets. It is a clear example of an underreporting and overwhelmed health system.

Links

Here are links to articles describing the situation in Ecuador and South Korea

4.3 Not modeling the pandemic

I am not an epidemiologist. I do not model the pandemic with differential equations. I only fit exponential functions to the data. I do not make any predictions for the future apart from fitting trends. As far as I know, the most popular model in epidemiology is SEIR.

Links

Here are links to analysis based on variations of the SEIR model:

5. Clustering - technical details

What is clustering?

Clustering is an unsupervised machine learning technique. A clustering algorithm finds by itself groups (so-called clusters) such that objects in one group are more similar to each other than to those in other groups. Here is a good primer on clustering methods from the scikit-learn library.

Here are typical results from a clustering algorithm which detected 3 clusters in the data (image source: scikit-learn)

Clustering methodology

I used the DBSCAN clustering algorithm (more precisely the scikit-learn implementation). Each clustering algorithm needs either a vectorization of the records to be clustered or a definition of a distance between the records. I used the mean relative difference in the logarithm of confirmed cases counts averaged over time. I also included a small addition for the mean relative difference in daily increase % averaged over time. The purpose is of such an addition is to favour clustering countries with similar dynamics of changes not only with similar counts. The most important hyperparameter in the DBSCAN algorithm is the epsilon parameter - ε. I chose the ε by hand, trying to maximize the number of clusters and minimize the number of points outside any cluster. I tested about 5 different values of ε. I use minimum sample size=3. DBSCAN has a known limitation of assuming a constant density of data points in all clusters. To overcome it I ran two additional clustering rounds:

with ε/2 to divide the biggest clusters into smaller subclusters
with 2ε to additionally cluster noise points outside any cluster

From all the countries in the world I kept 73 countries which have:

the population of at least 1 million
reached at least 10 cases per million inhabitants at least 2 weeks ago
more than 100 cases

Who am I?

I am Aleksandra Chrabrowa and here is my LinkedIn profile.

You can contact me via e-mail at ok.l1m3k at gmail.com

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.