Hong Kong SARS-CoV-2 Pandemic Trend Analysis

About

This document is a simple data analysis of SARS-CoV-2 Pandemic in Hong Kong. This notebook serves to analyze and visualize the progress of the pandemic from two datasets and various perspectives.

**Note** In Jupyter notebooks and lab, you can see the documentation for a python function by hitting ``SHIFT + TAB``. Hit it twice to expand the view.

In the tutorial, we obtain the 'Latest situation of reported cases of COVID-19' in Hong Kong data through data.gov.hk, which holds various important data related to the Coronavirus Disease (COVID-19). We apply the pandas package, which is a fast and easy-to-use data analysis tool for Python programming language, to manipulate the table contents. In this step, we read the data table into an DataFrame object named 'df_hk_covid', which is a two-dimensional, size-mutable, potentially heterogeneous tabular data.

From the cell above, we can see an overview of the data table, including the number of rows, number of columns, name of rows, contents in the first 5 rows, and contents in the last five rows. We can further investigate the information of a DataFrame object through the pandas.DataFrame.info function.

The cell above shows the column index (#), column name (Column), number of non-missing values of a column (Non-Null Count), and the variable type of the column (Dtype). That information is helpful for the further analysis of the data.

Preprocessing

Before the analysis, we need to perform some pre-processing to transform the date in the DataFrame to a format that the machine can recognize through the function pandas.to_datetime.

On the cell above, we select a column through the operation: "df_hk_covid['As of date']". This operation will return a list of content in the selected column. You can select your own column of interest by replacing 'As of date' with other column names. Afterward, the content within the column is transformed into a novel datetime64 formate by the function pandas.to_datetime. We then replace the content in the 'As of date' column with a new format.

Through the function pandas.DataFrame.info, we can see that the data type of the column 'As of date' is transformed to 'datetime64[ns]' which can be recognized by the machine. The 'Maximum' date in the table has latest timestamp: "Timestamp('2021-09-01 00:00:00')".

Therefore, we can find out the row having the latest record and store its index in 'idx_latest_date'.

Viusual analysis of the number of SARS-CoV-2 cases over time

To visualize the number of SARS-CoV-2 cases over time, we select a column ''Number of confirmed cases' as follows:

We also extract the latest number of confirmed cases as follows:

Here, we apply pandas.DataFrame.loc to access a group of rows and columns by a specific condition. The condition of a row is the last day of the table, and the condition of the column is to select the number of confirmed cases. Therefore, we select the number of confirmed cases on the last day of the table. We apply the function int() to transform our query result to an integer.

Then, we visualize the Accumulated SARS-CoV-2 Cases in Hong Kong by Time as a line plot as follows:

Visual analysis of the daily cases of SARS-CoV-2

In this analysis, we would like to visualize the daily confirmed cases of SARS-CoV-2 in Hong Kong. However, the data table provided does not directly include the information. Therefore, we apply pandas.DataFrame.diff to calculates the difference between the count of the confirmed case compared to the number in the previous row as follow:

In the previous cell, we also applied pandas.DataFrame.fillna because the first row of the data does not have a difference compared to the previous row. Then we applied pandas.DataFrame.astype to define that our calculated results are integers.

Afterward, we select the latest number of new cases and the peak number of the daily case by pandas.DataFrame.max to make some annotation on our line plot:

Then, we visualize the SARS-CoV-2 Daily Confirmed Cases in Hong Kong as a line plot as follows:

Viusual analysis of the active cases of SARS-CoV-2

The number of active SARS-CoV-2 cases is neither provided in the dataset. By definition, the "number of active cases" should be the Number of confirmed cases - the Number of death cases - the "Number of discharge cases. In this time, we would like to visualize the trend of active cases of SARS-CoV-2 in Hong Kong in an alternative way. First of all, we calculated the number of active cases as follows:

Likewise, we select the latest records to make some annotation on our stack plot:

The trend of SARS-CoV-2 Active Cases in Hong Kong as a stack plot is then shown as follows:

Analysis of Vaccination

In this section, we would like to analyze statistics about vaccination during Hong Kong SARS-CoV-2 Pandemic. Those data can be access through the complete Our World in Data COVID-19 dataset

Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020)

In this analysis, we would like to visualize the proportion of vaccination in Hong Kong as a piechart.

This dataset contains worldwide information of SARS-CoV-2 Pandemic, we would like to select a subset of data that is related to Hong Kong vaccination status by pandas.DataFrame.loc as follows:

Here, we substracted 588 rows and 7 related columns. We can see that vaccination information in the early stage of the pandemic is missing. Therefore, we should apply pandas.DataFrame.fillna to fill zeros as vaccination counts in the early stage.

Similar to the previous analysis, we also process date information as follows:

Then we calculate the number and portion of unvaccinated/vaccinated people. Here, the number of people which only the first dose should be the number of people vaccinated - people fully vaccinated

Then we select the portion of unvaccinated/vaccinated people on the last day in the dataset:

The SARS-CoV-2 Vaccine Dose Status in Hong Kong as a pie chart is then shown as follows:

Cross-validation of multiple datasets

Sometimes, we may have multiple data tables which contain different information. But they can be related through a specific column. In this case, we can apply pandas.DataFrame.merge to merge those data tables. In our cases, two datasets are related by 'date' and 'As of date' columns. Therefore, can merge two data sources and perform cross-validate the number of cases record in two datasets.

In the previous cell, we merged data table 'df_hk_vac' and 'df_hk_covid' by 'data' and 'As of date'. We specified how="inner" to perform inner joint which means that only rows that appear in both tables are preserved.

As a result, the merged data table has 22 columns from both data tables. There are 588 rows appear in both datasets. Then we select the number of cases from two data sources:

Here, we calculated the proportion of consensus of total cases in two data sets as follows:

We can see that only around 90% and 87% of total cases and daily cases are consistent across datasets. We draw a scatter plot to indicate the inconsistency of two datasets: