COVID-19 | Underreported Deaths

A visualization of the official number of COVID-19 deaths and the remaining that went underreported in Mexico City.

5 min readOct 5, 2022

The Context

The government of Mexico City provides open data on the city’s daily death certificates. At the same time, by mid-2020, it was providing daily reports on the daily death toll caused by COVID-19. It was apparent, for many people living in the capital, that the official numbers were much lower than what people were experiencing. A comparison between these numbers provides an approximation to the extent of the underreporting.

The Idea

The idea is to visualize the number of deaths caused by COVID-19 registered by federal authorities for Mexico City (CDMX) and compare that figure with the number of death certificates registered by the local authorities of CDMX. The intuition behind this is that the records should show that the excess deaths when comparing 2020 relative to the same dates for previous years is greater than the official number of COVID-19 deaths. The suspicion is that a number of public policies may have caused a significant underreporting of coronavirus deaths. Comparing these datasets will provide a more accurate picture of the extent of this probable phenomenon.

The Data

The data needed to implement the idea are the following, all corresponding to CDMX:

The average daily number of deaths during past years.
The daily number of deaths during the year 2020.
The official daily number of COVID-19 deaths.

The first two pieces of information can be found on the official open data site of the CDMX local government. In that site, we can find a page where a database is available that includes the record of all the death certificates issued by the local government since 2017 until approximately a couple of weeks before the query. This database has the basic details for each record: date, place, and cause of death.

The last piece of information can be found in an official database by the federal government. A CSV file can be downloaded from the site which contains records of tests performed for SARS-CoV-2 up to one day prior to the query, plenty of additional information accompanies each record. Of particular interest is the information on where the test was performed and whether the person tested died.

The Process

The process consists in data cleaning and manipulation in order to give the information the desired format. Likewise, certain computations are made from this data in order to obtain additional variables. Finally, the graph is plotted, in this case a time series with lines that reveal differences over time. The tool used in the process is the R language with the support of multiple tidyverse libraries.

The first part of the process involves creating a table with the number of deceased cases for CDMX. To do this, we filter from the federal database those COVID-19 cases that resulted in deaths and whose state of residence is the CDMX, ENTIDAD_RES == 9. Only those records with a date of death earlier than the most recent date for which there is a record of death certificates in the CDMX are taken. This is done to subsequently group all these records according to the date of death — FECHA_DEF — and then count the number of cases with the function n().

covid_data <- covid19 %>%%
   filter(ENTITY_RES == 9, DEATH_DATE_DEF <= max(acts$death_date)) %>%
   group_by(DEF_DATE_DEF) %>%%
   summarise(DEF_T = n())

The next part consists of using the CDMX’s government database to obtain the number of death certificates issued each year on the same dates. Specifically, we filter out those dates between February 28th — date of the first confirmed case of COVID-19 in Mexico in 2020 — and the most recent date for which there is a record during the current year 2020. Since this database includes records beginning January 1, 2017, the desired information is available for the years 2017, 2018, 2019 and 2020. Finally, as in the previous part, all these records are grouped according to the date — fecha_defuncion — and then the number of records is counted with the function n().

actas_data <- actas %>%
   filter(fecha_defuncion >= "2020-02-28" | 
            (fecha_defuncion >= "2019-02-28" & fecha_defuncion <= as.Date(max(actas$fecha_defuncion)) - years(1)) | 
            (fecha_defuncion >= "2018-02-28" & fecha_defuncion <= as.Date(max(actas$fecha_defuncion)) - years(2)) | 
            (fecha_defuncion >= "2017-02-28" & fecha_defuncion <= as.Date(max(actas$fecha_defuncion)) - years(3))) %>%
   group_by(fecha_defuncion) %>%
   summarise(DEF_T = n())

Additional data cleaning is done to prepare the data for the next step of the process, which is to create a new table containing the average number of daily death certificates issued between 2017 and 2019, the years without the COVID-19 factor. During the preparation of the data, a new variable, CATEGORY, was created, where the year of each record was indicated in order to standardize all the dates for 2020. This is done in order to facilitate data manipulation and the plotting of the final graph. Having done this, we can filter those records from the years without coronavirus and then group them by date and calculate the average for each day over the three years.

promedio_data <- actas_data %>%
   filter(CATEGORIA %in% c("2017", "2018", "2019")) %>%
   group_by(FECHA) %>%
   summarise(DEF_T = mean(DEF_T)) %>%
   mutate(CATEGORIA = "PROM_3")

At this point in the process, we have the number of COVID-19 cases registered as deceased and the daily average number of death certificates issued for the years 2017–2019. The final step before plotting the graph is to obtain a new variable: the number of daily deaths that there should be in CDMX if only the number of official COVID-19 deaths were added to the average of past years. That is, this new variable should closely follow the number of certificates issued during 2020. The likely difference between these variables is defined as the excess deaths that are explained by an underreporting of the number of deaths caused by the coronavirus. After creating this new table, the official daily number of death certificates issued during 2020 is filtered and finally all these variables are added to the same table.

covidmean_data <- mean_data %>%
   inner_join(covid_data, by = "FECHA") %>%
   summarise(FECHA, DEF_T = DEF_T.x + DEF_T.y, CATEGORIA = "COVID + PROM")
 
 actas_data <- filter(actas_data, CATEGORIA == "2020")
 
 plot_data <- bind_rows(actas_data, mean_data, covidmean_data)

The Result

The graph shows three parallel time series. One tracks the reported daily number of death certificates during 2020, colored red. Another represents the aggregate average number of daily death certificates between 2017, 2018 and 2019, in gray. A final series adds to the one described above, the daily number of COVID-19 deaths reported for that same day in CDMX.

Red: Daily death certificates issued for 2020| Gray: Average 2017–2019 | Yellow: Average 2017–2019 + COVID-19 deaths reported in CDMX

To see more stuff like this, check out my GitHub profile. Be good!