New Measures of the COVID-19 Pandemic: A New Time-Series Dataset

The multitude of papers exploring the effects of the COVID-19 pandemic over the last 12 months has motivated us to develop new, alternative measures of COVID-19. One limitation of current research has been the lack of robustness in quantifying the effects of the pandemic. We use a novel approach, word searches from popular newspaper articles, to capture key variants of proxies for the pandemic. We thus construct six different indices relating to the COVID-19 pandemic, including a COVID index, a medical index, a vaccine index, a travel index, an uncertainty index, and an aggregate COVID-19 sentiment index.


I. Introduction
The literature on the COVID-19 pandemic from a business and economics point of view has grown at an unprecedented rate, with over 100 papers on this subject in business and economics journals (for surveys of the literature, see Padhan & Prabheesh, 2021;. In this note, we develop different measures of COVID-19 using data from all articles on the COVID-19 pandemic in the 45 most popular newspapers worldwide. We propose a total of 327 keywords that relate to different aspects of COVID-19. These keywords are exhaustive and thus representative of the pandemic, allowing us to construct its different dimensions. Specifically, we develop indices for travel, vaccine, medical, COVID-19, and uncertainty. We add to these five indices an aggregate index (containing all 327 words) we refer to as the pandemic sentiment index. These six indices capture all COVID-19-related matters, are unique, and differ from what has been predominantly used in the literature. Specifically, we consider the different proxies used in the literature. The most popular measure of COVID-19 has been the number of deaths and the number of virus cases (Gurrib et al., 2021;Haroon & Rizvi, 2020). Other COVID-19 measures include the Google-based COVID-19 fear sentiment and investor attention to the pandemic (Chen et al., 2020), a global fear index that combines indices of COVID-19 cases and deaths (Salisu & Akanni, 2020), an index of uncertainty due to pandemics and epidemics (Salisu & Sikiru, 2020), an accounting index reflecting the periods before and after the COVID-19 outbreak (He et al., 2020), and other government response indices, such as the COVID-19 government re-sponse stringency, containment and health, and economic support indices (Chang et al., 2021).
The literature has not considered more specific newsbased measures, such as those that relate to the development of vaccines (our vaccine index), medical progress (our medical index), or travel (our travel index). Researchers have used COVID measures such as the numbers of deaths and infections from COVID-19 and governmental responses to the virus, and, in this regard, our COVID index is similar but sharply different in the way we measure it (see Section II for details). Our set of keywords provides a COVID-19 measure that goes far beyond numbers of virus cases and deaths, and we therefore consider it a more comprehensive measure of  Similarly, while researchers have used measures of COVID-19-related uncertainty (e.g., Ahir et al., 2019;Chen et al., 2020;Salisu & Akanni, 2020;Salisu & Sikiru, 2020), they capture only a narrow definition of uncertainty. Our measure, because it covers keywords depicting uncertainty, is broader and, at the very least, offers an alternative proxy for uncertainty due to the pandemic. Finally, our aggregate measure of COVID-19 (A_COVID), which incorporates all 327 words, is a broad representation of pandemic sentiment, both positive and negative, and it stands out as an overall measure of pandemic sentiment.
Our contribution to the literature consists of our alternative measures of COVID-19. We do not obtain these by counting the numbers of deaths and virus cases, as has predominantly been done, but by searching for keywords in newspapers that best represent the ramifications of the COVID-19 pandemic. In addition, we offer variants of the COVID-19 proxies that capture events related to vaccines, medical progress, travel, and uncertainty. We provide a total of six indices representing different aspects of the pandemic, none of which have been considered to date, at least from a business and economics point of view. This variety of indices offers researchers an opportunity to a) conduct robustness tests on previous findings in the literature and b) determine the effects of specific COVID-19-related developments on financial and economic systems. Our proposed dataset thus provides a platform for additional research.

A. Data collection
Our data collection begins with the identification of keywords that relate to the COVID-19 pandemic. The search is broad, to be as informative and accommodative of the pandemic as possible. We thus create a dictionary of 327 words, which are listed in Table 1 (Panel A). This is the first step.
The second step is to consider the newspapers from which to extract COVID-19-related articles. From Narayan's (2019) list of 100 major sources of newspapers globally, we select 41 international newspapers, which represent the bulk of the publications, and add another four premier publications (The Washington Post, Los Angeles Times, USA Today, and Chicago Tribune) for our data. These newspapers are available from the ProQuest database, our main data source for newspapers.

B. Index construction
For each of the 45 newspapers, we retrieve daily news articles published between December 31, 2019, and to April 28, 2021, from the ProQuest database. The start date is motivated by the discovery of the COVID-19 virus while the end data is based on the time this paper was concluded. Using the ProQuest TDM Python algorithm, we count the number of times each word in our dictionary appears in these publications' daily articles. For each type of data (i.e., COVID, medical, vaccine, travel, uncertainty, and aggregate), say, T, we sum the daily word count, as detailed in Table 2.
In the next step, we run a heteroskedasticity-consistent ordinary least squares (OLS) regression of T on day-of-theweek dummy variables. We exclude the Wednesday dummy to avoid the dummy variable trap. The resulting constant and residuals from the OLS model are added to adjust the data for day-of-the-week effects. We thus obtain , where denotes time. In the third step, we compute the index as

III. Dataset
The time-series indices based on Equation (1) are presented in Table 3. There are six indices, and the aggregate index includes all 327 keywords. An MS Excel version of the dataset is available from the authors.

IV. Concluding remarks
The objective of this note is to present a new dataset for measures of COVID-19. Motivated by the large volume of empirical studies on the COVID-19 pandemic, we present researchers with multiple proxies for COVID-19, allowing future researchers to both test new hypotheses and confirm the robustness of existing empirical conclusions. We intend to update the dataset quarterly and post the updated data on various platforms, including the journal's webpage, the authors' homepages, and ResearchGate.