Methodology

What data source did we use?

The news articles about South Africa water-related issues are extracted from two datasets from the GDELT project. GDELT maintains a database of the world's news media in different languages.

  1. The articles data are extracted from GDELT DOC 2.0 API. This dataset includes basic information about news articles such as its URL, author, title, and publication date.
  2. The knowledge graph data are extracted from GDELT Global Knowledge Graph 2.0. The database is updated daily. The knowledge graph data connects every person, organization, location, count, theme, news source, and event across the planet into a single network. We are using the location and theme data from this dataset.

We used the newsfeed library to extract the Knowledge graph data and used the gdelt-doc-api library to extract news articles data with Python.

How did we extract the data from the data source?

DOC 2.0 API data (Articles data)

Extraction

We extracted doc 2.0 API data which satisfies the following two criteria:

Result

The news articles data extracted from GDELT doc2.0 API for the period fromJan 1, 2017 to Apr 14, 2022 includes the following information

Knowledge Graph data

Extraction

We extracted knowledge graph data which satisfy the following two criteria:

Result

The data extracted from GDELT knowledge graph dataset includes the following information:

We didn’t use the person names and organization names in the current data repository. Because of the fact that organization names are usually made of multiple words, it would require us to use a different methodology to implement the search functionality on these names, which we haven’t worked out currently. WHY NOT?

Merged Articles and Knowledge Graph Data

Merging the doc2.0 data with the knowledge graph data using the article URL yielded about 26,000 articles from the two datasets with all of the information mentioned above. Articles are deduplicated based on title.

Data Processing and Filtering

Filter by content

We found many articles coded with water-related themes in the GDELT database that were not focused on water sanitation, water supply or water security. The article might contain relevant terms, but was not about water issues. We needed to filter out these irrelevant articles. It’s hard to do that manually because of the volume of articles. As a result, we obtained further information for these articles and then filtered them with OpenAI.

Extracting article summaries and keywords

We used the Full-text downloader (built with newspaper3k and Wayback Machine) from the newsfeed library to extract more detailed information for each article.

These information includes the following elements:

Letting OpenAI decide whether this is an article of our interest

We categorized the articles from the two datasets using the criteria listed below.Then we gave the classification task to OpenAI:

OpenAI Methodology Diagram

OpenAI charges by word fragments that they call “tokens.” Using title for the first round of classification for articles without the word “water” in its summary helped save on the cost.


Filter by location

Although all of the articles were sourced from South Africa, the collection of articles included reports from all over the world. Since we are only interested in news articles about Africa, we needed to filter out only articles about Africa.

Because most articles have multiple related locations, we needed to decide the country where the article is based. We used the most frequently mentioned location in the Location column to determine this. We created a country label for each article based on this process. Where several locations are mentioned the same times, we put “Multiple” as the country label.

This narrowed the collection of stories to about 5,000 water-related articles focused on African countries.

Adding OpenUp dataset

We added a dataset Open Up had collected of 800 water-related articles in Africa that included keywords and country. This led to a total of about 6,000 articles.

Creating keywords for summary index

Creating Article tags based on the themes

We used this list of 10 water-related themes as tags for each article.

  1. Water Sanitation
  2. Water Security
  3. Water Supply
  4. Wastewater
  5. Water Treatment
  6. Natural Disaster
  7. Urban Water
  8. Waterborne Diseases
  9. Rural Water
  10. Water Pricing