An Experiment in Crowdscraping

A look back at creating "A map of terrorist attacks, according to Wikipedia"

From July 2016 until June 2020, the StoryMaps team published an ongoing worldwide map of terrorist attacks. The map showed cumulative activity over time, filterable by date, location and perpetrator group. It was sourced from information curated by a Wikipedia discussion community. The crowdsourcing aspect of this project made the map unique and allowed for an up-to-the-hour view of activity. It also, as we were quick to point out, meant that the dataset was less than fully authoritative.  

Still, we thought the experiment was worthwhile. If you took the map with the appropriately sized grain of salt and were guarded about the conclusions you drew, it provided a good first draft of events.

Alas, in recent months, the editing activity on these Wikipedia pages has become very irregular, and we have lost confidence in the integrity of the data (even taking into account the above the disclaimers). Therefore, we have decided to discontinue this map.  

For the past four years, this project generated quite a bit of interest. At its best, it also offered a lot of information. And for us, the challenge of getting the data from the web pages on to a map provided a lot of lessons. If you’re interested in what we’ve learned, please read on.

Crowdsourcing from Wikipedia: A Mixed Bag...

Most high school students learn that Wikipedia, while helpful, is not to be relied on as a definitive information source. We certainly won’t dispute that. Still, when a Wikipedia forum is well moderated and firing on all pistons, it gets a lot right. Here are some observations gleaned from monitoring pages and discussions over the past few years. 

  • Wikipedia contributors can be a very diligent. The users we encountered were meticulous in revising details and numbers as the picture became clearer. They also fixed each other’s typos and refined grammar.
  • Contributors challenged information and enforced sourcing.
  • No table row, no matter how far in the past, was considered frozen; editors often revisited pages from a year or more in the past to make corrections.
  • Though the terrorism forum was conducted in English, it was evident that many of the contributors are not native speakers – an indication that the user community reflected international perspectives.
  • Sometimes the wording of an entry hinted strongly at a political agenda. This was not often the case, but when it did occur, the community eventually rounded out the edges -- often requiring citations or flagging biased language.

Considerable list of citations involved in one month of terrorism reporting on Wikipedia

"Crowdscraping"

The term “Crowdsourcing” is a good general term for when a community provides information to a data repository. However, in the case of this project, the term doesn’t exactly fit because the “crowd” wasn’t contributing directly to our map. Instead, the crowd was pouring its information into Wikipedia pages, which we in turn scraped and turned into a database.  So, the term “Crowdscraping” seems more apropos.

One of the series of tables to be scraped

Node.js

The most fundamental challenge in creating our map was how to derive a database from a series of pages. Furthermore, the workflow would need to be automated, which meant choosing a programming environment.

We chose to use Node.js, because it allowed us to leverage our familiarity with JavaScript. Node.js is an environment that allows JavaScript to run in contexts other than the browser. Often, Node.js is used to create web server applications. However, we just needed to run a utility process at regular intervals, and Node.js is good for that too.

There’s also a vast assortment of 3rd party libraries available for Node.js. For example, it was easy to find a library that downloads a web page and parses for specific elements in that page. Another library enables the program to write data out in CSV format. These libraries range in impact from time saver (I’m so glad I don’t have to re-invent that wheel) to game changer (I’d never know how to invent that wheel).

Other Technical Challenges

Here’s a list of other challenges peculiar to this project:

A Plethora of Pages

The Wikipedia project was organized into multiple pages, each page representing a month. So, e.g. for 2016, there were twelve different pages to be scraped. As each month arrived, another page came online. By the end of the project, we were scraping more than 45 pages!

A plethora of pages

Consolidation

The goal is to get this data into a form that can efficiently be ingested by the viewer client. For a variety of reasons, it's easier for the client to query a single database table than a more complicated data structure that would store each page's info separately. This means that our process, in addition to scraping each page, needs to consolidate the information from the myriad pages into a single table.

Everything Changes

As mentioned before, each Wikipedia page remains a work in progress – revisions can be made at any time to any month. That means that we can’t just scrape a page once and leave the information alone. Each time we run the process, ALL pages must be re-scraped and the database updated.

Deep Scraping for Coordinates

For mapping purposes, an obviously important detail of each table row is the location where the attack took place. Conveniently, each row indeed contains a location field. Not so conveniently, the location field points to a separate location page. Getting the coordinates for a location then requires that a secondary level of scraping be performed.

Each row in the terrorist attach table points to a location page.

Workflow

Having defined the goal (to scrape a series of pages and consolidate the resulting data into a single geocoded table) and the tool set (Node.js and accompanying libraries), this is a high-level view of the workflow:

Scrape

Scrape the table data from each web page and write output to a corresponding CSV (comma separated value) file.

Page https://en.wikipedia.org/wiki/List_of_terrorist_incidents_in_January_2016 becomes file 01_2016.csv.

Libraries used:

  •  request-promise  - Handles the mechanics of downloading the Wikipedia page.
  •  cheerio  - Implementation of core jQuery for Node.js use. Enables program to parse various parts of page HTML.
  •  json2csv  - Writes the resulting data array to CSV.

Merge

Merge all CSV's resulting from initial scrape into a single master CSV file.

Now all incidents are in a single CSV.

Libraries used:

Create a Location Table

Many of the individual incidents in the master CSV share a common location page (e.g. Somalia or Aleppo). From the master CSV file, create a list of unique Wikipedia location page URLs. Then, using a scraping process similar to that in step 1, mine each of those URLs for location coordinates.

Libraries used:

Join

Join the resulting location information from step 3 to the incidents in the master CSV file.

Libraries used:

  •  csv2json  - reads input CSV's, master.csv and location-table.csv.
  •  json2csv  - exports results of join to filesystem (as final.csv).

Publish as Feature Service

Finally, import the master csv into a feature service.

Libraries used:

  •  csv2json  - reads final.csv.
  •  https  - part of core node js; facilitates record-loading post calls to feature server.

Considerable list of citations involved in one month of terrorism reporting on Wikipedia

One of the series of tables to be scraped

A plethora of pages

Each row in the terrorist attach table points to a location page.

Page https://en.wikipedia.org/wiki/List_of_terrorist_incidents_in_January_2016 becomes file 01_2016.csv.

Now all incidents are in a single CSV.