Extracting Text from Wikipedia (evanjones.ca)

[ 2008-April-13 16:50 ]

One of the greatest things about Wikipedia is that it is a completely open project. The software used to run it is open source, and the data is freely available. For a natural language processing course, I processed some text from Wikipedia. It was considerably harder than I expected. One of the biggest problems is that there is no well-defined parser for the wiki text that is used to write the articles. The parser is a mess of regular expressions, and users frequently add fragments of arbitrary HTML. Here is how I managed to wade through this and get something useful out the other end, including the software and the resulting data.

I only wanted a subset of Wikipedia, since the entire thing is too much data. I chose to extract the articles that are part of the Wikipedia "release version" project. This project is trying to identify the articles that are good enough to be included in Wikipedia "releases," such as the Wikipedia Selection for Schools.

My code is available under a BSD licence. The data is taken from Wikipedia, and is covered by Wikipedia's licence (the GFDL).

Code: wikipedia2text.tar.bz2
Titles of 2722 "good" articles
XML dump of the "good" articles (2008-03-12): wikipedia2text-toparticles.xml.bz2 (35 MB compressed; 127 MB uncompressed)
Parsed XML from 2615 articles: wikipedia2text-toparticles.tar.bz2 (34 MB compressed; 200 MB uncompressed)
Extracted plain text: wikipedia2text-extracted.txt.bz2 (18 MB compressed; 63 MB uncompressed; 10 million words)

How to extract text from Wikipedia:

Get the Wikipedia articles dump [direct link to English Wikipedia]. It is about 3 GB compressed with bzip2, and about 16 GB uncompressed.

Get the list of "best" articles. I used the following shell command:

for i in `seq 1 7` do;
    wget "http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team"\
"/Release_Version_articles_by_quality/$i"
done

Extract the list of titles using extracttop.py:

./extracttop.py toparticles/* | sort > top.txt

Use MWDumper to filter only the articles you care about. The version in SVN is considerably newer, but the prebuilt version works fine. Warning: It takes 28 minutes for my 3.8GHz P4 Xeon to decompress and filter the entire English Wikipedia pages dump. It produced 127 MB of output for 2722 articles.
```
time bzcat enwiki-20080312-pages-articles.xml.bz2 \
    | java -server -jar mwdumper.jar --format=xml --filter=exactlist:top.txt \
            --filter=latest --filter=notalk \
    > pages.xml
```
Use xmldump2files.py to split the filtered XML dump into individual files. This only takes about 2 minutes.
```
./xmldump2files.py pages.xml files_directory
```
Use wiki2xml_command.php to parse the wiki text to XML. This can lead to segmentation faults or infinite loops when regular expressions go wrong. It doesn't always output valid XML since it passes a lot of the text through directly. This took 90 minutes on my machine.
```
./wiki2xml_all.sh files_directory
```
Use wikiextract.py to extract plain text from all the articles. It uses BeautifulSoup to parse the so-called "XML" output, then my code attempts to extracts just the body text of the article, ignoring headers, images, tables, lists, and other formatting. This took 24 minutes to execute.
```
./wikiextract.py files_directory wikitext.txt
```

The worst part about this process is that parsing the articles is terrible. The best thing would be to use the real parser from MediaWiki, but that seemed like more work. Many people have attempted to write Wikipedia parsers. FlexBisonParse is an abandoned attempt to build a "real" parser written in C. It also fails to parse many articles. A more recent project is mwlib, which is being used by PediaPress to convert Wikipedia articles to PDF. I should have tried it, but I didn't. There is a recent development effort to create a new parser. A mailing list has been created for it, and some documentation has been written. It will be an enormous amount of work to change the parser for Wikipedia, but it would be very valuable for people wanting to extract data from this valuable resource. I hope they see it through.