One of the greatest things about Wikipedia is that it is a completely open project. The software used to run it is open source, and the data is freely available. For a natural language processing course, I processed some text from Wikipedia. It was considerably harder than I expected. One of the biggest problems is that there is no well-defined parser for the wiki text that is used to write the articles. The parser is a mess of regular expressions, and users frequently add fragments of arbitrary HTML. Here is how I managed to wade through this and get something useful out the other end, including the software and the resulting data.
I only wanted a subset of Wikipedia, since the entire thing is too much data. I chose to extract the articles that are part of the Wikipedia "release version" project. This project is trying to identify the articles that are good enough to be included in Wikipedia "releases," such as the Wikipedia Selection for Schools.
My code is available under a BSD licence. The data is taken from Wikipedia, and is covered by Wikipedia's licence (the GFDL).
wikipedia2text.tar.bz2
wikipedia2text-toparticles.xml.bz2
(35 MB compressed; 127 MB uncompressed)wikipedia2text-toparticles.tar.bz2
(34 MB compressed; 200 MB uncompressed)wikipedia2text-extracted.txt.bz2
(18 MB compressed; 63 MB uncompressed; 10 million words)How to extract text from Wikipedia:
for i in `seq 1 7` do; wget "http://en.wikipedia.org/wiki/Wikipedia:Version_1.0_Editorial_Team"\ "/Release_Version_articles_by_quality/$i" done
extracttop.py
:
./extracttop.py toparticles/* | sort > top.txt
time bzcat enwiki-20080312-pages-articles.xml.bz2 \ | java -server -jar mwdumper.jar --format=xml --filter=exactlist:top.txt \ --filter=latest --filter=notalk \ > pages.xml
xmldump2files.py
to split the filtered XML dump into individual files. This only takes about 2 minutes.
./xmldump2files.py pages.xml files_directory
wiki2xml_command.php
to parse the wiki text to XML. This can lead to segmentation faults or infinite loops when regular expressions go wrong. It doesn't always output valid XML since it passes a lot of the text through directly. This took 90 minutes on my machine.
./wiki2xml_all.sh files_directory
wikiextract.py
to extract plain text from all the articles. It uses BeautifulSoup to parse the so-called "XML" output, then my code attempts to extracts just the body text of the article, ignoring headers, images, tables, lists, and other formatting. This took 24 minutes to execute.
./wikiextract.py files_directory wikitext.txt
The worst part about this process is that parsing the articles is terrible. The best thing would be to use the real parser from MediaWiki, but that seemed like more work. Many people have attempted to write Wikipedia parsers. FlexBisonParse is an abandoned attempt to build a "real" parser written in C. It also fails to parse many articles. A more recent project is mwlib, which is being used by PediaPress to convert Wikipedia articles to PDF. I should have tried it, but I didn't. There is a recent development effort to create a new parser. A mailing list has been created for it, and some documentation has been written. It will be an enormous amount of work to change the parser for Wikipedia, but it would be very valuable for people wanting to extract data from this valuable resource. I hope they see it through.