Wikipedia related stuff
Introduction
This webpage contains a collection of scripts of some use for extracting content from Wikipedia and other Wikimedia wikis.
Graph extraction
The following C++ program and Perl scripts can be used to extract the graph of Wikipedia from an XML dump of the database (such dumps can be downloaded from the Wikipedia servers here):
- wikipedia_graph.cpp, C++ program, which uses Gnome's libxml2, libunicode and some features from the C++ TR1 library (recent compilers should support them natively).
- ordonne.pl, Perl script, which uses the external Unix sort program.
- merge.pl, Perl script, which uses the external Unix sort program.
Usage:
mkdir temp
cd temp
$PROG/wikipedia_graph $SRC/wiki.xml.gz
$PROG/ordonne.pl
$PROG/merge.pl > ../edge_list
cd ..
This will produce an edge_list file, containing the
list of the graph edges, along with a index file,
containing the node labels. The format of this file is the one used
by the Large Sparse Graph library.
Snaphsot extraction
split_xml.pl is a Perl script extracting snapshots from a dump of Wikipedia containing multiple revisions.
