Wikipedia related stuff
Introduction
This webpage contains a collection of scripts of some use for extracting content from Wikipedia and other Wikimedia wikis.
Graph extraction
The following C++ program and Perl scripts can be used to extract the graph of Wikipedia from an XML dump of the database (such dumps can be downloaded from the Wikipedia servers here):
-
wikipedia_graph.cpp, C++
program, which uses Gnome's libxml2, libicu
and some features from the C++ TR1 library (recent compilers should
support them natively). A typical compile-and-link command line is:
g++ wikipedia_graph.cpp -I/usr/include/libxml2 -I/usr/include/unicode -o wikipedia_graph -lxml2 -licuuc
- ordonne.pl, Perl script, which uses the external Unix sort program.
- merge.pl, Perl script, which uses the external Unix sort program.
Usage:
$PROG/wikipedia_graph $SRC/wiki.xml.gz $PROG/ordonne.pl $PROG/merge.pl
This will produce an edge_list
file, containing the
list of the graph edges, along with an index
file,
containing the node labels. The format of these files is the one used
by the Large Sparse Graph library.
Snaphsot extraction
split_xml.pl is a Perl script extracting snapshots from a dump of Wikipedia containing multiple revisions.