Wikipedia related stuff
This webpage contains a collection of scripts of some use for extracting content from Wikipedia and other Wikimedia wikis.
The following C++ program and Perl scripts can be used to extract the graph of Wikipedia from an XML dump of the database (such dumps can be downloaded from the Wikipedia servers here):
program, which uses Gnome's libxml2, libicu
and some features from the C++ TR1 library (recent compilers should
support them natively). A typical compile-and-link command line is:
g++ wikipedia_graph.cpp -I/usr/include/libxml2 -I/usr/include/unicode -o wikipedia_graph -lxml2 -licuuc
- ordonne.pl, Perl script, which uses the external Unix sort program.
- merge.pl, Perl script, which uses the external Unix sort program.
mkdir temp cd temp $PROG/wikipedia_graph $SRC/wiki.xml.gz $PROG/ordonne.pl $PROG/merge.pl > ../edge_list cd ..
This will produce an
edge_list file, containing the
list of the graph edges, along with an
containing the node labels. The format of these files is the one used
by the Large Sparse Graph library.
split_xml.pl is a Perl script extracting snapshots from a dump of Wikipedia containing multiple revisions.