Eng

Pierre Senellart

  • Home
  • Resume
  • Publications
  • Talks
  • Teaching
  • Students
  • Other

Contact: pierre@senellart.com
  • Introduction
  • Graph extraction
  • Snaphsot extraction

Last Modification
2017-05-02 20:33:55 UTC

Wikipedia related stuff

Introduction

This webpage contains a collection of scripts of some use for extracting content from Wikipedia and other Wikimedia wikis.

Graph extraction

The following C++ program and Perl scripts can be used to extract the graph of Wikipedia from an XML dump of the database (such dumps can be downloaded from the Wikipedia servers here):

  • wikipedia_graph.cpp, C++ program, which uses Gnome's libxml2, libicu and some features from the C++ TR1 library (recent compilers should support them natively). A typical compile-and-link command line is:
    g++ wikipedia_graph.cpp -I/usr/include/libxml2 -I/usr/include/unicode -o wikipedia_graph -lxml2 -licuuc 
    
  • ordonne.pl, Perl script, which uses the external Unix sort program.
  • merge.pl, Perl script, which uses the external Unix sort program.

Usage:

$PROG/wikipedia_graph $SRC/wiki.xml.gz
$PROG/ordonne.pl
$PROG/merge.pl
    

This will produce an edge_list file, containing the list of the graph edges, along with an index file, containing the node labels. The format of these files is the one used by the Large Sparse Graph library.

Snaphsot extraction

split_xml.pl is a Perl script extracting snapshots from a dump of Wikipedia containing multiple revisions.