Pierre Senellart

Ara (SYSTRAN) Deu (SYSTRAN) Ell (SYSTRAN) Eng Fra (SYSTRAN) Ita (SYSTRAN) Jpn (SYSTRAN) Kor (SYSTRAN) Nld (SYSTRAN) Por (SYSTRAN) Rus (SYSTRAN) Spa (SYSTRAN) Swe (SYSTRAN) Zho (SYSTRAN)

Home > Software > Wikipedia

  • Home
  • News
  • Resume
  • Publications
  • Talks
  • Teaching
  • Students
  • Software
    • Fuzzy XML
    • Larse Sparse Graph
    • Wikipedia
  • Other works
Contact: pierre@senellart.com
  • Introduction
  • Graph extraction
  • Snaphsot extraction

Last Modification
2009-02-24 13:24:37 UTC

Wikipedia related stuff

Introduction

This webpage contains a collection of scripts of some use for extracting content from Wikipedia and other Wikimedia wikis.

Graph extraction

The following C++ program and Perl scripts can be used to extract the graph of Wikipedia from an XML dump of the database (such dumps can be downloaded from the Wikipedia servers here):

  • wikipedia_graph.cpp, C++ program, which uses Gnome's libxml2, libunicode and some features from the C++ TR1 library (recent compilers should support them natively).
  • ordonne.pl, Perl script, which uses the external Unix sort program.
  • merge.pl, Perl script, which uses the external Unix sort program.

Usage:

mkdir temp
cd temp
$PROG/wikipedia_graph $SRC/wiki.xml.gz
$PROG/ordonne.pl
$PROG/merge.pl > ../edge_list
cd ..
    

This will produce an edge_list file, containing the list of the graph edges, along with a index file, containing the node labels. The format of this file is the one used by the Large Sparse Graph library.

Snaphsot extraction

split_xml.pl is a Perl script extracting snapshots from a dump of Wikipedia containing multiple revisions.