DBpedia 15% sample for purpose of analysis
Dear DBpedia users,
If you want to observe algorithms behavior on DBpedia graph, it can be interesting to run it on a representative subset to minimize computational cost. If you are interested in it here is some material for you:
- a paper comparing sampling methods and their abilities to conserve graph properties: http://www.stat.cmu.edu/~fienberg/Stat36-835/Leskovec-sampling-kdd06.pdf
- a random walker code to sample linked data graph from SPARQL endpoint: tune SPARQL filters in the function getRandomNeighbor to add/remove some constraint on node selection, run it on a powerful computer if you need a large subset: http://fr.dbpedia.org/fichiers/DBpedia/en/3.7/DBpediaSampler.java
- a sampling result on english-speaking DBpedia (including wikiPageWikiLink relations), 15% of 3.64 millions resources = 546.000 nodes, http://fr.dbpedia.org/fichiers/DBpedia/en/3.7/randomWalkerResults_en.dbpedia.rar
Sampling method and results presented above is a representative node selection, it does not contain any arc.
Feel free to share and have fun !
Nicolas

