How to load Yago into Apache Jena / Fuseki

Ferdinand Mütsch

2016-11-11

This article describes how to load the Yago Linked Data knowledge collection into an Apache Jena triple store database on Windows 10 as well as on Linux.

At very first, please make sure you have Java 8 Runtime Environment installed on your system.
Download all Yago graphs you need from the downloads section as .ttl files. In my case I took all graphs from TAXONOMY, CORE and additonally the yagoDBpediaInstances and yagoDBpediaClasses collections to have relations from Yago entities to DBpedia ones. Download the files to a folder on your system, let’s say /home/ferdinand/yago/ on Linux or C:\Users\Ferdinand\yago on Windows and extract them using 7zip.
Delete all .7z files.
Download apache-jena-3.1.1.zip (or newer version) and apache-jena-fuseki-2.4.1.zip from here and extract them to, let’s say /home/ferdinand/jena/ and /home/ferdinand/fuseki/ (or the analogue directories on Windows).
Now the .ttl files needs to get some kind of preprocessed, where non-unicode characters are replaced in order for Jena to accept the data. On Linux run sed -i 's/|/-/g' ./* && sed -i 's/\\\\/-/g' ./* && sed -i 's/–/-/g' ./* from within the directory where your .ttl files are. On Windows, start the Ubuntu Bash, navigate to the respective directory (e.g. /mnt/c/Users/Ferdinand/yago) and do the same command. It will take several minutes. I mean, really several…
Create a folder to be used for the database later, e.g. /home/ferdinand/yago/data.
Add the Fuseki root directory (e.g. /home/ferdinand/fuseki) and the Jena bin (or bat on Win) (e.g. /home/ferdinand/jena/bin) to your PATH environment variable. On Linux you would do this by editing your ~/.bash_profile, on Windows you can search for “envionment variables” and then use the Windows system settings dialog.
Load the graphs using tdbloader: tdbloader.bat --loc data ./* from the directory where your .ttl files are located. This may take several hours. Not joking…
Start Fuseki typing java -jar fuseki-server.jar --update --loc /home/ferdinand/yago/data /myGraph to run fuseki with your entire Yago graph available under the myGraph alias.
Open http://localhost:3030 in your browser and start making queries.

If you’re about to run really expensive queries, consider the following.

Set the JVM_ARGS environment variable to -Xms512m -Xmx2048M -XX:-UseGCOverheadLimit -XX:+UseParallelGC. This will basically prevent you from getting OutOfMemory errors.
Use tdbquery since it might be a little more performant than the web SPARQL endpoint. An example tdbquery command might look like this, assuming you have a file q.txt that contains your SPARQL query: tdbquery --loc=/home/ferdinand/yago/data --time --results=CSV --query=q.txt > output.txt

Comments