At very first, please make sure you have Java 8 Runtime Environment installed on your system.
Download all Yago graphs you need from the downloads section as .ttl files. In my case I took all graphs from TAXONOMY, CORE and additonally the yagoDBpediaInstances and yagoDBpediaClasses collections to have relations from Yago entities to DBpedia ones. Download the files to a folder on your system, let’s say
/home/ferdinand/yago/on Linux or
C:\Users\Ferdinand\yagoon Windows and extract them using 7zip.
apache-jena-3.1.1.zip(or newer version) and
apache-jena-fuseki-2.4.1.zipfrom here and extract them to, let’s say
/home/ferdinand/fuseki/(or the analogue directories on Windows).
Now the .ttl files needs to get some kind of preprocessed, where non-unicode characters are replaced in order for Jena to accept the data. On Linux run
sed -i 's/|/-/g' ./* && sed -i 's/\\\\/-/g' ./* && sed -i 's/–/-/g' ./*from within the directory where your .ttl files are. On Windows, start the Ubuntu Bash, navigate to the respective directory (e.g.
/mnt/c/Users/Ferdinand/yago) and do the same command. It will take several minutes. I mean, really several…
Create a folder to be used for the database later, e.g.
Add the Fuseki root directory (e.g.
/home/ferdinand/fuseki) and the Jena bin (or bat on Win) (e.g.
/home/ferdinand/jena/bin) to your
PATHenvironment variable. On Linux you would do this by editing your
~/.bash_profile, on Windows you can search for “envionment variables” and then use the Windows system settings dialog.
Load the graphs using tdbloader:
tdbloader.bat --loc data ./*from the directory where your .ttl files are located. This may take several hours. Not joking…
Start Fuseki typing
java -jar fuseki-server.jar --update --loc /home/ferdinand/yago/data /myGraphto run fuseki with your entire Yago graph available under the myGraph alias.
Open http://localhost:3030 in your browser and start making queries.
If you’re about to run really expensive queries, consider the following.
JVM_ARGSenvironment variable to
-Xms512m -Xmx2048M -XX:-UseGCOverheadLimit -XX:+UseParallelGC. This will basically prevent you from getting OutOfMemory errors.
Use tdbquery since it might be a little more performant than the web SPARQL endpoint. An example tdbquery command might look like this, assuming you have a file
q.txtthat contains your SPARQL query:
tdbquery --loc=/home/ferdinand/yago/data --time --results=CSV --query=q.txt > output.txt