Sunday, January 8, 2017

A short tutorial to use Virtuoso RDF database to store some RDF data and execute a SPARQL query


I have used 4store, which is easy to install ("apt-get install 4store") and very easy to use. For 4store exist a short but reasonable documentation https://4store.danielknoell.de/trac/wiki/Documentation/. Unfortunately, the code has not been much developed lately on github ( https://github.com/garlik/4store) and new constructs in SPARQL are not implemented (e.g. subselect queries). My task required queries, where select clauses are nested and I was forced to look for another RDF database and SPARQL endpoint. 


Install

My understanding of "easy to install" is "apt-get install" or very close. Virtuoso was the next candidate I found for Debian stretch it ready to install. The installation is smooth but it is necessary to enter a password during installation - write it down, it is required later! 

sudo apt-get install virtuoso-opensource 

which gives version 6.1.6 at the time of writing. 

Start

The Virtuoso tutorial https://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSStart does not start at the beginning and points to various places not connected to my task and sometimes not even working, but you can see in it, that you should point your browser at http://localhost:8890/conductor (obvious!), where the default user is "dba" but the default password is not what is stated in the tutorial but the password you set during installation. "the rest is self-explanatory" says the "tutorial".

The initial screen, after login,  invites you to 5 major themes (from Business Process Integration to Enterprise Collaboration) none of them relevant for wanting just to load RDF data and execute a query. Bewildering! 

On top of the screen a list of 9 tabs includes "LinkedData", which opens a second list of tabs and a screen for "SPARQL Execution" - closer. But how to load data? 


Load data

Check the other 8 tabs... perhaps: "QuadStore Upload"? Correct: with the browse tab find the file (in my case Ntriples with .nt extension) and extend the proposed "Named Graph IRI" to with some name (e.g. xx) to "http://localhost:8890/DAVxx". The value will be required for the query. Click on upload. The filename will disappear and in the upper right you can read - if you look there - "upload finished". Done. 

Query

Now click on the SPARQL tab again, paste the "Named Graph IRI" value into the "Default Graph IRI" field and the query into the query field. Test with 

SELECT * WHERE { ?s ?p ?o } LIMIT 10

and you should see underneath the list of s, p and o value. Neat. 

A more complex query can be pasted in, e.g. 

SELECT ?filename ?fpath ?count ?md5group 
WHERE { 
    ?unit dove:md5 ?md5group . 
    ?unit dove:filename ?filename . 
    ?unit dove:filepath ?fpath . 
        { SELECT  ?md5group (Count(?unitgroup) as ?count) 
             WHERE { 
                   ?unitgroup dove:md5 ?md5group . 
             } GROUP BY ?md5group Having (count(?unitgroup) > 1) 
        } 
   } order by ?md5group ?fpath 

 and produces the expected results. 

The above complex query relies on the extension of the "Namespace" in its tab with the prefix value for "dove" with "http://gerastree.at/dove_2017#". 

Queries can be saved (clicking on "Save") by giving a "File of XML template value" starting with "/DAV/", e.g. "/DAV/groupsWC" and a description text. In the SPARQL Execution tab, selecting saved queries shows a list of the stored ones and one can click "Edit" to get them back into the query window and execute them as above or click the query name, which executes them and opens the XML file in the browser (the URL, in my case http://localhost:8890/DAV/groupsWC can be opened in any browser and produces the XML doc - could be useful). 

Conclusion

My purpose, loading data and executing a SPARQL query was satisfied and I did, for now, not venture deeper into Virtuoso and conductor. I hope this short tutorial helps to get started quickly and easily with Virtuoso and it shows, implicitly,  what is wrong with (a) tutorials which do not start at the beginning, (b) inconsistent terminology and (c) software which is not modular but includes "all but the kitchen sink". 

Update

I had some trouble with loading RDF and found:
  1. The RDF triples can be provided as gziped file (no other compression method is currently supported) - this saves a lot of disk space!
  2. There is a bulk loader process which can be used; it took me a while to understand that the directory from which the file is loaded must be included in the /etc/virtuoso/virtuoso.ini file: 
    DirsAllowed              = ., /usr/share/virtuoso-opensource-6.1/vad, /home/frank/virtuosoData, /home/frank/.Dove
        
    You can then use the isql-vt console (in debian, may have a diffferent name in other distributions) as 
    sudo isql-vt localhost usernaem password 
    ld_dir('/home/frank/virtuosoData/', '*.*', 'http://graphname');
    rdf_loader_run()
    

No comments:

Post a Comment