The Evaluation Tool

What is the Evalution Tool?

It is a simple java program that supports you in manually evaluating links between ressources of the semantic web. Evaluating means that you take a random sample (or all) of a set of links and that you determine for each of those links if they are correct or incorrect.

an example screenshot of the Evaluation Tool

It can

load reference data (the links) as ntriples or alignment format
save the evaluated links as ntriples (according to the LATC standards split in positive.nt and negative.nt), alignment format or as tab separated CSV
work with geocoordinates, calculate and show geographical distances between the two nodes of a link
create a README file with the number of links, sample size, sample precision and date in it
show a graph of precision by confidence cutoff which helps you determine the optimal confidence threshold

Included are also executable classes, scripts and XSLT stylesheets that allow to:

determine the amount of ressources with multiple link partners (high link "polygamy" is an indicator of bad link- and/or data quality in sameAs links)
convert alignment files to tab separated CSV which helps create random samples
convert ntriples files to csv

Motivation

One of the core aims of the semantic web is to create useful links between already existing ressources. If you happen to create some of those links, you most probably do it with a tool like Silk that reads in a configuration file ("link specification") and can potentially create millions of links out of such a link specification. Before flodding the semantic web with millions of links, it is however a good idea to check if those links are correct in the first place :-) Often those URIs contain large sequences of seemingly random numbers and do not provide enough information to know what the URI represents. Manually creating a random sample and then copy-pasting dozens of URLs into the browser can get tedious however so the Evaluation Tool was created to support the user with that task.

an excerpt of a big file containing semantic web links in the alignment format — An examplary link set

Disclaimer

This program was gradually developed as my own tool to help me in my work and is in no way guaranteed to be bug free or thoroughly optimized for usability or ease of installation. Before you implement such a program yourself however, I think just using mine may safe you a lot of time and headache should need the same functionality. If you only need to evaluate a small set of links once it may not be worth it as you need quite a lot of stuff to execute it, namely svn, maven and java, but if you use it regularily I'm sure it can save you a lot of work. To my knowledge such a tool does not already exist (if there already is please tell me).

Installation & Execution

Prerequisites

First, you need Subversion, Maven 2 or higher and Java 6 or higher. If you don't have them, you can install them in Ubuntu with the following commands (although Subversion and Java should already be preinstalled for most versions):

sudo apt-get install subversion
sudo apt-get install maven2
sudo apt-get install sun-java6-jre

Note that Maven 3 is already available and backwards compatible so you can install that as well but Maven 2 seems to be included in the standard package sources for Ubuntu 11.04 and is thus easier to install. For windows I guess you can find it here: Subversion , Maven 3 and Java .

Installation

Go to your favourite directory and then execute:

svn checkout https://saim.svn.sourceforge.net/svnroot/saim/trunk saim
cd saim/saim-core
mvn compile

Execution

.../saim/saim-core$ mvn exec:java -Dexec.mainClass=de.evaluationtool.EvaluationTool

If you work with very big files you may need export MAVEN_OPTS=-Xmx2048m (or some other value) before but in those cases you should probably have set a reasonable load limit anyways.

Update

The Evaluation Tool is now developed in its own project on github but you should still use the old sourceforge link if you just want to run it. On github, my colleague Mofeed Hassan is developing a Web Interface for it with a different visualization, so if you want to help out with the development, create an issue or make a push request there.

Workflow

0. Creating the links

Make sure that you create the links as either ntriples or alignment format. I actually suggest using both, with the ntriples file containing the links above your chosen thresholds and the alignment file also containing lowprecision links. This makes it easy to identify the best threshold and reselect the links without the need to run the matching again. In Silk that may look like this:

<Filter threshold="0.70"/>
<Outputs>
 <Output type="file" >
  <Param name="file" value="above70.xml"/>
  <Param name="format" value="alignment"/>
 </Output>
 <Output minConfidence="0.95" type="file">
  <Param name="file" value="links.nt"/>
  <Param name="format" value="ntriples"/>
 </Output>
</Outputs>

1. Preparing a random sample

1.1 With a small linkset

1.2 With a big linkset

1.2.1 If the format is ntriples

1.2.2 If the format is alignment

2. Evaluating

After loading and shuffling, the program displays a list of the links together with a few buttons. Initially, only the urls of the links are displayed but the label thread sequentially loads the representative property (probably rdfs:label) for each URL from a SPARQL endpoint. If the labels are loaded correctly and display the right property you can now evaluate all the links with the buttons "correct", "incorrect" and "unsure". The "URLs" button resolves the urls of a link in the browser and also displays all their triples from the SPARQL endpoint. If the labels are not properly displayed, you need to...

prefix	property	endpoint	hasGeoCoordinates
http://dbpedia.org/resource/	rdfs:label	http://dbpedia.org/sparql	true
http://bio2rdf.org/mesh:	dc:title	http://mesh.bio2rdf.org/sparql	false