Skip to content

Web annotations: addressing complex regions with URIs

At the moment there is no Web standard for addressing complex spatial regions (ellipses, polygons, etc.) in media objects by means of URIs. Applications that need to support that feature, such as our multimedia annotation clients (e.g., http://dme.arcs.ac.at/annotation/), currently use their own schemes for defining such regions, with the result that annotations created by one client cannot interpreted by others that do not support/understand these schemes.

The W3C Media Fragments URI specification proposes a syntax for addressing media fragments in several dimensions (temporal, spatial, etc.). However, the spatial dimension currently supports only rectangular segments, which insufficient for most annotation use cases. A few days ago I raised that issue at the MF public mailing list: http://lists.w3.org/Archives/Public/public-media-fragment/2010Sep/0000.html

The following snippet (taken from http://www.w3.org/2001/Annotea/User/Protocol.html) shows how “document”-based annotations worked back in 2002, when the Web was still quite document-centric. At that time, XPointer was sufficient for addressing regions in Web documents:



 
  
  
    http://serv1.example.com/some/page.html#xpointer(id("Main")/p[2])
  
  Annotation of Sample Page
  ...
 

In the last decade people started building multimedia annotation tools that require more sophisticated fragment identification mechanisms than XPointer. Because there was no specification for how to represent complex media fragments, they introduced their own solutions with the side-effect that interoperability among annotation clients is not given anymore. Another annotation client would probably understand the <a:annotates> property but since there is no fragment definition in the media object URI and no <context> property it would simply assume that there is no fragment.

Here is an example of how we represent an image annotation with a polygon region using SVG fragments, which does not mean that any other client can interpret this. In our SVG definition we embedded the annotated image so that also any non-RDF but SVG-capable client (e.g., browsers) can render the annotation.



  
    
	Some annotation text....

	
		
		
			
				
				
		
		
		
	

 

With the MF specification there is – at least in my opinion – a chance to bring this back on the interoperability track. We could, for instance, allow for the specification of complex segments in MF URIs, which could result in something like this:



  
    
	Some annotation text....
 

…which does not really make sense, because there is such a variety of possible fragment types that it is hardly possible to cover them all in a single spec. Therefore we propose to introduce a by-reference MF identification, which is a key/value pair telling the clients that there is some more info about (spatial) fragments available at some other resource. Here is an example how this could look like:



	
		
		Some annotation text....
	
	
		image/svg+xml
		...
	

This tells a client, which receives that RDF document and should now render the annotation, that the annotation annotates an image (http://example.com/image1.jpg) and that there is information about the fragment available at resource . In this example I used a URN instead of a URL to illustrate that a URI in RDF is not necessarily dereferencable. In this case it is simply defined in the annotation RDF document.

Technically, we can achieve the same by defining a new annotation standard that provides means (additional classes and properties) for representing complex media fragments. However, if we can reuse the MF spec for addressing complex segments in media objects we can avoid the divergence between browser implementations and annotation use cases.

In 2002 it was possible to address segments in documents with existing Web specs (XPointer). IMHO it should be possible to address complex segments in media objects on the Web with existing and upcoming Web specs (e.g., MF).

Semantic Web and Libraries

Here I try to explain the relationship between libraries and semantic web technology as I observed it throughout the past years: http://www.semantic-web.at/1.36.resource.297.bernhard-haslhofer-x22-linked-data-is-an-attempt-to-continue-the-well-established-informat.htm

MacPorts cleanup

If your mac ports installation becomes corrupted for some reason you might want to get rid of all installed ports and restart from a fresh ports installation. The following command might be useful for that:

  • Remove all ports and their dependencies: sudo port uninstall –follow-dependents all
  • Clean up everything: sudo port clean –all all

Nice quote on standardization

Found a nice quote on standardization in Goldfarb’s paper “HyTime: A standard for structured hypermedia interchange”

You can standardize some of the facilities
for all of the applications, or all
of the facilities for some of the applications,
but you can’t standardize all
of the facilities for all of the applications.

How to set up your local DBpedia instance #2

Over the DBpedia mailing list I got informed that Virtuoso has released a new script for importing RDF dumps into Virtuoso: http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderExampleDbpedia

The instructions in my previous blog post should still work, but might be outdated.

Notes from SWIB09

From November 24th to 25th a workshop on Semantic Web in Digital Libraries, called SWIB’09, took place in Cologne, Germany. It was an event dedicated to librarians, information professionals, and other interested people who have or are currently thinking of implementing linked data in their institutions. The workshop language was German, which somehow makes it difficult for the rest of the world to follow what was going on there. But the main goal was to reach the librarians in Germany, so the language choice is understandable.

Before summarizing some (by far not all!!!) of the main points / outcomes, my personal impression on this workshop: I was amazed in how many libraries in Germany Linked Data is already an issue. Many are currently investigating how they can implement LD in practice, some of them already have…which I really was not aware of, although I have been following the developments in that domain for quite a while now.

Day 1

The target audience of the first day were decision makers, so it started with an introduction to the topic:

Jakob Voss was the first speaker and started this “Introduction to the Semantic Web” talk with a citation of the LibraryThing founder Tim Spalding, who wrote in a blog entry “…it would be nice if you could link to a book in a library catalogue [...].”, a statement that very well describes the problem with current library catalogues. He then started explaining the origins the Web, which was built because of motivations (uniform encoding, uniform addressing, uniform transport) that must sound very familiar to any librarian. The problem with current libraries is that they are still gate-keepers. Although, they already have all the linked data ingredients (identifier, metadata) in place they are still locking them in their data bases and provide no way to directly dereference them. In that way they do not (or only slowly) take the chance to integrate their assets with the Web and risk that others take over their role in a global information network, such as the Web.

Stefan Gradmann from HU Berlin then continued with a great talk entitled “Why should Cultural Institutions care about the Semantic Web?”. He first pointed at library 101, which currently doesn’t mention the term Semantic Web a single time and pointed out that librarians will never be the one who have to implement these technologies, but they at least need to know the basics about it. I very much liked his enumeration of good and bad reasons why librarians should care:

Bad reasons:

  1. because it is required to mention “semantic web” in research proposals in order to get funding
  2. because Web 3.0 is a logical consequence of Web 2.0 – Web X.0 are marketing terms and indicate that the developments in the Web are linear in time, which is not the case

Good reasons (with ascending importance):

  1. because otherwise others (not the library people) will do it
  2. because the resources in libraries might in future be considered as a collection of museum resources (that are not on the Web)
  3. because after the introduction of RDA, libraries cannot ignore Semantic Web anymore (the RDA vocabulary is defined in RDFS)
  4. because this can enable new forms of interaction. For instance, a search for European capitals mentioned in the literature of the 20th century
  5. because librarians can contribute their knowledge and their catalogues to the Semantic Web. The Semantic Web idea actually originates from a library background
  6. because also in future, in a global information network, libraries should be recognized as leading academic institutions

He concludes his talk with a threat (“Semantic Web may lead to the extinction of stored catalogue records”) and a perspective (libraries can contribute their knowledge in dealing with changes within the Semantic Web).

After the break it was my turn. The main messages I tried to communicate were:

  • Libraries must care about the (Semantic) Web. Their future clients expect that information is available on the Web. If this is not the case, closed resources within libraries won’t be conceived as possible sources for fulfilling the users’ information need.
  • Semantic Web (Identifier, Metadata, Controlled Vocabularies) is actually an attempt to continue on the Web what librarians have been doing in their catalogues for centuries.

Then I gave some pointers to existing Linked Data implementations in the context of digital libraries (see previous blog post).

Then Ed Summers and Anders Söderbäck gave great talks on LCSH and LIBRIS. As the previous talks, they pointed out that the importance of the topic for libraries. Since there is already much information available on these services (see my previous post), I won’t repeat the details here. Just the updates, at least I was not aware of: LIBRIS now links to http://dbpedia.org/, http://id.loc.gov/authorities/, and http://lcsubjects.org/. In future they would like to add even more links to http://dbpedia.org/, http://musicbrainz.org/, http://viaf.org/, http://openlibrary.org/, etc.

Jürgen Kett from the German National Library (DNB) announced that an internal linked data project has started in November 2009 and that it is planned to expose the DNB’s authority as Linked Data by mid 2010. They also plan to provide a SPARQL endpoint and pointed out that the business model will be adapted so that all data can/will be free of charge. Great to hear that.

Timo Borst from the German National Library of Economics (ZBW) announced that their “STW Thesaurus for Economics”, which is now exposed as Linked Data at http://zbw.eu/beta/stw-ws.

The first day ended with a talk by Anja Jentzsch from FU Berlin about DBpedia and the current developments in that project. She announced that the current DBpedia version now also integrates authority data from the German national library.

Day 2

Elena Semenova started the day with explaining ontologies from the perspective of information specialists. Her explanations sounded quite logic for me, so I can conclude that the perspectives in distinct communities are quite similar.

For me, the second talk was quite interesting, because one day earlier I attended a CIDOC CRM Workshop at the German Archaeological Institute and also there Linked Data, i.e., the integration of CIDOC CRM with the Linked Data approach was definitely an issue. Therefore I was glad to see that Karin Teichmann from the DNB gave an introduction to the CRM at linked-data related workshop.

The bibliographic ontology (Bibo) as a potential successor for bibliographic data formats was the next talk. Jakob Voss explained the motivation why this ontology has been developed (“There was a need at Zitgist”). He explained its main structural elements (documents) and also emphasized that the reuse of existing ontologies is a central design paradigm for the design of ontologies in the Web.

Stefan Gradmann and Marlies Orlensky then gave some insights into the current developments in the semantic layer of Europeana – the European Digital Libraries flagship project. The project is still in an early state, but an experimental ThoughtLab for Linked Data in Europeana is already in place. At the moment the are collecting the user requirements for the semantic layer in terms of search and browsing functionality.

Patrick Danowsky from the CERN Library talked about licensing issues and explained that public domain (as alternative to copyright and creative commons) is probably the best solution for open data. He also gave a preview of the CERN digital library which will publish its data as linked data next week.

Joachim Neubert, also from the German National Library of Economics, presented the experiences they made when publishing their data as Linked Data, presented the methodology for data conversion they used, and pointed to a beta service (e.g., http://zbw.eu/beta/pm20/person/00012). He also gave the pointer to the Pendantic Web Group, which summarized frequently observed problems on the Web of Data and give value hints on how to solve these problems.

Then Sven Lars Svensson and Jürgen Kett gave an extended presentation on the project at the DNB, which aims at exposing the DNB authority data as Linked Data. They built the ontology on FOAF, extended with DNB-specific fields, use SKOS, etc. Most importantly, we can expect a first running prototype by mid 2010.

Felix Ostrowski discussed analogies between models in the application logic of (repository) software and the models (ontologies) on the Linked Data Web, and André Hagenbruch presented his ideas on how the library of Bochum University could integrate Linked Data in the design of a (their) library portal. More details in the slides….

Some outcomes of the concluding session were:
- it requires an infrastructure so that people working on linked data in the library field know from each other. This is not case at the moment
- Sematnic Web / Linked data must also be integrated into information science curricula at universities and also in training courses for librarians
- the community should also start talking with vendors of digital library systems. They are currently not taking part in the discussion (at least in Germany)
- …

Linked Data in the Context of Digital Libraries

Next week I am giving a talk on Linked Data in the Context of Digital Library Systems at SWIB09 in Cologne. So it is time to catch up with recent developments in that area. Some projects I am mentioning here will be presented in separate talks. Nevertheless I would like to briefly sketch their main contributions.

For each project, I try to find the answers to the following questions, which might be important for future adopters of Linked Data technology:

  1. What kind of data are exposed as Linked Data?
  2. How did they implement the Linked Data Principles?
  3. What was the motivation of the institutions to consider Linked Data as a way for sharing data?

The Library of Congress Subject Headings (LCSH)

The Library of Congress was one of the early adopters of Linked Data. Last year (or even earlier?) Ed Summers and his colleagues started to build a first prototype and exposed the approximately 260,000 authority records held at the Library of Congress according to the Linked Data Principles. They converted the LCSH, which were originally available as MARCXML, into SKOS and used the Library of Congress Control Number to mint the URIs of the exposed concepts. On May 1st 2009 the experimental service went into production (see http://lcsh.info/) and the LCSH are now available at http://id.loc.gov/, following the scheme http://id.loc.gov/authorities/{lccn}#concept. Here is an example: http://id.loc.gov/authorities/sh85058486#concept. Dereferencing this URI using the HTTP Accept Header “Accept: application/rdf+xml” returns:



  
    
    Hallstatt period
    
    
    
    1986-02-11T00:00:00-04:00
    
    1996-09-11T10:10:33-04:00
    
    
  
  
    Mounds--Rhine River Valley
  
  
    Iron age
  

The Swedish Union Catalogue (LIBRIS)

The Swedish Union Catalogue was the other Linked Data service presented at DC2008 (here is the paper). The catalogue comprises data from about 175 libraries and contains six million records following the MARC21 standard. Today records describing various types of resources including persons, authors, subjects, organizations are accessible at http://libris.kb.se/. This example shows a record about a book: http://libris.kb.se/bib/10542240. When you dereference this URI, you get a 303 See Other response with Location http://libris.kb.se/data/bib/10542240?format=application%2Frdf%2Bxml. This in turn returns the following record:

The records are made available by building a simple RDF wrapper on-top of the integrated library system. Similar to the LCSH, persistent URIs were created using the record’s unique number. So the URIs follow the pattern http://libris.kb.se/resource/bib/{number} for bibliographic records and http://libris.kb.se/resource/auth/{number} for authority records. Dublin Core (http://dublincore.org/) is used as vocabulary for bibliographic data, FOAF for persons and organizations. Currently the returned data are not interlinked with any other data source.

The motivation for LIBRIS to make their bibliographic data available as linked data was that the standard way to access bibliographic data is still through search-retrieve protocols such as SRU/W or Z.39.50. There is currently no way to address records directly and there are hardly any links between records. Since the LIBRIS developers had to create a new Web interface anyway, they decided to make data also available for machines, i.e., to exposed them following the linked data principles. With minor effort, they are now making data that was previously available only for the library community and/or people who were familiar the complexity of existing specifications.


  

    Hallstatt textiles : technical analysis, scientifc investigation and experiment on Iron Age textiles /
    2005
    text
    Anton Kern
    Kern, Anton 1947-
    Hallstatt : eine Einleitung zu einem sehr bemerkenswerten Ort
    
    
  

RAMEAU Subject Headings

Within the TELplus project, European institutions started to work on a service (http://www.cs.vu.nl/STITCH/rameau/) that exposes a SKOSified version the RAMEAU subject headings as open linked data. Rameau is the main subject vocabulary at the French national library (BnF) and other french institutions. It contains approx. 160,000 concepts including common nouns and geographic names. The concepts are interlinked with the LCSH concepts based on 60,000 available manual mappings. The service has been announced in April 2009 and is still experimental. Here is the announcement mail.

Depending on a client’s preference, the service can return rdf/xml, or html descriptions containing RDFa markup. Here is the sample response returned when dereferencing the resource http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb11942233p (which is linked in the above LCSH record) and following the 303 See Other response:






]>

   FRBNF119422333
   Civilisation de Hallstatt
   Civilisation hallstattienne
   Culture de Hallstatt
   Culture hallstatienne
   Hallstatt, Civilisation de
   Hallstattien
   Hallstattkultur
   Osthalstattkultur
   Premier âge du fer
   Civilisation du premier âge du fer en Europe
   Source : Dict. de la préhistoire / A. Leroi-Gourhan, 1994. - Les sociétés de la préhistoire / J.-P. Mohen, Y. Taborin, 1998. - Les Celtes / V. Kruta, 2000. - La préhistoire / D. Vialou, 2004
   Domaine : 930
   
   
   
   
   
   
   




   
   
   http://www.w3.org/2004/02/skos/core#closeMatch
   1.0




   
   
   http://www.w3.org/2004/02/skos/core#closeMatch
   1.0



   


Dewey Decimal Classification (DDC) Summaries

In August 2009 Michael Panzer from OCLC announced the Dewey Decimal Classification (DDC) Summaries to be published as linked data. From http://dewey.info one can retrieve the top 1000 classes of the Dewey Decimal Classification in nine languages. As all the other services mentioned before, it uses content negotiation to determine whether to deliver rdf/xml, xhtml + rdfa, or other serialization formats (Turtle, Json). The service is still experimental, and the data are not interlinked with others. The technical details about this service are described here, including the promising statement that OCLC plans to conduct further development into this direction.

I couldn’t find any information about the effort it took to implement that service. It seems that the design of the URI patterns was far from trivial: the DDC Summary service supports things like versioning and allows clients to retrieve the changes that happened to concepts over time.

A very interesting aspect of the DDC Summaries Linked Data Service is that the exposed data include licensing information, which explicitly permits reuse of the exposed data in terms of the Creative Commons BY-NC-ND license. This means that you can use the exposed DDC Summaries under the conditions that (i) you attribute the work in the manner specified by the author, (ii) you do not use the data for commercial purposes, and (iii) you do not alter, transform or build upon this work. I am not a lawyer, but in my interpretation linking from your data sets to the DDC Summaries should be perfectly legal. Using the DDC Summaries data for organizing assets in your own commercial library system is not.

Here is a sample resource exposed by this service retrieved from http://dewey.info/class/943/2009/08/about.en:




  
    
    
    OCLC Online Computer Library Center, Inc.
    
    
    en
    943
    
    Central Europe; Germany
    
    
  


VIAF

Another very interesting project highly relevant for the digital libraries domain is called VIAF: The Virtual International Authority File (http://viaf.org/). It is a joint project of more than ten national libraries, implemented and hosted by OCLC. It has the goal to match and link authority files of national libraries and then making that information available on the Web.

Unfortunately there is not so much information available on this project, besides a set of slides saying that Semantic Web Building blocks are used within this service. But dereferencing a given URL (e.g., http://viaf.org/viaf/56611857) with the HTTP Accept Header application/rdf+xml clearly shows that the VIAF service is also taking into account the Linked Data principles:


	
	
	
		קפקה, פרנץ, 1883-1924
		Kafka, Franz, 1883-1924
		
		
		
	

Summary

So what were the main motivations of all these developments?

First of all, several people within library institutions realized that libraries should open their data to the context of a globally interlinked information network, which is the Web. To achieve that, they must integrate their vocabularies and data with the Web environment (or actually the Web architecture) so that their data can be used and integrated with any other Web application also in other communities. Linked Data is one possible the technical realization of that idea.

The problem with existing library-data exchange protocols is that (i) although the use the Internet infrastructure for exchanging data, they do not really integrate with the Web architecture, and (ii) they are very specific to the digital libraries community and difficult to adopt in other domains. The common building blocks of the Web (e.g., URI, HTTP) are widely accepted across domains.

The effort for implementing a Linked Data service on-top of existing systems was obviously rather low: LIBRIS implemented their service as part of their Website reorganization; the first LCSH prototype was implemented by a single person (?).

Interestingly the motivation to adopt Linked Data within library institutions was – as far as I know – always bottom-up, driven by a few technical enthusiasts. This nicely reflects the beginning of developments in other areas (e.g., the Web, Open Source Software, Wikipedia,…). Hopefully we can see similar developments in the digital libraries domain.

How to set up your local DBpedia instance

For research purposes – if you don’t want to become a DBpedia DoS attacker :-) – it might sometimes be necessary to set up a local DBpedia instance in your local Virtuoso Server. The following steps describe how we did that on our linux machine:

Prerequisites:

So far I haven’t found an official hardware requirements specification for hosting a DBpedia instance. From our experience, you need lots of RAM (>8GB) and fast hard disks

Perform the following steps in order to set up your local DBpedia instance:

Install a local Virtuoso instance

I think there is no need to recite the Virtuoso installation instructions. In our case everything worked fine.

Check if the isql commandline interface and the conductor interface at http://myhost:8890/ are working.

Make sure, that you assign enough memory to Virtuoso. You can do this by tuning the virtuoso.ini parameters. For a machine with 16GB RAM, use the following parameters:

NumberOfBuffers = 1000000
MaxDirtyBuffers = 800000
MaxCheckpointRemap = 1000000
DefaultIsolation = 2

More information about tuning is available here and here.

Download (and convert) the DBpedia dumps

If you want to make your DBpedia data dereferencable within your domain (e.g., myhost.org instead of dbpedia.org) you should rename the resources in the DBpedia dumps before importing them into Virtuoso.

For a single dump file you can use this command:

nohup sed -i -e 's%http://dbpedia.org/%http://myhost.org/%g' dumpFile.nt &

For a whole directory of dump files you can use this script:

#!/bin/sh 

OLDURI=$1
NEWURI=$2
FILE=$3
LOGF=`basename $0`.log

if [ -z "$OLDURI" -o -z "$NEWURI" -o -z "$FILE" ]
then
  echo "Usage: `basename $0` [olduri] [newuri] [ttl-file]"
  exit
fi

echo "======================================="
echo "URI replacement started."
echo "======================================="

TRANSFORM_URI () {

   file=$1
   echo "Replacing URI $OLDURI in file $file with $NEWURI in background"

   regexp="s%$OLDURI%$NEWURI%g"

#  echo $regexp 

   nohup sed -i -e $regexp $file & > $LOGF

}

if [ -f "$FILE" ]
then
	TRANSFORM_URI $FILE
elif [ -d "$FILE" ]
then
    for ff in `find $FILE -name '*.nt'`
    do
        TRANSFORM_URI $ff
    done
else
   echo "The input is not file or directory"
fi
echo "======================================="
echo "URI transformation finished."
echo "======================================="

exit 0

Start the script as follows:

>
nohup ./adapturis.sh http://dbpedia.org http://myhost.org data/ turtleFile.nt

Load the DBpedia dumps

For loading a single dump file, I am using the following command:

For loading the DBpedia dumps, I used the following script. It either takes a single file or iterates through all files in a directory and loads the dump files using Virtuoso’s ttlp_mt function. If a single dump-file contains erroneous triples, it cuts out these triples in a file called bad.nt and continues inserting the remaining ‘good’ triples.


#!/bin/sh 

PORT=$1
USER=$2
PASS=$3
file=$4
g=$5
LOGF=`basename $0`.log

if [ -z "$PORT" -o -z "$USER" -o -z "$PASS" -o -z "$file" -o -z "$g" ]
then
  echo "Usage: `basename $0` [DSN] [user] [password] [ttl-file] [graph-iri]"
  exit
fi

if [ ! -f "$file" -a ! -d "$file" ]
then
    echo "$file does not exists"
    exit 1
fi

mkdir READY 2>/dev/null
rm -f $LOGF $LOGF.*

echo "Starting..."
echo "Logging into: $LOGF"

DOSQL ()
{
    isql $PORT $USER $PASS verbose=on banner=off prompt=off echo=ON errors=stdout exec="$1" > $LOGF
}

LOAD_FILE ()
{
    f=$1
    g=$2
    echo "Loading $f (`cat $f | wc -l` lines) `date \"+%H:%M:%S\"`" | tee -a $LOG

    DOSQL "ttlp_mt (file_to_string_output ('$f'), '', '$g', 255); checkpoint;" > $LOGF

    if [ $? != 0 ]
    then
	echo "An error occured, please check $LOGF"
	exit 1
    fi

    line_no=`grep Error $LOGF | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'`
    newf=$f.part
    inx=1
    while [ ! -z "$line_no" ]
    do
	cat $f |  awk "BEGIN { i = 1 } { if (i==$line_no) { print \$0; exit; } i = i + 1 }"  >> bad.nt
	line_no=`expr $line_no + 1`
	echo "Retryng from line $line_no"
	echo "@prefix rdfs:  ." > tmp.nt
	cat $f |  awk "BEGIN { i = 1 } { if (i>=$line_no) print \$0; i = i + 1 }"  >> tmp.nt
	mv tmp.nt $newf
	f=$newf
	mv $LOGF $LOGF.$inx
	DOSQL "ttlp_mt (file_to_string_output ('$f'), '', '$g', 255); checkpoint;" > $LOGF

	if [ $? != 0 ]
	then
	    echo "An error occured, please check $LOGF"
	    exit 1
	fi
	line_no=`grep Error $LOGF | awk '{ match ($0, /line [0-9]+/, x) ; match (x[0], /[0-9]+/, y); print y[0] }'`
	inx=`expr $inx + 1`
    done
    rm -f $newf 2>/dev/null
    echo "Loaded.  "
}

echo "======================================="
echo "Loading started."
echo "======================================="

echo "Disabling transaction logging...."
DOSQL "log_enable(0);"

if [ -f "$file" ]
then
    LOAD_FILE $file $g
    mv $file READY 2>> /dev/null
elif [ -d "$file" ]
then
    for ff in `find $file -name '*.nt'`
    do
	abspath=`readlink -f $ff`
	LOAD_FILE $abspath $g
	mv $ff READY 2>> /dev/null
    done
else
   echo "The input is not file or directory"
fi
echo "======================================="
echo "Final checkpoint."
DOSQL "checkpoint;" > temp.res
echo "Enabling transaction logging (row autocommit) again..."
DOSQL "log_enable(2);"
echo "======================================="
echo "Check bad.nt file for skipped triples."
echo "======================================="

exit 0

Assuming you have downloaded all dump files into a directory data, then you can start the RDF data ingest script (named importDumps.sh) as follows:

nohup ./importDumps.sh 1111 dba dba dump/ http://mygraphuri.org

Especially with older DBpedia dumps (e.g., 3.2) I had the problem that certain triples contained invalid nodes and were rejected be the Virtuoso turtle parser. Because in my case I need all triples, even the erroneous ones, I set the parser to a high tolerance level (255). You might change it to 17 or even lower levels. More information about this is available here.

Emergent Semantics # 2 (and Linked Data)

At the WWW 2009 I found a paper entitled “idMesh: Graph-based Disambiguation of Linked Data” describing the application of the so-called emergent semantics principles in the Web of Data. It is by the same author as the overview paper mentioned in my previous post.

If we assume that there exist several interlinked profiles on the Web, e.g., a foaf file about Tim Berners-Lee linked with other foaf files, then there is the problem that there are some profiles that actually refer to different entities (e.g., another Tim Berners-Lee). Hence, some profiles within the network (graph) of foaf files do not describe the intended real-world referent. This might happen because links (e.g, owl:sameAs) are created automatically with a certain level of uncertainty or because of spamming.

The approach described in this paper does not use common record linkage / instance matching techniques to decide if two profiles refer to the same real-world entity. It rather analyses the relationships between entities and uses trust-values to detect conflicting links between entities.

The semantics of a profile is not examined based on the underlying model (foaf) and its instances but based on the relationship between the instances, which are treated as purely syntactic structures.

The algorithmic technique applied on this problem is called Factor Graphs, a topic I should probably read more about.