Kohacon 2020

Linked Data 101

Presented by Jonathan Hunt (@kayakr)

https://kayakr.gitlab.io/presentation-2020-kohacon-lod/

HTML: Navigate via left/right/up/down, ESC to see all slides, 's' for speaker notes

Linked Data 101

  • 10:30 Linked Data 101 (Part 1) 12:00
       Linked data: what? why? how? who?
  • 12:00 Lunch 1:30
  • 1:30 Linked Data 101 (Part 2) 3:30
       SPARQL, Wikidata, applications, linked data for libraries & archives
  • 3:30 Afternoon tea & questions 4:00

CC-BY-2.0 @dullhunk

Introductions

  • Your aims for workshop?
  • Level of knowledge of Linked Data?
  • kia ora tātouIn Kakariki canyon, Hokitika river
  • ko Kilimanjaro te maunga
  • ko Hurunui te awa
  • Ōtautahi ahau
  • ko jonathan hunt tōku ingoa

Catalyst

Catalyst Map

What is Linked Data?

  • 1945: Vannevar Bush, "As we may think", Memex
  • 1965: Ted Nelson, "hypertext", "hypermedia"
  • 1989: Tim Berners-Lee, World Wide Web, HTTP at CERN
  • 2001: "The Semantic Web"($) Tim Berners-Lee, James Hendler, and Ora Lassila
  • Web technologies: HTTP, RDF, URIs1

[1] http://knowledgegraph.today/paper.html

Why?

  • Data at scale
  • Hierarchies & graphs
  • Evolving schemas
  • Mapping between data models
  • Sparse data
  • Connecting data islands
  • HTTP: Hypertext Transfer Protocol
    • GET, POST, PUT, DELETE
  • RDF: Resource Description Framework
    • simple data model (subject, predicate, object)
    • more complex models via RDF Schema, shapes, etc.
  • URIs: Uniform Resource Identifier

TBL - 5 star open data I

TBL - 5 star open data II#

make your stuff available on the web (whatever format) under an open licence
★★make it available as structured data (e.g. excel instead of image scan of a table)
★★★non-proprietary format (e.g. csv instead of excel)
★★★★use URIs to identify things, so that people can point at your stuff
★★★★★link your data to other people’s data to provide context

Linked data principles

  • Use URIs as names for things: things not strings
  • Use HTTP URIs so that people can look up those names: "dereferencing"
  • When someone looks up a URI, provide useful info e.g. RDF, JSON-LD
  • include links to other URIs

https://www.w3.org/DesignIssues/LinkedData.html

Example (hypothetical)

{
  "@context": "http://api.digitalnz.org/schema",
  "@type" "foaf:Person",
  "@id": "https://digital.nz/entity/person/101",
  "name": "Colin John McCahon",
  "altLabel": {
  "en": ["Colin McCahon", "McCahon, Colin John"]
  },
  "dateOfBirth": "1919-08-01",
  "dateOfDeath": "1987-05-27",
  "sameAs": [
  "http://collections.tepapa.govt.nz/Person/1502",
  "http://findnzartists.org.nz/artist/9785/"
  ],
  "seeAlso": [ "https://teara.govt.nz/en/biographies/5m4" ]
  }

https://digital.nz/entity/person/101

From table to graph (1)

  • IdFirst namesFamily nameBirth dateBirth place
    9785ColinMcCahon1 August 1919Timaru
    328RitaAngus12 March 1908Hasting
  • #9785 -> Family name -> "McCahon"

From table to graph (2)

  • IdFirst namesFamily nameBirth dateBirth place
    9785ColinMcCahon1 August 1919Timaru
  • #9785 -> Family name -> "McCahon"
  • #9785 -> birthDate -> "1 August 1919"
  • subject -> predicate -> object

From table to graph (3)

  • IdFirst namesFamily nameBirth dateBirth place
    9785ColinMcCahon1 August 1919Timaru
  • #9785 -> Family name -> "McCahon"
  • #9785 -> birthDate -> "1 August 1919"

From table to graph (4)

  • IdFirst namesFamily nameBirth dateBirth place
    9785ColinMcCahon1 August 1919Timaru

Graphs (history)

Konigsberg bridges.png7 bridges.svgKönigsberg graph.svg

Images from https://en.wikipedia.org/wiki/Seven_Bridges_of_Königsberg

But we should have URIs...

Namespaces and prefixes

subject can be object

These are Compact URIs or CURIES https://www.w3.org/TR/2010/NOTE-curie-20101216/>
"Tomorrow will be the same but not as this is" thumbnail appears courtesy of the Colin McCahon Research and Publication Trust.

subject can be object

Cool URIs

  • Uniform Resource Identifier
    ISBN, car rego, NZBN, ORCID, DOI, etc., see @ldodds
  • TBL: Cool URIs don't change (1998)
  • On a domain you control
  • Use natural keys (e.g. ids, names)
  • Technology neutral (not .aspx, .php)
  • HTTP response codes (RFC2616 or https://http.cat/)
    301, 300, 404, 410 Gone, 420, 451, etc.
    • @ldodds https://twitter.com/ldodds/status/1042150473615204359

httpRange-14

  • nza:9785 -> dc:creator -> dnz:36542738
  • from W3C Technical Architecture Group
  • distinguish between id of resource and representation of the resource
  • content negotiation; HTTP 303 "See other"
  • Europeana proxies

"Tomorrow will be the same but not as this is" thumbnail appears courtesy of the Colin McCahon Research and Publication Trust.

httpRange-14

  • https://digitalnz.org/id/36542738
       dc:creator "Colin McCahon"
       dc:creator "Christchurch Art Gallery"
  • if Accept: text/html, 303 to https://digitalnz.org/page/36542738
  • if Accept: application/ld+json, 303 to https://digitalnz.org/data/36542738

Entity !== Subject

  • "Colin John McCahon" vs ATL: "McCahon, Colin John, 1919-1987" vs Te Ara McCahon, Colin John
  • foaf:name "Colin John McCahon"
  • ATL: skos:prefLabel "McCahon, Colin John, 1919-1987"
  • FindNZArtists: skos:prefLabel "McCahon, Colin"
  • Te Ara: skos:prefLabel "McCahon, Colin John"
  • Te Papa: skos:prefLabel "Colin McCahon"

Entity !== Subject, 2

Serialisation (as Turtle, .ttl)


            @prefix foaf: <http://xmlns.com/foaf/0.1/> .
            @prefix dbpedia-owl: <http://dbpedia.org/ontology/> .
            @prefix tgn: <http://vocab.getty.edu/tgn/> .

            <http://findnzartists.org.nz/artist/9785>
              foaf:familyName "McCahon" ;
              foaf:birthday "1 August 1919" ;
              dbpedia-owl:birthPlace <tgn:1099274> .

            <http://findnzartists.org.nz/artist/328>
              foaf:familyName "Angus" ;
              foaf:birthday "12 March 1908" ;
              dbpedia-owl:birthPlace <tgn:1098956> .

Serialisation (as RDF/XML, .xml)

<?xml version="1.0" encoding="utf-8" ?>
  

  
    McCahon
    1 August 1919
    
  

  
    Angus
    21 March 1908
    
  
  

Serialisation (JSON-LD, .jsonld)

{
  "@context": "http://api.digitalnz.org/schema",
  "@type" "foaf:Person",
  "@id": "https://digital.nz/entity/person/101",
  "name": "Colin John McCahon",
  "altLabel": {
  "en": ["Colin McCahon", "McCahon, Colin John"]
  },
  "dateOfBirth": "1919-08-01",
  "dateOfDeath": "1987-05-27",
  "sameAs": [
  "http://collections.tepapa.govt.nz/Person/1502",
  "http://findnzartists.org.nz/artist/9785/"
  ],
  "seeAlso": [ "https://teara.govt.nz/en/biographies/5m4" ]
  }

https://digital.nz/entity/person/101

Other serialisations

  • RDFa (in HTML)
  • Notation3 (.n3)
  • N-triples (.nt), N-Quads (.nq)
  • HexTuples
  • HDT (Header, Dictionary, Triples)

SPARQL (round 1)

All triples for an object


              SELECT ?p ?o
              WHERE {
                <info:fedora/qsr-collection:500> ?p ?o
              }
            

Result#


              "p","o"
              http://purl.org/dc/elements/1.1/title,Christchurch Press October 2010
              http://purl.org/dc/elements/1.1/identifier,qsr-collection:500
              info:fedora/fedora-system:def/relations-external#isMemberOfCollection,info:fedora/qsr-collection:158
              http://purl.org/dc/elements/1.1/type,Collection
              http://purl.org/dc/elements/1.1/description,Newspapers published by the Christchurch Press in October 2010.
              http://quakestudies.canterbury.ac.nz#isPublishedBy,info:fedora/qsr-contentpartner:9
            

Exercise: Real-world federation example

e.g. all books in the Harvard Library written by people born in San Francisco

  1. Visit https://comunica.dev/docs/query/advanced/federation/
  2. In LDF query, click gear icon, set proxy to https://proxy.linkeddatafragments.org/, note HTTPS
  3. Copy & paste three source URIs from 1.
  4. Copy & paste SELECT query, between '' from 1.
  5. Click "Execute query" button #

Reification (1)

Reification (2)

  • Assert that Jacinda Adern is Prime Minister, then reference that statement and give it a date.

@prefix ex: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
ex:jacinda_adern ex:hasRole "Prime Minister" .
<<ex:jacinda_adern ex:hasRole "Prime Minister">> ex:sworn "2017-10-26"^^xsd:date .
            

Reification (3)

  • query for the date of that triple:

PREFIX ex: <http://example.org/>
SELECT ?subject ?p ?o
WHERE {
  <<ex:jacinda_adern ex:hasRole "Prime Minister">> ?p ?o
}
  • result:
1	ex:sworn "2017-10-26"^^xsd:date

Reification (4)

screenshot of Apache Jena Fuseki querying using SPARQL

Common vocabularies

  • DCTERMS
  • FOAF (me & my friends)
  • SKOS (terms, hierarchy)
  • Wikidata
  • schema.org (people, place, events, etc.)
  • CIDOC-CRM v6.2.3, E99, P188
  • BIBFRAME (work, instance, item, agent, subject, event)
  • PROV-O (Agent, Activity, Entity)
  • Records in Contexts (WIP RiC-O v0.2, ric, Agent, RecordSet, Record)
  • BIO & more

Choosing predicates

When owl:sameAs isn’t the Same

  • owl:sameAs === schema:sameAs, sameas.org
    "sameAs": [
        "http://collections.tepapa.govt.nz/Person/1502",
        "http://www.teara.govt.nz/en/biographies/5m4/mccahon-colin-john"
        "http://natlib.govt.nz/records/22355455"]
  • challenges with sameAs: e.g. context (PDF)
    • <http://www.w3.org/2000/01/rdf-schema#seeAlso
    • <http://umbel.org/umbel/sc/isLike>
    • <http://www.w3.org/2004/02/skos/core#exactMatch>
    • <http://www.w3.org/2004/02/skos/core#closeMatch>
    • <http://www.w3.org/2004/02/skos/core#broadMatch>
    • <http://www.w3.org/2004/02/skos/core#narrowMatch>
    • <http://open.vocab.org/terms/similarTo>
    • <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym>

Validation

Enrichment

Named Entity Recognition (NER)

language models that allow entities, e.g. People, Places, Topics, etc. to be identified

What linked data won't do

  • Doesn't fix data quality
  • Licensing (e.g. CC-BY)
  • Tabular data isn't going away...
    see W3C CSV on the Web

Who (GLAM)

Who (World)?

LOD cloud

  • 1,260 datasets (May 2020)LOD cloud as of 27 July 2020
  • none from NZ!?
  • At best, mentions in Wikidata, VIAF, placenames in Getty (TGN)

What we haven't talked about

  • Triples -> Quads
  • Language handling, e.g. @en, @mi
  • RDF Schema, SHACL etc.
  • Protege
  • Inference and logic
  • W3C Linked Data Platform (LDP)
  • LD FLex / Communica
  • graphql vs SPARQL
  • PID server

SPARQL (round 2)

All triples for an object


              SELECT ?p ?o
              WHERE {
                <info:fedora/qsr-collection:500> ?p ?o
              }
            

Result#


              "p","o"
              http://purl.org/dc/elements/1.1/title,Christchurch Press October 2010
              http://purl.org/dc/elements/1.1/identifier,qsr-collection:500
              info:fedora/fedora-system:def/relations-external#isMemberOfCollection,info:fedora/qsr-collection:158
              http://purl.org/dc/elements/1.1/type,Collection
              http://purl.org/dc/elements/1.1/description,Newspapers published by the Christchurch Press in October 2010.
              http://quakestudies.canterbury.ac.nz#isPublishedBy,info:fedora/qsr-contentpartner:9
            

SPARQL: Adding a second variable

  • Query in previous slide is only subject #
  • Visit instance to see available fields
  • Add PREFIX dc:<http://purl.org/dc/elements/1.1/>
  • Add new pattern. ?subject dc:title ?title .
  • or same ?subject via ; #

Exercise: Real-world federation example

e.g. all books in the Harvard Library written by people born in San Francisco

  1. Visit https://comunica.dev/docs/query/advanced/federation/
  2. In LDF query, click gear icon, set proxy to https://proxy.linkeddatafragments.org/, note HTTPS
  3. Copy & paste three source URIs from 1.
  4. Copy & paste SELECT query, between '' from 1.
  5. Click "Execute query" button #

Exercise

  • a: Query some triples
  • b: Query multiple data sources
  • c. Fork the triples to create new data

a. Query some triples

  1. Visit https://query.linkeddatafragments.org/ with some data: #
  2. Click "Execute query" button,
    should see 8 triples returned
  3. Query for name of person born in Timaru #,
    ?name "Colin McCahon"

b. Add a second datasource

  1. Paste https://untitled-go4y1dompky1.runkit.sh/ and select option rendered in red.
  2. Click "Execute query" button, #
    ?name "Colin McCahon"
    ?name "Violet Faigan"

c. Fork the triples to create new data

  1. Visit https://runkit.com/kayakr/nz-artists-turtle
  2. Click "Clone this notebook" button (on left).
  3. On sign up form, click "skip this".
  4. Click "Clone notebook" button.
  5. Click notebook title - add your initials.
  6. Overwrite existing triples with other data.
  7. Click refresh icon
  8. Confirm endpoint is returning data (e.g. click endpoint link)

c, continued

  1. Copy & share endpoint URL from part c
  2. Paste URL into "Choose datasources" and select option rendered in red.
  3. For proxy, click gear icon, click "Set proxy", set to https://proxy.linkeddatafragments.org/. Note HTTPS

Wikidata

Wikidata statements

s - p - o == item - property - value
Wikidata statements, 1.2billion

Anatomy of a Wikidata item

Wikidata data model diagram https://commons.wikimedia.org/wiki/File:Datamodel_in_Wikidata.svg

Properties

WDQS

  • Locations depicted in paintings: https://w.wiki/hNQ
  • Find all: Nobel Prize winners in Literature, who fought in at least one war, the year they won the prize, and the year the war(s) started #

Wikidata for GLAMs

Applications (1)

Applications: Entity Explosion

Applications: OpenRefine

  • OpenRefine, data-cleaning
  • facets: text facet, source by count desc
  • clusters: off-by-one, multiple algorithms
  • sort by length
  • reconciliation
  • geocoding

Linked data for libraries

  • MARC
  • FRBR
  • BIBFRAME

MARC (MAchine-Readable Cataloging)

  • developed late 1906's, standardised 1971
  • 245 10 $a Mothership connection $h [sound recording] / $c Parliament.
  • overloaded
  • divergence of strings
  • challenges with complex relations
  • library-specific

RDF to the rescue...?

  • IFLA FRBR: Work, Expression, Manifestation, Item, from 1992
  • LC BIBFRAME draft 2012, v2.0 2016Bibframe model diagram
  • Data model and vocabulary:
    Work, Instance, Item
    Agent, Subject, Event
  • MARC 21 to BIBFRAME 2.0, BIBFRAME to MARC

Library of Congress

schema.org

Linked data for Archives

Resources

Summary

5-star linked data, RDF, SPARQL, OpenRefine, various tools

data as infrastructure

Questions/Discussion

jhunt@catalyst.net.nz