oxigraph icon indicating copy to clipboard operation
oxigraph copied to clipboard

Could Oxigraph define a Default Base IRI?

Open sgoetz-brox opened this issue 2 years ago • 13 comments

When trying to upload this Turtle file (example.ttl):

<#> a <http://schema.org/Thing> .

with curl:

curl --request "POST" --header "Content-Type:text/turtle" --upload-file "example.ttl" "http://oxigraph?default"

the upload fails with this error:

error while parsing IRI '#': No scheme found in an absolute IRI on line 1 at position 4


The Turtle file seems to be valid, e.g. according to riot --validate example.ttl.


The relevant section in the Turtle spec is 6.3 IRI References. For my example, this statement applies, I think:

If none of the above specifies the Base URI, the default Base URI (section 5.1.4, "Default Base URI") is used.

The referenced section (5.1.4. Default Base URI) in the URI standard says:

If none of the conditions described above apply, then the base URI is defined by the context of the application.

Would it be possible / make sense for Oxigraph to define such a Default Base IRI? (ideally with the option for users to overwrite it)

In the same section it also says "A sender of a representation containing relative references is responsible for ensuring that a base URI for those references can be established.", so I suppose it could be argued that if Oxigraph doesn’t define one, users shouldn’t upload such files (which is why I submitted this as a feature request instead of a bug report).

sgoetz-brox avatar May 17 '23 09:05 sgoetz-brox

Hi! Thank you for trying Oxigraph.

so I suppose it could be argued that if Oxigraph doesn’t define one, users shouldn’t upload such files (which is why I submitted this as a feature request instead of a bug report

Yes, it is the current behavior.

Would it be possible / make sense for Oxigraph to define such a Default Base IRI? (ideally with the option for users to overwrite it)

Also, yes. It would be a great addition.

If a target named graph is set in the URL (for example http://oxigrah/store?graph=http://example.com) it would make senseto use the target graph URL (in the example http://example.com) as base IRI by default. When loading to the default graph (?default) I am less sure. Maybe we should not set a default base IRI in this case and allow the user to provide a base URI with an other parameter (like &base=http://example.com/base)?

What do you think?

Tpt avatar May 17 '23 19:05 Tpt

This is the only issue that is preventing me from transitioning my Python library rdfhash to use Oxigraph.

Here I have an example .ttl file showing 64 blank nodes nested within each other, each blank node contains a simple predicate URI <:> (most nested blank node contains the triple _:b <:> <:>):

@prefix : <:> .
[ : [ : [ : [ : [ : [ : [ : [ :
[ : [ : [ : [ : [ : [ : [ : [ :
[ : [ : [ : [ : [ : [ : [ : [ :
[ : [ : [ : [ : [ : [ : [ : [ :
[ : [ : [ : [ : [ : [ : [ : [ :
[ : [ : [ : [ : [ : [ : [ : [ :
[ : [ : [ : [ : [ : [ : [ : [ :
[ : [ : [ : [ : [ : [ : [ : [ :
: 
] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ]
] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ]
] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ]
] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] .

Getting the error when trying to load the file using pyoxigraph/oxrdflib:

ValueError: No scheme found in an absolute IRI

NeilGraham avatar May 23 '23 17:05 NeilGraham

@NeilGraham Congrats for this library! It looks very nice!

pyoxigraph Store.load method has a base_iri parameter exactly for that. You need to provide a base URI (for example http://example.com/) and it should work well. Other pyoxigraph parsing methods also allows this parameter.

RDFlib (hence oxrdflib) also has a base parameter in most of its parsing functions.

Tpt avatar May 23 '23 17:05 Tpt

There is an issue because <:> is not a valid URI. The data shown isn't a factor.

A URI either has a scheme, which is at least one character before the colon, or it is a relative reference, where the first segment of a path can not contain a :. (rule path-noscheme leading to segment-nz-nc; rule path-rootless only applies after a scheme).

https://www.rfc-editor.org/rfc/rfc3986#section-3 and https://www.w3.org/TR/rdf-concepts/#iri-abnf

<#> is legal with the external base or using BASE. <x:> is legal as absolute URI.

afs avatar May 23 '23 18:05 afs

@Tpt Thank you, I'm excited to speed it up with oxigraph!

@afs Ah ok, I can accept having to specify a scheme, thanks for the RDF 1.2 reference

NeilGraham avatar May 24 '23 00:05 NeilGraham

I'd like to broaden out this issue and revive it to say it would really expand the utility of this library and also be spec compliant to a) allow a default base IRI, and actually even better b) have an option to disable IRI validation.

The linked spec is for the abstract RDF syntax. Clearly, turtle, a valid concrete syntax, allows for non-absolute IRIs that are resolved by the details of the concrete syntax. This deferral of conformance between abstract and concrete syntaxes are why there are multiple ways to represent RDF in the first place, and clearly the ability to not always specify the whole dang IRI is a highly desired one because it's true of pretty much every syntax.

By enforcing compliance with the abstract spec at the level of the database, we close off entire categories of valid program-level logic. A simple example would be one that wanted, for whatever reason, say space, to use its own internal short node IDs that mapped to absolute IRIs when serializing and sending.

I haven't read the source enough to know if IRI validation is somehow load bearing, but IMO conformance with the abstract spec at the expense of being able to use this library, the closest thing to an sqlite for graph databases that i've been able to find, seems like a not worth it trade to me! I would love to help out with any of the chore of making either a) default base IRIs or b) IRI validation optional.

sorry if my tone is harsh, i have been reading and arguing about the history and culture of RDF literally all day while trying to make a schema layer on top of oxigraph lol, so i am just tired. I really appreciate all y'alls work here!!!!!

sneakers-the-rat avatar Jan 10 '24 12:01 sneakers-the-rat

Hi @sneakers-the-rat

About parsing with a base IRI, setting a base IRI is now supported nearly everywhere for parsing, in the Rust API (example, the Python API (example), the JS API, the CLI API (--base argument) and the HTTP API (the ?graph query parameter is used for resolution). It is likely there are some missing APIs here, please refer them to me (or even better, do a pull request!).

About serialization with a base IRI, this is a tricky topic because a dumb serialization will make either none either all IRI that can relative. I am not sure there is a nice behavior in this area.

About making IRI validation optional, it is something I am currently adding. The next release will have a mode to disable validation and so, allow inserting invalid IRIs. Example of API.

make a schema layer on top of oxigraph

Amazing! Thank you for building on top of Oxigraph.

Tpt avatar Jan 11 '24 17:01 Tpt

About making IRI validation optional, it is something I am currently adding.

hell yes!!!! thank you for being responsive (so responsive you started working on it before i even asked!). I was going to PR but I'll hold off if you're already doing it (or i can help if that would be useful!)

About parsing with a base IRI

Awesome, yes saw this for parsing external RDF :). I'm thinking of creating new RDF in such a way that oxigraph can make local IRIs without needing to prepend some bogus fake: iri prefix just to satisfy the validation, so that when served it can be hydrated with the correct prefix, just saying that for clarify of the issue history, sounds like you got where i'm coming from!!

The next release will have a mode to disable validation and so, allow inserting invalid IRIs. Example of API.

It seems like this is for parsing without validation? I was imagining some env variable or a parameter given to Store that disables validation for that whole store - I'm not sure if a given store keeps any metadata yet, but that seems like something that you'd want to be true or not for a whole triple store - this one is either made of IRIs or strings that behave like tokens.

So eg. here: https://github.com/oxigraph/oxigraph/blob/c2df0b829d8528218ee77733c0ecf49379fb37cc/lib/oxrdf/src/named_node.rs#L24

Which is exposed to python here: https://github.com/oxigraph/oxigraph/blob/c2df0b829d8528218ee77733c0ecf49379fb37cc/python/src/model.rs#L68

could be switched to using this method which just accepts a string: https://github.com/oxigraph/oxigraph/blob/c2df0b829d8528218ee77733c0ecf49379fb37cc/lib/oxrdf/src/named_node.rs#L39

given some environment variable or store configuration.

About serialization with a base IRI ... I am not sure there is a nice behavior in this area.

Totally, imo that's something left to app-level logic: DB shouldn't have to be responsible for knowing whether or not someone is doing RDF correctly, that sounds like a quick way to sprawl the scope of the project. Definitely not asking the devs to "solve RDF plz."

sneakers-the-rat avatar Jan 12 '24 01:01 sneakers-the-rat

It seems like this is for parsing without validation? I was imagining some env variable or a parameter given to Store that disables validation for that whole store - I'm not sure if a given store keeps any metadata yet, but that seems like something that you'd want to be true or not for a whole triple store - this one is either made of IRIs or strings that behave like tokens.

For now I was mostly thinking it would be a flag that needs to be given everytime something might be validated (think NamedNode::new vs NamedNode_new_unchecked) and the .unchecked() option of the new parsers. Oxigraph would still assume that everything that is inside it is always valid (to avoid doing e.g. escaping on serialization...) and if it's not the case, too bad for the user (broken serialization...). This way we allow people to e.g. load a Dbpedia/Wikidata dump without validation because they know it's already valid but to validate the SPARQL request syntax the users send to the system. And the places where unvalidated content might get introduced are cristal clear in the code (in a Rust-like fashion). On your Python example it would mean e.g. that we would add an unchecked flag to the NamedNode constuctor in Python: NamedNode('http://example.com/', unchecked=True). What do you think?

Tpt avatar Jan 12 '24 07:01 Tpt

That could work - in working with the python wrapper I'm wondering if there's any way to allow the class to be inherited from? there's a typing.final decorator added, but i also don't know if that's some sort of limitation with the bindings themselves.

That way, we could wrap the class in a python way and get the best of both ways :).

sneakers-the-rat avatar Jan 12 '24 10:01 sneakers-the-rat

I'm wondering if there's any way to allow the class to be inherited from? there's a typing.final decorator added, but i also don't know if that's some sort of limitation with the bindings themselves.

Yes to both. The binding system allows inheritance but allowing it introduces a small performance hit. Hence the bindings system does not enable inheritance by default and pyoxigraph has not opt-in to allow inheritance. I would be curious of what use case you have where inheritance would be better than composition.

Tpt avatar Jan 12 '24 20:01 Tpt

Aha if it's a performance hit, then I can just wrap around rather than underneath!

I am getting started writing an ORM (literally one day of work so not much to see yet: https://github.com/p2p-ld/pyoxigraph-pydantic ) and had wanted to do something like this for working with namespaces, for example

import urllib.parse
from pyoxigraph import NamedNode as BaseNamedNode

class NamedNode(BaseNamedNode):

    def __getattr__(self, item:str) -> NamedNode:
        iri = urllib.parse.urljoin(self.value, item)
        return NamedNode(iri)

    def __mul__(self, item:str) -> NamedNode:
        if "#" in item.value:
            return ValueError("Only one anchor is allowed in an IRI")
        iri = urllib.parse.urljoin(self.value, '#' + item)
        return NamedNode(iri)

to do things like

>>> FOAF = NamedNode('http://xmlns.com/foaf/0.1/')
>>> FOAF.name
NamedNode('http://xmlns.com/foaf/0.1/name')
>>> Nested = NamedNode('http://example.com/')
>>> Nested.term * 'anchor'
NamedNode('http://example.com/term#anchor')

or, since it's related to this issue

from myPackage import config

class NamedNode(BaseNamedNode):

    def __init__(self, value:str, use_default:bool=True):
        if use_default:
            value = urllib.parse.urljoin(
                config.default_iri, value
            )
        super().__init__(value)

to be able to do a default base IRI at the level of the application rather than at the level of the database.

So not to derail the issue - Both of those are examples of the underlying desire to make working with RDF less cumbersome at smaller scales and for program logic, rather than for the traditional cases of large external datasets that already have well-established IRIs. Relaxing the need for complete IRIs lets us blend things that should have absolute IRIs and be unique entities with terms that would be more easily implemented as relative or local entities (this tension is arguably at the heart of the structure of blank nodes, but i won't open that can of worms). Default base IRI is a great step towards that, as would being able to opt-out of validation, etc!

I think the question is basically "is oxigraph a Linked Data Platform database, an abstract RDF database, a concrete N-triples/N-quads database, or a more generalized 'sqlite for graph databases' tool," and the answer to that influences which strategy to take :).

Sorry for being longwinded - I certainly can just wrap around and virtualize all the oxigraph objects if there is some performance hit, I wasn't sure if inheritance was a) impossible or b) just a code correctness thing. Just wanted to clarify a bit why I think relaxing some of the RDF requirements would be a big benefit towards broadening the scope of use for this tool beyond traditional RDF triple store towards an unfilled need for simple, low deployment complexity graph databases. Also understand you and y'all have your own priorities <3 again thanks for the work, have been really enjoying working with oxigraph! (learning rust in the process, which is one of the reasons I'm excited about it!)

sneakers-the-rat avatar Jan 12 '24 22:01 sneakers-the-rat

I think the question is basically "is oxigraph a Linked Data Platform database, an abstract RDF database, a concrete N-triples/N-quads database, or a more generalized 'sqlite for graph databases' tool," and the answer to that influences which strategy to take :).

That's a great question. For now Oxigraph description is "a SPARQL graph database" so its main reference is SPARQL specifications. I definitely do not have the bandwith to move it to a generic sqlite-for-graphs database just like AWS Neptune is a generic cloud graph database with both SPARQL and OpenCypher. So, the assumption that the stored data is valid RDF is likely to stay. However, I think it is perfectly ok to allow people to load not perfectly valid RDF (broken IRIs...) with some "unsafe" option. In this case the user would be on his own in case of weird behaviors (but I guess there should not be too many). Does it answer your question and sounds good to you?

The way I would implement your easy namespace support would be to create a Namespace class like:

class Namespace:
    def __init__(prefix: str):
         self.prefix = prefix

    def __getattr__(self, local: str) -> NamedNode:
        return NamedNode(self.prefix + local)

This way

FOAF = Namespace('http://xmlns.com/foaf/0.1/')
FOAF.name

would be the exact equivalent of

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
foaf:name

and namespaces and node identifiers are well separated.

have been really enjoying working with oxigraph! (

Thank you!

Note: beware urllib.parse.urljoin is doing relative IRI resolution whereas Turtle handle prefixes with simple concatenation. So, in your code, the __getattr__ implementation is behaving like Turtle "base" and not like Turtle "prefix". Not sure if it is what you intended.

For example:

@prefix ex: <http://example.com/bar> .
ex:s ex:p ex:o .

will be parsed like

<http://example.com/bars> <http://example.com/barp> <http://example.com/baro> .

whereas

@base <http://example.com/bar> .
<s> <p> <o> .

will be parsed like

<http://example.com/s> <http://example.com/p> <http://example.com/o> .

Tpt avatar Jan 13 '24 10:01 Tpt