sophia_rs icon indicating copy to clipboard operation
sophia_rs copied to clipboard

How to get more information from parsers?

Open pchampin opened this issue 4 years ago • 13 comments

Beyond the parsed triples/quads, parsers may collect additional useful information, for example prefix declarations or base IRI. What would be the best API to get this information?

My initial idea was to add methods to the triple/quad source returned by the parse methods, to access this information. For example:

    let parsed_triples = sophia::parser::turtle::parse_bufread(ttl_file)?;
    my_graph.insert_all(&mut parsed_triples);
    let prefix_map = triples.get_prefix_map();

The drawback of this approach is that it forces us to keep the triple source, even when it is exhausted. Method such as Graph::insert_all can not consume it, they have to borrow it mutably (which is rather counter intuitive).

Another approach would be to use a kind of callback:

    let parsed_triples = sophia::parser::turtle::parse_bufread(ttl_file)?;
    let mut prefix_map = HashMap::new();
    parsed_triples.on_prefix(|prefix, iri| prefix_map.insert(prefix, iri));
    my_graph.insert_all(parsed_triples);

This approach might be slightly harder to implement, but offers more flexibility. And it makes it possible to consume sources while still getting the additional information.

@Tpt @MattesWhite any thought?

pchampin avatar Feb 25 '20 10:02 pchampin

I don't know if such information is that valuable. For me the task of a parser is to read and interpret the contents of an RDF document. Therefore, it should return only absolute IRIs, resolving prefixes and base IRI. Accordingly, the information someone could get from a parser should already be included in the triples returned.

As a special case, retrieving the base IRI from a Turtle document (or other Notation3 derived format) can be ... dangerous. As stated in RDF 1.1 chapter 6.3

[...] Each @base or BASE directive sets a new In-Scope Base URI, relative to the previous one.

So there can be an unlimited number of @base directives in a Turtle document, doing some fun stuff like:

# This is valid Turtle
@base <http://example.com/fun/stuff/> .
<Peter> a <../Human> .

@base <../> .
<Human> <#hasEntity> <stuff/Peter> .

So what base IRI should be returned?

Of course I see that there are possibilities where it can be beneficial to yield such informations and not to convert IRIs immediately but I think those are very optimized and in such cases parsers should return an own type of Graph rather than implementing the streaming interface.

MattesWhite avatar Feb 25 '20 13:02 MattesWhite

For me the task of a parser is to read and interpret the contents of an RDF document. Therefore, it should return only absolute IRIs, resolving prefixes and base IRI.

absolutely, I'm not suggesting otherwise.

Consider the following use-case though. I get a turtle file, I parse it as a graph, then apply some changes to the graph. Then I want to serialize it back to turtle, into something as similar as possible as the original file. For this, I need to "remember" which prefixes were declared in the original file (and possibly its @base declaration), because this information is not available anymore in the graph itself.

Note that in other APIs (such as Rdflib in Python, and I Jena in Java), every graph has an associated prefix mapping, precisely to address the use-case above. But I never liked this design, because for me the graph should be only the abstract syntax.

Granted, some turtle files might override their @base or some prefix in the middle of the file, making such "round-trip" nearly impossible to achieve. But at least for the majority of "simple" cases, we should provide a simple way to do it.

pchampin avatar Mar 02 '20 14:03 pchampin

And for the record, the more I think about it, the more I'm leaning towards the callback solution. Among others, it has the advantage of allowing to consume the TripleSource or QuadSource, as we would do with a standard iterator.

pchampin avatar Mar 02 '20 15:03 pchampin

Concerning #23 I get the feeling that this topic is to much detail for a common API. As this topic is pretty specific to Turtle, e.g. NTriples have no prefixes and Json-LD has its "@context", and targets a "majority of simple cases".

You said yourself in another issue that you like to keep sophia as general as applicable which I totally agree with. Therefore, I would suggest to leave this question open for implementers of Turtle parsers. In addition, this would increase the incompatibility with rio's parser.


BTW metis now also includes a Turtle parser. I programmed it as an intermediate step towards an Notation3 parser (which I started but not works, yet). It is still WIP and has many rough edges but parses at least most documents. For now it is not aligned with sophia's parser API, yet. But I plan to do so in the future. Maybe you can use it as an additional inspiration for "RDF in Rust".

MattesWhite avatar Mar 03 '20 08:03 MattesWhite

this topic is pretty specific to Turtle

I beg to differ: except for N-Triples and N-Quads, all major concrete syntaxes have a notion of base and prefix binding (Turtle, TriG, N3, RDF/XML, JSON-LD, RDFa).

pchampin avatar Mar 03 '20 15:03 pchampin

Hi @pchampin , is there any progress on this?

Could we just add a prefixes hashmap to the Dataset trait?

Sophia: a Rust toolkit for RDF and Linked Data

Sophia is a toolkit to work with RDF and Linked Data. Not a canonical representation of RDF. When building a toolkit it is important to consider developer experience

From my point of view, in a RDF graph there are only 2 things that matters to developers: the list of triples/quads and the map of prefixes.

Prefixes are important:

  • For humans because they make the RDF more readable
  • For machines because they make the RDF smaller in size (important with large datasets)
  • For humans and machines they give more information on the Linked Data space in which this RDF operates

But right now sophia only enables to work with 1 of those 2 essential components of RDF.

Ideally this prefixes hashmap would be populated automatically when parsing. The rio_turtle TurtleParser and TriGParser structs already populate a prefixes: HashMap<String, String>, when parsing the file with a public function .prefixes() to get the prefixes hashmap. We would just need to collect the prefixes at the end of the collect_quads() function. And there could be also a function to add_prefix() on the Dataset

What do you think? I can help with implementing this if needed

vemonet avatar Mar 12 '24 10:03 vemonet

Syntactically the Turtle format allows redefining/shadowing existing prefixes (and bases!) at any point within the file, making that line a pivot point for how prefixed names and relative IRIs are resolved. Two textually identical foo:bar terms before and after such a line may resolve very differently during parsing. It's only an informal convention that many Turtle files set a base or prefix one time each, but this is not actually a requirement.

So a HashMap of prefix-label-to-value already isn't enough to accurately capture the full spectrum of what the file format would permit while being syntactically valid. This flexibility in the format makes armchair design of the proposed API and data types needed to fulfill the ask pretty non-trivial.

shanesveller avatar Mar 12 '24 14:03 shanesveller

Hi @shanesveller, thanks for the feedback! The URIs stored in the graph are already resolved, so it does not matter if some prefixes are redefined mid-way. We can just keep the last prefix defined (which is what the RIO turtle parser already do)

So a HashMap of prefix-label-to-value already isn't enough to accurately capture the full spectrum of what the file format would permit while being syntactically valid.

We do not want to "capture the full spectrum of what the turtle file can define". We just want to capture the prefixes used in an already defined RDF resource (e.g. a file). For which a HashMap<String,String> is perfectly enough (I don't see anything else that we need to properly captures the prefixes to be honest. We could use the curies.rs crate we are currently developing but that feels overkill for the current use :p )

And for the edge-case of the 2 people who are having fun redefining prefixes in the middle of their turtle files. It is ok, we will "lose" the non-important information of the first prefix defined. But that does not remove anything in how helpful making prefixes available in Sophia will be to RDF developers!

This flexibility in the format makes armchair design of the proposed API and data types needed to fulfill the ask pretty non-trivial.

What you mean by flexibility of the format and armchair design? We are not gonna do any parsing of turtle format, I just propose that Sophia properly integrates a prefixes hashmap already returned by the parsers it is using.

It could be implemented quite fast and easily:

  1. we add the prefixes hashmap on the GenericLightDataset and GenericFastDataset
  2. apparently the RIO turtle parser already handles it, so if it works as expected all we need is to save the prefixes hashmap at the end of the .collect_quads() function in the Dataset returned by .collect_quads()

I am not sure what is non-trivial here? There are no changes proposed to the API, and nothing we need to implement. We just pick up the prefixes already served by the RIO turtle parser at the end of .collect_quads(), and it's all benefits. No breaking changes in the API, just more features!

vemonet avatar Mar 12 '24 18:03 vemonet

Hi @pchampin , is there any progress on this?

Unfortunately not at the moment.

Could we just add a prefixes hashmap to the Dataset trait?

Definitely not! See below.

What this issue is not about

From my point of view, in a RDF graph there are only 2 things that matters to developers: the list of triples/quads and the map of prefixes.

I strongly disagree. Prefix maps are not part of the RDF graph. They are part of some serialization formats, but they are in no way intrinsic to the graph.

For machines [prefix maps] make the RDF smaller in size (important with large datasets)

No, AFAIK, RDF storage implementations use other strategies for efficiency (like indexing). They don't rely on prefix maps.

Now, I don't deny that prefix maps have a great value for developers, and that's why I opened this issue in the first place: I want to be able to get (a good approximation of) the prefixes declared in the parsed content, so that I can use them later (in particular when I serialize back the graph).

However, I don't want to make the prefix map part of the Graph or Dataset traits (even though several other RDF APIs do that), precisely because it perpetuates the misconception that prefixes are "part" of the data model. What I could live with would be a WithPrefixMap trait, that some implementations of Graph or Dataset could also implement, if they really want to. But that should definitely be a separate trait. But that's actually not what this issue is about.

What this issue is about

This issue is not about bundling the prefix map in the graph, but about extracting the prefix map from the parser (which, anyway, would be necessary to bundle it with the graph if we really wanted to).

Currently, the Rio parsers on which most Sophia parsers are based does not make this possible. That's why this issue has been stalling; we first need to change Rio, then reflect this change in Sophia.

Furthermore, Rio is no longer actively maintained. It's main developer has moved towards a new parser architecture (see for example https://github.com/oxigraph/oxigraph/tree/main/lib/oxttl). Ultimately, I might drop Rio and use the new oxigraphs parsers instead. But that's also a major refactoring.

What do you think? I can help with implementing this if needed

Help is always welcome :) The most future-proof path would probably be

  • make a PT to oxttl to make it possible to retrieve the prefix map after/during parsing
  • if that works, make an implementation of Sophia parsers based on oxttl
  • expose the "prefix-map extraction" features of oxttl in the Sophia parser
  • integrate the latter in the TripleParser and QuadParser traits

I realize this is a big workplan...

pchampin avatar Mar 19 '24 18:03 pchampin

make a PT to oxttl to make it possible to retrieve the prefix map after/during parsing

It's already done with the prefixes method

Tpt avatar Mar 19 '24 20:03 Tpt

It's already done with the prefixes method

I did look quicly at the oxttl doc (too quickly, obviously!) but I missed it. That's great news. All the more reason to migrate to oxttl...

pchampin avatar Mar 21 '24 14:03 pchampin

Actually I found the prefixes method even in RIO: https://github.com/oxigraph/rio/blob/main/turtle/src/turtle.rs#L67 (not sure if it is actually works though) That's why I said I thought it should not be too hard to implement

We just need to call this method after the last step of parsing :D (but it is also a good opportunity to upgrade to oxttl)

However, I don't want to make the prefix map part of the Graph or Dataset traits (even though several other RDF APIs do that), precisely because it perpetuates the misconception that prefixes are "part" of the data model. What I could live with would be a WithPrefixMap trait, that some implementations of Graph or Dataset could also implement, if they really want to. But that should definitely be a separate trait.

I don't mind if this is implemented as a different trait DatasetWithPrefixes so we can keep proper separation of concerns. All I want is to get a single object out of parsing! One that I can use easily for triple matching and serializing with the right prefixes

But that's actually not what this issue is about.

It is nice that we talk about where to put the PrefixMap here, because this way if we want to help you implementing it we already know where to start ;)

vemonet avatar Mar 28 '24 18:03 vemonet

Actually I found the prefixes method even in RIO: https://github.com/oxigraph/rio/blob/main/turtle/src/turtle.rs#L67 (not sure if it is actually works though) That's why I said I thought it should not be too hard to implement

:open_mouth: ok, sorry for missing that one! So yes, we have a low hanging fruit here for solving this issue, at least for sophia_turtle.£ Adding a method to TurtleParser and TriGParser to retrieve the current prefix map, sounds like an easy add.

(...) I don't mind if this is implemented as a different trait DatasetWithPrefixes so we can keep proper separation of concerns. All I want is to get a single object out of parsing!

Please note that you currently get a TripleSource or a QuadSource out of parsing, not even a Graph nor a Dataset. So you would still need two steps when parsing:

  • insert the triple/quads into the graph/dataset (or alternatively collect a new graph/dataset from the source)
  • then extract the final prefix map from the parser and attach it to your graph/dataset

I don't see an easy way to automate that, because I don't think it even makes sense to attach a prefix-map to a TripleSource or a QuadSource -- because the prefix map is potentially changing while the source is being parsed...

One that I can use easily for triple matching and serializing with the right prefixes

But that's actually not what this issue is about.

It is nice that we talk about where to put the PrefixMap here, because this way if we want to help you implementing it we already know where to start ;)

Help is always welcome :smile:. I just want to avoid getting a PR merging two different issues (exposing a prefix method in parsers, and allowing to attach prefix maps to graphs/datasets). I might refuse such a PR for the 2nd aspect, while the first one would be ok. This would waste everybody's time.

pchampin avatar Mar 29 '24 11:03 pchampin