rdftab.rs icon indicating copy to clipboard operation
rdftab.rs copied to clipboard

statements table cannot have a primary key defined over it due to nulls

Open cmungall opened this issue 4 years ago • 13 comments
trafficstars

For some applications it is desirable to to declare primary keys for tables. For example, sqlalchemy requires all mapped tables to have a PK if the ORM layer is to be used.

If we were to declare a primary key for statements we would need to either

  1. Add an additional statement id column and declare it PK
  2. Use a unique combination such as s,p,o,v as a composite PK

The first has disadvantages, e.g. overhead of adding SEQUENCEs, divergence from RDF model

The second seems reasonable. However, many RDBMs do not allow nulls in a PK. sqlite is more lax here though. However, even though sqlite allows it, certain tools such as sqlalchemy ORM will not allow nulls in PK fields. By design, with statements, EITHER object OR value MUST be NULL.

e.g. given:

class Statements(Base):
    """
    Represents an RDF triple
    """
    __tablename__ = 'statements'
    stanza = Column(Text, primary_key=True)
    subject = Column(Text, primary_key=True)
    predicate = Column(Text, primary_key=True)
    object = Column(Text, primary_key=True)
    value = Column(Text, primary_key=True)
    datatype = Column(Text, primary_key=True)
    language = Column(Text, primary_key=True)

if we try and do an insert with an rdfs:label (which has object NULL) we get:

sqlalchemy.orm.exc.FlushError: Can't update table statements using NULL for primary key value on column statements.object

This is OK if one doesn't want to use the sqlalchemy ORM layer (the lower-level base layer should work just fine), but it might be good to be aware of this limitation for future evolution

Aside: I used to be more of a computer science purist and decried ORMs as evil, impedance mismatch yadda yadda. But the sqlalchemy ORM is quite nice and I see the advantages of ORMs in solving more pedestrian problems...

I don't think we should change anything as I think the existing structure is nice and simple and good for certain kinds of queries.

I think the solution here is: if you want to do CRUD with an ORM, then define a separate schema in which object and value are combined into one field (e..g a basic "triples" table), and define a trivial bridge layer between rdftab.statements and your.triple (e.g populate one from the other via INSERT INTO .. AS SELECT ...). This is a separate concern from rdftab.

This issue can be immediately closed, just wanted to flag this if useful for planning future evolution.

cmungall avatar May 12 '21 02:05 cmungall

Note that one consequence of lack of PKs is lack of enforcement of triple duplication, e.g.

(venv) ~/repos/semantic-sql(main) $ sqlite3 db/pato.db "SELECT count(*) FROM statements"
73122
(venv) ~/repos/semantic-sql(main) $ ./bin/rdftab db/pato.db < owl/pato.owl 
(venv) ~/repos/semantic-sql(main) $ sqlite3 db/pato.db "SELECT count(*) FROM statements"
109683
(venv) ~/repos/semantic-sql(main) $ ./bin/rdftab db/pato.db < owl/pato.owl 
(venv) ~/repos/semantic-sql(main) $ sqlite3 db/pato.db "SELECT count(*) FROM statements"
146244

I think this lack of enforcement is a good thing. Different ontologies may have the same triple (via import modules etc; although having a named graph column would alleviate this #13). It would be tedious to filter these in advance. And it would be slow for the code to have to do dupe checks.

The solution here is to put the burden on the client, to have good loading hygeine and to avoid loading the same thing multiple times. and to possibly do SQL UPDATES to remove dupes (this may not be super-trivial due to the fact that NULL != NULL in SQL... it's like a blank node!). Just wanted to flag this too.

cmungall avatar May 12 '21 02:05 cmungall

I still think ORMs are a bad idea but I do care about NULLs and primary keys. I don't mind reworking things if we can find a better design.

I've tried lots of ways to represent RDF over the years. With RDFTab I decided to try lots of columns even though there are lots of NULLs -- call the current design "Schema 0".

The core of the problem is that RDF objects come in four forms: ID, plain literal, typed literal, language literal. We need at least three columns: subject, predicate, object. We can add a graph column. The stanza column is an optimization to avoid chasing blank nodes across the whole table. If we use thick triples then stanza is not needed.

I've given a fair bit of thought to a column with a hash of the triple, carefully designed for checking triple identity, but that seems complicated on a number of levels.

We could use just these columns and avoid NULLs by always using JSON to represent the object: {"object": "ex:foo"}, {"value": "Foo"}. I don't like this because it's a pain to do the simplest thing and just query for literals such as labels. Call this "Schema 1".

The next simplest thing is to add a datatype column and overload it -- call this "Schema 2":

  • graph: ID or NULL?
  • stanza: ID
  • subject: ID
  • predicate: ID
  • object: ID or literal value
  • datatype:
    • ID (typed literal): xsd:integer
    • @lang (language literal): @en-CA
    • keyword for a plain literal (legacy?): :plain
    • keyword for an ID: :id

For thick triples we could add another datatype for the case when the object is a nested JSON object.

Do you have any other alternatives?

For writing queries and working with results, I think Schema 1 is strictly worse than 0 and 2, so I want to discard Schema 1. Schema 0 and Schema 2 have different tradeoffs and annoyances, but it seems to me that they're roughly the same.

One key difference could be indexing. For Schema 0 we build different indexes for IRI and literals, but for Schema 2 they'd be the same index. I don't know if that's better or worse, I haven't tested it.

If you think Schema 2 is better on the whole, then I'd be OK with moving in that direction.

jamesaoverton avatar May 12 '21 12:05 jamesaoverton

Great analysis. I think I am 100% in alignment. I didn't think there was a good solution when I wrote my ticket, but I am actually liking schema 2.

Mild preference for a non-CURIE for the two keyword values for datatype (e.g. just id or plain) but that is very minor

On Wed, May 12, 2021 at 5:56 AM James A. Overton @.***> wrote:

I still think ORMs are a bad idea but I do care about NULLs and primary keys. I don't mind reworking things if we can find a better design.

I've tried lots of ways to represent RDF over the years. With RDFTab I decided to try lots of columns even though there are lots of NULLs -- call the current design "Schema 0".

The core of the problem is that RDF objects come in four forms: ID, plain literal, typed literal, language literal. We need at least three columns: subject, predicate, object. We can add a graph column. The stanza column is an optimization to avoid chasing blank nodes across the whole table. If we use thick triples then stanza is not needed.

I've given a fair bit of thought to a column with a hash of the triple, carefully designed for checking triple identity, but that seems complicated on a number of levels.

We could use just these columns and avoid NULLs by always using JSON to represent the object: {"object": "ex:foo"}, {"value": "Foo"}. I don't like this because it's a pain to do the simplest thing and just query for literals such as labels. Call this "Schema 1".

The next simplest thing is to add a datatype column and overload it -- call this "Schema 2":

  • graph: ID or NULL?
  • stanza: ID
  • subject: ID
  • predicate: ID
  • object: ID or literal value
  • datatype:
    • ID (typed literal): xsd:integer
    • @lang https://github.com/lang (language literal): @en-CA
    • keyword for a plain literal (legacy?): :plain
    • keyword for an ID: :id

For thick triples we could add another datatype for the case when the object is a nested JSON object.

Do you have any other alternatives?

For writing queries and working with results, I think Schema 1 is strictly worse than 0 and 2, so I want to discard Schema 1. Schema 0 and Schema 2 have different tradeoffs and annoyances, but it seems to me that they're roughly the same.

One key difference could be indexing. For Schema 0 we build different indexes for IRI and literals, but for Schema 2 they'd be the same index. I don't know if that's better or worse, I haven't tested it.

If you think Schema 2 is better on the whole, then I'd be OK with moving in that direction.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ontodev/rdftab.rs/issues/16#issuecomment-839749257, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOLCNWT5ITXCIOLNH6DTNJ3GNANCNFSM44XUZZQA .

cmungall avatar May 12 '21 16:05 cmungall

What do you think of my proposal @beckyjackson, @lmcmicu, @ckindermann?

jamesaoverton avatar May 12 '21 18:05 jamesaoverton

I'm not sure I see the overhead issue with having an extra ID column as the primary key, and it would make inserting to the database more efficient (thus probably make it faster to generate a given DB), plus save on the space required to actually instantiate the primary key (as compared with the other proposals).

But that aside, I think Schema 2 sounds reasonable. I'm not sure, though, that I understand the need for two different datatypes to cover these two cases:

  • ID (typed literal): xsd:integer
  • keyword for an ID: :id

In both cases these wouldn't these just be URIRefs?

lmcmicu avatar May 18 '21 14:05 lmcmicu

Further to the first part of my comment (regarding having an auto-incremented ID column as a primary key). After thinking on it a bit more I don't think it would actually save space, as we'd want to have a secondary index on the fields that need to be queried over. However secondary indexes can have nulls unproblematically. And this doesn't nullify my first observation that having an auto-incremented ID primary key would make insertions a lot faster (as opposed to the other proposals for a primary key), and it seems to me that this is an important consideration, regardless of what is decided regarding the index used for queries.

lmcmicu avatar May 18 '21 14:05 lmcmicu

Some initial thoughts on the auto-increment ID:

  • we would be deviating from the RDF spec, and we would have to think through the semantics of operations like merge
  • how would this play with reification? If statements are reified in either rdf or owl, a statement/axiom node is generated (this can be blank or an IRI in RDF, but in OWL it must be blank). There can be zero to many statements per triple.

one possibility would be to make statements map to rdf/owl Statements, and not triples, with an auto-reification step for any unreified triples, giving them a fresh blank node. This has some advantages, particularly for efficient querying of reified triples. We have to deal with blank nodes for merge semantics anyway. But there are disadvantages here too. Just take a look at the rdfstar mailing list to get an idea of the complexities that emerge when you trying and introduce statements as first-class entities.

cmungall avatar May 18 '21 18:05 cmungall

Some initial thoughts on the auto-increment ID:

* we would be deviating from the RDF spec, and we would have to think through the semantics of operations like merge

I wasn't looking at it from this perspective, since I am still learning as far as RDF goes. I was just speaking to the "overhead" point that was made initially. The added overhead, in and of itself, doesn't seem problematic to me. I just wanted to mention that, and that if we did this then we could use secondary indexes, which if I'm not mistaken can have nulls (and duplicates). I can't really speak to the other points about the RDF spec and about reification.

That said, I think James's Schema 2 sounds reasonable.

lmcmicu avatar May 18 '21 22:05 lmcmicu

Thanks Michael, great points!

cmungall avatar May 19 '21 00:05 cmungall

Schema 2 sounds reasonable to me as well.

Note that one consequence of lack of PKs is lack of enforcement of triple duplication [...].

Even with PKs there is no straightforward way to avoid triple duplication as far as I can see. Consider the thick triples D subclassOf ObjectIntersectionOf(A,B,C) and D subclassOf ObjectIntersectionOf(B,C,A). These triples are the same (in OWL). Yet, they would have different values for the PK. So, if a client manipulates such triples without maintaining a consistent order for elements in RDF lists (or sets in thick triples), then it is very easy to end up with an exponential number of duplicates.

So,

The solution here is to put the burden on the client, to have good loading hygeine and to avoid loading the same thing multiple times.

seems the way to go!

ckindermann avatar May 19 '21 09:05 ckindermann

Thanks for your input, everyone. Nobody is objecting to Schema 2, so I'll make a plan to move towards that.

jamesaoverton avatar May 19 '21 12:05 jamesaoverton

If we're going to switch to Schema 2, what do we think about having very short one-letter column names?

x current
g graph
s subject
p predicate
o object
d datatype
z stanza

Note that thick triples doesn't need a stanza, and Chris isn't using it for semantic-sql so far.

jamesaoverton avatar May 25 '21 20:05 jamesaoverton

No objections from me.

lmcmicu avatar May 26 '21 13:05 lmcmicu