rdflib icon indicating copy to clipboard operation
rdflib copied to clipboard

Shelve basicindex

Open jpmccu opened this issue 6 years ago • 22 comments

Shelve is a local keystore database in python that takes native python objects. I have made a simple store with it that optimizes for read and write, while keeping indexing simple and local. The data structure is a tree that looks like (including the native store:

{ "context" : { "subject": { "predicate" : set(["object"]) } } }

The context-level is the unit of storage in Shelve, and so operations are performed by reading and writing whole contexts in memory, then storing them as a unit to disk on each mutation. This gets shaky with really big contexts, but is performant when contexts aren't huge. Additionally, an LRU cache is enabled so that sequential and near-sequential mutations to the context don't require lots of disk reads. Searches that require grabbing all of a subtree from the data structure should be pretty fast. Finding all the subjects with a matching object will be an order N operation over the graph, which is worst case performance.

Order-N transformations like data generation, simple filtering, format conversion, etc. are therefore optimal, but don't go doing complex graph queries with it. It might work for a linked data server if you don't supply a SPARQL endpoint on top of it.

jpmccu avatar May 22 '18 22:05 jpmccu

Additional issues: changes needed for dbm on 2.7 mean breaking 3, since dbm doesn't like unicode keys. Also, whatever backend is being used in my local test doesn't scale well, so I need to find a better one. Watch this space (or not) for updates.

jpmccu avatar May 23 '18 00:05 jpmccu

@jimmccusker would you be interested in updating this PR to work with 5.0.0+? I see that it previously passed only the Python 2.7 tests and none of the 3.x tests. In 5.0.0+, you might get this to pass 3.5+ only.

nicholascar avatar May 01 '20 11:05 nicholascar

Hi @jimmccusker, are you interested in getting this to work in Python 3.6+ / RDFlib 5.0.0? We are keen to see a couple more store implementations are we are planning on killing off the old in-memory store that doesn't support a lot of expected features (like Turtle parsing!).

nicholascar avatar Jul 30 '20 04:07 nicholascar

I'm actually thinking of trying again with sqlite to try to use its full text search, actually. It should match the syntax I'm working on for a fuseki implementation too. Do you mean the in memory store that's the default store, or is that a different one?

jpmccu avatar Jul 30 '20 13:07 jpmccu

Of the two stores in memory.py, Memory & IOMemory, I think, from memory (!) that one of them doesn't support all features, perhaps IOMemory. The performance advantages it has over Memory might be negated with a change to Python 3.6 dicts, so then no need to maintain IOMemory if it was faster but supported fewer features.

Can't quite remember all this though so will have to test out the stores' features and speeds first.

nicholascar avatar Jul 30 '20 16:07 nicholascar

If you're worried about performance, for what it's worth I've brought the default memory store in py3 up over 1 billion triples.

On Thu, Jul 30, 2020 at 12:30 PM Nicholas Car [email protected] wrote:

Of the two stores in memory.py, Memory & IOMemory, I think, from memory (!) that one of them doesn't support all features, perhaps IOMemory. The performance advantages it has over Memory might be negated with a change to Python 3.6 dicts, so then no need to maintain IOMemory if it was faster but supported fewer features.

Can't quite remember all this though so will have to test out the stores' features and speeds first.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-666508587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEI7OUKRGRILMKKK7IDR6GN3TANCNFSM4FBGCTCA .

-- Jim McCusker

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute [email protected] [email protected] http://tw.rpi.edu

jpmccu avatar Jul 30 '20 17:07 jpmccu

@jimmccusker are you still keen on this store? I've got a PR in to update the Sleepycat store to the newer version of (Pythons wrapper for) BerkeleyDB and that works well. I'm keen to see more stores added.

nicholascar avatar Jul 02 '21 11:07 nicholascar

I haven't had the chance to work on it. It would probably be SQLite based, as I've found some unexpectedly expensive operations with shelve.

Jamie

On Fri, Jul 2, 2021 at 7:33 AM Nicholas Car @.***> wrote:

@jimmccusker https://github.com/jimmccusker are you still keen on this store? I've got a PR in to update the Sleepycat store to the newer version of (Pythons wrapper for) BerkeleyDB and that works well. I'm keen to see more stores added.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-872930698, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEO2RKV4D2QRNA3C2UDTVWPYVANCNFSM4FBGCTCA .

-- Jamie McCusker (she/they)

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu

jpmccu avatar Jul 02 '21 13:07 jpmccu

Would there be advantages to this approach instead of just https://github.com/RDFLib/rdflib-sqlalchemy with SQLite?

westurner avatar Aug 28 '21 12:08 westurner

@jimmccusker I don't know much about Shelve but when you say "It would probably be SQLite based" what do you mean? Is it that you would reuse the SQLite Store, as indicated by @westurner or perhaps something else?

Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for options here and need to brush up on my use of SQLite. The SQLite doesn't appear much or perhaps at all in the main RDFlib docco.

It would be good to do a stock take of all the Stores and to perhaps list their similarities & differences in mainline RDFlib docco.

nicholascar avatar Aug 29 '21 01:08 nicholascar

If there's a SQLite store, it's probably already doing better what I'd try to do. If benchmarks say it's comparable to sleepycat, we should just go with that.

On Sat, Aug 28, 2021 at 9:38 PM Nicholas Car @.***> wrote:

@jimmccusker https://github.com/jimmccusker I don't know much about Shelve but when you say "It would probably be SQLite based" what do you mean? Is it that you would reuse the SQLite Store, as indicated by @westurner https://github.com/westurner or perhaps something else?

Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for options here and need to brush up on my use of SQLite. The SQLite doesn't appear much or perhaps at all in the main RDFlib docco.

It would be good to do a stock take of all the Stores and to perhaps list their similarities & differences in mainline RDFlib docco.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-907714127, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEKLTTPP6ATUGTIOXODT7GFQBANCNFSM4FBGCTCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Jamie McCusker (she/they)

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu

jpmccu avatar Aug 29 '21 02:08 jpmccu

But I'm guessing that going through the sqlalchemy layer is probably not as performant as it could be. I haven't investigated, though.

On Sat, Aug 28, 2021 at 10:35 PM Jamie McCusker @.***> wrote:

If there's a SQLite store, it's probably already doing better what I'd try to do. If benchmarks say it's comparable to sleepycat, we should just go with that.

On Sat, Aug 28, 2021 at 9:38 PM Nicholas Car @.***> wrote:

@jimmccusker https://github.com/jimmccusker I don't know much about Shelve but when you say "It would probably be SQLite based" what do you mean? Is it that you would reuse the SQLite Store, as indicated by @westurner https://github.com/westurner or perhaps something else?

Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for options here and need to brush up on my use of SQLite. The SQLite doesn't appear much or perhaps at all in the main RDFlib docco.

It would be good to do a stock take of all the Stores and to perhaps list their similarities & differences in mainline RDFlib docco.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-907714127, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEKLTTPP6ATUGTIOXODT7GFQBANCNFSM4FBGCTCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Jamie McCusker (she/they)

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu

-- Jamie McCusker (she/they)

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu

jpmccu avatar Aug 29 '21 02:08 jpmccu

going through the sqlalchemy layer is probably not as performant as it could be

Well that’s the thing: I assumed a Shelve implementation using the native Shelve API would be best, but then you’d have to invent (or borrow, if you could copy from BerkeleyDB) all the CRUD equivalent functions in RDFlib-speak as well as all the SPO, PSO etc indexing. That’s what I thought you wanted to do!

Perhaps we really do need a Store features and performance comparison table. Then we will know what, If anything’s missing.

is this something an RPI student might be able to do @jimmccusker?

nicholascar avatar Aug 29 '21 03:08 nicholascar

Here are the sqla Tables and Indexes: https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/rdflib_sqlalchemy/tables.py

Tests for SQLite: https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/test/test_sqlalchemy_sqlite.py

500-25K triples: https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/test/test_store_performance.py

Is shelve similar to LevelDB in key/value interface?

  • https://github.com/python/cpython/blob/main/Lib/shelve.py
    • Someday pickle could be extended to load data but not code: https://github.com/python/cpython/blob/main/Lib/pickle.py#L1497-L1539
    • jsonpickle
    • ijson +& simdjson for performance
  • https://github.com/RDFLib/rdflib-leveldb/blob/master/rdflib_leveldb/leveldbstore.py
  • https://github.com/cosmos/iavl

From https://github.com/jsonpickle/jsonpickle#security :

Security jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.

Warning

The jsonpickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

Consider signing data with an HMAC if you need to ensure that it has not been tampered with.

HMACs and Merkle hashes help with data integrity, but not [cryptographic] identity (which we now have W3C ld-proofs for part of)

Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.

https://pypi.org/project/ijson/#performance-tips

https://github.com/simdjson/simdjson#performance-results

westurner avatar Aug 29 '21 05:08 westurner

... FWIW, Arrow + Parquet is definitely faster than pickle or JSON (because Parquet is already shaped like it needs to be in RAM); but serialization of arbitrary types does not make a proper DBMS with ACID guarantees, like SQLite.

https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes

datetime64 indexes, too? https://arrow.apache.org/docs/python/pandas.html#date-types


Edit: Shelve solves for persisting a dict of python objects; for when there's not enough RAM. But:

  • SEC: shelve executes unsigned code due to pickle,
  • PERF,SCAL,BUG: shelve doesn't do ordered transactions, so shelve is not safe for parallel use: if there are e.g. writes during reads, the behavior is nondeterministic due to lack of (database transaction) Isolation. From https://en.wikipedia.org/wiki/ACID re Isolation:

depending on the method used, the effects of an incomplete transaction might not even be visible to other transactions

Rdflib-sqlachemy should already be wrapping with transaction BEGIN and COMMIT SQL statements that mutate the database?

westurner avatar Aug 29 '21 05:08 westurner

The major issue with shelve is that it is expensive to iterate keys. The documentation doesn't explain how bad, but I've definitely seen performance be far worse than iterating through a key list file. I can see if there's an undergraduate who'd like a project like this that I can mentor.

Thanks, Jamie

On Sun, Aug 29, 2021 at 1:20 AM Wes Turner @.***> wrote:

... FWIW, Arrow + Parquet is definitely faster than pickle or JSON (because Parquet is already shaped like it needs to be in RAM); but serialization of arbitrary types does not make a proper DBMS with ACID guarantees, like SQLite.

https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes

datetime64 indexes, too? https://arrow.apache.org/docs/python/pandas.html#date-types

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-907731361, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEPX4JQF7NUS3A54IHTT7G7RXANCNFSM4FBGCTCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Jamie McCusker (she/they)

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu

jpmccu avatar Aug 29 '21 15:08 jpmccu

I can see if there's an undergraduate who'd like a project like this that I can mentor

Great, well I'm happy to assist too with students and anyone else! I already have an undergrad working on RDFlib-related things now too but he is consumed with Units of Measure. Going forward, I think I want to do more at the fundamental RDFlib end - all this Store stuff, JSON-LD & Turtle serialization improvements etc. as there really are some critical issues that lots of people would benefit from solutions to.

nicholascar avatar Aug 30 '21 00:08 nicholascar

We have something we did for units of measure conversion, but it's Ontology specific. https://pypi.org/project/whyis-unit-converter/

On Sun, Aug 29, 2021 at 8:14 PM Nicholas Car @.***> wrote:

I can see if there's an undergraduate who'd like a project like this that I can mentor

Great, well I'm happy to assist too with students and anyone else! I already have an undergrad working on RDFlib-related things now too but he is consumed with Units of Measure. Going forward, I think I want to do more at the fundamental RDFlib end - all this Store stuff, JSON-LD & Turtle serialization improvements etc. as there really are some critical issues that lots of people would benefit from solutions to.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-907907491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEM3XDU66JVL2UYWEVLT7LEN3ANCNFSM4FBGCTCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Jamie McCusker (she/they)

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu

jpmccu avatar Aug 30 '21 02:08 jpmccu

Thanks for the link to that work Jamie: yes I know about OM and have communicated a bit with Hajo Rijgersberg about it and I'm glad to have your Python converter in mind now.

The current work we are doing is building a Python UCUM converter, based on the JavaScript one. Once that's done, we will work on a QUDT converter which shoul dbe pretty easy compared to the UCUM one, given the conversion vectors present in QUDT. The goal is to have multi-system unit conversions available in RDFlib.

nicholascar avatar Aug 30 '21 03:08 nicholascar

  • [ ] how to publish a Dataset as CSVW Linked Data with units of measure
    • [ ] how to specify the units of measure of a CSV/CSVW column as a URI
    • http://wrdrd.github.io/docs/consulting/units#csvw-and-units

http://wrdrd.github.io/docs/consulting/linkedreproducibility#csv-csvw-and-metadata-rows

CSV, CSVW, and metadata rows

A data table with 7 metadata header rows (column label, property URI path, DataType, unit, accuracy, precision, significant figures)

  • [ ] how to publish a ScholarlyArticle of StructuredPremises like Datasets {from a Jupyter-Book in a repo2docker REES container image}
    • "#LinkedReproducibility"
    • #LinkedResearch

On Sun, Aug 29, 2021, 23:13 Nicholas Car @.***> wrote:

Thanks for the link to that work Jamie: yes I know about OM and have communicated a bit with Hajo Rijgersberg about it and I'm glad to have your Python converter in mind now.

The current work we are doing is building a Python UCUM converter, based on the JavaScript one https://github.com/lhncbc/ucum-lhc. Once that's done, we will work on a QUDT http://qudt.org/ converter which shoul dbe pretty easy compared to the UCUM one, given the conversion vectors present in QUDT. The goal is to have multi-system unit conversions available in RDFlib.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-907980057, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMNS45NFLXSA5ZYYFFJVDT7LZMZANCNFSM4FBGCTCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

westurner avatar Aug 30 '21 07:08 westurner

@jpmccu @westurner RDLib's gone up a couple of versions now, any continued interest here?

nicholascar avatar Mar 20 '24 02:03 nicholascar

Not in the near future. I've been using the OxiGraph store for similar use cases.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Nicholas Car @.> Sent: Tuesday, March 19, 2024 10:51:31 PM To: RDFLib/rdflib @.> Cc: Jamie McCusker @.>; Mention @.> Subject: Re: [RDFLib/rdflib] Shelve basicindex (#830)

@jpmccuhttps://github.com/jpmccu @westurnerhttps://github.com/westurner RDLib's gone up a couple of versions now, any continued interest here?

— Reply to this email directly, view it on GitHubhttps://github.com/RDFLib/rdflib/pull/830#issuecomment-2008588086, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAETCEIPXEACXTDZXG5GSYTYZD2THAVCNFSM4FBGCTCKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQHA2TQOBQHA3A. You are receiving this because you were mentioned.Message ID: @.***>

jpmccu avatar Mar 20 '24 03:03 jpmccu