rdflib
rdflib copied to clipboard
Shelve basicindex
Shelve is a local keystore database in python that takes native python objects. I have made a simple store with it that optimizes for read and write, while keeping indexing simple and local. The data structure is a tree that looks like (including the native store:
{ "context" : { "subject": { "predicate" : set(["object"]) } } }
The context-level is the unit of storage in Shelve, and so operations are performed by reading and writing whole contexts in memory, then storing them as a unit to disk on each mutation. This gets shaky with really big contexts, but is performant when contexts aren't huge. Additionally, an LRU cache is enabled so that sequential and near-sequential mutations to the context don't require lots of disk reads. Searches that require grabbing all of a subtree from the data structure should be pretty fast. Finding all the subjects with a matching object will be an order N operation over the graph, which is worst case performance.
Order-N transformations like data generation, simple filtering, format conversion, etc. are therefore optimal, but don't go doing complex graph queries with it. It might work for a linked data server if you don't supply a SPARQL endpoint on top of it.
Additional issues: changes needed for dbm on 2.7 mean breaking 3, since dbm doesn't like unicode keys. Also, whatever backend is being used in my local test doesn't scale well, so I need to find a better one. Watch this space (or not) for updates.
@jimmccusker would you be interested in updating this PR to work with 5.0.0+? I see that it previously passed only the Python 2.7 tests and none of the 3.x tests. In 5.0.0+, you might get this to pass 3.5+ only.
Hi @jimmccusker, are you interested in getting this to work in Python 3.6+ / RDFlib 5.0.0? We are keen to see a couple more store implementations are we are planning on killing off the old in-memory store that doesn't support a lot of expected features (like Turtle parsing!).
I'm actually thinking of trying again with sqlite to try to use its full text search, actually. It should match the syntax I'm working on for a fuseki implementation too. Do you mean the in memory store that's the default store, or is that a different one?
Of the two stores in memory.py, Memory
& IOMemory
, I think, from memory (!) that one of them doesn't support all features, perhaps IOMemory
. The performance advantages it has over Memory
might be negated with a change to Python 3.6 dicts, so then no need to maintain IOMemory
if it was faster but supported fewer features.
Can't quite remember all this though so will have to test out the stores' features and speeds first.
If you're worried about performance, for what it's worth I've brought the default memory store in py3 up over 1 billion triples.
On Thu, Jul 30, 2020 at 12:30 PM Nicholas Car [email protected] wrote:
Of the two stores in memory.py, Memory & IOMemory, I think, from memory (!) that one of them doesn't support all features, perhaps IOMemory. The performance advantages it has over Memory might be negated with a change to Python 3.6 dicts, so then no need to maintain IOMemory if it was faster but supported fewer features.
Can't quite remember all this though so will have to test out the stores' features and speeds first.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-666508587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEI7OUKRGRILMKKK7IDR6GN3TANCNFSM4FBGCTCA .
-- Jim McCusker
Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute [email protected] [email protected] http://tw.rpi.edu
@jimmccusker are you still keen on this store? I've got a PR in to update the Sleepycat store to the newer version of (Pythons wrapper for) BerkeleyDB and that works well. I'm keen to see more stores added.
I haven't had the chance to work on it. It would probably be SQLite based, as I've found some unexpectedly expensive operations with shelve.
Jamie
On Fri, Jul 2, 2021 at 7:33 AM Nicholas Car @.***> wrote:
@jimmccusker https://github.com/jimmccusker are you still keen on this store? I've got a PR in to update the Sleepycat store to the newer version of (Pythons wrapper for) BerkeleyDB and that works well. I'm keen to see more stores added.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-872930698, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEO2RKV4D2QRNA3C2UDTVWPYVANCNFSM4FBGCTCA .
-- Jamie McCusker (she/they)
Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu
Would there be advantages to this approach instead of just https://github.com/RDFLib/rdflib-sqlalchemy with SQLite?
@jimmccusker I don't know much about Shelve but when you say "It would probably be SQLite based" what do you mean? Is it that you would reuse the SQLite Store, as indicated by @westurner or perhaps something else?
Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for options here and need to brush up on my use of SQLite. The SQLite doesn't appear much or perhaps at all in the main RDFlib docco.
It would be good to do a stock take of all the Stores and to perhaps list their similarities & differences in mainline RDFlib docco.
If there's a SQLite store, it's probably already doing better what I'd try to do. If benchmarks say it's comparable to sleepycat, we should just go with that.
On Sat, Aug 28, 2021 at 9:38 PM Nicholas Car @.***> wrote:
@jimmccusker https://github.com/jimmccusker I don't know much about Shelve but when you say "It would probably be SQLite based" what do you mean? Is it that you would reuse the SQLite Store, as indicated by @westurner https://github.com/westurner or perhaps something else?
Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for options here and need to brush up on my use of SQLite. The SQLite doesn't appear much or perhaps at all in the main RDFlib docco.
It would be good to do a stock take of all the Stores and to perhaps list their similarities & differences in mainline RDFlib docco.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-907714127, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEKLTTPP6ATUGTIOXODT7GFQBANCNFSM4FBGCTCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Jamie McCusker (she/they)
Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu
But I'm guessing that going through the sqlalchemy layer is probably not as performant as it could be. I haven't investigated, though.
On Sat, Aug 28, 2021 at 10:35 PM Jamie McCusker @.***> wrote:
If there's a SQLite store, it's probably already doing better what I'd try to do. If benchmarks say it's comparable to sleepycat, we should just go with that.
On Sat, Aug 28, 2021 at 9:38 PM Nicholas Car @.***> wrote:
@jimmccusker https://github.com/jimmccusker I don't know much about Shelve but when you say "It would probably be SQLite based" what do you mean? Is it that you would reuse the SQLite Store, as indicated by @westurner https://github.com/westurner or perhaps something else?
Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for options here and need to brush up on my use of SQLite. The SQLite doesn't appear much or perhaps at all in the main RDFlib docco.
It would be good to do a stock take of all the Stores and to perhaps list their similarities & differences in mainline RDFlib docco.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-907714127, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEKLTTPP6ATUGTIOXODT7GFQBANCNFSM4FBGCTCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Jamie McCusker (she/they)
Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu
-- Jamie McCusker (she/they)
Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu
going through the sqlalchemy layer is probably not as performant as it could be
Well that’s the thing: I assumed a Shelve implementation using the native Shelve API would be best, but then you’d have to invent (or borrow, if you could copy from BerkeleyDB) all the CRUD equivalent functions in RDFlib-speak as well as all the SPO, PSO etc indexing. That’s what I thought you wanted to do!
Perhaps we really do need a Store features and performance comparison table. Then we will know what, If anything’s missing.
is this something an RPI student might be able to do @jimmccusker?
Here are the sqla Tables and Indexes: https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/rdflib_sqlalchemy/tables.py
Tests for SQLite: https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/test/test_sqlalchemy_sqlite.py
500-25K triples: https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/test/test_store_performance.py
Is shelve similar to LevelDB in key/value interface?
- https://github.com/python/cpython/blob/main/Lib/shelve.py
- Someday pickle could be extended to load data but not code: https://github.com/python/cpython/blob/main/Lib/pickle.py#L1497-L1539
- jsonpickle
- ijson +& simdjson for performance
- https://github.com/RDFLib/rdflib-leveldb/blob/master/rdflib_leveldb/leveldbstore.py
- https://github.com/cosmos/iavl
From https://github.com/jsonpickle/jsonpickle#security :
Security jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.
Warning
The jsonpickle module is not secure. Only unpickle data you trust.
It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.
Consider signing data with an HMAC if you need to ensure that it has not been tampered with.
HMACs and Merkle hashes help with data integrity, but not [cryptographic] identity (which we now have W3C ld-proofs for part of)
Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.
https://pypi.org/project/ijson/#performance-tips
https://github.com/simdjson/simdjson#performance-results
... FWIW, Arrow + Parquet is definitely faster than pickle or JSON (because Parquet is already shaped like it needs to be in RAM); but serialization of arbitrary types does not make a proper DBMS with ACID guarantees, like SQLite.
https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes
datetime64 indexes, too? https://arrow.apache.org/docs/python/pandas.html#date-types
Edit: Shelve solves for persisting a dict of python objects; for when there's not enough RAM. But:
- SEC: shelve executes unsigned code due to pickle,
- PERF,SCAL,BUG: shelve doesn't do ordered transactions, so shelve is not safe for parallel use: if there are e.g. writes during reads, the behavior is nondeterministic due to lack of (database transaction) Isolation. From https://en.wikipedia.org/wiki/ACID re Isolation:
depending on the method used, the effects of an incomplete transaction might not even be visible to other transactions
Rdflib-sqlachemy should already be wrapping with transaction BEGIN
and COMMIT
SQL statements that mutate the database?
The major issue with shelve is that it is expensive to iterate keys. The documentation doesn't explain how bad, but I've definitely seen performance be far worse than iterating through a key list file. I can see if there's an undergraduate who'd like a project like this that I can mentor.
Thanks, Jamie
On Sun, Aug 29, 2021 at 1:20 AM Wes Turner @.***> wrote:
... FWIW, Arrow + Parquet is definitely faster than pickle or JSON (because Parquet is already shaped like it needs to be in RAM); but serialization of arbitrary types does not make a proper DBMS with ACID guarantees, like SQLite.
https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes
datetime64 indexes, too? https://arrow.apache.org/docs/python/pandas.html#date-types
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-907731361, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEPX4JQF7NUS3A54IHTT7G7RXANCNFSM4FBGCTCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Jamie McCusker (she/they)
Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu
I can see if there's an undergraduate who'd like a project like this that I can mentor
Great, well I'm happy to assist too with students and anyone else! I already have an undergrad working on RDFlib-related things now too but he is consumed with Units of Measure. Going forward, I think I want to do more at the fundamental RDFlib end - all this Store stuff, JSON-LD & Turtle serialization improvements etc. as there really are some critical issues that lots of people would benefit from solutions to.
We have something we did for units of measure conversion, but it's Ontology specific. https://pypi.org/project/whyis-unit-converter/
On Sun, Aug 29, 2021 at 8:14 PM Nicholas Car @.***> wrote:
I can see if there's an undergraduate who'd like a project like this that I can mentor
Great, well I'm happy to assist too with students and anyone else! I already have an undergrad working on RDFlib-related things now too but he is consumed with Units of Measure. Going forward, I think I want to do more at the fundamental RDFlib end - all this Store stuff, JSON-LD & Turtle serialization improvements etc. as there really are some critical issues that lots of people would benefit from solutions to.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-907907491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEM3XDU66JVL2UYWEVLT7LEN3ANCNFSM4FBGCTCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Jamie McCusker (she/they)
Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu
Thanks for the link to that work Jamie: yes I know about OM and have communicated a bit with Hajo Rijgersberg about it and I'm glad to have your Python converter in mind now.
The current work we are doing is building a Python UCUM converter, based on the JavaScript one. Once that's done, we will work on a QUDT converter which shoul dbe pretty easy compared to the UCUM one, given the conversion vectors present in QUDT. The goal is to have multi-system unit conversions available in RDFlib.
- [ ] how to publish a Dataset as CSVW Linked Data with units of measure
- [ ] how to specify the units of measure of a CSV/CSVW column as a URI
- http://wrdrd.github.io/docs/consulting/units#csvw-and-units
http://wrdrd.github.io/docs/consulting/linkedreproducibility#csv-csvw-and-metadata-rows
CSV, CSVW, and metadata rows
A data table with 7 metadata header rows (column label, property URI path, DataType, unit, accuracy, precision, significant figures)
- [ ] how to publish a ScholarlyArticle of StructuredPremises like Datasets
{from a Jupyter-Book in a repo2docker REES container image}
- "#LinkedReproducibility"
- #LinkedResearch
On Sun, Aug 29, 2021, 23:13 Nicholas Car @.***> wrote:
Thanks for the link to that work Jamie: yes I know about OM and have communicated a bit with Hajo Rijgersberg about it and I'm glad to have your Python converter in mind now.
The current work we are doing is building a Python UCUM converter, based on the JavaScript one https://github.com/lhncbc/ucum-lhc. Once that's done, we will work on a QUDT http://qudt.org/ converter which shoul dbe pretty easy compared to the UCUM one, given the conversion vectors present in QUDT. The goal is to have multi-system unit conversions available in RDFlib.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/pull/830#issuecomment-907980057, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMNS45NFLXSA5ZYYFFJVDT7LZMZANCNFSM4FBGCTCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@jpmccu @westurner RDLib's gone up a couple of versions now, any continued interest here?
Not in the near future. I've been using the OxiGraph store for similar use cases.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Nicholas Car @.> Sent: Tuesday, March 19, 2024 10:51:31 PM To: RDFLib/rdflib @.> Cc: Jamie McCusker @.>; Mention @.> Subject: Re: [RDFLib/rdflib] Shelve basicindex (#830)
@jpmccuhttps://github.com/jpmccu @westurnerhttps://github.com/westurner RDLib's gone up a couple of versions now, any continued interest here?
— Reply to this email directly, view it on GitHubhttps://github.com/RDFLib/rdflib/pull/830#issuecomment-2008588086, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAETCEIPXEACXTDZXG5GSYTYZD2THAVCNFSM4FBGCTCKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQHA2TQOBQHA3A. You are receiving this because you were mentioned.Message ID: @.***>