q
q copied to clipboard
Reuse of previously loaded data
Reusing a file that has already been loaded in the past should be faster. Can be that by some form of caching the loaded data.
On a similar note I was wondering how one could reuse the generated db.
Changing :memory:
to q.sqlite
and ending with db.conn.commit()
instead of table_creator.drop_table()
did the trick.
Caching the data is not obvious, as you need to check if it's the same (could be the file's md5sum), and have some sort of garbage collection.
Yeah, as we all know, the 2 most difficult problems in computer science are cache invalidation, naming things and off by one errors.
Exactly :)
Hi, sorry for the late reply. Been offline for a couple of days.
Thanks a lot, I'll take a deeper look at your tip and see if I can find some trick to make the invalidation fast enough (was planning on cksum, perhaps a sampled cksum or something, with an option to be stricter and slower through a command line parameter).
Harel
I've created an API which will allow q to be used from python code as a module. The changes also inherently include the possibility to reuse previously loaded data (e.g. running multiple queries against the same loaded data).
Alpha version of the new API will be committed into the main branch in a couple of days.
Alpha branch of the python api has been committed to https://github.com/harelba/q/tree/expose-as-python-api.
The python api supports reuse of already-loaded data, and this capability is exposed to the command line by allowing the user to write multiple queries in the same q execution - E.g. q "select ..." "select ..." "select ..." ...
. Running q like that will load the data only once for each file, even if it's used in multiple queries. In the future, I'll probably add an interactive REPL for this as well.
Any input would be helpful and appreciated.
Harel
Forgot to write - The readme file of the branch contains the required information about the API.
This capability is now fully supported internally, and exposed partially by running multiple queries on the same command line (Every invocation of q reuses the data between multiple queries that are being run).
This issue will be closed when the feature is fully exposed.
This can be also done like an interactive SQL client, so at the start it loads all the data(into memory, I don't care, I have 32GB ram) and then we can execute the queries. My sample file is like 3GB and waiting for another minute per each query isn't good. Like:
$ q --client -H data.csv as data q > select count(*) from data ------------------ | count(*) | ------------------ | 10000000 | ------------------ q > select my_field from data where condition=true limit 3 ------------- | my_field | ------------- | val1 | ------------- | val2 | ------------- | val3 | -------------
Support for multiple files discussable.
Hi @msangel @Fil
I'm going to release a new version of q soon. It's a large change, which includes inherent caching capabilities similar to the ones you're describing, eliminating the need to wait between multiple queries of the same file.
Harel