Add caching mechanism and rework remote loader/stats?
Originally by @femtotrader on https://github.com/trickvi/datapackage/issues/61:
Hello,
I think datapackage should provide a cache mechanism.
For this (if user want to have this cache mechanism 2 optional dependencies could be requests and requests-cache)
- http://requests.readthedocs.org/en/latest/
- http://requests-cache.readthedocs.org/ I don't suggest to monkey-patch requests to use CachedSession by default but to pass a CachedSession instead of a requests.Session
One possible use could be :
import datapackage import requests_cache import datetime session = requests_cache.CachedSession(cache_name='cache', backend='sqlite', expire_after=datetime.timedelta(days=60)) datapkg = datapackage.DataPackage('http://data.okfn.org/data/cpi/', session=session)Default value of parameter session should beNone`. This session should be stored as a member of DataPackage.
When session is not None request will be performed using
self.session.get(url)Kind regards
PS : a similar approach was used in https://github.com/femtotrader/pandas_datareaders_unofficial
edit: and is now (oct 2015) used in official "pandas-datareader" https://github.com/pydata/pandas-datareader/
see also pydata/pandas-datareader#48
This cache mechanism is very important if we want to test datapackage library with all datapackages in /datasets organisation So see also https://github.com/frictionlessdata/testsuite-py/issues/12
Any news about this ? it will help a lot https://github.com/datasets/registry/issues/114
Just adding that for testing purposes, instead of implementing caching on the library itself, we can use it only on tests using a library like https://github.com/sigmavirus24/betamax.
vcr.use_cassette('user')
looks like the monkey patch approach of requests-cache http://requests-cache.readthedocs.io/en/latest/user_guide.html#installation
requests_cache.install_cache()
I'm not a big fan of this approach. I prefer passing a session object this is much simpler than their approach using context manager (with)... (my 2 cts)
It's not really about "implementing caching on the library itself"
It's just about changing calls like
requests.get(...)
to
session.get(...)
but anyway whatever your technical choices are, what is important is to be able to know quickly what datapackages in https://github.com/datasets/registry are not valid. But it seems that Rufus have some ideas / plans.
Just to be clear, I'm pointing out betamax because, as far as I can see, the reason you're suggesting this task is to allow us to monitor the datasets. With it, we can solve the testing issue without having to add more code to the datapackage-py.
I didn't know betamax previously. Both can be used with the monkey patch approach and so can allow to monitor the datasets.
I suppose it's kinda blocked by https://github.com/frictionlessdata/specs/issues/243
@roll while frictionlessdata/datapackage-py#243 refers to a cache property, it has a different use and meaning to the above as far as I see.
It's also related to https://github.com/frictionlessdata/goodtables-py/issues/140 - both could require providing some custom requests session to tabulator. But caching could lead to other kind of problems like memory usage (we do streaming for everything).
So this issue is something to investigate.