framework icon indicating copy to clipboard operation
framework copied to clipboard

Add caching mechanism and rework remote loader/stats?

Open vitorbaptista opened this issue 9 years ago • 9 comments

Originally by @femtotrader on https://github.com/trickvi/datapackage/issues/61:

Hello,

I think datapackage should provide a cache mechanism.

For this (if user want to have this cache mechanism 2 optional dependencies could be requests and requests-cache)

  • http://requests.readthedocs.org/en/latest/
  • http://requests-cache.readthedocs.org/ I don't suggest to monkey-patch requests to use CachedSession by default but to pass a CachedSession instead of a requests.Session

One possible use could be :

import datapackage
import requests_cache
import datetime
session = requests_cache.CachedSession(cache_name='cache', backend='sqlite', expire_after=datetime.timedelta(days=60))
datapkg = datapackage.DataPackage('http://data.okfn.org/data/cpi/', session=session)

Default value of parameter session should beNone`. This session should be stored as a member of DataPackage.

When session is not None request will be performed using

self.session.get(url)

Kind regards

PS : a similar approach was used in https://github.com/femtotrader/pandas_datareaders_unofficial

edit: and is now (oct 2015) used in official "pandas-datareader" https://github.com/pydata/pandas-datareader/

see also pydata/pandas-datareader#48

vitorbaptista avatar Apr 18 '16 11:04 vitorbaptista

This cache mechanism is very important if we want to test datapackage library with all datapackages in /datasets organisation So see also https://github.com/frictionlessdata/testsuite-py/issues/12

femtotrader avatar Apr 18 '16 16:04 femtotrader

Any news about this ? it will help a lot https://github.com/datasets/registry/issues/114

femtotrader avatar Aug 29 '16 12:08 femtotrader

Just adding that for testing purposes, instead of implementing caching on the library itself, we can use it only on tests using a library like https://github.com/sigmavirus24/betamax.

vitorbaptista avatar Aug 29 '16 13:08 vitorbaptista

vcr.use_cassette('user')

looks like the monkey patch approach of requests-cache http://requests-cache.readthedocs.io/en/latest/user_guide.html#installation

requests_cache.install_cache()

I'm not a big fan of this approach. I prefer passing a session object this is much simpler than their approach using context manager (with)... (my 2 cts)

It's not really about "implementing caching on the library itself"

It's just about changing calls like

requests.get(...)

to

session.get(...)

but anyway whatever your technical choices are, what is important is to be able to know quickly what datapackages in https://github.com/datasets/registry are not valid. But it seems that Rufus have some ideas / plans.

femtotrader avatar Aug 29 '16 13:08 femtotrader

Just to be clear, I'm pointing out betamax because, as far as I can see, the reason you're suggesting this task is to allow us to monitor the datasets. With it, we can solve the testing issue without having to add more code to the datapackage-py.

vitorbaptista avatar Aug 29 '16 14:08 vitorbaptista

I didn't know betamax previously. Both can be used with the monkey patch approach and so can allow to monitor the datasets.

femtotrader avatar Aug 29 '16 16:08 femtotrader

I suppose it's kinda blocked by https://github.com/frictionlessdata/specs/issues/243

roll avatar Aug 29 '16 16:08 roll

@roll while frictionlessdata/datapackage-py#243 refers to a cache property, it has a different use and meaning to the above as far as I see.

pwalsh avatar Aug 29 '16 17:08 pwalsh

It's also related to https://github.com/frictionlessdata/goodtables-py/issues/140 - both could require providing some custom requests session to tabulator. But caching could lead to other kind of problems like memory usage (we do streaming for everything).

So this issue is something to investigate.

roll avatar Aug 22 '17 12:08 roll