pybliometrics
pybliometrics copied to clipboard
Compress cache files
As a team of scientometricians, my colleagues and me are considering to share our ~/.scopus/scopus_search/ directories to avoid redownloading data and to parallelise multiple downloads for a single project.
In order to speed up synchronisation and to avoid filling up our local drives too much, gz compression (or any other) of the md5-named cache files would be tremendously helpful.
Yes, we also share our cache.
Compression seems like a good idea. Do you have an idea of decompression times? Because that's the cost of saving space on disk
I haven't measured, but used rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison to select gz for my use-case.
In any case, compared to the download rate from Scopus of 30-40 MB per hour, any delay due to (de)compression will be negligible.
Okay, this sounds good and certainly makes sense.
I am thinking about how to best implement the compression:
- Should the filename change or should it not?
- Should it always be there or should there be a switch?
- Should all classes be affected or just ScopusSearch results?
Depending on the answers, all previously cached files will be useless which I'd like to avoid.
In any case, that's something for pybliometrics 3.0.
… previously cached files will be useless which I'd like to avoid.
I presume that in this case, some kind of inference & if … else … will be needed, regardless of whether the files receive an extension, or not. Maybe Pandas' compression inference is a good example of that?
… just ScopusSearch results?
Having used only the latter, I still guess that significant benefits are possible for each search class.
… should there be a switch?
Yes, please :-) Different situations require different prioritisations of speed over storage or the other way round. (De)compression will most likely add some delay. The main question is probably: What should the default be? I vote for compression='gz'.