pybliometrics icon indicating copy to clipboard operation
pybliometrics copied to clipboard

Compress cache files

Open katrinleinweber opened this issue 6 years ago • 4 comments
trafficstars

As a team of scientometricians, my colleagues and me are considering to share our ~/.scopus/scopus_search/ directories to avoid redownloading data and to parallelise multiple downloads for a single project.

In order to speed up synchronisation and to avoid filling up our local drives too much, gz compression (or any other) of the md5-named cache files would be tremendously helpful.

katrinleinweber avatar Jul 22 '19 10:07 katrinleinweber

Yes, we also share our cache.

Compression seems like a good idea. Do you have an idea of decompression times? Because that's the cost of saving space on disk

Michael-E-Rose avatar Jul 22 '19 11:07 Michael-E-Rose

I haven't measured, but used rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison to select gz for my use-case.

In any case, compared to the download rate from Scopus of 30-40 MB per hour, any delay due to (de)compression will be negligible.

katrinleinweber avatar Jul 22 '19 18:07 katrinleinweber

Okay, this sounds good and certainly makes sense.

I am thinking about how to best implement the compression:

  • Should the filename change or should it not?
  • Should it always be there or should there be a switch?
  • Should all classes be affected or just ScopusSearch results?

Depending on the answers, all previously cached files will be useless which I'd like to avoid.

In any case, that's something for pybliometrics 3.0.

Michael-E-Rose avatar Jul 24 '19 16:07 Michael-E-Rose

… previously cached files will be useless which I'd like to avoid.

I presume that in this case, some kind of inference & if … else … will be needed, regardless of whether the files receive an extension, or not. Maybe Pandas' compression inference is a good example of that?

… just ScopusSearch results?

Having used only the latter, I still guess that significant benefits are possible for each search class.

… should there be a switch?

Yes, please :-) Different situations require different prioritisations of speed over storage or the other way round. (De)compression will most likely add some delay. The main question is probably: What should the default be? I vote for compression='gz'.

katrinleinweber avatar Jul 24 '19 17:07 katrinleinweber