pynsee
pynsee copied to clipboard
Bugfix/clean tmpfiles
Should handle that issue https://github.com/InseeFrLab/pynsee/issues/205
- systemic removal of tempdirs (if not generated through lru_cache) or tempfiles (in the other case)
- some auto black formatting
Thanks for that, should we add ignore_errors=True to rmtree?
Ok, so that should be good to go once the last tests are completed. @hadrilec @tfardet I'll let you review the PR
Reading the PR in more details, I'm actually not sure I understand what should be going on here.
For instance, you're clearing the tmpdir generated by a function relying on lru_cache for the value it returns... a value which relies on the tmpdir!
Basically there are three possibilities here:
- the file is actually supposed to be stored for a long time and we move it to the cache folder (unlikely but @hadrilec will know)
- the file is supposed to last the whole python session, in which case we actually need to keep a list of these tmpfiles/dirs and clear them all up in an
atexitfunction - the file is really supposed to be temporary (how temporary, then?) and we should remove the
lru_cachefromget_capabilities
Yes you're right. In fact, I first went a bit agressively on the cleaning after it froze my WSL partition and had to went back after seeing the lru_cache trick. I might have missed some checks, and I mostly relied on the automated tests to find what went well and what went wrong. That may entirely be the case on the nested lru_cache you've found (the nesting being the reason why emptying the file from the tempdir in the inner function does not trigger an exception).
As far as I can see, the tempdir creation is not consistent over the whole package. I suspect the lru_cached function creating the tempdir was created later on pynsee's creation to avoid continuous hardrive writing. There are still some other non-lru-cached tempdir created though.
I'm not sure many tempdirs should be conserved longer than short-term. To be honest, some tempdirs are not even not necessary as you could perfectly unzip a file without writing to the hardrive, using an io object (with the eventuality of this beeing not right with nested zips...). But that should be something for a future PR...
I think I'm good.
As I've said, I removed what seemed to me unnecessary calls to tempfiles. I've kept it on:
- the
_download_store_file(which no serve a fixed temp directory to_download_fileand_import_options). It should be cleaned up by itself ; we could rewrite those functions avoiding tempdir altogether, but the size of those files (> 1 Go of CSV) might be too much for some machines to hold in RAM _get_temp_dir, which is only used by_create_insee_folderanymore. I've taken the liberty to add a warning to the user, displaying that if platformdirs fails, pynsee can not be expected to cleanup the tempdir at the end of the python session.
I hope there's not too much mistakes on that PR. If there are (which is probably the case), I will be absent for 2 weeks, so feel free to correct it by yourself if needed.
EDIT
I 've also lightly altered the get_temp_dir to simplify the code. The path to the insee folder should not have changed (if that's the case, then it is a mistake).
I've also had some thoughts for future PRs:
- wherever there is a call to
_create_insee_folder, it might be worth checking if functions can be rewritten to use the@save_dfdecorator instead (it could simplify maintainance on the long run) - maybe we should use black once and for all to cleanup the whole code. There is a lot of "garbage" on my PR due to black cleaning up the code on each modified file (sorry about that 😳) ;
- I've just seen there was a call to Nominatim on the sirene section:
- it might be worth setting a cache on Nominatim (which is strongly encouraged); I've just made one on french-cities using
diskcacheand that could be reused at almost no cost (but at the light downfall to have one more depedency) ; - note that we could also get fastest results using the BAN with the CSV endpoint.
- In any case, I think this query is in violation with Nominatim's geocoding policy.
We need a set User-Agent(EDIT : no, that's already done) and a rate limiter / sleep between calls (with backoff_factor currently set to 1, the first retry is set to .5s); if they're here somewhere I've missed those.
- it might be worth setting a cache on Nominatim (which is strongly encouraged); I've just made one on french-cities using
- I've added a "TODO" comment on download/_download_store_file. This function is not using the
updateargument and at first reading I'd say this is not necessary at all, the function being called by download_file (which is using the@save_dfdecorator)
If that's ok with you both @hadrilec @tfardet , fill free to open issues on this without me.
Nominatim is called from _get_location_openstreetmap which is, in my opinion, compliant with the guidelines
wherever there is a call to
_create_insee_folder, it might be worth checking if functions can be rewritten to use the@save_dfdecorator instead (it could simplify maintainance on the long run)
I plan on restarting the config PR soon, breaking it into smaller bits this time, this is definitely on my todo list and should indeed go away
maybe we should use black once and for all to cleanup the whole code. There is a lot of "garbage" on my PR due to black cleaning up the code on each modified file (sorry about that 😳) ;
good point, we could also add this as a pre-commit hook
I've added a "TODO" comment on download/_download_store_file. This function is not using the
updateargument and at first reading I'd say this is not necessary at all, the function being called by download_file (which is using the@save_dfdecorator)
OK, I'll check and will open an issue about it if it seems necessary
Nominatim is called from _get_location_openstreetmap which is, in my opinion, compliant with the guidelines
Yes about the User Agent (sorry I missed that, I was looking at the session object and didn't even saw you added it on each get call...) and the cache (let's say I've no excuse for missing that...).
And also yes about the rate. I've just tested your example from the doc and it queries Nominatim at a rate of 1.4it/sec. Might be something I misunderstood on requests retry system (probably) or even the lagtime to get a query and write it to the disk (which I doubt). I'll have to check what I've missed once I get some time. Well, sorry about that also.
😳
@hadrilec @tfardet do you want to review this once more or should this be merged?
I will have a look, thanks