pynsee Bugfix/clean tmpfiles

Should handle that issue https://github.com/InseeFrLab/pynsee/issues/205

systemic removal of tempdirs (if not generated through lru_cache) or tempfiles (in the other case)
some auto black formatting

Aug 07 '24 16:08 tgrandje

Thanks for that, should we add ignore_errors=True to rmtree?

Aug 07 '24 20:08 tfardet

Ok, so that should be good to go once the last tests are completed. @hadrilec @tfardet I'll let you review the PR

Aug 08 '24 07:08 tgrandje

Reading the PR in more details, I'm actually not sure I understand what should be going on here.

For instance, you're clearing the tmpdir generated by a function relying on lru_cache for the value it returns... a value which relies on the tmpdir!

Basically there are three possibilities here:

the file is actually supposed to be stored for a long time and we move it to the cache folder (unlikely but @hadrilec will know)
the file is supposed to last the whole python session, in which case we actually need to keep a list of these tmpfiles/dirs and clear them all up in an atexit function
the file is really supposed to be temporary (how temporary, then?) and we should remove the lru_cache from get_capabilities

Aug 08 '24 08:08 tfardet

Yes you're right. In fact, I first went a bit agressively on the cleaning after it froze my WSL partition and had to went back after seeing the lru_cache trick. I might have missed some checks, and I mostly relied on the automated tests to find what went well and what went wrong. That may entirely be the case on the nested lru_cache you've found (the nesting being the reason why emptying the file from the tempdir in the inner function does not trigger an exception).

As far as I can see, the tempdir creation is not consistent over the whole package. I suspect the lru_cached function creating the tempdir was created later on pynsee's creation to avoid continuous hardrive writing. There are still some other non-lru-cached tempdir created though.

I'm not sure many tempdirs should be conserved longer than short-term. To be honest, some tempdirs are not even not necessary as you could perfectly unzip a file without writing to the hardrive, using an io object (with the eventuality of this beeing not right with nested zips...). But that should be something for a future PR...

Aug 08 '24 10:08 tgrandje

I think I'm good.

As I've said, I removed what seemed to me unnecessary calls to tempfiles. I've kept it on:

the _download_store_file (which no serve a fixed temp directory to _download_file and _import_options). It should be cleaned up by itself ; we could rewrite those functions avoiding tempdir altogether, but the size of those files (> 1 Go of CSV) might be too much for some machines to hold in RAM
_get_temp_dir, which is only used by _create_insee_folder anymore. I've taken the liberty to add a warning to the user, displaying that if platformdirs fails, pynsee can not be expected to cleanup the tempdir at the end of the python session.

I hope there's not too much mistakes on that PR. If there are (which is probably the case), I will be absent for 2 weeks, so feel free to correct it by yourself if needed.

EDIT I 've also lightly altered the get_temp_dir to simplify the code. The path to the insee folder should not have changed (if that's the case, then it is a mistake).

Aug 09 '24 13:08 tgrandje

I've also had some thoughts for future PRs:

wherever there is a call to _create_insee_folder, it might be worth checking if functions can be rewritten to use the @save_df decorator instead (it could simplify maintainance on the long run)
maybe we should use black once and for all to cleanup the whole code. There is a lot of "garbage" on my PR due to black cleaning up the code on each modified file (sorry about that 😳) ;
I've just seen there was a call to Nominatim on the sirene section:
- it might be worth setting a cache on Nominatim (which is strongly encouraged); I've just made one on french-cities using diskcache and that could be reused at almost no cost (but at the light downfall to have one more depedency) ;
- note that we could also get fastest results using the BAN with the CSV endpoint.
- In any case, I think this query is in violation with Nominatim's geocoding policy. ~~We need a set User-Agent~~ (EDIT : no, that's already done) and a rate limiter / sleep between calls (with backoff_factor currently set to 1, the first retry is set to .5s); if they're here somewhere I've missed those.
I've added a "TODO" comment on download/_download_store_file. This function is not using the update argument and at first reading I'd say this is not necessary at all, the function being called by download_file (which is using the @save_df decorator)

If that's ok with you both @hadrilec @tfardet , fill free to open issues on this without me.

Aug 09 '24 13:08 tgrandje

Nominatim is called from _get_location_openstreetmap which is, in my opinion, compliant with the guidelines

Aug 09 '24 13:08 hadrilec

wherever there is a call to _create_insee_folder, it might be worth checking if functions can be rewritten to use the @save_df decorator instead (it could simplify maintainance on the long run)

I plan on restarting the config PR soon, breaking it into smaller bits this time, this is definitely on my todo list and should indeed go away

maybe we should use black once and for all to cleanup the whole code. There is a lot of "garbage" on my PR due to black cleaning up the code on each modified file (sorry about that 😳) ;

good point, we could also add this as a pre-commit hook

I've added a "TODO" comment on download/_download_store_file. This function is not using the update argument and at first reading I'd say this is not necessary at all, the function being called by download_file (which is using the @save_df decorator)

OK, I'll check and will open an issue about it if it seems necessary

Aug 09 '24 14:08 tfardet

Nominatim is called from _get_location_openstreetmap which is, in my opinion, compliant with the guidelines

Yes about the User Agent (sorry I missed that, I was looking at the session object and didn't even saw you added it on each get call...) and the cache (let's say I've no excuse for missing that...).

And also yes about the rate. I've just tested your example from the doc and it queries Nominatim at a rate of 1.4it/sec. Might be something I misunderstood on requests retry system (probably) or even the lagtime to get a query and write it to the disk (which I doubt). I'll have to check what I've missed once I get some time. Well, sorry about that also.

😳

Aug 09 '24 14:08 tgrandje

@hadrilec @tfardet do you want to review this once more or should this be merged?

Aug 27 '24 18:08 tgrandje

I will have a look, thanks

Aug 28 '24 06:08 hadrilec

pynsee pynsee copied to clipboard

Bugfix/clean tmpfiles

pynsee
pynsee copied to clipboard