Spine-Toolbox
Spine-Toolbox copied to clipboard
Importing large datasets eats all memory
I have an example project from a user that imports a lot of data from numerous .csv files (see the discussion in #1742). Because Importer reads all data from all files before writing it to database, executing the project will eventually exhaust all memory on my system. A simple relief is to write the data to the database repeatedly after each file has been processed but there is still the problem that we cache database data for validation. As the database grows so does the memory footprint of the cache leading again to out-of-memory situation. Not sure what to do with it, yet.
I see, nice one. I believe the make_cache
argument to import_data
can be set to None to skip all caching and run the queries directly from the db. That can be an option if we could expose that to the user or even automatically figure out that the memory will get eaten and switch to that mode instead.
We could also consider implementing some form of incremental validation that doesn't need to look into the entire db every time, but that sounds not straightforward.