Spine-Toolbox icon indicating copy to clipboard operation
Spine-Toolbox copied to clipboard

Importing large datasets eats all memory

Open soininen opened this issue 2 years ago • 1 comments

I have an example project from a user that imports a lot of data from numerous .csv files (see the discussion in #1742). Because Importer reads all data from all files before writing it to database, executing the project will eventually exhaust all memory on my system. A simple relief is to write the data to the database repeatedly after each file has been processed but there is still the problem that we cache database data for validation. As the database grows so does the memory footprint of the cache leading again to out-of-memory situation. Not sure what to do with it, yet.

soininen avatar Sep 12 '22 13:09 soininen

I see, nice one. I believe the make_cache argument to import_data can be set to None to skip all caching and run the queries directly from the db. That can be an option if we could expose that to the user or even automatically figure out that the memory will get eaten and switch to that mode instead.

We could also consider implementing some form of incremental validation that doesn't need to look into the entire db every time, but that sounds not straightforward.

manuelma avatar Sep 12 '22 13:09 manuelma