run_startup() has measurable memory overhead that could be improved for large datasets

Open MahmoodEtedadi opened this issue 7 months ago • 0 comments

Problem Summary

When using sm.run_startup() with large datasets, memory usage becomes a concern. In testing, peak memory consumption was observed to be approximately 1.6× to 1.7× the actual in-memory size of the DataFrame retained by seismometer. This overhead arises from internal processing such as indexing, filtering, cohort construction, and temporary intermediate structures that scale with row count. This overhead can limit the ability to work with large data on systems with limited RAM. It may be worth exploring ways to reduce memory usage in these scenarios, either through internal optimizations or support for more memory-efficient data handling.

Impact

The current memory overhead in run_startup() can make it difficult to work with large datasets on machines with limited RAM, potentially causing crashes or preventing full-data analysis in otherwise feasible environments.

Possible Solution

One potential approach is to optimize internal data handling to reduce memory overhead — for example, by avoiding unnecessary copies, reusing structures, or restructuring the logic to process the full dataset in smaller internal chunks while still retaining access to all of the data. This would help limit peak memory usage without requiring external chunking or sacrificing full-dataset context. Another direction could be to support compatibility with libraries like Dask to enable more efficient memory usage through lazy or partitioned computation, though this would involve broader architectural changes.

Jun 13 '25 18:06 MahmoodEtedadi