Memory usage with high # of predicates
Possible area of enhancement regarding memory usage when a high number of predicates is specified.
I remember when I profiled, the memory peaked during the creation of the predicate columns.
@Jwoo5 could you please help confirm if this is the case for your dataset/tasks as well? I just used mprof to run a script with the extraction code and looked at the memory plots.
Tagging @mmcdermott
Yes, the memory peaked when creating the predicate columns.
It would be great if we can expand the predicates grammar to support "any of codes" operation without creating intermediate predicate columns to realize the final expression (e.g., or(...))
FYI, when I used only ~70 predicates to define the task and process the same data, it took about 14 minutes with ~15GB of RAM. This number is much less than that when I used ~1400 predicates, which took about 2 hours with ~150GB of RAM, so I believe # of predicates is the main problem.
Ok, this is great information. We can pretty easily support this.
So I think that #90 will likely solve this, so we will relegate active analysis of this to there. In the end, we may also need to invest in specific, easy to use profiling scripts as well to better understand the computational performance, but hopefully #90 will eliminate this issue and we'll be good to go.
Can we check if #90 actually solved this? @justin13601 or @Jwoo5