knime-rdkit icon indicating copy to clipboard operation
knime-rdkit copied to clipboard

Adjust parallelization of TwoComponentReaction node to significantly reduce memory usage

Open chaubold opened this issue 9 months ago • 0 comments

The TwoComponentReaction submitted tasks for an executor service in the following scheme: one task for each element in the first input column. The task then performed the reaction of this element (=reactant) with all reactants of a second input column. So the output of this task is the list of reaction results of all these pairings. The output needs to be kept in memory until it has been written out.

Now imagine the second input column has a lot of rows, meaning each task needs to keep a lot of results in memory.

The thread pool is configured to use ~2x as many threads as there are CPU cores, so if there's a 4 core CPU this means 8 tasks are running in parallel, so at least 8 large results need to be kept in memory.

Changed with this commit: each reaction is handled as individual task. While this might increase the bookkeeping overhead, it makes sure that way fewer results need to be kept in memory, which in practice showed much better performance because the operating system doesn't need to manage too memory (which is outside of the JVM, but RDKit molecules in C via JNI).

chaubold avatar Mar 31 '25 14:03 chaubold