kinoml icon indicating copy to clipboard operation
kinoml copied to clipboard

Memory usage during featurization

Open jaimergp opened this issue 3 years ago • 0 comments

The recent changes in the featurization pipeline changed how the featurizers go through the different systems in a dataset.

Previously, a single system would go all the way through the pipeline (composed of N featurizers). This made it difficult to auto-detect global properties of the dataset (e.g. maximum length to pad bit-vectors to), so we refactored the pipeline so it traverses featurizers first.

# before
for system in systems:
    for featurizer in featurizers:
        featurizer.featurizer(system)

# now
for featurizer in featurizers:
   featurizer.featurize(systems)

This, however, implies that ALL the artifacts created by each featurizer coexist in time for the full dataset; aka, more memory usage. To give some numbers, ChEMBL28 (158K systems) peaks at around 6GB of RAM; mainly all the RDKit molecules that will be created from the SMILES. We do clear the featurizations dictionary after each pass by default (recent change), but I am writing this down as an issue because it might become a bottleneck for more complex schemes. These might require featurizing datasets in batches.

jaimergp avatar May 18 '21 20:05 jaimergp