kinoml
kinoml copied to clipboard
Memory usage during featurization
The recent changes in the featurization pipeline changed how the featurizers go through the different systems in a dataset.
Previously, a single system would go all the way through the pipeline (composed of N featurizers). This made it difficult to auto-detect global properties of the dataset (e.g. maximum length to pad bit-vectors to), so we refactored the pipeline so it traverses featurizers first.
# before
for system in systems:
for featurizer in featurizers:
featurizer.featurizer(system)
# now
for featurizer in featurizers:
featurizer.featurize(systems)
This, however, implies that ALL the artifacts created by each featurizer coexist in time for the full dataset; aka, more memory usage. To give some numbers, ChEMBL28 (158K systems) peaks at around 6GB of RAM; mainly all the RDKit molecules that will be created from the SMILES. We do clear the featurizations dictionary after each pass by default (recent change), but I am writing this down as an issue because it might become a bottleneck for more complex schemes. These might require featurizing datasets in batches.