coordgenlibs icon indicating copy to clipboard operation
coordgenlibs copied to clipboard

Coordgen is slower than RDKit's native 2d coordinate generation

Open d-b-w opened this issue 5 years ago • 6 comments

Coordgen is slower than RDKit's native 2d coordinate generation. Average speeds are about 100x slower, and in the worse cases, coordgen can take multiple seconds.

The two tools don't do the same things, and I think that coordgen results are much better, so the comparison is not totally fair. I do think that that coordgen should target being able to consistently produce coordinates in less than 0.1s, and have averages closer to 0.001s. This will allow us to discuss making coordgen the default in RDKit, which would be cool.

I'm going to link to the internal Schrödinger bug tracker, and our internal display for performance testing below, sorry...

  • https://jira.schrodinger.com/browse/CRDGEN-244
  • https://jira.schrodinger.com/browse/CRDGEN-212
  • https://stu.schrodinger.com/performance/Shared%20components/Coordgen%20Performance

At the time I post this, our automated performance testing says that:

2d coordinate generator Average speed (s) Slowest (s) Count > 0.1s Count > 1s
RDKit native 0.00035 0.04 0 0
coordgen 0.028 3.9 17 235

d-b-w avatar Sep 26 '19 22:09 d-b-w

@d-b-w, it might be a good idea to add some of the molecules (especially the slow ones) from these benchmarks as tests in this repository.

ricrogz avatar Sep 27 '19 01:09 ricrogz

Sorry for reviving this 2-year old ticket. I have just stumbled on the same problem on an internal dataset using the latest RDKit 2021.03.1 release. So I decided to reproduce the problem on public data and I fetched 2000 indoles with 50 to 60 heavy atoms from ChEMBL (csv file attached) chembl27_2000_indoles_50-60_ha.csv.gz

Native RDKit depiction of these 2000 molecules takes ~3 s:

%%time
rdDepictor.SetPreferCoordGen(False)
for m in mols:
    rdDepictor.Compute2DCoords(m)
CPU times: user 3.02 s, sys: 23 ms, total: 3.05 s
Wall time: 3.04 s

CoordGen takes ~360x longer:

%%time
rdDepictor.SetPreferCoordGen(True)
for m in mols:
    rdDepictor.Compute2DCoords(m)
CPU times: user 18min 10s, sys: 868 ms, total: 18min 11s
Wall time: 18min 10s

At the moment, this means that CoordGen cannot be used to depict large-ish molecules in a table. Do you have plans to address this in the near future? Thanks a lot in advance.

ptosco avatar Apr 09 '21 16:04 ptosco

ugh, we just accidentally blew up coordgen time by at least 10x, which should be addressed in - #90

Sorry about that. When #90 is merged, I'll immediately issue a patch release of coordgen and post a PR to RDKit.

We're definitely hoping to do further work on this before the fall RDKit release. The bug in #90 actually provides some clues to next steps.

d-b-w avatar Apr 09 '21 16:04 d-b-w

Thank you for the super-fast reply, Dan! Looking forward to the PR.

ptosco avatar Apr 09 '21 19:04 ptosco

Thanks Dan! It looks much better now :-)

%%time
rdDepictor.SetPreferCoordGen(True)
for m in chembl_mols_2000:
    rdDepictor.Compute2DCoords(m)
CPU times: user 2min 5s, sys: 53 ms, total: 2min 5s
Wall time: 2min 5s

ptosco avatar Apr 10 '21 07:04 ptosco

great! This issue is should remain open; I feel like the current rate is still too slow. But it's acceptable for many use cases.

d-b-w avatar Apr 10 '21 20:04 d-b-w