biopandas icon indicating copy to clipboard operation
biopandas copied to clipboard

[RFC] Enabling Cython-based PDB parser backend for speed improvements

Open a-r-j opened this issue 2 years ago • 6 comments

Describe the workflow you want to enable

Currently, the pure-python of PDB parsing in BioPandas is quite slow - certainly too slow for highthroughput structural bioinformatics or ML.

Describe your proposed solution

I have written a Cython-based implementation (CPDB) which is considerably faster and would like to set this as the default parsing backend. As it stands, I believe this to be one of the fastest (if not the fastest) available PDB parser for Python.

Screenshot 2023-08-29 at 13 25 44

Performance comparison

However, given BioPandas' widespread usage, I am unclear if distributing this with a Cython component will lead to dependency problems for users.

Describe alternatives you've considered, if relevant

Speeding up the passage of time

Additional context

a-r-j avatar Aug 29 '23 11:08 a-r-j

@a-r-j This is super cool.

Btw. perhaps we don't need to worry about it extra dependencies here because NumPy already uses Cython (https://github.com/numpy/numpy/blob/main/build_requirements.txt), and pandas is build on NumPy, and BioPandas is build on pandas :P

rasbt avatar Aug 29 '23 11:08 rasbt

That's a good point! I was mostly concerned about the potential for build problems (mostly as cpdb is my first time working with Cython). I'll make a PR tonight and push a dev release so we can collect some feedback.

a-r-j avatar Aug 29 '23 13:08 a-r-j

One difference in the comparison is that your Cython implementation only reads ATOM, HETATM, and ENDMDL lines while biopandas reads all. Would be interesting to compare the performance if all lines are read (no need to parse like biopandas?).

Ruibin-Liu avatar Aug 30 '23 18:08 Ruibin-Liu

@Ruibin-Liu Hmm, that's a really great point. I could add a read_header arg to cpdb. In any case, I wouldn't have thought it would make a huge difference to speed; in terms of line count PDB files are most coordinates.

a-r-j avatar Aug 30 '23 18:08 a-r-j