graphein RFAM family <-> PDB structure ID mapping

Here's a script to retrieve a mapping between RFAM families and PDB structure IDs. How could this be integrated into the codebase?

There are two points that might need to be addressed:

In the RFAM API, I couldn't figure out how to get a complete list of RFAM families. Judging from here, it seems that the family accession IDs follow the format: RF00001, RF00002, ..., RF04236. To retrieve all families, I introduced an argument max_id to specify the max ID limit (e.g. setting max_id=4236 will stop querying after RF04236).
Downloading the mappings for all families is time-consuming (it seems we can only query a single family at a time), it took ~40 min on my laptop. Would it be good to cache the data in graphein/datasets? It might be important to allow the users to re-download the data in case of updates in the RFAM database.

Pull Request Checklist

[ ] Added a note about the modification or contribution to the ./CHANGELOG.md file (if applicable)
[ ] Added appropriate unit test functions in the ./graphein/tests/* directories (if applicable)
[ ] Modify documentation in the corresponding Jupyter Notebook under ./notebooks/ (if applicable)
[ ] Ran python -m py.test tests/ and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., python -m py.test tests/protein/test_graphs.py)
[x] Checked for style issues by running black . and isort .

Jun 27 '23 13:06 rvinas

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 66 lines in your changes missing coverage. Please review.

Project coverage is 44.70%. Comparing base (8123f42) to head (2f3e6d6). Report is 184 commits behind head on master.

Files	Patch %	Lines
graphein/rna/download_rfam.py	0.00%	66 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #324      +/-   ##
==========================================
+ Coverage   40.27%   44.70%   +4.43%     
==========================================
  Files          48      114      +66     
  Lines        2811     7982    +5171     
==========================================
+ Hits         1132     3568    +2436     
- Misses       1679     4414    +2735

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jun 27 '23 13:06 codecov-commenter

Thanks for this Ramon, looks great! Let me have a think about how to integrate this more. An immediate thought is to couple this to the PDBManager which can take care of retrieving structures etc. We'd probably need some adaptations to support splitting RNA datasets properly.

Re finding families, is family.txt.gz what you want?

There's also Rfam.pdb.gz that could be helpful?

I think favouring the metadata/indices stored on the FTP server over the API might be better from a user POV (probably faster & no worries about being rate limited). We could make a wrapper for this similar to the PDBManager?

There also seems to be a ton of metadata on the FTP server. I'm not sure what else could be useful to pull in 🤔

Jun 27 '23 14:06 a-r-j

This sounds good. I feel quite silly, I completely missed these two files! Yes, downloading via the FTP server would definitely be much faster. I'll modify the script to just download these two files (and perhaps merge everything into a single dataframe?). We could then look into how to integrate this into PDBManager.

Jun 28 '23 06:06 rvinas

Kudos, SonarCloud Quality Gate passed!