graphein
graphein copied to clipboard
RFAM family <-> PDB structure ID mapping
Here's a script to retrieve a mapping between RFAM families and PDB structure IDs. How could this be integrated into the codebase?
There are two points that might need to be addressed:
- In the RFAM API, I couldn't figure out how to get a complete list of RFAM families. Judging from here, it seems that the family accession IDs follow the format: RF00001, RF00002, ..., RF04236. To retrieve all families, I introduced an argument
max_idto specify the max ID limit (e.g. settingmax_id=4236will stop querying after RF04236). - Downloading the mappings for all families is time-consuming (it seems we can only query a single family at a time), it took ~40 min on my laptop. Would it be good to cache the data in
graphein/datasets? It might be important to allow the users to re-download the data in case of updates in the RFAM database.
Pull Request Checklist
- [ ] Added a note about the modification or contribution to the
./CHANGELOG.mdfile (if applicable) - [ ] Added appropriate unit test functions in the
./graphein/tests/*directories (if applicable) - [ ] Modify documentation in the corresponding Jupyter Notebook under
./notebooks/(if applicable) - [ ] Ran
python -m py.test tests/and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,python -m py.test tests/protein/test_graphs.py) - [x] Checked for style issues by running
black .andisort .
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Codecov Report
Attention: Patch coverage is 0% with 66 lines in your changes missing coverage. Please review.
Project coverage is 44.70%. Comparing base (
8123f42) to head (2f3e6d6). Report is 184 commits behind head on master.
| Files | Patch % | Lines |
|---|---|---|
| graphein/rna/download_rfam.py | 0.00% | 66 Missing :warning: |
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@ Coverage Diff @@
## master #324 +/- ##
==========================================
+ Coverage 40.27% 44.70% +4.43%
==========================================
Files 48 114 +66
Lines 2811 7982 +5171
==========================================
+ Hits 1132 3568 +2436
- Misses 1679 4414 +2735
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks for this Ramon, looks great! Let me have a think about how to integrate this more. An immediate thought is to couple this to the PDBManager which can take care of retrieving structures etc. We'd probably need some adaptations to support splitting RNA datasets properly.
Re finding families, is family.txt.gz what you want?
There's also Rfam.pdb.gz that could be helpful?
I think favouring the metadata/indices stored on the FTP server over the API might be better from a user POV (probably faster & no worries about being rate limited). We could make a wrapper for this similar to the PDBManager?
There also seems to be a ton of metadata on the FTP server. I'm not sure what else could be useful to pull in 🤔
This sounds good. I feel quite silly, I completely missed these two files! Yes, downloading via the FTP server would definitely be much faster. I'll modify the script to just download these two files (and perhaps merge everything into a single dataframe?). We could then look into how to integrate this into PDBManager.
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
No data about Coverage
No data about Duplication







