biopandas
biopandas copied to clipboard
Add mmCIF -> PDB conversion
Fixes #97
Description
I started adding the conversion method proposed by @mrauha.
I ran into an issue writing the test:
from pandas.testing import assert_frame_equal
def test_mmcif_pdb_conversion():
pdb =PandasPdb().fetch_pdb("3EIY")
mmcif = PandasMmcif().fetch_mmcif("3EIY")
mmcif_pdb = mmcif.get_pandas_pdb()
assert_frame_equal(pdb.df["ATOM"].drop(columns=["line_idx"]), mmcif_pdb.df["ATOM"].drop(columns=["line_idx"]))
assert_frame_equal(pdb.df["HETATM"].drop(columns=["line_idx"]), mmcif_pdb.df["HETATM"].drop(columns=["line_idx"]).reset_index(drop=True))
Which gives an error relating to atom numbering in the HETATM dataframes:
AssertionError Traceback (most recent call last)
/Users/arianjamasb/github/biopandas/dev.ipynb Cell 7' in <cell line: 11>()
[8](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=7) assert_frame_equal(pdb.df["ATOM"].drop(columns=["line_idx"]), mmcif_pdb.df["ATOM"].drop(columns=["line_idx"]))
[9](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=8) assert_frame_equal(pdb.df["HETATM"].drop(columns=["line_idx"]), mmcif_pdb.df["HETATM"].drop(columns=["line_idx"]).reset_index(drop=True))
---> [11](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=10) test_mmcif_pdb_conversion()
/Users/arianjamasb/github/biopandas/dev.ipynb Cell 7' in test_mmcif_pdb_conversion()
[6](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=5) mmcif_pdb = mmcif.get_pandas_pdb()
[8](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=7) assert_frame_equal(pdb.df["ATOM"].drop(columns=["line_idx"]), mmcif_pdb.df["ATOM"].drop(columns=["line_idx"]))
----> [9](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=8) assert_frame_equal(pdb.df["HETATM"].drop(columns=["line_idx"]), mmcif_pdb.df["HETATM"].drop(columns=["line_idx"]).reset_index(drop=True))
[... skipping hidden 2 frame]
File ~/opt/anaconda3/envs/biopandas/lib/python3.8/site-packages/pandas/_libs/testing.pyx:52, in pandas._libs.testing.assert_almost_equal()
File ~/opt/anaconda3/envs/biopandas/lib/python3.8/site-packages/pandas/_libs/testing.pyx:167, in pandas._libs.testing.assert_almost_equal()
File ~/opt/anaconda3/envs/biopandas/lib/python3.8/site-packages/pandas/_testing/asserters.py:682, in raise_assert_detail(obj, message, left, right, diff, index_values)
679 if diff is not None:
680 msg += f"\n[diff]: {diff}"
--> 682 raise AssertionError(msg)
AssertionError: DataFrame.iloc[:, 1] (column name="atom_number") are different
DataFrame.iloc[:, 1] (column name="atom_number") values are different (100.0 %)
[index]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[left]: [1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, 1431, ...]
[right]: [1331, 1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, ...]
Any idea of the source of the discrepancy?
Update: This seems to be fixed now! 🥳
Related issues or pull requests
#97
Pull Request Checklist
- [ ] Added a note about the modification or contribution to the
./docs/sources/CHANGELOG.mdfile (if applicable) - [ ] Added appropriate unit test functions in the
./biopandas/*/testsdirectories (if applicable) - [ ] Modify documentation in the corresponding Jupyter Notebook under
biopandas/docs/sources/(if applicable) - [ ] Ran
PYTHONPATH='.' pytest ./biopandas -svand make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./biopandas/classifier/tests/test_stacking_cv_classifier.py -sv) - [ ] Checked for style issues by running
flake8 ./biopandas
Thanks for adding that! For the test, just curious, does this happen with other structures as well?
It's consistent across a few examples. I narrowed it down to TER records being the problem.
E.g.
# 3eiy.pdb
ATOM 1329 NZ LYS A 175 8.861 36.709 -13.496 1.00 63.20 N
ATOM 1330 OXT LYS A 175 10.451 35.432 -10.086 1.00 55.71 O
TER 1331 LYS A 175
HETATM 1332 K K A 176 24.990 43.276 0.005 0.50 24.45 K
HETATM 1333 NA NA A 177 1.633 34.181 11.897 1.00 26.73 NA
# 3eiy.cif
ATOM 1329 N NZ . LYS A 1 196 ? 8.861 36.709 -13.496 1.00 63.20 ? 175 LYS A NZ 1
ATOM 1330 O OXT . LYS A 1 196 ? 10.451 35.432 -10.086 1.00 55.71 ? 175 LYS A OXT 1
HETATM 1331 K K . K B 2 . ? 24.990 43.276 0.005 0.50 24.45 ? 176 K A K 1
HETATM 1332 NA NA . NA C 3 . ? 1.633 34.181 11.897 1.00 26.73 ? 177 NA A NA 1
Now, I'm not sure how reliably TER records are placed in PDB files (some info here ) and if we can be super comfortable just ignoring it. I imagine it would cause a bit of a headache trying to make selections consistent between the same structure in PDB & .cif formats.
Hmm... I think that we could have it without the TER records and just a note about that in the docs for now. So that people are aware. Otherwise, inferring the TER records would be some additional work (I guess that could be done based on the CIF structure index? But I don't have that much experience with TER). My guess is the primary usecase of BioPandas is data analysis, and the TER records are not super necessary anyways.
So, for now I'd say let's warn people about the TER entries but not worry about it.
Hello @a-r-j! Thanks for updating this PR.
- In the file
biopandas/mmcif/pandas_mmcif.py:
Line 534:89: E501 line too long (95 > 88 characters) Line 536:89: E501 line too long (106 > 88 characters)
Comment last updated at 2022-09-03 21:28:28 UTC
Sure, I added an offset arg so people at least have an access point. Hopefully that's a sufficient remedy for most usecases.
I think that works! Do you have a unit test handy? Otherwise, I can help looking into this!
Think I managed to infer it automatically based on the number of chains. There was also an issue in ATOM atom numbering for multichain proteins (again, TER records). So I figured we can infer the appropriate offsets for the ATOM and HETATM dfs from the number of chains.
Added appropriate unittest and seems to check out.
Wow this looks good! Currently traveling and will take one more close look in a calm moment, when I am not out and about. Thanks a lot!
Arg totally missed this and am currently traveling (on mobile). I will make a reminder to check this out as soon as I am back on my computer again!
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Just did a little documentation update (sorry for the delay, I was traveling during the last couple of weeks, and there's tons of stuff to catch up on)
Happy New Year @rasbt ! Anything I can do to get this PR merged? :)