biopandas Add mmCIF -> PDB conversion

Fixes #97

Description

I started adding the conversion method proposed by @mrauha.

I ran into an issue writing the test:

from pandas.testing import assert_frame_equal

def test_mmcif_pdb_conversion():
    pdb =PandasPdb().fetch_pdb("3EIY")
    mmcif = PandasMmcif().fetch_mmcif("3EIY")
    mmcif_pdb = mmcif.get_pandas_pdb()
    
    assert_frame_equal(pdb.df["ATOM"].drop(columns=["line_idx"]), mmcif_pdb.df["ATOM"].drop(columns=["line_idx"]))
    assert_frame_equal(pdb.df["HETATM"].drop(columns=["line_idx"]), mmcif_pdb.df["HETATM"].drop(columns=["line_idx"]).reset_index(drop=True))

Which gives an error relating to atom numbering in the HETATM dataframes:

AssertionError                            Traceback (most recent call last)
/Users/arianjamasb/github/biopandas/dev.ipynb Cell 7' in <cell line: 11>()
      [8](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=7)     assert_frame_equal(pdb.df["ATOM"].drop(columns=["line_idx"]), mmcif_pdb.df["ATOM"].drop(columns=["line_idx"]))
      [9](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=8)     assert_frame_equal(pdb.df["HETATM"].drop(columns=["line_idx"]), mmcif_pdb.df["HETATM"].drop(columns=["line_idx"]).reset_index(drop=True))
---> [11](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=10) test_mmcif_pdb_conversion()

/Users/arianjamasb/github/biopandas/dev.ipynb Cell 7' in test_mmcif_pdb_conversion()
      [6](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=5) mmcif_pdb = mmcif.get_pandas_pdb()
      [8](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=7) assert_frame_equal(pdb.df["ATOM"].drop(columns=["line_idx"]), mmcif_pdb.df["ATOM"].drop(columns=["line_idx"]))
----> [9](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=8) assert_frame_equal(pdb.df["HETATM"].drop(columns=["line_idx"]), mmcif_pdb.df["HETATM"].drop(columns=["line_idx"]).reset_index(drop=True))

    [... skipping hidden 2 frame]

File ~/opt/anaconda3/envs/biopandas/lib/python3.8/site-packages/pandas/_libs/testing.pyx:52, in pandas._libs.testing.assert_almost_equal()

File ~/opt/anaconda3/envs/biopandas/lib/python3.8/site-packages/pandas/_libs/testing.pyx:167, in pandas._libs.testing.assert_almost_equal()

File ~/opt/anaconda3/envs/biopandas/lib/python3.8/site-packages/pandas/_testing/asserters.py:682, in raise_assert_detail(obj, message, left, right, diff, index_values)
    679 if diff is not None:
    680     msg += f"\n[diff]: {diff}"
--> 682 raise AssertionError(msg)

AssertionError: DataFrame.iloc[:, 1] (column name="atom_number") are different

DataFrame.iloc[:, 1] (column name="atom_number") values are different (100.0 %)
[index]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[left]:  [1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, 1431, ...]
[right]: [1331, 1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, ...]

Any idea of the source of the discrepancy?

Update: This seems to be fixed now! 🥳

Related issues or pull requests

#97

Pull Request Checklist

[ ] Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
[ ] Added appropriate unit test functions in the ./biopandas/*/tests directories (if applicable)
[ ] Modify documentation in the corresponding Jupyter Notebook under biopandas/docs/sources/ (if applicable)
[ ] Ran PYTHONPATH='.' pytest ./biopandas -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./biopandas/classifier/tests/test_stacking_cv_classifier.py -sv)
[ ] Checked for style issues by running flake8 ./biopandas

Aug 06 '22 05:08 a-r-j

Thanks for adding that! For the test, just curious, does this happen with other structures as well?

Aug 06 '22 15:08 rasbt

It's consistent across a few examples. I narrowed it down to TER records being the problem.

E.g.

# 3eiy.pdb
ATOM   1329  NZ  LYS A 175       8.861  36.709 -13.496  1.00 63.20           N  
ATOM   1330  OXT LYS A 175      10.451  35.432 -10.086  1.00 55.71           O  
TER    1331      LYS A 175                                                      
HETATM 1332  K     K A 176      24.990  43.276   0.005  0.50 24.45           K  
HETATM 1333 NA    NA A 177       1.633  34.181  11.897  1.00 26.73          NA

# 3eiy.cif
ATOM   1329 N  NZ  . LYS A 1 196 ? 8.861   36.709 -13.496 1.00 63.20 ? 175 LYS A NZ  1 
ATOM   1330 O  OXT . LYS A 1 196 ? 10.451  35.432 -10.086 1.00 55.71 ? 175 LYS A OXT 1 
HETATM 1331 K  K   . K   B 2 .   ? 24.990  43.276 0.005   0.50 24.45 ? 176 K   A K   1 
HETATM 1332 NA NA  . NA  C 3 .   ? 1.633   34.181 11.897  1.00 26.73 ? 177 NA  A NA  1

Now, I'm not sure how reliably TER records are placed in PDB files (some info here ) and if we can be super comfortable just ignoring it. I imagine it would cause a bit of a headache trying to make selections consistent between the same structure in PDB & .cif formats.

Aug 06 '22 16:08 a-r-j

Hmm... I think that we could have it without the TER records and just a note about that in the docs for now. So that people are aware. Otherwise, inferring the TER records would be some additional work (I guess that could be done based on the CIF structure index? But I don't have that much experience with TER). My guess is the primary usecase of BioPandas is data analysis, and the TER records are not super necessary anyways.

So, for now I'd say let's warn people about the TER entries but not worry about it.

Aug 06 '22 21:08 rasbt

Hello @a-r-j! Thanks for updating this PR.

In the file biopandas/mmcif/pandas_mmcif.py:

Line 534:89: E501 line too long (95 > 88 characters) Line 536:89: E501 line too long (106 > 88 characters)

Comment last updated at 2022-09-03 21:28:28 UTC

Aug 07 '22 15:08 pep8speaks

Sure, I added an offset arg so people at least have an access point. Hopefully that's a sufficient remedy for most usecases.

Aug 07 '22 15:08 a-r-j

I think that works! Do you have a unit test handy? Otherwise, I can help looking into this!

Aug 07 '22 17:08 rasbt

Think I managed to infer it automatically based on the number of chains. There was also an issue in ATOM atom numbering for multichain proteins (again, TER records). So I figured we can infer the appropriate offsets for the ATOM and HETATM dfs from the number of chains.

Added appropriate unittest and seems to check out.

Aug 08 '22 12:08 a-r-j

Wow this looks good! Currently traveling and will take one more close look in a calm moment, when I am not out and about. Thanks a lot!

Aug 13 '22 21:08 rasbt

Arg totally missed this and am currently traveling (on mobile). I will make a reminder to check this out as soon as I am back on my computer again!

Aug 18 '22 07:08 rasbt

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Sep 03 '22 21:09 review-notebook-app[bot]

Just did a little documentation update (sorry for the delay, I was traveling during the last couple of weeks, and there's tons of stuff to catch up on)

Sep 03 '22 21:09 rasbt

Happy New Year @rasbt ! Anything I can do to get this PR merged? :)

Jan 04 '23 18:01 a-r-j

biopandas biopandas copied to clipboard

Add mmCIF -> PDB conversion

Description

Related issues or pull requests

Pull Request Checklist

Comment last updated at 2022-09-03 21:28:28 UTC

biopandas
biopandas copied to clipboard