pymatgen icon indicating copy to clipboard operation
pymatgen copied to clipboard

CifWriter writes standard-incompliant chemical formulae and cell formula

Open e-kwsm opened this issue 2 years ago • 2 comments

Describe the bug A clear and concise description of what the bug is.

Chemical formulae-related items written by CifWriter do not comply with the specification.

To Reproduce Steps to reproduce the behavior:

Run the following script (POSCAR of CH₃NH₃PbI₃ from https://materialsproject.org/materials/mp-1194604/):

#!/usr/bin/env python3
import pymatgen.core
import pymatgen.io.cif
import pymatgen.io.vasp


def main():
    print("#", pymatgen.core.__version__)
    s = pymatgen.io.vasp.Poscar.from_string(
        """H24 Pb4 C4 I12 N4
1.0
   8.6500830000000004    0.0000000000000000    0.0000000000000005
   0.0000000000000014    8.9913910000000001    0.0000000000000006
   0.0000000000000000    0.0000000000000000   13.1226540000000007
H Pb C I N
24 4 4 12 4
direct
   0.3037930000000000    0.4412980000000000    0.2500000000000000 H+
   0.1962070000000000    0.9412980000000000    0.2500000000000000 H+
   0.6962070000000000    0.5587020000000000    0.7500000000000000 H+
   0.8037930000000000    0.0587020000000000    0.7500000000000000 H+
   0.4531830000000000    0.3445690000000000    0.1812530000000001 H+
   0.0468170000000000    0.8445690000000000    0.3187469999999999 H+
   0.5468170000000000    0.6554310000000000    0.6812530000000001 H+
   0.9531830000000000    0.1554310000000000    0.8187469999999999 H+
   0.5468170000000000    0.6554310000000000    0.8187469999999999 H+
   0.9531830000000000    0.1554310000000000    0.6812530000000001 H+
   0.4531830000000000    0.3445690000000000    0.3187469999999999 H+
   0.0468170000000000    0.8445690000000000    0.1812530000000001 H+
   0.4961490000000000    0.6095140000000000    0.1858000000000000 H+
   0.0038510000000000    0.1095139999999999    0.3141999999999999 H+
   0.5038510000000000    0.3904860000000000    0.6858000000000001 H+
   0.9961490000000000    0.8904860000000000    0.8141999999999999 H+
   0.5038510000000000    0.3904860000000000    0.8141999999999999 H+
   0.9961490000000000    0.8904860000000000    0.6858000000000001 H+
   0.4961490000000000    0.6095140000000000    0.3141999999999999 H+
   0.0038510000000000    0.1095139999999999    0.1858000000000000 H+
   0.3604920000000000    0.4782410000000000    0.7500000000000000 H+
   0.1395080000000000    0.9782410000000000    0.7500000000000000 H+
   0.6395080000000000    0.5217590000000000    0.2500000000000000 H+
   0.8604920000000000    0.0217590000000000    0.2500000000000000 H+
   0.5000000000000000    0.0000000000000000    0.0000000000000000 Pb2+
   0.0000000000000000    0.5000000000000000    0.5000000000000000 Pb2+
   0.5000000000000000    0.0000000000000000    0.5000000000000000 Pb2+
   0.0000000000000000    0.5000000000000000    0.0000000000000000 Pb2+
   0.4259970000000000    0.4084900000000000    0.2500000000000000 C2-
   0.0740030000000000    0.9084900000000000    0.2500000000000000 C2-
   0.5740030000000000    0.5915100000000000    0.7500000000000000 C2-
   0.9259970000000000    0.0915100000000000    0.7500000000000000 C2-
   0.5692760000000000    0.9730720000000000    0.2500000000000000 I-
   0.9307240000000000    0.4730720000000002    0.2500000000000000 I-
   0.4307240000000000    0.0269280000000000    0.7500000000000000 I-
   0.0692759999999999    0.5269280000000000    0.7500000000000000 I-
   0.3263700000000000    0.6794470000000000    0.0175740000000000 I-
   0.1736300000000000    0.1794470000000001    0.4824260000000000 I-
   0.6736300000000000    0.3205530000000000    0.5175740000000000 I-
   0.8263700000000000    0.8205530000000000    0.9824260000000000 I-
   0.6736300000000000    0.3205530000000000    0.9824260000000000 I-
   0.8263700000000000    0.8205530000000000    0.5175740000000000 I-
   0.3263700000000000    0.6794470000000000    0.4824260000000000 I-
   0.1736300000000000    0.1794470000000001    0.0175740000000000 I-
   0.4788280000000000    0.4539980000000000    0.7500000000000000 N3-
   0.0211720000000000    0.9539980000000000    0.7500000000000000 N3-
   0.5211720000000000    0.5460020000000000    0.2500000000000000 N3-
   0.9788280000000000    0.0460020000000000    0.2500000000000000 N3-
"""
    )

    w = pymatgen.io.cif.CifWriter(s.structure)
    print(w)


if __name__ == "__main__":
    main()

Output is as follows:

# 2022.11.7
# generated using pymatgen
data_H6PbCI3N
_symmetry_space_group_name_H-M   'P 1'
_cell_length_a   8.65008300
_cell_length_b   8.99139100
_cell_length_c   13.12265400
_cell_angle_alpha   90.00000000
_cell_angle_beta   90.00000000
_cell_angle_gamma   90.00000000
_symmetry_Int_Tables_number   1
_chemical_formula_structural   H6PbCI3N
_chemical_formula_sum   'H24 Pb4 C4 I12 N4'
_cell_volume   1020.63119132
_cell_formula_units_Z   4
loop_
 _symmetry_equiv_pos_site_id
 _symmetry_equiv_pos_as_xyz
  1  'x, y, z'
loop_
 _atom_site_type_symbol
 _atom_site_label
 _atom_site_symmetry_multiplicity
 _atom_site_fract_x
 _atom_site_fract_y
 _atom_site_fract_z
 _atom_site_occupancy
  H  H0  1  0.30379300  0.44129800  0.25000000  1
  H  H1  1  0.19620700  0.94129800  0.25000000  1
  H  H2  1  0.69620700  0.55870200  0.75000000  1
  H  H3  1  0.80379300  0.05870200  0.75000000  1
  H  H4  1  0.45318300  0.34456900  0.18125300  1
  H  H5  1  0.04681700  0.84456900  0.31874700  1
  H  H6  1  0.54681700  0.65543100  0.68125300  1
  H  H7  1  0.95318300  0.15543100  0.81874700  1
  H  H8  1  0.54681700  0.65543100  0.81874700  1
  H  H9  1  0.95318300  0.15543100  0.68125300  1
  H  H10  1  0.45318300  0.34456900  0.31874700  1
  H  H11  1  0.04681700  0.84456900  0.18125300  1
  H  H12  1  0.49614900  0.60951400  0.18580000  1
  H  H13  1  0.00385100  0.10951400  0.31420000  1
  H  H14  1  0.50385100  0.39048600  0.68580000  1
  H  H15  1  0.99614900  0.89048600  0.81420000  1
  H  H16  1  0.50385100  0.39048600  0.81420000  1
  H  H17  1  0.99614900  0.89048600  0.68580000  1
  H  H18  1  0.49614900  0.60951400  0.31420000  1
  H  H19  1  0.00385100  0.10951400  0.18580000  1
  H  H20  1  0.36049200  0.47824100  0.75000000  1
  H  H21  1  0.13950800  0.97824100  0.75000000  1
  H  H22  1  0.63950800  0.52175900  0.25000000  1
  H  H23  1  0.86049200  0.02175900  0.25000000  1
  Pb  Pb24  1  0.50000000  0.00000000  0.00000000  1
  Pb  Pb25  1  0.00000000  0.50000000  0.50000000  1
  Pb  Pb26  1  0.50000000  0.00000000  0.50000000  1
  Pb  Pb27  1  0.00000000  0.50000000  0.00000000  1
  C  C28  1  0.42599700  0.40849000  0.25000000  1
  C  C29  1  0.07400300  0.90849000  0.25000000  1
  C  C30  1  0.57400300  0.59151000  0.75000000  1
  C  C31  1  0.92599700  0.09151000  0.75000000  1
  I  I32  1  0.56927600  0.97307200  0.25000000  1
  I  I33  1  0.93072400  0.47307200  0.25000000  1
  I  I34  1  0.43072400  0.02692800  0.75000000  1
  I  I35  1  0.06927600  0.52692800  0.75000000  1
  I  I36  1  0.32637000  0.67944700  0.01757400  1
  I  I37  1  0.17363000  0.17944700  0.48242600  1
  I  I38  1  0.67363000  0.32055300  0.51757400  1
  I  I39  1  0.82637000  0.82055300  0.98242600  1
  I  I40  1  0.67363000  0.32055300  0.98242600  1
  I  I41  1  0.82637000  0.82055300  0.51757400  1
  I  I42  1  0.32637000  0.67944700  0.48242600  1
  I  I43  1  0.17363000  0.17944700  0.01757400  1
  N  N44  1  0.47882800  0.45399800  0.75000000  1
  N  N45  1  0.02117200  0.95399800  0.75000000  1
  N  N46  1  0.52117200  0.54600200  0.25000000  1
  N  N47  1  0.97882800  0.04600200  0.25000000  1

Focus on the lines

_chemical_formula_structural   H6PbCI3N
_chemical_formula_sum   'H24 Pb4 C4 I12 N4'
_cell_formula_units_Z   4

Expected behavior A clear and concise description of what you expected to happen.

The lines must be

_chemical_formula_structural   '(C H3 N H3)4 Pb4 I12'
_chemical_formula_sum   'C4 H24 I12 N4 Pb4'
_cell_formula_units_Z   1

or

_chemical_formula_structural   '(C H3 N H3)1 Pb I3'
_chemical_formula_sum   'C1 H6 I3 N1 Pb1'
_cell_formula_units_Z   4
  • [ ] The elemental order of _chemical_formula_sum is wrong
  • [ ] IIUC the numbers of elements in _chemical_formula_structural and _chemical_formula_sum must be the same; the above _chemical_formula_structural’s are for illustration, and it would be difficult to write them as such, though

From _cell_formula_units_Z:

The number of the formula units in the unit cell as specified by _chemical_formula_structural, _chemical_formula_moiety or _chemical_formula_sum.

From _chemical_formula_structural:

See the _chemical_formula_[] category description for the rules for writing chemical formulae for inorganics, organometallics, metal complexes etc., in which bonded groups are preserved as discrete entities within parentheses, with post-multipliers as required. The order of the elements should give as much information as possible about the chemical structure. Parentheses may be used and nested as required. This formula should correspond to the structure as actually reported, i.e. trace elements not included in atom-type and atom-site lists should not be included in this formula (see also _chemical_formula_analytical).

From _chemical_formula_sum:

See the _chemical_formula_[] category description for the rules for writing chemical formulae in which all discrete bonded residues and ions are summed over the constituent elements, following the ordering given in general rule (5) in the _chemical_formula_[] category description. Parentheses are not normally used.

From https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Cchemical_formula.html

(5) Unless the elements are ordered in a manner that corresponds to their chemical structure, as in _chemical_formula_structural, the order of the elements within any group or moiety depends on whether carbon is present or not. If carbon is present, the order should be: C, then H, then the other elements in alphabetical order of their symbol. If carbon is not present, the elements are listed purely in alphabetical order of their symbol. This is the 'Hill' system used by Chemical Abstracts. This ordering is used in _chemical_formula_moiety and _chemical_formula_sum.

(emphasis mine)

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please supply relevant versions and platform info):

  • OS: (e.g. Mac, Windows, Linux): EndeavourOS
  • Version (e.g. 2022.11.17): 2022.11.7

Additional context Add any other context about the problem here.

e-kwsm avatar Feb 05 '23 23:02 e-kwsm

Thanks for reporting this. But I would like to understand what is the actual implication of this beyond "non-standards" compliance. Does it affect the use of the CIF in any software out there?

I am happy for someone to write a PR to fix this. But unless there is a pressing compatibility problem, I don't foresee being able to spend time to work on this.

shyuep avatar Feb 13 '23 20:02 shyuep

Thank you for your reply. So far I have no software problems related to elemental order of _chemical_formula_sum and to discrepancy among _cell_formula_units_Z and _chemical_formula_sum/_chemical_formula_structural, but the latter is, I believe, incorrect.

e-kwsm avatar Feb 14 '23 06:02 e-kwsm