emmet icon indicating copy to clipboard operation
emmet copied to clipboard

[WIP] Large structural changes to MPcules

Open espottesmith opened this issue 1 year ago • 11 comments

Meant to do this piecemeal, but there kept being things that I wanted/needed. Also, the main point was to make changes/fixes to the API, and that's actually one of the things I still need to mess with/test. Anyways, here we are.

Contributor Checklist

  • [X] I have broken down my PR scope into the following TODO tasks
    • [X] Add new document containing molecular dipole/multipole data and associated builders
    • [X] On @munrojm's suggestion, changed the has_props field in the molecule summary doc to aid in querying
    • [X] Remove probably unnecessary query operators
    • [X] Switch molecule element querying to be based on composition, which should be more efficient?
    • [X] Add dipoles/multipoles to the information returned in a trajectory query. Also improved testing for the trajectory query operator
    • [X] Add three-center (3C) bonds and 3-center 4-electron hyperbonds to the orbital doc/builder. Thanks @samblau for making the changes to pymatgen necessary to parse this data
    • [X] Change all molecule property builders to break down documents based on hash, rather than formula. Even for the small and quick property builders, the current scale of the data (~4M tasks, ~600k molecules) made separation by formula problematic.
    • [x] Reformat some data to remove lists (where possible) and make querying on summary document faster
    • [ ] Make some changes to NBO-based bonding, based on some issues identified by @samblau and @orionarcher
    • [ ] (Potentially) add string reps (InChI and SMILES) to all molecules docs, rather than just tasks (for SMILES) and molecules/summary (for InChI)
  • [X] I have run the tests locally and they passed. (I never run ALL the tests, cause that takes a hot minute, but I did run all of the tests that I messed with!)
  • [X] I have added tests, or extended existing tests, to cover any new features or bugs fixed in this PR

espottesmith avatar Jan 18 '24 16:01 espottesmith

Codecov Report

Attention: Patch coverage is 90.95941% with 98 lines in your changes missing coverage. Please review.

Project coverage is 90.03%. Comparing base (7ded807) to head (1ddb41c).

Files with missing lines Patch % Lines
emmet-core/emmet/core/molecules/trajectory.py 81.18% 38 Missing :warning:
...et-builders/emmet/builders/molecules/trajectory.py 91.62% 16 Missing :warning:
emmet-core/emmet/core/molecules/orbitals.py 91.02% 14 Missing :warning:
...mmet-api/emmet/api/routes/molecules/tasks/utils.py 50.00% 11 Missing :warning:
...t/api/routes/molecules/electric/query_operators.py 76.92% 9 Missing :warning:
...mmet-builders/emmet/builders/molecules/electric.py 94.23% 6 Missing :warning:
emmet-core/emmet/core/molecules/electric.py 95.65% 4 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #926      +/-   ##
==========================================
+ Coverage   90.00%   90.03%   +0.02%     
==========================================
  Files         143      147       +4     
  Lines       13441    13460      +19     
==========================================
+ Hits        12098    12119      +21     
+ Misses       1343     1341       -2     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Jan 18 '24 16:01 codecov-commenter

This is still WIP (will be contributing to it this week), but it'd be helpful at this point to have folks' eyes on it. @munrojm @orionarcher

espottesmith avatar Jul 10 '24 17:07 espottesmith

Yo @tschaume @yang-ruoxi @tsmathis, can this be merged?

espottesmith avatar Aug 01 '24 17:08 espottesmith

I'd say more or less yes, just a couple questions from me

  1. I see the enums for qchem are now excluded from version control in the .gitignore, where the vasp enums are still included. Should that be the case?
  2. I see large chunks of code in some places that are commented out like here: https://github.com/materialsproject/emmet/blob/74917fbc62d5709474a5b9473926ff7adbab75ab/emmet-api/emmet/api/routes/molecules/molecules/query_operators.py#L77 is there a reason to keep sections of code like this for later? Or should they be removed?
  3. And I see some new test files, are these just generated task documents? Nothing copyrighted?

tsmathis avatar Aug 01 '24 17:08 tsmathis

In addition to the comments by @tsmathis, I'd like to go through the indexes that we need to make sure are built for the most common queries to be performant. I've added a few already but it'd be great if you could let me know any that are missing:

  • molecule_id
  • last_updated
  • formula_alphabetical
  • charge
  • spin_multiplicity
  • natoms
  • nelements
  • builder_meta.$**
  • elements.$**
  • composition.$**
  • chemsys
  • symmetry.$**
  • has_props.$**

tschaume avatar Aug 01 '24 19:08 tschaume

To @tsmathis:

  1. Since the enums get auto-generated, I figured there was no need for changes made to them to get pushed. If you want the line removed from .gitignore, that's fine.
  2. Those lines can be removed for now. If they need to be added back in later, that's fine.
  3. All of the test data that I've added is from collections that are already in MP or are going to be in MP, I believe. So no copyright issues.

Some fields that should probably be indexed:

  • species_hash
  • inchi
  • inchi_key
  • task_ids
  • unique_levels_of_theory
  • unique_solvents

espottesmith avatar Aug 01 '24 20:08 espottesmith

Great, thanks! I added the indexes. Are all query operators for the summary collection covered by indexes now?

tschaume avatar Aug 01 '24 20:08 tschaume

Okay, cool. All fine by me with those then, just wanted to check

tsmathis avatar Aug 01 '24 20:08 tsmathis

I just added HashQuery to the summary resource, which means that we should add an index to coord_hash in addition to species_hash. Otherwise, yeah, everything should be covered.

espottesmith avatar Aug 02 '24 13:08 espottesmith

Ok great. I've added the index on coord_hash too. Are there any mp-api client changes necessary that we should be aware of?

tschaume avatar Aug 02 '24 17:08 tschaume

It doesn't look like it? I think all of the fields available in the rester search() function are still query-able through the API. Though weirdly, in mprester.py, the molecule endpoint is

molecules: MoleculeRester

which points to the molecules collection, while I'm pretty sure it should be

molecules: MoleculesSummaryRester

pointing to the molecules summary collection.

espottesmith avatar Aug 02 '24 18:08 espottesmith