Inconsistent results of materials query
Python version
3.12.8
Pymatgen version
2025.1.24
Operating system version
macOS 15.2
Current behavior
The following code:
with MPRester(apikey) as mpr:
mp_data = mpr.materials.summary.search(
fields=["material_id", "deprecated", "formula_pretty", "nelements", "structure", "theoretical", "symmetry"]
)
print("Number of materials found:", len(mp_data))
print("Database version", mpr.get_database_version())
returns
Number of materials found: 178580
Database version 2024.12.18
Of these, there are 169385 non deprecated materials, as returned by:
> sum(1 for x in mp_data if not x.deprecated)
This is consistent with the number of the web portal. Good. But now, consider this:
with MPRester(apikey) as mpr:
mp_data = mpr.materials.summary.search(
deprecated=False,
fields=["material_id", "deprecated", "formula_pretty", "nelements", "structure", "theoretical", "symmetry"]
)
print("Number of materials found:", len(mp_data))
print("Database version", mpr.get_database_version())
It is exactly the same query, except I ask for all non deprecated by passing deprecated=False. But it now returns:
Number of materials found: 153902
Database version 2024.12.18
Expected Behavior
I expect the two routes to return the same number (and same list) of non deprecated materials.
Minimal example
Relevant files to reproduce this bug
No response
Hi @fxcoudert! Just to check --- was this meant for pymatgen or for https://github.com/materialsproject/api? I want to make sure you get the quickest feedback possible.
CC @tschaume in case it's relevant to him.
Thanks @Andrew-S-Rosen! The difference is due to the new ~15k GNoMe materials that are included in the API response if a user accepted its terms on the website. You can set include_gnome=False in mpr.materials.summary.search() to exclude GNoMe materials regardless of whether their terms have been accepted or not. HTH
The all-knowing Patrick has spoken!!
Thanks @tschaume. I don't know if I have accepted the new terms or not, but what I am sure is that both queries were made at the same time, with the same function. So whether the terms were accepted or not, shouldn't the numbers be consistent? (169k in the first case, 154 in the second case)
PS: how can I check whether I have accepted the new terms or not? I can't seem to find the information in the dashboard for my account.
Re. @Andrew-S-Rosen: I have no idea if it is an API bug or a pymatgen bug. I have only queried through the pymatgen functions, not tried directly the API from another code.
@fxcoudert I agree we should and can make this a lot more transparent. If you see the group TERMS:ACCEPT-NC listed under "Groups" on your dasboard, you've accepted the non-commercial terms for GNoMe and should be able to access the GNoMe explorer.
@yang-ruoxi We might have to add an explicit line on the dashboard that indicates whether the user has accepted the GNoMe terms or not.
@tsmathis Would you mind taking a look at the example code in this issue and see if you can reproduce it? We might have to double-check the deprecated fields for the GNoMe data. Thanks!
@tschaume, the results here are reproducible. This is a side effect of user group access control behavior mixing with bulk download behavior in the client. I'll link you my slack messages where I had investigated this a little bit ago, we can discuss from there.
@fxcoudert I started PR #974 to address this inconsistency. It's still work in progress and will need some data reorg on our end. We're hoping we can get this out with our next data release.
@tschaume quick question: once I have run a query and gotten structures back, how can I identify if a specific structure is in the gnome dataset or not? I thought it would be somewhere in the metadata, for example as struct.builder_meta.license, but that one always has value 'BY-C' (which is actually weird, cause it's not a valid license code?)
@fxcoudert both the builder_meta.batch_id and builder_meta.license fields in a SummaryDoc will help with that. The batch_id for GNoMe materials is gnome_r2scan_statics and its license is BY-NC. The two licenses options BY-C and BY-NC refer to the creative commons licenses. HTH
@tschaume can you share info on how to interpret e.g. mp-3202637. Via the AWS Open Data bucket it looks like this task is a GNoME calculation based on its path, however the mp-3202637 material defined in the materials collection (which has one and only one task) has builder_meta.license: "BY-C". This is all with reference to the 2025.06.09 release.
@mkhorton, running incremental builds caused some of the materials that would have been binned in the GNoME dataset to get the default "BY-C" license in the builder_meta, I've fixed all the objects in the Open Data bucket for 2025.06.09 so they are correctly "BY-NC" now.
Querying the materials endpoint now with an empty search (routed to s3) and a filter expression (routed to mongo) will yield the same count of "BY-NC" entries:
>>> with MPRester(<API_KEY>, use_document_model=False, monty_decode=False) as mpr:
... mats = mpr.materials.search()
>>> len(list(filter(lambda x: x["builder_meta"]["license"] == "BY-NC", mats)))
45608
>>> with MPRester(<API_KEY>) as mpr:
... projected_mats = mpr.materials.search(fields=["builder_meta"])
>>> len(list(filter(lambda x: x.builder_meta.license == "BY-NC", projected_mats)))
45608
Thanks Tyler! Really appreciate the rapid response too.
Short of re-downloading the open data, is there an easy way to search for affected materials? No problem if there is not.
Would still require running a query with the client I think -> get the GNoME material ids doing something similar to what I pasted above, just include material_id in the fields param, slice your local dataset using those "BY-NC" material ids, and split that resulting set on the builder_meta.license field.
I would say that would just be an academic exercise though to find which materials in your local dataset have the incorrect license field. If you only care about having the data with the correct fields, just re-downloading is of course easier