api icon indicating copy to clipboard operation
api copied to clipboard

Inconsistent results of materials query

Open fxcoudert opened this issue 10 months ago • 13 comments

Python version

3.12.8

Pymatgen version

2025.1.24

Operating system version

macOS 15.2

Current behavior

The following code:

with MPRester(apikey) as mpr:
    mp_data = mpr.materials.summary.search(
        fields=["material_id", "deprecated", "formula_pretty", "nelements", "structure", "theoretical", "symmetry"]
    )
    print("Number of materials found:", len(mp_data))
    print("Database version", mpr.get_database_version())

returns

Number of materials found: 178580
Database version 2024.12.18

Of these, there are 169385 non deprecated materials, as returned by:

> sum(1 for x in mp_data if not x.deprecated)

This is consistent with the number of the web portal. Good. But now, consider this:

with MPRester(apikey) as mpr:
    mp_data = mpr.materials.summary.search(
        deprecated=False,
        fields=["material_id", "deprecated", "formula_pretty", "nelements", "structure", "theoretical", "symmetry"]
    )
    print("Number of materials found:", len(mp_data))
    print("Database version", mpr.get_database_version())

It is exactly the same query, except I ask for all non deprecated by passing deprecated=False. But it now returns:

Number of materials found: 153902
Database version 2024.12.18

Expected Behavior

I expect the two routes to return the same number (and same list) of non deprecated materials.

Minimal example


Relevant files to reproduce this bug

No response

fxcoudert avatar Jan 25 '25 14:01 fxcoudert

Hi @fxcoudert! Just to check --- was this meant for pymatgen or for https://github.com/materialsproject/api? I want to make sure you get the quickest feedback possible.

CC @tschaume in case it's relevant to him.

Andrew-S-Rosen avatar Jan 28 '25 20:01 Andrew-S-Rosen

Thanks @Andrew-S-Rosen! The difference is due to the new ~15k GNoMe materials that are included in the API response if a user accepted its terms on the website. You can set include_gnome=False in mpr.materials.summary.search() to exclude GNoMe materials regardless of whether their terms have been accepted or not. HTH

tschaume avatar Jan 28 '25 20:01 tschaume

The all-knowing Patrick has spoken!!

Andrew-S-Rosen avatar Jan 28 '25 20:01 Andrew-S-Rosen

Thanks @tschaume. I don't know if I have accepted the new terms or not, but what I am sure is that both queries were made at the same time, with the same function. So whether the terms were accepted or not, shouldn't the numbers be consistent? (169k in the first case, 154 in the second case)

PS: how can I check whether I have accepted the new terms or not? I can't seem to find the information in the dashboard for my account.

Re. @Andrew-S-Rosen: I have no idea if it is an API bug or a pymatgen bug. I have only queried through the pymatgen functions, not tried directly the API from another code.

fxcoudert avatar Jan 28 '25 22:01 fxcoudert

@fxcoudert I agree we should and can make this a lot more transparent. If you see the group TERMS:ACCEPT-NC listed under "Groups" on your dasboard, you've accepted the non-commercial terms for GNoMe and should be able to access the GNoMe explorer.

@yang-ruoxi We might have to add an explicit line on the dashboard that indicates whether the user has accepted the GNoMe terms or not.

@tsmathis Would you mind taking a look at the example code in this issue and see if you can reproduce it? We might have to double-check the deprecated fields for the GNoMe data. Thanks!

tschaume avatar Mar 04 '25 17:03 tschaume

@tschaume, the results here are reproducible. This is a side effect of user group access control behavior mixing with bulk download behavior in the client. I'll link you my slack messages where I had investigated this a little bit ago, we can discuss from there.

tsmathis avatar Mar 04 '25 17:03 tsmathis

@fxcoudert I started PR #974 to address this inconsistency. It's still work in progress and will need some data reorg on our end. We're hoping we can get this out with our next data release.

tschaume avatar Mar 05 '25 19:03 tschaume

@tschaume quick question: once I have run a query and gotten structures back, how can I identify if a specific structure is in the gnome dataset or not? I thought it would be somewhere in the metadata, for example as struct.builder_meta.license, but that one always has value 'BY-C' (which is actually weird, cause it's not a valid license code?)

fxcoudert avatar Mar 06 '25 11:03 fxcoudert

@fxcoudert both the builder_meta.batch_id and builder_meta.license fields in a SummaryDoc will help with that. The batch_id for GNoMe materials is gnome_r2scan_statics and its license is BY-NC. The two licenses options BY-C and BY-NC refer to the creative commons licenses. HTH

tschaume avatar Mar 06 '25 18:03 tschaume

@tschaume can you share info on how to interpret e.g. mp-3202637. Via the AWS Open Data bucket it looks like this task is a GNoME calculation based on its path, however the mp-3202637 material defined in the materials collection (which has one and only one task) has builder_meta.license: "BY-C". This is all with reference to the 2025.06.09 release.

mkhorton avatar Jul 08 '25 19:07 mkhorton

@mkhorton, running incremental builds caused some of the materials that would have been binned in the GNoME dataset to get the default "BY-C" license in the builder_meta, I've fixed all the objects in the Open Data bucket for 2025.06.09 so they are correctly "BY-NC" now.

Querying the materials endpoint now with an empty search (routed to s3) and a filter expression (routed to mongo) will yield the same count of "BY-NC" entries:

>>> with MPRester(<API_KEY>, use_document_model=False, monty_decode=False) as mpr:
...     mats = mpr.materials.search()
>>> len(list(filter(lambda x: x["builder_meta"]["license"] == "BY-NC", mats)))
45608
>>> with MPRester(<API_KEY>) as mpr:
...     projected_mats = mpr.materials.search(fields=["builder_meta"])
>>> len(list(filter(lambda x: x.builder_meta.license == "BY-NC", projected_mats)))
45608

tsmathis avatar Jul 09 '25 01:07 tsmathis

Thanks Tyler! Really appreciate the rapid response too.

Short of re-downloading the open data, is there an easy way to search for affected materials? No problem if there is not.

mkhorton avatar Jul 09 '25 06:07 mkhorton

Would still require running a query with the client I think -> get the GNoME material ids doing something similar to what I pasted above, just include material_id in the fields param, slice your local dataset using those "BY-NC" material ids, and split that resulting set on the builder_meta.license field.

I would say that would just be an academic exercise though to find which materials in your local dataset have the incorrect license field. If you only care about having the data with the correct fields, just re-downloading is of course easier

tsmathis avatar Jul 09 '25 18:07 tsmathis