api icon indicating copy to clipboard operation
api copied to clipboard

mpr.materials.elasticity.search() does not return all results

Open fxcoudert opened this issue 1 year ago • 3 comments

Python version

Python 3.10.14

Pymatgen version

2024.8.9

Operating system version

macOS 14

Current behavior

data = mpr.materials.elasticity.search(fields=["material_id", "formula_pretty", "structure", "elastic_tensor"])
print("Number of materials retrieved:", len(data))

You can see that it expects 12392 results (in line with what the website shows, roughly) but only downloads 10733.

Capture d’écran 2024-09-11 à 15 24 30

Expected Behavior

The search should download all available records.

Minimal example

from pymatgen.ext.matproj import MPRester
mpr = MPRester(mpapikey)
data = mpr.materials.elasticity.search(fields=["material_id", "formula_pretty", "structure", "elastic_tensor"])
print("Number of materials retrieved:", len(data))

fxcoudert avatar Sep 11 '24 13:09 fxcoudert

moving this to the api repo since it's not an issue with pymatgen

janosh avatar Sep 11 '24 13:09 janosh

Thanks for reporting this @fxcoudert. Good catch! I can reproduce the issue and am looking into it.

tschaume avatar Sep 11 '24 18:09 tschaume

@fxcoudert 1659 out of 12392 entries are deprecated and the API client skips them for empty queries while including them in the count for the progress bar. If you add a query on poisson_ratio for instance, deprecated entries are included and would need to be post-filtered out, though. @munrojm might remember why we went this route.

tschaume avatar Sep 11 '24 21:09 tschaume

We released a new database version (2024-11-14) yesterday along with mp-api==0.44.0 which fixes this issue. The client now returns all documents when no query is provided regardless of their deprecation status and ignores the field argument. The full elasticity collection currently contains 13074 entries.

tschaume avatar Dec 13 '24 21:12 tschaume

Hi @tschaume I have a question. I wanted to quickly check materials in the next database version:

>>> from pymatgen.ext.matproj import MPRester
>>> mpr = MPRester(apikey)
>>> mp_data = mpr.materials.summary.search(fields=["material_id", "formula", "formula_pretty", "nelements", "structure", "theoretical", "symmetry"])
Retrieving SummaryDoc documents: 306376it [02:55, 1743.71it/s]                                                                      
>>> mpr.get_database_version()
'2024.11.14'
>>> len(mp_data)
306376

But that's 306376 materials, whereas the website says there are 153,188. On the other hand, all results are returned twice:

>>> names = [str(x.material_id) for x in mp_data]
>>> len(names)
306376
>>> len(set(names))
153188
>>> len(set(names)) * 2
306376

Would you have any idea why all records are duplicated? This is with mp-api 0.41.2, so is the older version of mp-api not compatible with newer database versions?

fxcoudert avatar Dec 15 '24 10:12 fxcoudert

Two new findings:

  • mp-api==0.44.0 appears to be incompatible with pymatgen==2024.8.9. When I updated only mp-api from my previous environment, I got:
AttributeError: materials is not an attribute of this implementation of MPRester, which only supports functionalityused by 80% of users. If you are looking for the full functionality MPRester, pls install the mp-api .
  • When I use mp-api==0.44.0 with pymatgen==2024.11.13 then I obtain the expected number of materials.
>>> mpr = MPRester(apikey)
>>> mp_data = mpr.materials.summary.search(fields=["material_id", "formula", "formula_pretty", "nelements", "structure", "theoretical", "symmetry"])
/Users/fx/miniforge3/envs/matten/lib/python3.10/site-packages/mp_api/client/core/client.py:509: UserWarning: Ignoring `fields` argument: All fields are always included when no query is provided.
  warnings.warn(
Retrieving SummaryDoc documents: 163097it [06:24, 424.71it/s]                                                                       
>>> len(mp_data)
163097
>>> sum(1 for x in mp_data if not x.deprecated)
153188

If database version and mp-api versions are locked together, I would suggest to emit a warning or even error. Stability of API and code are really important for scientific research, and having a workflow and environment that worked in previous version, and suddenly starts to silently give wrong results, is going to make reproducibility difficult.

fxcoudert avatar Dec 15 '24 10:12 fxcoudert

PS: minor cosmetic issue in the latest version. You can see in this that the progress bar expects 153188 documents (i.e., the number of non-deprecated records), but it will actually return 163097 documents.

>>> mp_data = mpr.materials.summary.search(fields=["material_id", "formula", "formula_pretty", "nelements", "structure", "theoretical", "symmetry"])
Retrieving SummaryDoc documents:   3%|█▌                                                    | 4401/153188 [00:07<02:00, 1238.42it/s]

So there is a mismatch somewhere between the query that calculates the number of records, and the final query that is run.

fxcoudert avatar Dec 15 '24 10:12 fxcoudert

Hi @fxcoudert, thanks for following up and documenting your checks!

Would you have any idea why all records are duplicated? This is with mp-api 0.41.2, so is the older version of mp-api not compatible with newer database versions?

Yes, it's generally best to stay up to date with the mp-api version. The data and the client are somewhat coupled \ since full downloads (i.e. search() without query arguments) are rerouted directly to S3. In addition to the data updates, there were also some minor bugs in the mp-api client that I fixed for 0.44.0 and newer.

mp-api==0.44.0 appears to be incompatible with pymatgen==2024.8.9

I'd have to take a closer look but this might be a pymatgen issue and my general recommendation would be to upgrade pymatgen anyways :) Glad to hear that 2024.11.13 works as expected.

If database version and mp-api versions are locked together, I would suggest to emit a warning or even error.

We've implemented a version check like this in the past to warn the user. It might not trigger as expected. I'll see if I can improve it.

So there is a mismatch somewhere between the query that calculates the number of records, and the final query that is run.

Also correct. Since v0.44.0, the direct downloads from S3 include all deprecated tasks and the progress bar is updated as objects/files are loaded into memory. The initial count for the total of the progress bar, however, comes from the database and doesn't include the deprecated tasks by default. I've released v0.45.0rc3 which should fix the issue.

tschaume avatar Dec 16 '24 21:12 tschaume