mpr.materials.elasticity.search() does not return all results
Python version
Python 3.10.14
Pymatgen version
2024.8.9
Operating system version
macOS 14
Current behavior
data = mpr.materials.elasticity.search(fields=["material_id", "formula_pretty", "structure", "elastic_tensor"])
print("Number of materials retrieved:", len(data))
You can see that it expects 12392 results (in line with what the website shows, roughly) but only downloads 10733.
Expected Behavior
The search should download all available records.
Minimal example
from pymatgen.ext.matproj import MPRester
mpr = MPRester(mpapikey)
data = mpr.materials.elasticity.search(fields=["material_id", "formula_pretty", "structure", "elastic_tensor"])
print("Number of materials retrieved:", len(data))
moving this to the api repo since it's not an issue with pymatgen
Thanks for reporting this @fxcoudert. Good catch! I can reproduce the issue and am looking into it.
@fxcoudert 1659 out of 12392 entries are deprecated and the API client skips them for empty queries while including them in the count for the progress bar. If you add a query on poisson_ratio for instance, deprecated entries are included and would need to be post-filtered out, though. @munrojm might remember why we went this route.
We released a new database version (2024-11-14) yesterday along with mp-api==0.44.0 which fixes this issue. The client now returns all documents when no query is provided regardless of their deprecation status and ignores the field argument. The full elasticity collection currently contains 13074 entries.
Hi @tschaume I have a question. I wanted to quickly check materials in the next database version:
>>> from pymatgen.ext.matproj import MPRester
>>> mpr = MPRester(apikey)
>>> mp_data = mpr.materials.summary.search(fields=["material_id", "formula", "formula_pretty", "nelements", "structure", "theoretical", "symmetry"])
Retrieving SummaryDoc documents: 306376it [02:55, 1743.71it/s]
>>> mpr.get_database_version()
'2024.11.14'
>>> len(mp_data)
306376
But that's 306376 materials, whereas the website says there are 153,188. On the other hand, all results are returned twice:
>>> names = [str(x.material_id) for x in mp_data]
>>> len(names)
306376
>>> len(set(names))
153188
>>> len(set(names)) * 2
306376
Would you have any idea why all records are duplicated? This is with mp-api 0.41.2, so is the older version of mp-api not compatible with newer database versions?
Two new findings:
-
mp-api==0.44.0appears to be incompatible withpymatgen==2024.8.9. When I updated onlymp-apifrom my previous environment, I got:
AttributeError: materials is not an attribute of this implementation of MPRester, which only supports functionalityused by 80% of users. If you are looking for the full functionality MPRester, pls install the mp-api .
- When I use
mp-api==0.44.0withpymatgen==2024.11.13then I obtain the expected number of materials.
>>> mpr = MPRester(apikey)
>>> mp_data = mpr.materials.summary.search(fields=["material_id", "formula", "formula_pretty", "nelements", "structure", "theoretical", "symmetry"])
/Users/fx/miniforge3/envs/matten/lib/python3.10/site-packages/mp_api/client/core/client.py:509: UserWarning: Ignoring `fields` argument: All fields are always included when no query is provided.
warnings.warn(
Retrieving SummaryDoc documents: 163097it [06:24, 424.71it/s]
>>> len(mp_data)
163097
>>> sum(1 for x in mp_data if not x.deprecated)
153188
If database version and mp-api versions are locked together, I would suggest to emit a warning or even error. Stability of API and code are really important for scientific research, and having a workflow and environment that worked in previous version, and suddenly starts to silently give wrong results, is going to make reproducibility difficult.
PS: minor cosmetic issue in the latest version. You can see in this that the progress bar expects 153188 documents (i.e., the number of non-deprecated records), but it will actually return 163097 documents.
>>> mp_data = mpr.materials.summary.search(fields=["material_id", "formula", "formula_pretty", "nelements", "structure", "theoretical", "symmetry"])
Retrieving SummaryDoc documents: 3%|█▌ | 4401/153188 [00:07<02:00, 1238.42it/s]
So there is a mismatch somewhere between the query that calculates the number of records, and the final query that is run.
Hi @fxcoudert, thanks for following up and documenting your checks!
Would you have any idea why all records are duplicated? This is with mp-api 0.41.2, so is the older version of mp-api not compatible with newer database versions?
Yes, it's generally best to stay up to date with the mp-api version. The data and the client are somewhat coupled \ since full downloads (i.e. search() without query arguments) are rerouted directly to S3. In addition to the data updates, there were also some minor bugs in the mp-api client that I fixed for 0.44.0 and newer.
mp-api==0.44.0 appears to be incompatible with pymatgen==2024.8.9
I'd have to take a closer look but this might be a pymatgen issue and my general recommendation would be to upgrade pymatgen anyways :) Glad to hear that 2024.11.13 works as expected.
If database version and mp-api versions are locked together, I would suggest to emit a warning or even error.
We've implemented a version check like this in the past to warn the user. It might not trigger as expected. I'll see if I can improve it.
So there is a mismatch somewhere between the query that calculates the number of records, and the final query that is run.
Also correct. Since v0.44.0, the direct downloads from S3 include all deprecated tasks and the progress bar is updated as objects/files are loaded into memory. The initial count for the total of the progress bar, however, comes from the database and doesn't include the deprecated tasks by default. I've released v0.45.0rc3 which should fix the issue.