[Bug]: Yb structures are not being pulled with MPRester()
Code snippet
from mp_api.client import MPRester
all_element_lists = [['C', 'O', 'Yb']]
all_docs = []
with MPRester() as mpr:
for elements in all_element_lists:
print(elements)
entries = mpr.get_entries_in_chemsys(elements)
mpids_raw = [entry.data["material_id"] for entry in entries]
mpids_raw = ",".join(mpids_raw)
docs = mpr.materials.summary.search(
material_ids=mpids_raw,
energy_above_hull=(0,0.01),
# deprecated=True,
fields=["material_id", "structure", "band_gap"],
)
all_docs.extend(docs)
print(f"Pulled {len(all_docs)} total docs")
for doc in all_docs:
mpid = doc.material_id
formula = doc.structure.composition.reduced_formula
print(f"{mpid}: {formula}")
What happened?
I am attempting to pull Yb pstructures from the materials project using MPRester(). For some reason, although other elements (C, O) are being pulled just fine, no Yb structures are extracted at all. Even setting deprecated=True to see if these Yb structures on the MP are deprecated still returns no Yb structures.
After discussion with @Andrew-S-Rosen, we think this might have something to do with the recomputed Yb systems not making their way to the API.
Version
0.45.3 (mp-api)
Which OS?
- [ ] MacOS
- [x] Windows
- [ ] Linux
Log output
['C', 'O', 'Yb']
Retrieving ThermoDoc documents: 100%|█████████████████████████████████████████████████| 105/105 [00:00<00:00, 4685126.81it/s]
Retrieving SummaryDoc documents: 100%|███████████████████████████████████████████████████| 19/19 [00:00<00:00, 866214.96it/s]
Pulled 19 total docs
mp-990424: C
mp-169: C
mp-2516584: C
mp-569304: C
mp-569416: C
mp-48: C
mp-606949: C
mp-1009490: O2
mp-12957: O2
mp-611836: O2
mp-723285: O2
mp-568286: C
mp-568363: C
mp-937760: C
mp-11725: CO2
mp-644607: CO2
mp-1190699: CO2
mp-20066: CO2
mp-1077906: CO2
the get_entries_by_chemsys function only pulls GGA_GGA+U entries by default, try:
entries = mpr.get_entries_in_chemsys(["C", "O", "Yb"], additional_criteria={"thermo_types": ["GGA_GGA+U", "R2SCAN"]})
using your same print loop:
Retrieving ThermoDoc documents: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 157/157 [00:00<00:00, 5827484.32it/s]
Retrieving SummaryDoc documents: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:00<00:00, 993387.79it/s]
>>> for doc in all_docs:
... mpid = doc.material_id
... formula = doc.structure.composition.reduced_formula
... print(f"{mpid}: {formula}")
...
mp-12957: O2
mp-1009490: O2
mp-1524462: O2
mp-2204849: O2
mp-611836: O2
mp-723285: O2
mp-972364: Yb
mp-1187875: Yb
mp-568286: C
mp-568363: C
mp-937760: C
mp-3347313: C
mp-990424: C
mp-169: C
mp-2516584: C
mp-569304: C
mp-569416: C
mp-48: C
mp-606949: C
mp-937760: C
mp-1100: YbC2 <--
mp-1215848: Yb2C <--
mp-1077906: CO2
mp-11725: CO2
mp-644607: CO2
mp-1190699: CO2
mp-20066: CO2
mp-2814: Yb2O3 <--
mp-990424: C
mp-169: C
mp-2516584: C
mp-569304: C
mp-569416: C
mp-48: C
mp-606949: C
mp-1077906: CO2
We can clarify that that is the default behavior in the function signature.
@tschaume, I dug a bit and found where defaulting to GGA_GGA+U entries only got obfuscated. We can discuss later
Thanks for the update, @tsmathis.
To make sure I understand, it seems like there are four ThermoTypes: GGA_GGA+U, GGA_GGA+U_R2SCAN, R2SCAN, and UNKNOWN. Because the original Yb-containing structures were deprecated, they do not have GGA_GGA+U entries anymore. But they do have thermochemistry from the R2SCAN level of theory due to the recompute effort. Is this correct? And right now, the get_entries_in_chemsys is only returning entries with GGA or GGA+U calculations by default (presumably this really means PBE and PBE+U, but excludes PBEsol even though it is a GGA)?
Possibility of four, reality is only three though since we only build the three hulls (compatibility, mixing scheme, and r2scan)
db.thermo.aggregate([{$project: {_id: 0, thermo_type: 1}}, {$group: {_id: "$thermo_type", count: {$sum: 1}}}])
[
{ _id: 'GGA_GGA+U', count: 153167 },
{ _id: 'R2SCAN', count: 82664 },
{ _id: 'GGA_GGA+U_R2SCAN', count: 164332 }
]
Yes to all your other points except the last question. The calcs(run types) that are considered when choosing entries to include with a material are (blessed calcs):
class BlessedCalcs(BaseModel, populate_by_name=True):
GGA: ComputedStructureEntry | None = None
GGA_U: ComputedStructureEntry | None = Field(None, alias="GGA+U")
PBESol: ComputedStructureEntry | None = Field(None, alias="PBEsol")
SCAN: ComputedStructureEntry | None = None
R2SCAN: ComputedStructureEntry | None = Field(None, alias="r2SCAN")
HSE: ComputedStructureEntry | None = None
Would have to run some checks to get actual counts on the materials with SCAN, PBEsol, and HSE entries though.
but materials without at least one GGA, GGA+U, or R2SCAN calc don't get through: (run type check)
Added explicit default for additional_criteria in 9e28c72
Thanks, folks.
What is the motivation behind having get_entries_in_chemsys filter by thermo types in the first place? Naively, it seems like the function should have no connection to thermochemistry at all. I had expected it to simply return every single entry in that chemical system by default. My concern is that the current behavior may cause a lot of confusion for end-users, but of course changing the default is equally problematic.
This might be a carry-over from when the function was migrated from the legacy rester. I'm not sure what the reasoning for the default filter was - it's not in get_entries, only in get_entries_in_chemsys. I'd be fine with making the two the same but, as you said, changing the default might be problematic for (some) users. Could be a good discussion for our next foundation meeting? 😄
I will make a note! Ultimately, I think it may be worth making the breaking change here simply because the inconsistency is not obvious. If I didn't know about it, I feel like most users aren't going to know either. If we change the default, at least it will produce more entries (that users can downselect) rather than fewer, so perhaps it is not a major concern in terms of the breaking change. Especially as MP becomes more r2SCAN-heavy, the legacy GGA/GGA+U entries may have less relevance.
provenance on the additional_criteria param for ref if it's helpful from when I followed the blames earlier:
https://github.com/materialsproject/api/commit/2292fc1a37fc8e3dc301653f1a6848b3749ba00b [ln 887 - 893]-> https://github.com/materialsproject/api/commit/72eafaa4dcbf03acef0adc0662567fd7b0b9b594 [ln 911 on]-> https://github.com/materialsproject/api/commit/e298b178ef8bddd1bd363791f102311d49ff8048
+1 re: the future concern about default behavior as skew in entry type distribution for r2SCAN vs. previous levels of theory changes
You have my blessing in the end. And thanks for the history!