api [Bug]: Yb structures are not being pulled with MPRester()

Code snippet

from mp_api.client import MPRester

all_element_lists = [['C', 'O', 'Yb']]

all_docs = []
with MPRester() as mpr:
    for elements in all_element_lists:
        print(elements)
        entries = mpr.get_entries_in_chemsys(elements)
        mpids_raw = [entry.data["material_id"] for entry in entries]
        mpids_raw = ",".join(mpids_raw)

        docs = mpr.materials.summary.search(
            material_ids=mpids_raw,
            energy_above_hull=(0,0.01),
      #      deprecated=True,
            fields=["material_id", "structure", "band_gap"],
        )

        all_docs.extend(docs)


print(f"Pulled {len(all_docs)} total docs")

for doc in all_docs:
    mpid = doc.material_id
    formula = doc.structure.composition.reduced_formula
    print(f"{mpid}: {formula}")

What happened?

I am attempting to pull Yb pstructures from the materials project using MPRester(). For some reason, although other elements (C, O) are being pulled just fine, no Yb structures are extracted at all. Even setting deprecated=True to see if these Yb structures on the MP are deprecated still returns no Yb structures.

After discussion with @Andrew-S-Rosen, we think this might have something to do with the recomputed Yb systems not making their way to the API.

Version

0.45.3 (mp-api)

Which OS?

[ ] MacOS
[x] Windows
[ ] Linux

Log output

['C', 'O', 'Yb']
Retrieving ThermoDoc documents: 100%|█████████████████████████████████████████████████| 105/105 [00:00<00:00, 4685126.81it/s]
Retrieving SummaryDoc documents: 100%|███████████████████████████████████████████████████| 19/19 [00:00<00:00, 866214.96it/s]
Pulled 19 total docs
mp-990424: C
mp-169: C
mp-2516584: C
mp-569304: C
mp-569416: C
mp-48: C
mp-606949: C
mp-1009490: O2
mp-12957: O2
mp-611836: O2
mp-723285: O2
mp-568286: C
mp-568363: C
mp-937760: C
mp-11725: CO2
mp-644607: CO2
mp-1190699: CO2
mp-20066: CO2
mp-1077906: CO2

Jul 31 '25 14:07 blaked8619

the get_entries_by_chemsys function only pulls GGA_GGA+U entries by default, try: entries = mpr.get_entries_in_chemsys(["C", "O", "Yb"], additional_criteria={"thermo_types": ["GGA_GGA+U", "R2SCAN"]})

using your same print loop:

Retrieving ThermoDoc documents: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 157/157 [00:00<00:00, 5827484.32it/s]
Retrieving SummaryDoc documents: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:00<00:00, 993387.79it/s]
>>> for doc in all_docs:
...     mpid = doc.material_id
...     formula = doc.structure.composition.reduced_formula
...     print(f"{mpid}: {formula}")
...
mp-12957: O2
mp-1009490: O2
mp-1524462: O2
mp-2204849: O2
mp-611836: O2
mp-723285: O2
mp-972364: Yb
mp-1187875: Yb
mp-568286: C
mp-568363: C
mp-937760: C
mp-3347313: C
mp-990424: C
mp-169: C
mp-2516584: C
mp-569304: C
mp-569416: C
mp-48: C
mp-606949: C
mp-937760: C
mp-1100: YbC2  <--
mp-1215848: Yb2C  <--
mp-1077906: CO2
mp-11725: CO2
mp-644607: CO2
mp-1190699: CO2
mp-20066: CO2
mp-2814: Yb2O3  <--
mp-990424: C
mp-169: C
mp-2516584: C
mp-569304: C
mp-569416: C
mp-48: C
mp-606949: C
mp-1077906: CO2

Jul 31 '25 15:07 tsmathis

We can clarify that that is the default behavior in the function signature.

@tschaume, I dug a bit and found where defaulting to GGA_GGA+U entries only got obfuscated. We can discuss later

Jul 31 '25 16:07 tsmathis

Thanks for the update, @tsmathis.

To make sure I understand, it seems like there are four ThermoTypes: GGA_GGA+U, GGA_GGA+U_R2SCAN, R2SCAN, and UNKNOWN. Because the original Yb-containing structures were deprecated, they do not have GGA_GGA+U entries anymore. But they do have thermochemistry from the R2SCAN level of theory due to the recompute effort. Is this correct? And right now, the get_entries_in_chemsys is only returning entries with GGA or GGA+U calculations by default (presumably this really means PBE and PBE+U, but excludes PBEsol even though it is a GGA)?

Jul 31 '25 16:07 Andrew-S-Rosen

Possibility of four, reality is only three though since we only build the three hulls (compatibility, mixing scheme, and r2scan)

db.thermo.aggregate([{$project: {_id: 0, thermo_type: 1}}, {$group: {_id: "$thermo_type", count: {$sum: 1}}}])
[
  { _id: 'GGA_GGA+U', count: 153167 },
  { _id: 'R2SCAN', count: 82664 },
  { _id: 'GGA_GGA+U_R2SCAN', count: 164332 }
]

Yes to all your other points except the last question. The calcs(run types) that are considered when choosing entries to include with a material are (blessed calcs):

class BlessedCalcs(BaseModel, populate_by_name=True):
    GGA: ComputedStructureEntry | None = None
    GGA_U: ComputedStructureEntry | None = Field(None, alias="GGA+U")
    PBESol: ComputedStructureEntry | None = Field(None, alias="PBEsol")
    SCAN: ComputedStructureEntry | None = None
    R2SCAN: ComputedStructureEntry | None = Field(None, alias="r2SCAN")
    HSE: ComputedStructureEntry | None = None

Would have to run some checks to get actual counts on the materials with SCAN, PBEsol, and HSE entries though.

but materials without at least one GGA, GGA+U, or R2SCAN calc don't get through: (run type check)

Jul 31 '25 17:07 tsmathis

Added explicit default for additional_criteria in 9e28c72

Jul 31 '25 20:07 tschaume

Thanks, folks.

What is the motivation behind having get_entries_in_chemsys filter by thermo types in the first place? Naively, it seems like the function should have no connection to thermochemistry at all. I had expected it to simply return every single entry in that chemical system by default. My concern is that the current behavior may cause a lot of confusion for end-users, but of course changing the default is equally problematic.

Jul 31 '25 20:07 Andrew-S-Rosen

This might be a carry-over from when the function was migrated from the legacy rester. I'm not sure what the reasoning for the default filter was - it's not in get_entries, only in get_entries_in_chemsys. I'd be fine with making the two the same but, as you said, changing the default might be problematic for (some) users. Could be a good discussion for our next foundation meeting? 😄

Jul 31 '25 20:07 tschaume

I will make a note! Ultimately, I think it may be worth making the breaking change here simply because the inconsistency is not obvious. If I didn't know about it, I feel like most users aren't going to know either. If we change the default, at least it will produce more entries (that users can downselect) rather than fewer, so perhaps it is not a major concern in terms of the breaking change. Especially as MP becomes more r2SCAN-heavy, the legacy GGA/GGA+U entries may have less relevance.

Jul 31 '25 21:07 Andrew-S-Rosen

provenance on the additional_criteria param for ref if it's helpful from when I followed the blames earlier:

https://github.com/materialsproject/api/commit/2292fc1a37fc8e3dc301653f1a6848b3749ba00b [ln 887 - 893]-> https://github.com/materialsproject/api/commit/72eafaa4dcbf03acef0adc0662567fd7b0b9b594 [ln 911 on]-> https://github.com/materialsproject/api/commit/e298b178ef8bddd1bd363791f102311d49ff8048

+1 re: the future concern about default behavior as skew in entry type distribution for r2SCAN vs. previous levels of theory changes

Jul 31 '25 21:07 tsmathis

You have my blessing in the end. And thanks for the history!

Jul 31 '25 21:07 Andrew-S-Rosen