api icon indicating copy to clipboard operation
api copied to clipboard

Add Deltalake query support

Open tsmathis opened this issue 1 month ago • 7 comments

should just work:tm:

tsmathis avatar Oct 23 '25 15:10 tsmathis

some notes:

~~1. depends on #1021~~ ~~2. could supersede #974~~ ~~3. this will need to be addressed for the progress bar to be accurate in the case were has_gnome_access=False:~~ https://github.com/materialsproject/api/blob/aee0f8c117e01e514604b4c5996f144a4c3b560d/mp_api/client/core/client.py#L573-L579 ~~the count can be retrieved from s3, but the COUNT(*) ... WHERE NOT IN ... is slow~~ 4. wasn't sure how to emit messages to the user, warnings might not be the best choice: https://github.com/materialsproject/api/blob/aee0f8c117e01e514604b4c5996f144a4c3b560d/mp_api/client/core/client.py#L532-L535 https://github.com/materialsproject/api/blob/aee0f8c117e01e514604b4c5996f144a4c3b560d/mp_api/client/core/client.py#L631-L636 5. On the fence if MPDataset should inherit user's choice of use_document_model or default to False, its extra overhead when True https://github.com/materialsproject/api/blob/aee0f8c117e01e514604b4c5996f144a4c3b560d/mp_api/client/core/client.py#L642 6. re: document model's, wasn't sure if making an MPDataDoc model was the right route so the emmet model is just passed through now. 7. @esoteric-ephemera, is this how coercing user input to AlphaIDs should go? Do you want to do something different? https://github.com/materialsproject/api/blob/aee0f8c117e01e514604b4c5996f144a4c3b560d/mp_api/client/routes/materials/tasks.py#L34 ~~8. Is MPAPIClientSettings the right place for these? Not sure if the user has the ability to adjust these if needed:~~ https://github.com/materialsproject/api/blob/aee0f8c117e01e514604b4c5996f144a4c3b560d/mp_api/client/core/settings.py#L90-L102

tsmathis avatar Oct 23 '25 15:10 tsmathis

ah and based on the failing test for trajectories, I assumed returning the pymatgen object was correct, should the dict be returned? @esoteric-ephemera https://github.com/materialsproject/api/blob/aee0f8c117e01e514604b4c5996f144a4c3b560d/mp_api/client/routes/materials/tasks.py#L57

tsmathis avatar Oct 23 '25 15:10 tsmathis

@tsmathis think the API was set up to return the jsanitized trajectory info: https://github.com/materialsproject/emmet/blob/3447c5af4746d539f1f4faf26b97715cb119c85d/emmet-api/emmet/api/routes/materials/tasks/query_operators.py#L73

Either way yeah I guess it returned the as_dict but we don't need to keep with that paradigm

For the AlphaID, to handle either the no prefix/separator ("aaaaaaft") and with prefix/separator ("mp-aaaaaaft") cases, both of these should work, but I can also just save the "padded identifier" as an attr on it to make this cleaner - I'll do that in the PR you linked:

"a"*(x._padlen-len(x._identifier)) + x._identifier

or

if (alpha := AlphaID(task_id, padlen=8))._separator:
  padded = str(alpha).rsplit(alpha._separator)[-1] 
else:
  padded = str(alpha)

esoteric-ephemera avatar Oct 23 '25 16:10 esoteric-ephemera

For the AlphaID, to handle either the no prefix/separator ("aaaaaaft") and with prefix/separator ("mp-aaaaaaft") cases, both of these should work, but I can also just save the "padded identifier" as an attr on it to make this cleaner - I'll do that in the PR you linked:

either way on this works for me, just want to make sure I stick to the intended usage (edit: or that we're at least consistent across the client)

Either way yeah I guess it returned the as_dict but we don't need to keep with that paradigm

Was going to say we could stick to whatever the frontend was expecting, but looking now the frontend doesn't even use the tasks.get_trajectory(...) function so it will need to be rewritten either way. The frontend does end up making a dataframe from the trajectory dict, so maybe just returning the dict will be best

tsmathis avatar Oct 23 '25 17:10 tsmathis

Codecov Report

:x: Patch coverage is 42.10526% with 77 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 66.10%. Comparing base (52a3c57) to head (b2a832f).

Files with missing lines Patch % Lines
mp_api/client/core/client.py 26.38% 53 Missing :warning:
mp_api/client/core/utils.py 47.82% 24 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1023      +/-   ##
==========================================
- Coverage   67.29%   66.10%   -1.20%     
==========================================
  Files          50       50              
  Lines        2770     2894     +124     
==========================================
+ Hits         1864     1913      +49     
- Misses        906      981      +75     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Oct 23 '25 17:10 codecov-commenter

good catches @tschaume

you're obviously free to keep adding changes for testing, but you can also just ping me if you want me to update things as changes come in upstream

tsmathis avatar Nov 05 '25 18:11 tsmathis

and FYI, tests are mostly good locally with my endpoint set to preview. these two fail regardless once the env gets reset during testing:

FAILED tests/test_mprester.py::TestMPRester::test_get_api_key_endpoint_from_env_var - KeyError: 'access_controlled_batch_ids'
FAILED tests/test_mprester.py::TestMPRester::test_get_default_api_key_endpoint - KeyError: 'access_controlled_batch_ids'

once the new deployment is live that will go away, and since this is a feature PR I'm not too concerned about the tests atm.

But I can also change the heartbeat meta ping to be more fault tolerant and use .get(...) and return nulls be default rather than directly accessing those keys :shrug:

tsmathis avatar Nov 05 '25 18:11 tsmathis

Working great for me! Full task download in ~6 min which is crazy compared to before

General discussion quetion: Now that we're working with just bare AlphaID (e.g., aaaaaaft), we may need to manually insert prefixes by endpoint right? Once the various materials endpoints are delta_backed, they should just get mp-. Do we want to manually insert a mpt- prefix or task- for the tasks?

esoteric-ephemera avatar Nov 12 '25 17:11 esoteric-ephemera

I think for the core tasks collection we're going prefix-less, correct? All the others will get prefixes at parse/build time.

tsmathis avatar Nov 12 '25 18:11 tsmathis

re: the iterating, indexing into the local dataset, etc

I am a little conflicted on what the best route for the python-like implementation/behavior of the local MPDatasets should be. Mainly because as soon as we leave arrow-land we're neutering the performance that can be achieved.

As an example: Regardless of how we do the iteration behavior, this is dog water:

# doesn't work currently, would have to update iterating to match Aaron's review comment first
>>> tasks = mpr.materials.tasks.search()
>>> non_metallic_r2scan_structures = [
    x.structure 
    for x in tasks 
    if x.output.bandgap > 0 and x.run_type == "r2SCAN"
]

compared to:

>>> import pyarrow.compute as pc
>>> tasks_ds = tasks.pyarrow_dataset
>>> expr = (pc.field(("output", "bandgap")) > 0) & (pc.field("run_type") == "r2SCAN")
>>> non_metallic_r2scan_structures = tasks_ds.to_table(columns=["structure"], filter=expr)

which is sub-second execution on my machine

I am obviously biased on this front since I am comfortable with arrow's usage patterns, not sure if the average client user would be willing to go down that route. Ideally though we should be guiding users towards a "golden path".

tsmathis avatar Nov 12 '25 23:11 tsmathis

Yeah it's hard to say what's best in this case. We'd probably want to prioritize user experience across endpoints, or just throw a specialized warning for full task retrieval that the return type is different

If pandas is a no-op from parquet (not sure if that's also true for the dataset or just an individual table/array) then that could be a viable alternative? Feel like pandas will be more familiar than arrow datasets

esoteric-ephemera avatar Nov 13 '25 01:11 esoteric-ephemera