Change how electrode flow retrieves charge density data
Small change in the electrode insertion workflow to avoid needing large object storage: Currently there's a get_charge_density_job function which does not point its output to e.g. GridFS. That causes problems when trying to store a CHGCAR in normal mongo stores
I've moved the get_charge_density_job within get_inserted_structures so that the charge density is not stored when generating inserted structures (it still may be stored in the user's GridFS from the static calc)
Another solution would be just adding a data=[Chgcar] kwarg to the get_charge_density_job function, but I think it makes more sense to remove get_charge_density_job
@esoteric-ephemera does it make sense to add the data to tge datastore for other parts of the workflow? It will surely fail at some point
It should work in this setup because the charge density is only used in one part of the flow (get_inserted_structures). If the user wanted to retrieve their charge densities for analysis, the host structure static calculation should store them in the datastore as well (this static is required in the electrode flow)
Maybe @jmmshn has thoughts on this - don't want to break the flow logic
@esoteric-ephemera, copying the file over was kind of a fireworks-related hack. When this was added, whenever there was a large object that was the input of the job. The UI that rendered the fireworks wf would try to render the entire thing which basically crashed it every time. So you might want to confirm that this is no longer the case.
Thanks for the suggestions @janosh, wasn't aware of the type hinting for Callable!
Thanks, @jmmshn. This is probably a safer approach then, since get_inserted_structures no longer takes VolumetricData as input, just the path to the previous calc and a function to parse it
Codecov Report
Attention: Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
Project coverage is 4.13%. Comparing base (
4244da9) to head (59b7f26).
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/atomate2/common/jobs/electrode.py | 0.00% | 3 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #1055 +/- ##
==========================================
- Coverage 72.82% 4.13% -68.69%
==========================================
Files 187 187
Lines 13637 13634 -3
Branches 1370 1370
==========================================
- Hits 9931 564 -9367
- Misses 3161 13039 +9878
+ Partials 545 31 -514
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/atomate2/common/jobs/electrode.py | 0.00% <0.00%> (-84.54%) |
:arrow_down: |
Thanks for the suggestions @janosh, wasn't aware of the type hinting for
Callable!Thanks, @jmmshn. This is probably a safer approach then, since
get_inserted_structuresno longer takesVolumetricDataas input, just the path to the previous calc and a function to parse it
@esoteric-ephemera , it looks like current version of get_charge_density_job
https://github.com/materialsproject/atomate2/blob/95ea0600e00bcded2a0cb6cfe6d190fa6a980c39/src/atomate2/common/jobs/electrode.py#L303
breaks for large objects because it need to send the entire charge density to the jobstore.
I think this can be fixed by adding data=True but we will run into the same problem with jobflow + Firework,
since you now have the following charge density in the input:
https://github.com/materialsproject/atomate2/blob/95ea0600e00bcded2a0cb6cfe6d190fa6a980c39/src/atomate2/common/jobs/electrode.py#L97C9-L97C16
So I don't see any way of not causing the problem with fireworks where the object is deserialized during viewing of the job in the webgui. Maybe it's best to move on completely from fireworks anyways.
So I think the choices are:
- Go back to the old implementation where the charge density is never "sent" between jobs so it never appears as in input.
- Just forget about any problems with view in fireworks... I have not used it in months and month and we should just worry about jobflow-remote.
I think we are already down the path of 2 mostly anyways so I can just complete refactor out the get_charge_density_job and send the charge density from static calculation directly.
Hey @jmmshn, the way I've rewritten it in the PR shouldn't hit either issue you've identified - it just slightly tweaks the electrode flow to take the path of the charge density file and a function as args. Should not have issues with storing large objects in Mongo, nor with rendering large objects in the fireworks webgui
We're still using fireworks, and I don't see a strong case to store the charge density in a database. Even in that case, the user can write a custom get_charge_density function that pulls from a database directly
So in my testing the following job fails with a "Document too large" error.
https://github.com/materialsproject/atomate2/blob/95ea0600e00bcded2a0cb6cfe6d190fa6a980c39/src/atomate2/common/jobs/electrode.py#L303-L318
My current understanding of what is happening is that:
get_charge_density_job calls the VASP version of the function:
https://github.com/materialsproject/atomate2/blob/95ea0600e00bcded2a0cb6cfe6d190fa6a980c39/src/atomate2/vasp/flows/electrode.py#L58-L73
So it will produced a document that contains a VolumetricData/Chgcar object.
It looks like the fact that this result does not have a data=[Chgcar, VolumetricData] in the decorator is the reason I'm seeing the failures.
PS: Thanks for the fast reply!
Yeah exactly what I've observed - adding Chgcar and VolumetricData is to the data field of job should work as well, but is there a need for storing the charge density?
but is there a need for storing the charge density? I think that is for the case where the different nodes cannot see the same file system? As I understand, you will have to store the charge density somewhere to be able to send it to the next job. That might be a somewhat common usage case if people use more heterogenous compute which is where I think jobflowremote is going.
That is essentially the tradeoff, this is a substantial amount of new data though so it might make sense to make
- "copy file over" the default behavior
- "send object via jobstore" an optional behavior
OK - so in the case of heterogeneous compute for a VASP job, we'd have to use something like static_job.output.vasp_objects["aeccar0"] to access the AECCAR0 without making a direct query to the job store. I haven't tried to see if that's possible but I would expect it is?
Either way, I think this is worth looking at in a separate PR - the flow as it currently exists in atomate2 doesn't support heterogeneous compute, this PR is really just to ensure that the flow doesn't fail from mongodb object size limitations
OK everything makes sense now. I agree. I think I was a bit confused cuz I thought the thing was already removed and somehow made it's way back in. But I I now see that this PR was not merged. Sorry for the confusion on my part!
@esoteric-ephemera, I added some minor fixes to the job naming/numbering and I try help get this PR passed.
Thanks for fixing this and shelping me clear up my confusion. I think I made a mistaking while writing the original wf and missed a place where the data is serialized to the jobstore. I'll fix those among other things once this PR is merged!
No worries @jmmshn, that all sounds good to me
Hi @esoteric-ephemera, we fixed the abinit tests, I think this should work now.
Hey @JaGeo , @janosh , @utf , should be ready for review. Thanks @jmmshn for input on this!
@JaGeo would you mind taking a quick look over the docs changes and see if that's what you had in mind for #1151 and #1152?
The openff tests seem to be OK now with the newer dependency stack, but I marked a few places with TODOs that will have to eventually be updated. Whenever you get a chance @orionarcher, would be super helpful if you could look these over
Thank you @esoteric-ephemera ! Those updates are great! (as usual).