pangeo-forge-recipes [DOC] Warn about difficulty of pulling data from HPC

[DOC] Warn about difficulty of pulling data from HPC

Open TomNicholas opened this issue 8 months ago • 15 comments

I just had a long and useful chat with @jbusecke, who corrected several misconceptions I had about what pangeo-forge was and how to use it. One misconception I had in particular was that I assumed it would be relatively easy to pull data from a HPC system into dataflow. I now understand that this is definitely not the case, and I will be lucky if NCAR supports uploading data via Globus or even an FTP server. :smiling_face_with_tear:

I think this is important context for understanding what pangeo forge can and can't do, as I think many users will be in the same position as me: "I have a simulation dataset sat on HPC, and I want to move it to ARCO Zarr data in the Cloud". It was not at all obvious to me that the main intended use case for pangeo-forge was pulling data that is already available publicly.

Could we find some way to document this better on the pangeo-forge-recipes documentation? Maybe also with some current recommendations as to what to do in this situation? I understand that this is not yet a solved problem.

Oct 20 '23 22:10 TomNicholas

Thanks for raising this Tom. The Pangeo Forge Pulls Data header is probably the most natural place to mention this.

Perhaps an easy first step would be promoting the Tip admonition to Important or some other "more grave" category, and fleshing out the text there a bit more?

Maybe cross-linking this section to the FAQs would also surface it a better. (And possibly moving the FAQs into the Getting Started section makes sense as well.)

Oct 20 '23 22:10 cisaacstern

I did not see / appreciate that bit of the docs! That already covers a lot of what I had in mind. Maybe explicitly pointing out that HPC systems generally do not implement ways of accessing data over URLs would be an improvement?

Oct 20 '23 22:10 TomNicholas

Sounds good to me! PRs welcome! 😄

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/4aae78fa39746f9e5c9870e2b6b2ff63f27c0eba/docs/composition/file_patterns.md?plain=1#L28-L49

Oct 20 '23 22:10 cisaacstern

@TomNicholas for your specific case, are you able to push data from your HPC to a GCS bucket? If so the recipe could be written against the GCS cache.

Oct 21 '23 03:10 cisaacstern

Would this not benefit from the daskrunner? That would allow scaling out on HPC, reading from the local filesystem (which pangeo-forge already supports because fsspec supports it) and then writing it out either to the cloud (via individual user credentials) or to the filesystem again?

Oct 21 '23 04:10 yuvipanda

@cisaacstern I think so, but that might be really slow if I can only push to a login node

Also the dataset needs to end up in an AWS bucket (AWS Open Data program), so am I likely to face large egress charges moving data from GCS to AWS?

Would this not benefit from the daskrunner?

Hopefully! Although I need to find out if the NCAR compute nodes can actually write out to the public cloud, otherwise I can't scale out because I would be limited to writing from a login node.

Oct 21 '23 05:10 TomNicholas

I think a larger question here is 'how cloud specific is pangeo-forge?'. The primary reason I got involved in the project (and started working on https://github.com/pangeo-forge/pangeo-forge-runner) is that I want it to be not cloud provider specific nor even require clouds. This is why I pushed to move all 'submission' and 'status' code out of the pangeo-forge-orchestrator (which was cloud and public GitHub specific) into runner. So while I agree that currently it's still tied to the cloud, I think with runner it need not be - especially with daskrunner, I think it can work just as well on HPC systems as it does in the cloud, although the specifics of how it is configured have to be different.

One primary is 'what is the equivalent of object storage on HPC systems?'. I think if you're running them on HPC systems that don't have any object storage deployed, the closest equivalent is probably whatever 'fast' scratch setup they have (something like Lustre mounted over NFS perhaps?). It'll be slower than running it on a cloud provider with object storage, but faster than reaching out to S3 from within your S3 provider. So the pattern for running this on HPC would be that the source data is on the fast local filesystem, the destination data should also be on a fast local filesystem, and then after you are done you can publish out to global cloud for external use. I think the 'final publishing' use case of public cloud object storage should be different from the 'intermediate output' use case. If the job is running on the cloud, then the 'intermediate output' location and the 'final publishing' location can be the same. On HPC systems I think this would be different.

All that said, until Dask Runner lands on Beam properly, on HPC systems too you'd be limited to non-scale out performance anyway. But I do believe that means that for meaningful HPC use, the best way to push that forward is to get daskrunner to completion.

Oct 21 '23 06:10 yuvipanda

so am I likely to face large egress charges moving data from GCS to AWS?

Potentially? I think it would depend on specifics but worth being cautious about.

if the NCAR compute nodes can actually write out to the public cloud, otherwise I can't scale out because I would be limited to writing from a login node.

As @yuvipanda helpfully observes, in a future DaskRunner world, it may very well make sense for Pangeo Forge work to happen entirely within the HPC storage context, with the final "publish" step happening as a subsequent forklifting of the pre-built ARCO data from the HPC filesystem to the cloud bucket. In this case, the question becomes, how does anyone working on this NCAR HPC ever efficiently move data to the cloud?

All that said, until Dask Runner lands on Beam properly, on HPC systems too you'd be limited to non-scale out performance anyway. But I do believe that means that for meaningful HPC use, the best way to push that forward is to get daskrunner to completion.

Thanks for this thoughtful reflection @yuvipanda. I agree. And this resonates with what @rabernat and I discussed yesterday: namely, that the DaskRunner represents a very important (possibly indispensable) on-ramp for the scientific community. My evolving understanding is that the DaskRunner as currently released in Beam implements only Map and GroupByKey, with Side Inputs and Combiners being two of the main remaining features necessary for us to actually run our current recipes; xref https://github.com/pangeo-forge/pangeo-forge-runner/pull/109#issuecomment-1771627000.

Oct 21 '23 17:10 cisaacstern

Thanks for the comments @yuvipanda !

how does anyone working on this NCAR HPC ever efficiently move data to the cloud?

They have Globus, it turns out, which hopefully I can use. Happy to contribute to integrating this with pangeo-forge, as @jbusecke suggested to me yesterday. I am expecting to need to do several more of these types of data moving tasks, from HPC to cloud.

make sense for Pangeo Forge work to happen entirely within the HPC storage context, with the final "publish" step happening as a subsequent forklifting

So basically I run the recipe on HPC, doing any data transformation to a temporary intermediate state on the HPC system itself (hopefully in parallel by using the daskrunner), then at the end write the result out to Cloud (AWS or Google) using Globus? How does this fit with the "pangeo-forge only pulls data" idea?

Oct 21 '23 17:10 TomNicholas

Correct. To be more explicit, pangeo-forge puts the end ARCO result on the HPC system itself, and then after that you can use any system to move it to the cloud.

How does this fit with the "pangeo-forge only pulls data" idea?

I don't actually know what this means! Can you explain this a little more?

Oct 22 '23 03:10 yuvipanda

I don't actually know what this means! Can you explain this a little more?

I was referring to this passage in the docs:

https://pangeo-forge.readthedocs.io/en/latest/composition/file_patterns.html#pangeo-forge-pulls-data

Oct 22 '23 03:10 TomNicholas

@TomNicholas in this case your Pangeo Forge pipeline will be pulling data from the HPC filesystem.

Oct 22 '23 03:10 cisaacstern

@TomNicholas in this case your Pangeo Forge pipeline will be pulling data from the HPC filesystem.

Correct. There will be no external cloud involvement at all from the pangeo-forge perspective. The compute is dask, and it's pulling data from the HPC system's filesystems and putting data back there. The final 'push' to cloud object storage doesn't involve pangeo forge at all (although we should provide documentation on how to do it)

Oct 22 '23 03:10 yuvipanda

Hey folks, super useful discussion. I want to throw another use-case in the mix here: For m2lines I am trying to move data from a GFDL HPC, but I DO NOT HAVE ACCESS to run anything on that machine. For any HPC side tasks I rely on collaborators, and their available time is a bottleneck!

While I appreciate the push towards making this work from 'within' the HPC, and I very much agree with the need for the daskrunner! , I think that the globus route offers a potentially much more generalizable method of ingestion from HPCs centers in the short term.

Given that I will not be able to 'upload from within' working on this alternative workflow below seems like a better opportunity for collaboration between folks here?

Proposed workflow

Somone with access to the HPC creates a public Globus collection (with credentials, since unfortunately none of the centers allow anonymous public access, this does not work). This is significantly easier to ask ppl for, instead of expecting them to install and run pgf, beam, and daskrunner.
Work on a 'OpenWithGlobus` stage, that uses the Globus API to authenticate, and download files to a cloud cache
- I would need to spend some more time here to confirm this is possible.
At this point we would be in regular PGF land in the cloud unless I have misunderstood something here.

The one awkward part here is the need for someone to actually 'put' the files into a Globus collection, which kinda violates the principle of PGF, but I think can just be seen as a 'flaky' data source. The easy of use, and unified (globus) api for this IMO outweighs this.

I want to stress that I think this is complimentary to the above approach (which we also need for using beam beyond data ingestion). I think ultimately we should implement both, but for mostly selfish reasons 😁, I would prefer the 'outside of HPC' method.

Oct 23 '23 17:10 jbusecke

The final 'push' to cloud object storage doesn't involve pangeo forge at all (although we should provide documentation on how to do it)

But this could also be a pangeo-forge stage, right? Even if it is just a single worker dummy stage, uploading the data, I think for reproducibility reasons it would be good to include this in the pipeline.

Oct 23 '23 17:10 jbusecke

pangeo-forge-recipes pangeo-forge-recipes copied to clipboard

[DOC] Warn about difficulty of pulling data from HPC

Proposed workflow

pangeo-forge-recipes
pangeo-forge-recipes copied to clipboard