dsub icon indicating copy to clipboard operation
dsub copied to clipboard

Google Batch provider needs to support --mount

Open FreshAirTonight opened this issue 9 months ago • 5 comments

Thank you for rolling out the new version with updated support for Batch. It appears that the GS bucket mount is not supported yet, or am I missing something?

FreshAirTonight avatar May 07 '24 17:05 FreshAirTonight

Hi @FreshAirTonight! Thanks for asking - no mount is not yet supported for the google-batch provider. It will likely be a part of the next release. I will report back here when available.

wnojopra avatar May 07 '24 17:05 wnojopra

Hi @wnojopra Thank you for the answer. The reason I am asking is because I am interested in the new machine types for nvidia L4, which is available with google-batch but not with google-cls-v2. Is there any plan to add the support for G2 machine types to google-cls-v2?

FreshAirTonight avatar May 07 '24 18:05 FreshAirTonight

It's unlikely we'll get that to work in google-cls-v2 for two reasons:

  1. The Lifesciences API (google-cls-v2 provider) is going away in favor of the Batch API (google-batch) provider.
  2. I don't think there's anything dsub can do to support additional machine types. That sounds like something the Lifesciences API controls.

I'm actually not familiar with nvidia L4 being not available with the Lifesciences API. Is there any documentation or anything that helps explain?

wnojopra avatar May 07 '24 18:05 wnojopra

It's unlikely we'll get that to work in google-cls-v2 for two reasons:

1. The Lifesciences API (google-cls-v2 provider) is going away in favor of the Batch API (google-batch) provider.

2. I don't think there's anything dsub can do to support additional machine types. That sounds like something the Lifesciences API controls.

I'm actually not familiar with nvidia L4 being not available with the Lifesciences API. Is there any documentation or anything that helps explain?

I tried Nvidia L4 with the Lifesciences API, and here is the error messages I got:

"Error: validating pipeline: unsupported accelerator: "nvidia-l4"". Details: "Error: validating pipeline: unsupported accelerator: "nvidia-l4"">

But I understand that there is not much motivation to make any change to the expiring API. Thank you for your explanation.

FreshAirTonight avatar May 07 '24 19:05 FreshAirTonight

@FreshAirTonight, I run into this recently as well. @wnojopra is 100% correct that the Lifesciences API is deprecated and only has about a year before EOL, so we should all be moving to batch anyways. I don't think there is anything that can be done within dsub to work around it, if the underlying API doesn't support it.

The API deprecation date was in July last year, which was barely 4 months after the L4s were released (March) and only 2 months after the L4s and their dedicated G2 accelerated VMs were added into general availability on GCP. I couldn't find any specific docs on them being supported or not, but I think it is fair to assume that they just didn't bother adding support of either (or both) to the API, since they already knew it will be going away anyways...

lm-jkominek avatar May 10 '24 14:05 lm-jkominek

Hi @FreshAirTonight

We released v0.4.12 yesterday which includes support to mount GCS buckets with the google-batch provider.

When you get the chance, can you verify if it resolves your issues?

wnojopra avatar May 31 '24 16:05 wnojopra

Hi @FreshAirTonight

We released v0.4.12 yesterday which includes support to mount GCS buckets with the google-batch provider.

When you get the chance, can you verify if it resolves your issues?

@wnojopra Thank you very much for this release! It has addressed the main issues I had. The mount option works, and I can access the L4 machine types with google batch.

I noticed two minor issues in my tests: (1) dstat and ddel threw some error messages like the following: File "/home/${username}/anaconda3/lib/python3.9/site-packages/dsub/providers/google_batch_operations.py", line 93, in get_create_time return _pad_timestamps(op.create_time.rfc3339()) AttributeError: rfc3339. There seems an issue with time string formatting.

(2) On the web interface of google batch, the "Memory per task" and "Core per task" show only "1.95GB" and "2 vCPU", even though the machine type has much higher specification than that. This happens regardless using "MIN_CORES=8 MIN_RAM=32" or not in my job submission. gcp_batch

FreshAirTonight avatar May 31 '24 21:05 FreshAirTonight

That's great, thanks @FreshAirTonight ! On your issues:

  1. I've been trying to reproduce this one but haven't seen it so far. What I have been able to figure out is that the create_time field is a proto.datetime_helpers.DatetimeWithNanoseconds, which has a rfc3339 method. But I'm not sure what dependency this comes from. Could you please run pip list and show me the output? I'd be interested to see what versions you're running and what differs from mine. In particular, these 3 might be culprits (these are the version I have):
$ pip list | grep proto
googleapis-common-protos 1.63.0
proto-plus               1.23.0
protobuf                 4.25.3
  1. I noticed the exact same thing, and raised the issue with the Batch API team.

    They say the per task resource requirements are treated as intention, which Batch uses to calculate how many tasks could fit into a VM. But tasks are free to use all resources once they are the VM.

    I was able to confirm this with my own testing. I submitted a task with n2-standard-4 machine type, checked /proc/meminfo, and saw ~15 GiB of memory available.

wnojopra avatar May 31 '24 21:05 wnojopra

@wnojopra I got the same versions on the three packages you mentioned:

pip list | grep proto
googleapis-common-protos      1.63.0
proto-plus                    1.23.0
protobuf                      4.25.3

I realized that the issue was caused by another version of protobuf. I have purged that package and now the issue is gone.

FreshAirTonight avatar May 31 '24 23:05 FreshAirTonight

2. I noticed the exact same thing, and raised the issue with the Batch API team.
   They say the per task resource requirements are treated as intention, which Batch uses to calculate how many tasks could fit into a VM. **But tasks are free to use all resources once they are the VM.**
   I was able to confirm this with my own testing. I submitted a task with n2-standard-4 machine type, checked /proc/meminfo, and saw ~15 GiB of memory available.

@wnojopra, just to be sure - this means that Batch web interface will display the low per-task specs, but it will actually honor the per-job specified resource requirements on the backend?

lm-jkominek avatar Jun 03 '24 15:06 lm-jkominek

2. I noticed the exact same thing, and raised the issue with the Batch API team.
   They say the per task resource requirements are treated as intention, which Batch uses to calculate how many tasks could fit into a VM. **But tasks are free to use all resources once they are the VM.**
   I was able to confirm this with my own testing. I submitted a task with n2-standard-4 machine type, checked /proc/meminfo, and saw ~15 GiB of memory available.

@wnojopra, just to be sure - this means that Batch web interface will display the low per-task specs, but it will actually honor the per-job specified resource requirements on the backend?

Can't say for all use cases. But this is true in my case when G2 machine types were used. My job would fail if only 1.95GB memory (according to batch web interface) is available for my job consumption. Frankly, Google batch web interface needs some improvement to avoid confusing its users.

FreshAirTonight avatar Jun 03 '24 16:06 FreshAirTonight

I realized that the issue was caused by another version of protobuf. I have purged that package and now the issue is gone.

Thats great to hear! I'll close off this issue now. I'll also check to see if I can require a specific version of protobuf for the next release.

@wnojopra, just to be sure - this means that Batch web interface will display the low per-task specs, but it will actually honor the per-job specified resource requirements on the backend?

dsub currently submits one Batch Job for each dsub task. And the Batch team has confirmed with me that tasks are free to use all resources once they are on the VM. So I believe the answer to your question is effectively, yes.

wnojopra avatar Jun 03 '24 17:06 wnojopra

One last comment here for others who may have issues with google-batch and protobuf: I was able to reproduce an issue with protobuf 3.18.0. With 3.19.0, things seem to be working fine again.

I do see that many newer versions have been released since then. In the next release of dsub, we will require 3.19.0 < protobuf < 5.26.0.

wnojopra avatar Jun 03 '24 18:06 wnojopra