python-docs-samples
python-docs-samples copied to clipboard
Add Gemma Flex Template example
Description
Adds a Gemma Flex Template example and an e2e test running on Dataflow. This code example is similar to #11284, but using a Pytorch model and deploying as a flex template. The e2e test will need model weights staged to GCS like the streaming Gemma example.
Note: Before submitting a pull request, please open an issue for discussion if you are not associated with Google.
Checklist
- [ ] I have followed Sample Guidelines from AUTHORING_GUIDE.MD
- [ ] README is updated to include all relevant information
- [ ] Tests pass:
nox -s py-3.9(see Test Environment Setup) - [ ] Lint pass:
nox -s lint(see Test Environment Setup) - [ ] These samples need a new API enabled in testing projects to pass (let us know which ones)
- [ ] These samples need a new/updated env vars in testing projects set to pass (let us know which ones)
- [ ] This sample adds a new sample directory, and I updated the CODEOWNERS file with the codeowners for this sample
- [ ] This sample adds a new Product API, and I updated the Blunderbuss issue/PR auto-assigner with the codeowners for this sample
- [ ] Please merge this PR for me once it is approved
Not sure what happened on the kokoro test run, the test target passed but the test execution as a whole was killed right after the test passed
Not sure what's happening here, test passes but the test session gets killed consistently. @engelke can you take a look?
@kweinmeister Still need an approving review from someone if you could take a look
If the complaint is with the model handler code I don't think it's too much of a change to cut that code in favor of linking to the source instead.
Debugging the tests: the output shows it's a timeout, but the tests are successful(?)
collecting ... collected 1 item
e2e_test.py::test_pipeline_dataflow PASSED [100%]
-- generated xml file: /workspace/dataflow/gemma-flex-template/sponge_log.xml --
======================== 1 passed in 3642.73s (1:00:42) ========================
nox > Session py-3.10 was successful.
err: signal: killed
The kokoro config is set to a max of 60 min (config). And you've configured the test to have a 5400s (2 hour) timeout.
A similar issue reported in https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4609.
At a guess, while the image is created each time (~20 mins) and it takes time for the job to start (~20 mins), the success message isn't being received, and this the system has 20 minutes of wait before it times out.
How long is this entire e2e test expected to be, and is the 2h wait there intentionally? Something else will need to be updated for that decorator to be respected.
As far as the E2E test timeout, in early testing we were dancing around the hour-mark as far as runs (some slightly under, some slightly over) so it definitely needs to be over an hour. Building the container + running the job as an invocation from a flex template takes substantial time, so we may need a little longer on the kokoro timeout
Building the container + running the job as an invocation from a flex template takes substantial time.
You can significantly bring this down by not including the model and the GPU software into the flex template image. This is a scenario where having two separate images, one for the flex template launcher and one for the custom container image would be better. Care should be taken to build the images with the same set of dependencies, which can be accomplished with requirements files and/or constraint files.
Also, we can speed up launch time by download the model into the SDK worker container from GCS during container startup, instead of shipping it inside the contanier. Currently, this could be done by using a custom entrypoint like https://github.com/liferoad/beamllm/blob/main/containers/ollama/entrypoint.sh, eventually we will have a Beam API for that.
including the model in the container will be less prone to a runtime error, but slower in short-term future.
What are the next steps on this PR?
@jrmccluskey are you still working on changes or you are waiting for a review (if so, please say explicitly that you have addressed previous comments, since current PR status seems to be Changes requested).
I see that tests are still failing -- did we figure out how to increase the time limits?
To be clear, comments have been addressed and I am waiting for a review. As far as the timeout, the kokoro config linked above can be updated to run longer; however, I was holding off on updating that since it's a repo-wide timeout. I suppose I should go ahead and update that just for the sake of having a green run on the PR.
@glasnt it appears you might have to submit a CLA for the github user id you've been using here, see: https://github.com/GoogleCloudPlatform/python-docs-samples/pull/11881/checks?check_run_id=27283317137
@glasnt it appears you might have to submit a CLA for the github user id you've been using here, see: https://github.com/GoogleCloudPlatform/python-docs-samples/pull/11881/checks?check_run_id=27283317137
Resolved.
The Python 3.10 CI checks might have succeeded, apart from cleanup:
Job has terminated in state FAILED: Workflow job: 2024-07-10_11_09_14-9934820475476190981 failed. Please ensure you have permission to access the job and the `--region` flag, us-central1, matches the job's region.
https://btx.cloud.google.com/invocations/0be77b16-c664-4214-b5bf-b70c2588330c/log
Same problem we've seen with some runs here:
Not too much of a surprise, if every test ran at around the same time quota would be tight
After looking at the system logs for the Dataflow workers in one of the earlier runs, it looks like the workers don't have enough disk space to load the container image and model.
Handler for GET /v1.41/images/get returned error: write /var/lib/docker/tmp/docker-export-3281602760/0e6537f85f3ebad7a4b5af8385d234950c2861657142f4f53123b65c153127fe/layer.tar: no space left on device
GPU images are large, and the model is several GB as well. Everything needs to fit into disk. Also from previous experiments with LLMs in Dataflow, each worker might need to have space for an additional copy of the model weights on disk, depending on the way the model is loaded.
It keeps crashing and retrying infinitely due to "no space left on device" until we reach the test timeout.
Try increasing the worker machines disk space with --disk_size_gb.
Startup of the worker pool in zone us-central1-a failed to bring up any of the desired 1 workers. Please refer to https://cloud.google.com/dataflow/docs/guides/common-errors#worker-pool-failure for help troubleshooting. ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS: Instance 'dataflow-gemma-flex-templ-07171029-mk9a-harness-7qn5' creation failed: The zone 'projects/python-docs-samples-tests/zones/us-central1-a' does not have enough resources available to fulfill the request. '(resource type:compute)'.
Looks like quota issue?
What are next steps here? This PR has been hanging for a while
We need the tests to pass in order to merge.
Is it still quota issues or something else?
=================================== FAILURES ===================================
____________________________ test_pipeline_dataflow ____________________________
Traceback (most recent call last):
File "/workspace/dataflow/gemma-flex-template/.nox/py-3-10/lib/python3.10/site-packages/google/api_core/retry/retry_unary.py", line 144, in retry_target
result = target()
File "/workspace/dataflow/conftest.py", line 167, in pubsub_wait_for_messages
messages = [m.message.data.decode("utf-8") for m in response.received_messages]
File "/workspace/dataflow/conftest.py", line 167, in
messages = [m.message.data.decode("utf-8") for m in response.received_messages]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 3: invalid start byte
I've tried some experiments on the region changing, and the decode error. These may need to be updated/reverted etc.
This should now run green with the encoding issues resolved. Since I worked so much on this one, I'd want someone !me to approve this PR
Sure enough, actually green! Thank you for the debugging work @glasnt !
I made some code simplifications. I think the issue with the byte64 was that Pub/Sub expects bytes and it was being passed a string. The local runner failed, but I suspect the Dataflow runner was implicitly converting it into base64. I made some changes to try to simplify the code a bit, and hopefully the tests stay green.
Looks like the workflow was running during the github outage last night, failed on the git clone step
LGTM, tests are passing but take 50+ minutes to run. We can merge, but it would be nice to optimize this further.
A majority of this is the image build, blocking on network; but as long as it's <60m, we're ok.