data-on-eks
data-on-eks copied to clipboard
Persistent bug during dp-bert-large-pretrain example
Description
I'm unable to run the trainium-inferentia BERT pretrain model. Following error is showing up during building:
Traceback (most recent call last): File "/home/ec2-user/.local/bin/torchx", line 8, in <module> sys.exit(main()) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/main.py", line 116, in main run_main(get_sub_cmds(), argv) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/main.py", line 112, in run_main args.func(args) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 248, in run self._run(runner, args) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 208, in _run app_handle = runner.run_component( File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/runner/api.py", line 186, in run_component return self.schedule(dryrun_info) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/runner/api.py", line 278, in schedule app_id = sched.schedule(dryrun_info) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/schedulers/kubernetes_scheduler.py", line 593, in schedule resp = self._custom_objects_api().create_namespaced_custom_object( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 231, in create_namespaced_custom_object return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs) # noqa: E501 File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 354, in create_namespaced_custom_object_with_http_info return self.api_client.call_api( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 279, in POST return self.request("POST", url, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 238, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (400) Reason: Bad Request HTTP response headers: HTTPHeaderDict({'Audit-Id': '9ea0bf3e-2327-45ae-aefa-0965f38155ff', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '344539cc-94a8-443b-82c3-e6ffd6feb173', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'bb6f7b9d-abeb-49b1-bcda-ff2bc8c180bf', 'Date': 'Tue, 23 Jan 2024 22:43:20 GMT', 'Content-Length': '232'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validatejob.volcano.sh\" denied the request: unable to find job queue: queues.scheduling.volcano.sh \"test\" not found;","code":400}
EKS Data blueprint was provided from [https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/trainium](EKS Data url)
I re-initialized the project several times in both Cloud9 and in my local system. Both with the same result.
I re-attempt the terraform ./install.sh
file.
Versions
-
ai-ml/trainium-inferentia: 39e790ce0d4e45979d1374a86b2030e55a838441
-
Terraform version: Terraform v1.5.5 on linux_amd64
-
Provider version(s): Terraform v1.5.5 on linux_amd64
Reproduction Code [Required]
cd ai-ml/trainium-inferentia/examples/dp-bert-large-pretrain chomd +x 2-bert-pretrain-precompile.sh ./2-bert-pretrain-precompile.sh
Workspace used: Cloud9 following along this Workshop [https://www.eksworkshop.com/docs/introduction/setup/your-account/]( EKS Workshop url)
List steps in order that led up to the issue you encountered
`cd data-on-eks/ai-ml/trainium/ && chmod +x install.sh
./install.sh
cd ai-ml/trainium-inferentia/examples/dp-bert-large-pretrain chomd +x 1-bert-pretrain-build-image.sh ./1-bert-pretrain-build-image.sh
kubectl exec -i -t -n default aws-cli-cmd-shell -c app -- sh -c "clear; (bash || ash || sh)"
yum install tar cd /data aws s3 cp s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar . --no-sign-request chmod 744 bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar tar xvf bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar `
Expected behavior
Pretrain Bert Model successfully built
Actual behavior
Traceback (most recent call last): File "/home/ec2-user/.local/bin/torchx", line 8, in <module> sys.exit(main()) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/main.py", line 116, in main run_main(get_sub_cmds(), argv) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/main.py", line 112, in run_main args.func(args) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 248, in run self._run(runner, args) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 208, in _run app_handle = runner.run_component( File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/runner/api.py", line 186, in run_component return self.schedule(dryrun_info) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/runner/api.py", line 278, in schedule app_id = sched.schedule(dryrun_info) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/schedulers/kubernetes_scheduler.py", line 593, in schedule resp = self._custom_objects_api().create_namespaced_custom_object( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 231, in create_namespaced_custom_object return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs) # noqa: E501 File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 354, in create_namespaced_custom_object_with_http_info return self.api_client.call_api( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 279, in POST return self.request("POST", url, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 238, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (400) Reason: Bad Request HTTP response headers: HTTPHeaderDict({'Audit-Id': '9ea0bf3e-2327-45ae-aefa-0965f38155ff', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '344539cc-94a8-443b-82c3-e6ffd6feb173', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'bb6f7b9d-abeb-49b1-bcda-ff2bc8c180bf', 'Date': 'Tue, 23 Jan 2024 22:43:20 GMT', 'Content-Length': '232'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validatejob.volcano.sh\" denied the request: unable to find job queue: queues.scheduling.volcano.sh \"test\" not found;","code":400}
Terminal Output Screenshot(s)
Additional context
Trainium on EKS blueprint
Thanks for raising the issue. I will try this blueprint and update the same to the issue.
"Failure","message":"admission webhook \"validatejob.volcano.sh\" denied the request: unable to find job queue: queues.scheduling.volcano.sh \"test\" not found;","code":400}
Just noticed the above error indicates the job queue is missing for Volcano. Try to run kubectl apply
on the below yaml manifest that will create namespace and the Volcano queue and try to run the shell script(2-bert-pretrain-precompile.sh
) again.
---
apiVersion: v1
kind: Namespace
metadata:
name: test
# Volcano dedicated queue for ml-team-a
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: test
spec:
reclaimable: false
weight: 1
We can update the blueprint if this works
Hi @vara-bonthu ,
I successfully added and applied the manifest yaml file.
The "Failure","message":"admission webhook \"validatejob.volcano.sh\" denied the request: unable to find job queue: queues.scheduling.volcano.sh \"test\" not found;","code":400}
error is no longer showing up, but I'm still not able to run the bert-compile
pods.
I'm attaching the screenshots to show my results:
It seems you've made good progress. The BERT large distributed training blueprint is utilizing Managed Node Groups, so you'll need to set the minimum and desired values to 2
. These values can be updated in the variables.tf
file. Here are the specific lines where you can make these changes:
Nodes Minimum Value Nodes Desired Value
After making these adjustments , please run terraform apply
. This will provision two nodes of trn1.32xlarge
instances. Ensure that your account has access to these nodes.
Upon completion, you should observe the pending pods transitioning to the running state.
I've noticed some gaps in the documentation that need updating. Thanks for validating and I appreciate a PR for these missing steps. Thank you!
@Gall-oDrone This blueprint has been recently updated with the fixes. Checkout the latest PR here https://github.com/awslabs/data-on-eks/pull/435
This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days
Issue closed due to inactivity.