data-on-eks Persistent bug during dp-bert-large-pretrain example

Description

I'm unable to run the trainium-inferentia BERT pretrain model. Following error is showing up during building:

Traceback (most recent call last): File "/home/ec2-user/.local/bin/torchx", line 8, in <module> sys.exit(main()) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/main.py", line 116, in main run_main(get_sub_cmds(), argv) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/main.py", line 112, in run_main args.func(args) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 248, in run self._run(runner, args) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 208, in _run app_handle = runner.run_component( File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/runner/api.py", line 186, in run_component return self.schedule(dryrun_info) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/runner/api.py", line 278, in schedule app_id = sched.schedule(dryrun_info) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/schedulers/kubernetes_scheduler.py", line 593, in schedule resp = self._custom_objects_api().create_namespaced_custom_object( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 231, in create_namespaced_custom_object return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs) # noqa: E501 File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 354, in create_namespaced_custom_object_with_http_info return self.api_client.call_api( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 279, in POST return self.request("POST", url, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 238, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (400) Reason: Bad Request HTTP response headers: HTTPHeaderDict({'Audit-Id': '9ea0bf3e-2327-45ae-aefa-0965f38155ff', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '344539cc-94a8-443b-82c3-e6ffd6feb173', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'bb6f7b9d-abeb-49b1-bcda-ff2bc8c180bf', 'Date': 'Tue, 23 Jan 2024 22:43:20 GMT', 'Content-Length': '232'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validatejob.volcano.sh\" denied the request: unable to find job queue: queues.scheduling.volcano.sh \"test\" not found;","code":400}

EKS Data blueprint was provided from [https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/trainium](EKS Data url)

I re-initialized the project several times in both Cloud9 and in my local system. Both with the same result. I re-attempt the terraform ./install.sh file.

Versions

ai-ml/trainium-inferentia: 39e790ce0d4e45979d1374a86b2030e55a838441
Terraform version: Terraform v1.5.5 on linux_amd64
Provider version(s): Terraform v1.5.5 on linux_amd64

Reproduction Code [Required]

cd ai-ml/trainium-inferentia/examples/dp-bert-large-pretrain chomd +x 2-bert-pretrain-precompile.sh ./2-bert-pretrain-precompile.sh

Workspace used: Cloud9 following along this Workshop [https://www.eksworkshop.com/docs/introduction/setup/your-account/]( EKS Workshop url)

List steps in order that led up to the issue you encountered

`cd data-on-eks/ai-ml/trainium/ && chmod +x install.sh

./install.sh

cd ai-ml/trainium-inferentia/examples/dp-bert-large-pretrain chomd +x 1-bert-pretrain-build-image.sh ./1-bert-pretrain-build-image.sh

kubectl exec -i -t -n default aws-cli-cmd-shell -c app -- sh -c "clear; (bash || ash || sh)"

yum install tar cd /data aws s3 cp s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar . --no-sign-request chmod 744 bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar tar xvf bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar `

Expected behavior

Pretrain Bert Model successfully built

Actual behavior

Traceback (most recent call last): File "/home/ec2-user/.local/bin/torchx", line 8, in <module> sys.exit(main()) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/main.py", line 116, in main run_main(get_sub_cmds(), argv) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/main.py", line 112, in run_main args.func(args) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 248, in run self._run(runner, args) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 208, in _run app_handle = runner.run_component( File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/runner/api.py", line 186, in run_component return self.schedule(dryrun_info) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/runner/api.py", line 278, in schedule app_id = sched.schedule(dryrun_info) File "/home/ec2-user/.local/lib/python3.8/site-packages/torchx/schedulers/kubernetes_scheduler.py", line 593, in schedule resp = self._custom_objects_api().create_namespaced_custom_object( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 231, in create_namespaced_custom_object return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs) # noqa: E501 File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 354, in create_namespaced_custom_object_with_http_info return self.api_client.call_api( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 279, in POST return self.request("POST", url, File "/home/ec2-user/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 238, in request raise ApiException(http_resp=r) kubernetes.client.exceptions.ApiException: (400) Reason: Bad Request HTTP response headers: HTTPHeaderDict({'Audit-Id': '9ea0bf3e-2327-45ae-aefa-0965f38155ff', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '344539cc-94a8-443b-82c3-e6ffd6feb173', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'bb6f7b9d-abeb-49b1-bcda-ff2bc8c180bf', 'Date': 'Tue, 23 Jan 2024 22:43:20 GMT', 'Content-Length': '232'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validatejob.volcano.sh\" denied the request: unable to find job queue: queues.scheduling.volcano.sh \"test\" not found;","code":400}

Terminal Output Screenshot(s)

Additional context

Trainium on EKS blueprint

Jan 23 '24 23:01 Gall-oDrone

Thanks for raising the issue. I will try this blueprint and update the same to the issue.

Jan 24 '24 07:01 vara-bonthu

"Failure","message":"admission webhook \"validatejob.volcano.sh\" denied the request: unable to find job queue: queues.scheduling.volcano.sh \"test\" not found;","code":400}

Just noticed the above error indicates the job queue is missing for Volcano. Try to run kubectl apply on the below yaml manifest that will create namespace and the Volcano queue and try to run the shell script(2-bert-pretrain-precompile.sh) again.

---
apiVersion: v1
kind: Namespace
metadata:
  name: test

# Volcano dedicated queue for ml-team-a
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: test
spec:
  reclaimable: false
  weight: 1

We can update the blueprint if this works

Jan 25 '24 03:01 vara-bonthu

Hi @vara-bonthu ,

I successfully added and applied the manifest yaml file. The "Failure","message":"admission webhook \"validatejob.volcano.sh\" denied the request: unable to find job queue: queues.scheduling.volcano.sh \"test\" not found;","code":400} error is no longer showing up, but I'm still not able to run the bert-compile pods. I'm attaching the screenshots to show my results: Screen Shot 2024-01-25 at 12 16 35

Jan 25 '24 18:01 Gall-oDrone

It seems you've made good progress. The BERT large distributed training blueprint is utilizing Managed Node Groups, so you'll need to set the minimum and desired values to 2. These values can be updated in the variables.tf file. Here are the specific lines where you can make these changes:

Nodes Minimum Value Nodes Desired Value

After making these adjustments , please run terraform apply. This will provision two nodes of trn1.32xlarge instances. Ensure that your account has access to these nodes.

Upon completion, you should observe the pending pods transitioning to the running state.

I've noticed some gaps in the documentation that need updating. Thanks for validating and I appreciate a PR for these missing steps. Thank you!

Jan 26 '24 05:01 vara-bonthu

@Gall-oDrone This blueprint has been recently updated with the fixes. Checkout the latest PR here https://github.com/awslabs/data-on-eks/pull/435

Feb 22 '24 04:02 vara-bonthu

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days

May 15 '24 00:05 github-actions[bot]

Issue closed due to inactivity.

Aug 22 '24 00:08 github-actions[bot]

data-on-eks data-on-eks copied to clipboard

Persistent bug during dp-bert-large-pretrain example

Description

Versions

Reproduction Code [Required]

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

Additional context

data-on-eks
data-on-eks copied to clipboard