Gautam Kumar

Results 31 issues of Gautam Kumar

**Description of your changes:** Adding the scenario when training jobs are stopped. Fixes #6465 ``` PASSED tests/unit_tests/tests/robomaker/test_robomaker_simulation_job_spec.py::RoboMakerSimulationJobSpecTestCase::test_minimum_required_args PASSED tests/unit_tests/tests/train/test_train_component.py::TrainingComponentTestCase::test_after_job_completed PASSED tests/unit_tests/tests/train/test_train_component.py::TrainingComponentTestCase::test_create_training_job PASSED tests/unit_tests/tests/train/test_train_component.py::TrainingComponentTestCase::test_cw_logs PASSED tests/unit_tests/tests/train/test_train_component.py::TrainingComponentTestCase::test_do_sets_name PASSED tests/unit_tests/tests/train/test_train_component.py::TrainingComponentTestCase::test_empty_hyperparameters PASSED tests/unit_tests/tests/train/test_train_component.py::TrainingComponentTestCase::test_first_party_algorithm PASSED...

size/S

The script mentioned in https://github.com/pytorch/examples/tree/master/imagenet does provides good guideline on single node training however it doesn't have good documentation on Distributed training on multiple Node. I tried to use two...

distributed

Getting the warning ``` Warning: rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding ``` This is being used in multiple places specifically in dex, we need...

Running notebook with nb_conda version 2.2.1 leads to [W 00:36:26.479 NotebookApp] [nb_conda] JSON parse fail: Extra data: line 2 column 1 (char 77) [E 00:36:26.480 NotebookApp] Uncaught exception POST /conda/environments/python3/packages/install...

Customer request: https://github.com/aws/amazon-sagemaker-operator-for-k8s/issues/118

enhancement
question

Use case: Customer running k8s cluster can be a launch pad to submit a training a job SM or even deploy a model to SM for inference.

We want kubeflow to be completely off the usage of static credentials but only using IRSA. 1. figure out the IAM user usage in kubeflow component. 2. Replace them with...

enhancement

**Is your feature request related to a problem? Please describe.** We will need to include kubeflow-training sdk and few more sdk which might be present for other component like katib...

enhancement

**Is your feature request related to a problem? Please describe.** Customer feedback: Kale seems to be a very useful feature to quickly spin up pipelines right out of a notebook...

enhancement