Hongshan Li

Results 10 issues of Hongshan Li

Link to the notebook: https://github.com/aws/amazon-sagemaker-examples/blob/master/reinforcement_learning/rl_deepracer_robomaker_coach_gazebo/deepracer_rl.ipynb Error: --------------------------------------------------------------------------- Exception encountered at "In [3]": --------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) in 9 sys.path.append("common") 10 sys.path.append("./src") ---> 11 from misc import get_execution_role,...

Hello, I am trying to use hiddenlayer to draw a pytorch model, I got some error coming out of onnx ``` --------------------------------------------------------------------------- TypeError Traceback (most recent call last) /home/ubuntu/mstar/scripts/rlfh/visualization.ipynb Cell...

### Describe the problem the feature is intended to solve When building tfx r2.8-rc0 with mkl support, I see the following issue: ``` ERROR: /root/.cache/bazel/_bazel_root/c206fe4b7a49887ed31d86472abc6776/external/org_tensorflow/tensorflow/core/common_runtime/BUILD:1739:11: Couldn't build file external/org_tensorflow/tensorflow/core/common_runtime/_objs/threadpool_device/threadpool_device.o: C++...

type:build/install
stat:awaiting response

## Bug Report Encountered errors when building tensorflow_model_server with r2.7. ### System information - **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: Using [tensorflow/serving:latest-devel-gpu](https://hub.docker.com/layers/tensorflow/serving/latest-devel-gpu/images/sha256-82e045dfcc1c3bed0c8dffa8e979d79a74002ede6ce1b003c845abe7ea1c7c55?context=explore) - **TensorFlow Serving installed from (source...

stat:awaiting response
type:bug

update nccl to 2.11.4 for pytorch 1.11.0

build
pytorch
Size:S

Link to the notebook: https://github.com/aws/amazon-sagemaker-examples/blob/master/step-functions-data-science-sdk/automate_model_retraining_workflow/automate_model_retraining_workflow.ipynb Error: --------------------------------------------------------------------------- Exception encountered at "In [8]": --------------------------------------------------------------------------- InvalidInputException Traceback (most recent call last) in 22 WorkerType='Standard', 23 NumberOfWorkers=2, ---> 24 Timeout=60 25 ) /opt/conda/lib/python3.7/site-packages/botocore/client.py...

Tensorboard profiler test [here](https://github.com/aws/deep-learning-containers/blob/master/test/dlc_tests/container_tests/bin/testTensorFlow) assumes that the profile logs are saved in `"/logs/train/plugins/profile"`, this however might change across different tensorflow releases. As an example, before tf2.7, the profiler logs are...

Hello Niklas, I have a question regarding reproducing SGPT's result. On the [mteb leaderboard](https://huggingface.co/spaces/mteb/leaderboard), the 125M-weightedmean-msmarco-specb-bitfit model achieve 12.21 NDCG@10 on SCIDOCS. However, I wasn't able reproduce the result following...

**Describe the bug** while converting a sharded zero3 checkpoint of llava styled multimodal model, I got the following error """ Traceback (most recent call last): File "/scratch/hongshal/code/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 551, in...

bug
training

Up to this point, I did not find minimal examples to resume training from universal checkpoint. The only example for using universal checkpoint is [here](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing) which is buried under layer...

enhancement