training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

[Feedback] (the dataset download link gets 403 error) docs/components/training/user-guides/pytorch.md |

Open itay-nvn-nv opened this issue 1 year ago • 5 comments

issue:

following this guide: https://www.kubeflow.org/docs/components/training/user-guides/pytorch/

which is using this image:

gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0

that attempts to download this file:

http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz

but as of today, requesting this link gets 403 status.

here you can see the proper output for this image: https://developer-qa.nvidia.com/blog/gpu-containers-runtime/#:~:text=Try%20running%20the%20MNIST%20training%20example%20included%20with%20the%20container%3A

suggestions:

  1. use links from this mirror instead, which is hosted on github and probably will be more reliable
https://github.com/fgnt/mnist
  1. allow to provide links to these files using env vars, to prevent hardcoding links that might be dead sometime.

notes: i assume this link is hardcoded in a script which is used in the dockerfile used to build this image. i found several references to this link across the kubeflow github: https://github.com/search?q=org%3Akubeflow%20%22train-images-idx3-ubyte.gz%22&type=code but couldn't trace the dockerfile used to build this image, nor detect which of these scripts was used in it.

itay-nvn-nv avatar Nov 20 '24 13:11 itay-nvn-nv

tested with this image: kubeflow/pytorch-dist-mnist:latest(latest tag, pushed at 22/11/2024) https://hub.docker.com/r/kubeflow/pytorch-dist-mnist/tags

the links were switched to a public S3 bucket, and download process completes:

Using distributed PyTorch with gloo backend
World Size: 2. Rank: 1
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ../data/FashionMNIST/raw/train-images-idx3-ubyte.gz

  0%|          | 65536/26421880 [00:00<01:12, 365219.76it/s]
100%|██████████| 26421880/26421880 [00:01<00:00, 16889476.36it/s]
Extracting ../data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw/train-labels-idx1-ubyte.gz

  0%|          | 0/29515 [00:00<?, ?it/s]
100%|██████████| 29515/29515 [00:00<00:00, 325193.23it/s]
Extracting ../data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ../data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz

  1%|▏         | 65536/4422102 [00:00<00:12, 361558.72it/s]
100%|██████████| 4422102/4422102 [00:00<00:00, 6085832.68it/s]
Extracting ../data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ../data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ../data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz

FYI this new image should replace these 2 old images, currently used in alot of the examples across the repo:

gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0 (latest tag, pushed at 07/03/2019) https://console.cloud.google.com/gcr/images/kubeflow-ci/global/pytorch-dist-mnist_test

gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0 (latest tag, pushed at 03/03/2019) https://console.cloud.google.com/gcr/images/kubeflow-ci/global/pytorch-dist-mnist-test

itay-nvn-nv avatar Nov 23 '24 21:11 itay-nvn-nv

/assign @itaynvn-runai

varodrig avatar Jan 12 '25 11:01 varodrig

PR is waiting to be approved.

varodrig avatar Jan 19 '25 00:01 varodrig

/area trainer

varodrig avatar Mar 11 '25 01:03 varodrig

/transfer trainer cc @andreyvelich

varodrig avatar Mar 11 '25 01:03 varodrig

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 09 '25 05:06 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Jun 29 '25 05:06 github-actions[bot]