datasets icon indicating copy to clipboard operation
datasets copied to clipboard

canot load EMNIST dataset

Open davidshen84 opened this issue 10 months ago • 5 comments

/!\ PLEASE INCLUDE THE FULL STACKTRACE AND CODE SNIPPET

Short description Failed to load the emnist dataset

Environment information

  • Operating System: Linux

  • Python version: 3.9

  • tensorflow-datasets/tfds-nightly version: 4.9.4

  • tensorflow/tf-nightly version: 12.6.1

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ? ✅

Reproduction instructions

import tensorflow_datasets as tfds

tfds.load("emnist", split=["train"])

If you share a colab, make sure to update the permissions to share it.

Link to logs

Expected behavior The emnist dataset is loaded successfully.

Additional context

NonMatchingChecksumError: Artifact https://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/gzip.zip, downloaded to /root/tensorflow_datasets/downloads/itl.nist.gov_iaui_vip_cs_links_EMNIST_gzipi4VnNviDSrfd9Zju6qv40flc3wr22t8ldulNStS6tmk.zip.tmp.8cdbd18c3c7144529f0a2a11d1829c60/itl, has wrong checksum:
* Expected: UrlInfo(size=535.73 MiB, checksum='fb9bb67e33772a9cc0b895e4ecf36d2cf35be8b709693c3564cea2a019fcda8e', filename='gzip.zip')
* Got: UrlInfo(size=110.12 KiB, checksum='bfd529724d06f22872f32d6649561a57fd25ec17ea51d6f2ad24b96ea0519c34', filename='itl')
To debug, see: https://www.tensorflow.org/datasets/overview#fixing_nonmatchingchecksumerror

I tried to download the file directly using the link https://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/gzip.zip, but I got redirected to the NIST homepage. I think the link is outdated.

davidshen84 avatar Apr 07 '24 02:04 davidshen84

@davidshen84 Well spotted, thanks for opening the issue! It seems the URL (https://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/gzip.zip) now redirects to https://www.nist.gov/itl which causes the problem.

Did you find the actual link?

marcenacp avatar Apr 11 '24 09:04 marcenacp

I cannot find any direct download link from the Internet. According to this page, https://www.nist.gov/itl/products-and-services/emnist-dataset, contacting the author is the only way to get the data set.

On Thu, 11 Apr 2024, 19:17 Pierre Marcenac, @.***> wrote:

@davidshen84 https://github.com/davidshen84 Well spotted, thanks for opening the issue! It seems the URL ( https://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/gzip.zip) now redirects to https://www.nist.gov/itl which causes the problem.

Did you find the actual link?

— Reply to this email directly, view it on GitHub https://github.com/tensorflow/datasets/issues/5356#issuecomment-2049270039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQBTL6FMUGUOGH6XLJ6O3Y4ZIKDAVCNFSM6AAAAABF25B3B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBZGI3TAMBTHE . You are receiving this because you were mentioned.Message ID: @.***>

davidshen84 avatar Apr 11 '24 11:04 davidshen84

Sorry, I just found out on that NIST page, the "original MNIST dataset" link points to the EMNIST dataset. 😅

Can you check if TF can still handle that file?

Thank you

On Thu, 11 Apr 2024, 21:45 Xi Shen, @.***> wrote:

I cannot find any direct download link from the Internet. According to this page, https://www.nist.gov/itl/products-and-services/emnist-dataset, contacting the author is the only way to get the data set.

On Thu, 11 Apr 2024, 19:17 Pierre Marcenac, @.***> wrote:

@davidshen84 https://github.com/davidshen84 Well spotted, thanks for opening the issue! It seems the URL ( https://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/gzip.zip) now redirects to https://www.nist.gov/itl which causes the problem.

Did you find the actual link?

— Reply to this email directly, view it on GitHub https://github.com/tensorflow/datasets/issues/5356#issuecomment-2049270039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQBTL6FMUGUOGH6XLJ6O3Y4ZIKDAVCNFSM6AAAAABF25B3B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBZGI3TAMBTHE . You are receiving this because you were mentioned.Message ID: @.***>

davidshen84 avatar Apr 12 '24 00:04 davidshen84

This should be the new emnist dataset URL: https://biometrics.nist.gov/cs_links/EMNIST/gzip.zip

davidshen84 avatar Apr 16 '24 09:04 davidshen84

Did you fix the error? If you have solved it, can you tell me how? I also get the same error :(

minchan0410 avatar May 07 '24 14:05 minchan0410

No. They just need to update the URL. I think you can manually download the archive and put it in the download folder. TF will skip the downloading, thus skip this bug.

On Wed, 8 May 2024, 00:36 minchan0410, @.***> wrote:

Did you fix the error? If you have solved it, can you tell me how? I also get the same error :(

— Reply to this email directly, view it on GitHub https://github.com/tensorflow/datasets/issues/5356#issuecomment-2098558880, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQBTO5CTBFCW3G4MBNVILZBDRFPAVCNFSM6AAAAABF25B3B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJYGU2TQOBYGA . You are receiving this because you were mentioned.Message ID: @.***>

davidshen84 avatar May 07 '24 22:05 davidshen84

Hello, https://github.com/tensorflow/datasets/pull/5401 which should have solved the issue is now merged! Starting from tomorrow, the change will be available in tfds-nightly.

ccl-core avatar May 08 '24 14:05 ccl-core

Thank you both for letting us know. It was helpful!! :)

minchan0410 avatar May 08 '24 14:05 minchan0410