datasets
datasets copied to clipboard
TFDS hangs when downloading/loading WIT, seems related to apache beam and gooogle.cloud.bigquery_storage_v1
/!\ PLEASE INCLUDE THE FULL STACKTRACE AND CODE SNIPPET
Short description I am trying to use the WIT dataset and upon using the download=True flag the program seems to hang or otherwise taking very long to run without any information as to what is going on.
The exact code I am using is
import tensorflow_datasets as tfds
subset, subset_info = tfds.load(
name="wit",
split="val",
shuffle_files=False,
download=True,
as_supervised=False,
data_dir=args.source_dataset_dir,
with_info=True,
)
Environment information
-
Operating System: Ubuntu 22.04
-
Python version: 3.8.13
-
tf-estimator-nightly 2.10.0.dev2022070408 pypi_0 pypi
-
tf-nightly 2.10.0.dev20220704 pypi_0 pypi
-
tfds-nightly 4.6.0.dev202207080047 pypi_0 pypi
-
Does the issue still exists with the last
tfds-nightly
package (pip install --upgrade tfds-nightly
) ? Yes
Reproduction instructions
import tensorflow_datasets as tfds
subset, subset_info = tfds.load(
name="wit",
split="val",
shuffle_files=False,
download=True,
as_supervised=False,
data_dir="/tmp/tfds/data",
with_info=True,
)
If you share a colab, make sure to update the permissions to share it.
Link to logs https://gist.github.com/AntreasAntoniou/73feea24699c60bc94411ad81c73f8d2
Expected behavior The code should download the required dataset and load it. Instead it downloads it, but then seems to hang. I understand that it states that datasets that use apache beam might take long times to load, but the current behaviour provides no information as to whether something is being loaded or if things are stuck. I let this run for 12 hours on an 8-core AMD CPU, so I was hoping that some feedback would come through, but it never did.
Additional context Add any other context about the problem here.