datasets
datasets copied to clipboard
Pushing a large dataset on the hub consistently hangs
Describe the bug
Once I have locally built a large dataset that I want to push to hub, I use the recommended approach of .push_to_hub to get the dataset on the hub, and after pushing a few shards, it consistently hangs. This has happened over 40 times over the past week, and despite my best efforts to try and catch this happening and kill a process and restart, it seems to be extremely time wasting -- so I came to you to report this and to seek help.
I already tried installing hf_transfer, but it doesn't support Byte file uploads so I uninstalled it.
Reproduction
import multiprocessing as mp
import pathlib
from math import ceil
import datasets
import numpy as np
from tqdm.auto import tqdm
from tali.data.data import select_subtitles_between_timestamps
from tali.utils import load_json
tali_dataset_dir = "/data/"
if __name__ == "__main__":
full_dataset = datasets.load_dataset(
"Antreas/TALI", num_proc=mp.cpu_count(), cache_dir=tali_dataset_dir
)
def data_generator(set_name, percentage: float = 1.0):
dataset = full_dataset[set_name]
for item in tqdm(dataset):
video_list = item["youtube_content_video"]
video_list = np.random.choice(
video_list, int(ceil(len(video_list) * percentage))
)
if len(video_list) == 0:
continue
captions = item["youtube_subtitle_text"]
captions = select_subtitles_between_timestamps(
subtitle_dict=load_json(
captions.replace(
"/data/",
tali_dataset_dir,
)
),
starting_timestamp=0,
ending_timestamp=100000000,
)
for video_path in video_list:
temp_path = video_path.replace("/data/", tali_dataset_dir)
video_path_actual: pathlib.Path = pathlib.Path(temp_path)
if video_path_actual.exists():
item["youtube_content_video"] = open(video_path_actual, "rb").read()
item["youtube_subtitle_text"] = captions
yield item
train_generator = lambda: data_generator("train", percentage=0.1)
val_generator = lambda: data_generator("val")
test_generator = lambda: data_generator("test")
train_data = datasets.Dataset.from_generator(
train_generator,
num_proc=mp.cpu_count(),
writer_batch_size=5000,
cache_dir=tali_dataset_dir,
)
val_data = datasets.Dataset.from_generator(
val_generator,
writer_batch_size=5000,
num_proc=mp.cpu_count(),
cache_dir=tali_dataset_dir,
)
test_data = datasets.Dataset.from_generator(
test_generator,
writer_batch_size=5000,
num_proc=mp.cpu_count(),
cache_dir=tali_dataset_dir,
)
dataset = datasets.DatasetDict(
{
"train": train_data,
"val": val_data,
"test": test_data,
}
)
succesful_competion = False
while not succesful_competion:
try:
dataset.push_to_hub(repo_id="Antreas/TALI-small", max_shard_size="5GB")
succesful_competion = True
except Exception as e:
print(e)
Logs
Pushing dataset shards to the dataset hub: 33%|██████████████████████████████████████▎ | 7/21 [24:33<49:06, 210.45s/it]
Error while uploading 'data/val-00007-of-00021-6b216a984af1a4c8.parquet' to the Hub.
Pushing split train to the Hub.
Resuming upload of the dataset shards.
Pushing dataset shards to the dataset hub: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [42:10<00:00, 55.01s/it]
Pushing split val to the Hub.
Resuming upload of the dataset shards.
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 1.55ba/s]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.51s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.39ba/s]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:30<00:00, 30.19s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.28ba/s]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:24<00:00, 24.08s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.42ba/s]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.97s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.49ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.54ba/s^
Upload 1 LFS files: 0%| | 0/1 [04:42<?, ?it/s]
Pushing dataset shards to the dataset hub: 52%|████████████████████████████████████████████████████████████▏ | 11/21 [17:23<15:48, 94.82s/it]
That's where it got stuck
System info
- huggingface_hub version: 0.15.1
- Platform: Linux-5.4.0-147-generic-x86_64-with-glibc2.35
- Python version: 3.10.11
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /root/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: Antreas
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.1.0.dev20230606+cu121
- Jinja2: 3.1.2
- Graphviz: N/A
- Pydot: N/A
- Pillow: 9.5.0
- hf_transfer: N/A
- gradio: N/A
- numpy: 1.24.3
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: /root/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /root/.cache/huggingface/assets
- HF_TOKEN_PATH: /root/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
Hi @AntreasAntoniou , sorry to know you are facing this issue. To help debugging it, could you tell me:
- What is the total dataset size?
- Is it always failing on the same shard or is the hanging problem happening randomly?
- Were you able to save the dataset as parquet locally? This would help us determine if the problem comes from the upload or the file generation.
I'm cc-ing @lhoestq who might have some insights from a datasets
perspective.
One trick that can also help is to check the traceback when you kill your python process: it will show where in the code it was hanging
Right. So I did the trick @lhoestq suggested. Here is where things seem to hang
Error while uploading 'data/train-00120-of-00195-466c2dbab2eb9989.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.15s/ba]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:52<00:00, 52.12s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.08s/ba]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:45<00:00, 45.54s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.08s/ba]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.03s/ba^Upload 1 LFS files: 0%| | 0/1 [
21:27:35<?, ?it/s]
Pushing dataset shards to the dataset hub: 63%|█████████████████████████████████████████████████████████████▎ | 122/195 [23:37:11<14:07:59, 696.98s/it]
^CError in sys.excepthook:
Traceback (most recent call last):
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1699, in print
extend(render(renderable, render_options))
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1335, in render
yield from self.render(render_output, _options)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render
for render_output in iter_render:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/constrain.py", line 29, in __rich_console__
yield from console.render(self.renderable, child_options)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render
for render_output in iter_render:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/panel.py", line 220, in __rich_console__
lines = console.render_lines(renderable, child_options, style=style)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1371, in render_lines
lines = list(
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 292, in split_and_crop_lines
for segment in segments:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render
for render_output in iter_render:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/padding.py", line 97, in __rich_console__
lines = console.render_lines(
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1371, in render_lines
lines = list(
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 292, in split_and_crop_lines
for segment in segments:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1335, in render
yield from self.render(render_output, _options)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render
for render_output in iter_render:
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 611, in __rich_console__
segments = Segments(self._get_syntax(console, options))
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 668, in __init__
self.segments = list(segments)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 674, in _get_syntax
lines: Union[List[Text], Lines] = text.split("\n", allow_blank=ends_on_nl)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/text.py", line 1042, in split
lines = Lines(
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/containers.py", line 70, in __init__
self._lines: List["Text"] = list(lines)
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/text.py", line 1043, in <genexpr>
line for line in self.divide(flatten_spans()) if line.plain != separator
File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/text.py", line 385, in plain
if len(self._text) != 1:
KeyboardInterrupt
Original exception was:
Traceback (most recent call last):
File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map
return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))
File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
yield _result_or_cancel(fs.pop())
File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
return fut.result(timeout)
File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 453, in result
self._condition.wait(timeout)
File "/opt/conda/envs/main/lib/python3.10/threading.py", line 320, in wait
waiter.acquire()
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/TALI/tali/scripts/validate_dataset.py", line 127, in <module>
train_dataset.push_to_hub(repo_id="Antreas/TALI-base", max_shard_size="5GB")
File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/dataset_dict.py", line 1583, in push_to_hub
repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parquet_shards_to_hub(
File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5275, in _push_parquet_shards_to_hub
_retry(
File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 282, in _retry
return func(*func_args, **func_kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 826, in _inner
return fn(self, *args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3205, in upload_file
commit_info = self.create_commit(
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 826, in _inner
return fn(self, *args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2680, in create_commit
upload_lfs_files(
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 353, in upload_lfs_files
thread_map(
File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map
return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)
File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 49, in _executor_map
with PoolExecutor(max_workers=max_workers, initializer=tqdm_class.set_lock,
File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 649, in __exit__
self.shutdown(wait=True)
File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/thread.py", line 235, in shutdown
t.join()
File "/opt/conda/envs/main/lib/python3.10/threading.py", line 1096, in join
self._wait_for_tstate_lock()
File "/opt/conda/envs/main/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
KeyboardInterrupt
@Wauplin
What is the total dataset size?
There are three variants, and the random hanging happens on all three. The sizes are 2TB, 1TB, and 200GB.
Is it always failing on the same shard or is the hanging problem happening randomly?
It seems to be very much random, as restarting can help move past the previous hang, only to find a new one, or not.
Were you able to save the dataset as parquet locally? This would help us determine if the problem comes from the upload or the file generation.
Yes. The dataset seems to be locally stored as parquet.
Hmm it looks like an issue with TQDM lock. Maybe you can try updating TQDM ?
I am using the latest version of tqdm
⬢ [Docker] ❯ pip install tqdm --upgrade
Requirement already satisfied: tqdm in /opt/conda/envs/main/lib/python3.10/site-packages (4.65.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
I tried trying to catch the hanging issue in action again
Pushing dataset shards to the dataset hub: 65%|█████████████████████████████████████████████████████████████████▊ | 127/195 [2:28:02<1:19:15, 69.94s/it]
Error while uploading 'data/train-00127-of-00195-3f8d036ade107c27.parquet' to the Hub.
Pushing split train to the Hub.
Pushing dataset shards to the dataset hub: 64%|████████████████████████████████████████████████████████████████▏ | 124/195 [2:06:10<1:12:14, 61.05s/it]C^[^C^C^C
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /TALI/tali/scripts/validate_dataset.py:127 in <module> │
│ │
│ 124 │ │
│ 125 │ while not succesful_competion: │
│ 126 │ │ try: │
│ ❱ 127 │ │ │ train_dataset.push_to_hub(repo_id="Antreas/TALI-base", max_shard_size="5GB") │
│ 128 │ │ │ succesful_competion = True │
│ 129 │ │ except Exception as e: │
│ 130 │ │ │ print(e) │
│ │
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/dataset_dict.py:1583 in push_to_hub │
│ │
│ 1580 │ │ for split in self.keys(): │
│ 1581 │ │ │ logger.warning(f"Pushing split {split} to the Hub.") │
│ 1582 │ │ │ # The split=key needs to be removed before merging │
│ ❱ 1583 │ │ │ repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parq │
│ 1584 │ │ │ │ repo_id, │
│ 1585 │ │ │ │ split=split, │
│ 1586 │ │ │ │ private=private, │
│ │
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:5263 in │
│ _push_parquet_shards_to_hub │
│ │
│ 5260 │ │ │
│ 5261 │ │ uploaded_size = 0 │
│ 5262 │ │ shards_path_in_repo = [] │
│ ❱ 5263 │ │ for index, shard in logging.tqdm( │
│ 5264 │ │ │ enumerate(itertools.chain([first_shard], shards_iter)), │
│ 5265 │ │ │ desc="Pushing dataset shards to the dataset hub", │
│ 5266 │ │ │ total=num_shards, │
│ │
│ /opt/conda/envs/main/lib/python3.10/site-packages/tqdm/std.py:1178 in __iter__ │
│ │
│ 1175 │ │ time = self._time │
│ 1176 │ │ │
│ 1177 │ │ try: │
│ ❱ 1178 │ │ │ for obj in iterable: │
│ 1179 │ │ │ │ yield obj │
│ 1180 │ │ │ │ # Update and possibly print the progressbar. │
│ 1181 │ │ │ │ # Note: does not call self.update(1) for speed optimisation. │
│ │
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:5238 in │
│ shards_with_embedded_external_files │
│ │
│ 5235 │ │ │ │ for shard in shards: │
│ 5236 │ │ │ │ │ format = shard.format │
│ 5237 │ │ │ │ │ shard = shard.with_format("arrow") │
│ ❱ 5238 │ │ │ │ │ shard = shard.map( │
│ 5239 │ │ │ │ │ │ embed_table_storage, │
│ 5240 │ │ │ │ │ │ batched=True, │
│ 5241 │ │ │ │ │ │ batch_size=1000, │
│ │
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:578 in wrapper │
│ │
│ 575 │ │ else: │
│ 576 │ │ │ self: "Dataset" = kwargs.pop("self") │
│ 577 │ │ # apply actual function │
│ ❱ 578 │ │ out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) │
│ 579 │ │ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou │
│ 580 │ │ for dataset in datasets: │
│ 581 │ │ │ # Remove task templates if a column mapping of the template is no longer val │
│ │
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:543 in wrapper │
│ │
│ 540 │ │ │ "output_all_columns": self._output_all_columns, │
│ 541 │ │ } │
│ 542 │ │ # apply actual function │
│ ❱ 543 │ │ out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) │
│ 544 │ │ datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou │
│ 545 │ │ # re-apply format to the output │
│ 546 │ │ for dataset in datasets: │
│ │
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:3073 in map │
│ │
│ 3070 │ │ │ │ │ leave=False, │
│ 3071 │ │ │ │ │ desc=desc or "Map", │
│ 3072 │ │ │ │ ) as pbar: │
│ ❱ 3073 │ │ │ │ │ for rank, done, content in Dataset._map_single(**dataset_kwargs): │
│ 3074 │ │ │ │ │ │ if done: │
│ 3075 │ │ │ │ │ │ │ shards_done += 1 │
│ 3076 │ │ │ │ │ │ │ logger.debug(f"Finished processing shard number {rank} of {n │
│ │
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:3464 in _map_single │
│ │
│ 3461 │ │ │ │ │ │ │ │ buf_writer, writer, tmp_file = init_buffer_and_writer() │
│ 3462 │ │ │ │ │ │ │ │ stack.enter_context(writer) │
│ 3463 │ │ │ │ │ │ │ if isinstance(batch, pa.Table): │
│ ❱ 3464 │ │ │ │ │ │ │ │ writer.write_table(batch) │
│ 3465 │ │ │ │ │ │ │ else: │
│ 3466 │ │ │ │ │ │ │ │ writer.write_batch(batch) │
│ 3467 │ │ │ │ │ │ num_examples_progress_update += num_examples_in_batch │
│ │
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_writer.py:567 in write_table │
│ │
│ 564 │ │ │ writer_batch_size = self.writer_batch_size │
│ 565 │ │ if self.pa_writer is None: │
│ 566 │ │ │ self._build_writer(inferred_schema=pa_table.schema) │
│ ❱ 567 │ │ pa_table = pa_table.combine_chunks() │
│ 568 │ │ pa_table = table_cast(pa_table, self._schema) │
│ 569 │ │ if self.embed_local_files: │
│ 570 │ │ │ pa_table = embed_table_storage(pa_table) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyboardInterrupt
I'm on my phone so can't help that much. What I'd advice to do is to save_to_disk if it's not already done and then upload the files/folder to the Hub separately. You can find what you need in the upload guide. It might not help finding the exact issue for now but at least it can unblock you.
In your last stacktrace it interrupted while embedding external content - in case your dataset in made of images or audio files that live on your disk. Is it the case ?
Yeah, the dataset has images, audio, video and text.
It's maybe related to https://github.com/apache/arrow/issues/34455: are you using ArrayND features ?
Also what's your pyarrow
version ? Could you try updating to >= 12.0.1 ?
I was using pyarrow == 12.0.0
I am not explicitly using ArrayND features, unless the hub API automatically converts my files to such.
I have now updated to pyarrow == 12.0.1 and retrying
You can also try to reduce the max_shard_size
- Sometimes parquet has a hard time working with data bigger than 2GB
So, updating the pyarrow seems to help. It can still throw errors here and there but I can retry when that happens. It's better than hanging.
However, I am a bit confused about something. I have uploaded my datasets, but while earlier I could see all three sets, now I can only see 1. What's going on? https://huggingface.co/datasets/Antreas/TALI-base
I have seen this happen before as well, so I deleted and reuploaded, but this dataset is way too large for me to do this.
It's a bug on our side, I'll update the dataset viewer ;)
Thanks for reporting !
Apparently this happened because of bad modifications in the README.md split metadata.
I fixed them in this PR: https://huggingface.co/datasets/Antreas/TALI-base/discussions/1
@lhoestq It's a bit odd that when uploading a dataset, one set at a time "train", "val", "test", the push_to_hub function overwrites the readme and removes differently named sets from previous commits. i.e., you push "val", all is well. Then you push "test", and the "val" entry disappears from the readme, while the data remain intact.
Also, just found another related issue. One of the many that make things hang or fail when pushing to hub.
In the following code:
train_generator = lambda: data_generator("train", percentage=1.0)
val_generator = lambda: data_generator("val")
test_generator = lambda: data_generator("test")
train_data = datasets.Dataset.from_generator(
train_generator,
num_proc=mp.cpu_count(),
writer_batch_size=5000,
cache_dir=tali_dataset_dir,
)
val_data = datasets.Dataset.from_generator(
val_generator,
writer_batch_size=5000,
num_proc=mp.cpu_count(),
cache_dir=tali_dataset_dir,
)
test_data = datasets.Dataset.from_generator(
test_generator,
writer_batch_size=5000,
num_proc=mp.cpu_count(),
cache_dir=tali_dataset_dir,
)
print(f"Pushing TALI-large to hub")
dataset = datasets.DatasetDict(
{"train": train_data, "val": val_data, "test": test_data}
)
succesful_competion = False
while not succesful_competion:
try:
dataset.push_to_hub(repo_id="Antreas/TALI-large", max_shard_size="2GB")
succesful_competion = True
except Exception as e:
print(e)
Things keep failing in the push_to_repo step, at random places, with the following error:
Pushing dataset shards to the dataset hub: 7%|██████████▋ | 67/950 [42:41<9:22:37, 38.23s/it]
Error while uploading 'data/train-00067-of-00950-a4d179ed5a593486.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.81ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.20s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.48ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.30s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.39ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.52s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.47ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.39s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.26ba/s]
Upload 1 LFS files: 0%| | 0/1 [16:38<?, ?it/s]
Pushing dataset shards to the dataset hub: 7%|███████████▎ | 71/950 [44:37<9:12:28, 37.71s/it]
Error while uploading 'data/train-00071-of-00950-72bab6e5cb223aee.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.18ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.94s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.36ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.67s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.57ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.16s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.68ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:09<00:00, 9.63s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.36ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.67s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.37ba/s]
Upload 1 LFS files: 0%| | 0/1 [16:39<?, ?it/s]
Pushing dataset shards to the dataset hub: 8%|████████████ | 76/950 [46:21<8:53:08, 36.60s/it]
Error while uploading 'data/train-00076-of-00950-b90e4e3b433db179.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.21ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:25<00:00, 25.40s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.56ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.40s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.49ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.53s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.27ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.25s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.42ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.03s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.39ba/s]
Upload 1 LFS files: 0%| | 0/1 [16:39<?, ?it/s]
Pushing dataset shards to the dataset hub: 9%|████████████▊ | 81/950 [48:30<8:40:22, 35.93s/it]
Error while uploading 'data/train-00081-of-00950-84b0450a1df093a9.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.18ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.65s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.92ba/s]
Upload 1 LFS files: 0%| | 0/1 [16:38<?, ?it/s]
Pushing dataset shards to the dataset hub: 9%|█████████████ | 82/950 [48:55<8:37:57, 35.80s/it]
Error while uploading 'data/train-00082-of-00950-0a1f52da35653e08.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.31ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:26<00:00, 26.29s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.42ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.57s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.64ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.35s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.64ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.74s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.31ba/s]
Upload 1 LFS files: 0%| | 0/1 [16:40<?, ?it/s]
Pushing dataset shards to the dataset hub: 9%|█████████████▋ | 86/950 [50:48<8:30:25, 35.45s/it]
Error while uploading 'data/train-00086-of-00950-e1cc80dd17191b20.parquet' to the Hub.
I have a while loop that forces retries, but it seems that the progress itself is randomly getting lost as well. Any ideas on how to improve this? It has been blocking me for way too long.
Should I build the parquet manually and then push manually as well? If I do things manually, how can I ensure my dataset works properly with "stream=True"?
Thank you for your help and time.
@lhoestq It's a bit odd that when uploading a dataset, one set at a time "train", "val", "test", the push_to_hub function overwrites the readme and removes differently named sets from previous commits. i.e., you push "val", all is well. Then you push "test", and the "val" entry disappears from the readme, while the data remain intact.
Hmm this shouldn't happen. What code did you run exactly ? Using which version of datasets
?
I have a while loop that forces retries, but it seems that the progress itself is randomly getting lost as well. Any ideas on how to improve this? It has been blocking me for way too long.
Could you also print the cause of the error (e.__cause__
) ? Or show the full stack trace when the error happens ?
This would give more details about why it failed and would help investigate.
Should I build the parquet manually and then push manually as well? If I do things manually, how can I ensure my dataset works properly with "stream=True"?
Parquet is supported out of the box ^^
If you want to make sure it works as expected you can try locally first:
ds = load_dataset("path/to/local", streaming=True)
@lhoestq @AntreasAntoniou I transferred this issue to the datasets
repository as the questions and answers are more related to this repo. Hope it can help other users find the bug and fixes more easily (like updating tqdm and pyarrow or setting a lower max_shard_size
).
~For the initial "pushing large dataset consistently hangs"-issue, I still think it's best to try to save_to_disk
first and then upload it manually/with a script (see upload_folder). It's not the most satisfying solution but at least it would confirm from where the problem comes from.~
EDIT: removed suggestion about saving to disk first (see https://github.com/huggingface/datasets/issues/5990#issuecomment-1607186914).
@lhoestq @AntreasAntoniou I transferred this issue to the datasets repository as the questions and answers are more related to this repo. Hope it can help other users find the bug and fixes more easily (like updating https://github.com/huggingface/datasets/issues/5990#issuecomment-1607120204 and https://github.com/huggingface/datasets/issues/5990#issuecomment-1607120278 or https://github.com/huggingface/datasets/issues/5990#issuecomment-1607120328).
thanks :)
For the initial "pushing large dataset consistently hangs"-issue, I still think it's best to try to save_to_disk first and then upload it manually/with a script (see upload_folder). It's not the most satisfying solution but at least it would confirm from where the problem comes from.
As I've already said in other discussions, I would not recommend pushing files saved with save_to_disk
to the Hub but save to parquet shards and upload them instead. The Hub does not support datasets saved with save_to_disk
, which is meant for disk only.
As I've already said in other discussions, I would not recommend pushing files saved with save_to_disk to the Hub but save to parquet shards and upload them instead. The Hub does not support datasets saved with save_to_disk, which is meant for disk only.
Well noted, thanks. That part was not clear to me :)
Sorry for not replying in a few days, I was on leave. :)
So, here are more information as to the error that causes some of the delay
Pushing Antreas/TALI-tiny to hub
Attempting to push to hub
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:24<00:00, 4.06s/ba]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:24<00:00, 4.15s/ba]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:26<00:00, 4.45s/ba]
/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/lfs.py:310: UserWarning: hf_transfer is enabled but does not support uploading from bytes or BinaryIO, falling back to regular upload
warnings.warn(
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:25<00:00, 4.26s/ba]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:27<00:00, 4.58s/ba]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:24<00:00, 4.10s/ba]
Pushing dataset shards to the dataset hub: 22%|████████████████████████▎ | 5/23 [52:23<3:08:37, 628.74s/it]
Exception: Error while uploading 'data/train-00005-of-00023-e224d901fd65e062.parquet' to the Hub., with stacktrace: <traceback object at 0x7f745458d0c0>, and type: <class 'RuntimeError'>, and
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url:
/lfs.huggingface.co/repos/7c/d3/7cd385d9324302dc13e3986331d72d9be6fa0174c63dcfe0e08cd474f7f1e8b7/3415166ae28c0beccbbc692f38742b8dea2c197f5c805321104e888d21d7eb90?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230627%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230627T003349Z&X-Amz-Expires=86400&X-Amz-Signature=5a12ff96f2
91f644134170992a6628e5f3c4e7b2e7fc3e940b4378fe11ae5390&X-Amz-SignedHeaders=host&partNumber=1&uploadId=JSsK8r63XSF.VlKQx3Vf8OW4DEVp5YIIY7LPnuapNIegsxs5EHgM1p4u0.Nn6_wlPlQnvxm8HKMxZhczKE9KB74t0etB
oLcxqBIvsgey3uXBTZMAEGwU6y7CDUADiEIO&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
One issue is that the uploading does not continue from the chunk it failed off. It often continues from a very old chunk. e.g. if it failed on chunk 192/250, it will continue from say 53/250, and this behaviour appears almost random.
Are you using a proxy of some sort ?
I am using a kubernetes cluster built into a university VPN.
So, other than the random connection drops here and there, any idea why the progress does not continue where it left off?
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 10.79ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.65ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.39ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.04ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.52ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 12.28ba/s]
Pushing dataset shards to the dataset hub: 20%|██████████████████████ | 75/381 [1:34:39<6:26:11, 75.72s/it]
Exception: Error while uploading 'data/train-00075-of-00381-1614bc251b778766.parquet' to the Hub., with stacktrace: <traceback object at 0x7fab6d9a4980>, and type: <class 'RuntimeError'>, and
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url:
/lfs.huggingface.co/repos/3b/31/3b311464573d8d63b137fcd5b40af1e7a5b1306843c88e80372d0117157504e5/ed8dae933fb79ae1ef5fb1f698f5125d3e1c02977ac69438631f152bb3bfdd1e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-
Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230629T053004Z&X-Amz-Expires=86400&X-Amz-Signature=da2b26270edfd6d0
d069c015a5a432031107a8664c3f0917717e5e40c688183c&X-Amz-SignedHeaders=host&partNumber=1&uploadId=2erWGHTh3ICqBLU_QvHfnygZ2tkMWbL0rEqpJdYohCKHUHnfwMjvoBIg0TI_KSGn4rSKxUxOyqSIzFUFSRSzixZeLeneaXJOw.Qx8
zLKSV5xV7HRQDj4RBesNve6cSoo&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 12.09ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 11.51ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 10.77ba/s]
Pushing dataset shards to the dataset hub: 20%|██████████████████████▋ | 77/381 [1:32:50<6:06:34, 72.35s/it]
Exception: Error while uploading 'data/train-00077-of-00381-368b2327a9908aab.parquet' to the Hub., with stacktrace: <traceback object at 0x7fab45b27f80>, and type: <class 'RuntimeError'>, and
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url:
/lfs.huggingface.co/repos/3b/31/3b311464573d8d63b137fcd5b40af1e7a5b1306843c88e80372d0117157504e5/9462ff2c5e61283b53b091984a22de2f41a2f6e37b681171e2eca4a998f979cb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-
Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230629T070510Z&X-Amz-Expires=86400&X-Amz-Signature=9ab8487b93d443cd
21f05476405855d46051a0771b4986bbb20f770ded21b1a4&X-Amz-SignedHeaders=host&partNumber=1&uploadId=UiHX1B.DcoAO2QmIHpWpCuNPwhXU_o1dsTkTGPqZt1P51o9k0yz.EsFD9eKpQMwgAST3jOatRG78I_JWRBeLBDYYVNp8r0TpIdeSg
eUg8uwPZOCPw9y5mWOw8MWJrnBo&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
Pushing dataset shards to the dataset hub: 8%|████████▋ | 29/381 [27:39<5:50:03, 59.67s/it]
Map: 36%|████████████████████████████████████████████████████ | 1000/2764 [00:35<00:34, 51.63 examples/Map: 72%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 2000/2764 [00:40<00:15, 49.06 examples/Map: 72%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 2000/2764 [00:55<00:15, 49.06 examples/Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2764/2764 [00:56<00:00, 48.82 examples/Pushing dataset shards to the dataset hub: 8%|████████▉ | 30/381 [28:35<5:43:03, 58.64s/iPushing dataset shards to the dataset hub: 8%|█████████▎ | 31/381 [29:40<5:52:18, 60.40s/iPushing dataset shards to the dataset hub: 8%|█████████▌ | 32/381 [30:46<6:02:20, 62.29s/it]
Map: 36%|███████████████████████████████████████████████████▎
This is actually the issue that wastes the most time for me, and I need it fixed. Please advice on how I can go about it.
Notice how the progress goes from | 77/381 to 30/381
If the any shard is missing on the Hub, it will re-upload it. It looks like the 30th shard was missing on the Hub in your case.
It also means that the other files up to the 77th that were successfully uploaded won't be uploaded again.
cc @mariosasko who might know better