datasets Pushing a large dataset on the hub consistently hangs

Pushing a large dataset on the hub consistently hangs

Open AntreasAntoniou opened this issue 1 year ago • 45 comments

Describe the bug

Once I have locally built a large dataset that I want to push to hub, I use the recommended approach of .push_to_hub to get the dataset on the hub, and after pushing a few shards, it consistently hangs. This has happened over 40 times over the past week, and despite my best efforts to try and catch this happening and kill a process and restart, it seems to be extremely time wasting -- so I came to you to report this and to seek help.

I already tried installing hf_transfer, but it doesn't support Byte file uploads so I uninstalled it.

Reproduction

import multiprocessing as mp
import pathlib
from math import ceil

import datasets
import numpy as np
from tqdm.auto import tqdm

from tali.data.data import select_subtitles_between_timestamps
from tali.utils import load_json

tali_dataset_dir = "/data/"

if __name__ == "__main__":
    full_dataset = datasets.load_dataset(
        "Antreas/TALI", num_proc=mp.cpu_count(), cache_dir=tali_dataset_dir
    )

    def data_generator(set_name, percentage: float = 1.0):
        dataset = full_dataset[set_name]

        for item in tqdm(dataset):
            video_list = item["youtube_content_video"]
            video_list = np.random.choice(
                video_list, int(ceil(len(video_list) * percentage))
            )
            if len(video_list) == 0:
                continue
            captions = item["youtube_subtitle_text"]
            captions = select_subtitles_between_timestamps(
                subtitle_dict=load_json(
                    captions.replace(
                        "/data/",
                        tali_dataset_dir,
                    )
                ),
                starting_timestamp=0,
                ending_timestamp=100000000,
            )

            for video_path in video_list:
                temp_path = video_path.replace("/data/", tali_dataset_dir)
                video_path_actual: pathlib.Path = pathlib.Path(temp_path)

                if video_path_actual.exists():
                    item["youtube_content_video"] = open(video_path_actual, "rb").read()
                    item["youtube_subtitle_text"] = captions
                    yield item

    train_generator = lambda: data_generator("train", percentage=0.1)
    val_generator = lambda: data_generator("val")
    test_generator = lambda: data_generator("test")

    train_data = datasets.Dataset.from_generator(
        train_generator,
        num_proc=mp.cpu_count(),
        writer_batch_size=5000,
        cache_dir=tali_dataset_dir,
    )

    val_data = datasets.Dataset.from_generator(
        val_generator,
        writer_batch_size=5000,
        num_proc=mp.cpu_count(),
        cache_dir=tali_dataset_dir,
    )

    test_data = datasets.Dataset.from_generator(
        test_generator,
        writer_batch_size=5000,
        num_proc=mp.cpu_count(),
        cache_dir=tali_dataset_dir,
    )

    dataset = datasets.DatasetDict(
        {
            "train": train_data,
            "val": val_data,
            "test": test_data,
        }
    )
    succesful_competion = False
    while not succesful_competion:
        try:
            dataset.push_to_hub(repo_id="Antreas/TALI-small", max_shard_size="5GB")
            succesful_competion = True
        except Exception as e:
            print(e)

Logs

Pushing dataset shards to the dataset hub:  33%|██████████████████████████████████████▎                                                                            | 7/21 [24:33<49:06, 210.45s/it]
Error while uploading 'data/val-00007-of-00021-6b216a984af1a4c8.parquet' to the Hub.                                                                                                               
Pushing split train to the Hub.                                                                                                                                                                    
Resuming upload of the dataset shards.                                                                                                                                                             
Pushing dataset shards to the dataset hub: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [42:10<00:00, 55.01s/it]
Pushing split val to the Hub.                                                                                                                                                                      
Resuming upload of the dataset shards.                                                                                                                                                             
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.55ba/s]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.51s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.39ba/s]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:30<00:00, 30.19s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.28ba/s]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:24<00:00, 24.08s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.42ba/s]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.97s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.49ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.54ba/s^
Upload 1 LFS files:   0%|                                                                                                                                                    | 0/1 [04:42<?, ?it/s]
Pushing dataset shards to the dataset hub:  52%|████████████████████████████████████████████████████████████▏                                                      | 11/21 [17:23<15:48, 94.82s/it]

That's where it got stuck

System info

- huggingface_hub version: 0.15.1
- Platform: Linux-5.4.0-147-generic-x86_64-with-glibc2.35
- Python version: 3.10.11
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /root/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: Antreas
- Configured git credential helpers: store
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.1.0.dev20230606+cu121
- Jinja2: 3.1.2
- Graphviz: N/A
- Pydot: N/A
- Pillow: 9.5.0
- hf_transfer: N/A
- gradio: N/A
- numpy: 1.24.3
- ENDPOINT: https://huggingface.co
- HUGGINGFACE_HUB_CACHE: /root/.cache/huggingface/hub
- HUGGINGFACE_ASSETS_CACHE: /root/.cache/huggingface/assets
- HF_TOKEN_PATH: /root/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False

Jun 10 '23 14:06 AntreasAntoniou

Hi @AntreasAntoniou , sorry to know you are facing this issue. To help debugging it, could you tell me:

What is the total dataset size?
Is it always failing on the same shard or is the hanging problem happening randomly?
Were you able to save the dataset as parquet locally? This would help us determine if the problem comes from the upload or the file generation.

I'm cc-ing @lhoestq who might have some insights from a datasets perspective.

Jun 13 '23 09:06 Wauplin

One trick that can also help is to check the traceback when you kill your python process: it will show where in the code it was hanging

Jun 13 '23 16:06 lhoestq

Right. So I did the trick @lhoestq suggested. Here is where things seem to hang

Error while uploading 'data/train-00120-of-00195-466c2dbab2eb9989.parquet' to the Hub.                                                                                                     
Pushing split train to the Hub.                                                                                                                                                            
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.15s/ba]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:52<00:00, 52.12s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.08s/ba]
Upload 1 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:45<00:00, 45.54s/it]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.08s/ba]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.03s/ba^Upload 1 LFS files:   0%|                                                                                                                                         | 0/1 [
21:27:35<?, ?it/s]                                                                                                                                                                         
Pushing dataset shards to the dataset hub:  63%|█████████████████████████████████████████████████████████████▎                                    | 122/195 [23:37:11<14:07:59, 696.98s/it]
^CError in sys.excepthook:                                                                                                                                                                 
Traceback (most recent call last):                                                                                                                                                         
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1699, in print                                                                                            
    extend(render(renderable, render_options))                                                                                                                                             
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1335, in render                                                                                           
    yield from self.render(render_output, _options)                                                                                                                                        
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render                                                                                           
    for render_output in iter_render:                                                                                                                                                      
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/constrain.py", line 29, in __rich_console__                                                                                 
    yield from console.render(self.renderable, child_options)                                                                                                                              
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render                                                                                           
    for render_output in iter_render:                                                                                                                                                      
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/panel.py", line 220, in __rich_console__                                                                                    
    lines = console.render_lines(renderable, child_options, style=style)                                                                                                                   
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1371, in render_lines                                                                                     
    lines = list(                                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 292, in split_and_crop_lines                                                                              
    for segment in segments:                                                                                                                                                               
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render                                                                                           
    for render_output in iter_render:                                                                                                                                                      
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/padding.py", line 97, in __rich_console__                                                                                   
    lines = console.render_lines(                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1371, in render_lines                                                                                     
    lines = list(                                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 292, in split_and_crop_lines                                                                              
    for segment in segments:                                                                                                                                                               
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1335, in render                                                                                           
    yield from self.render(render_output, _options)                                                                                                                                        
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/console.py", line 1331, in render                                                                                           
    for render_output in iter_render:                                                                                                                                                      
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 611, in __rich_console__                                                                                   
    segments = Segments(self._get_syntax(console, options))                                                                                                                                
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/segment.py", line 668, in __init__                                                                                          
    self.segments = list(segments)                                                                                                                                                         
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/syntax.py", line 674, in _get_syntax                                                                                        
    lines: Union[List[Text], Lines] = text.split("\n", allow_blank=ends_on_nl)                                                                                                             
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/text.py", line 1042, in split                                                                                               
    lines = Lines(                                                                                                                                                                         
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/containers.py", line 70, in __init__                                                                                        
    self._lines: List["Text"] = list(lines)                                                                                                                                                
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/text.py", line 1043, in <genexpr>                                                                                           
    line for line in self.divide(flatten_spans()) if line.plain != separator                                                                                                               
  File "/opt/conda/envs/main/lib/python3.10/site-packages/rich/text.py", line 385, in plain                                                    
    if len(self._text) != 1:                                                                                                                                                               
KeyboardInterrupt                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                            
Original exception was:                                                                                                                                                                                                                                                                        
Traceback (most recent call last):                                                                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map                                                                                                                                                                               
    return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs))                                                                                                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__                                                                                                                                                                                                 
    for obj in iterable:                                                                                                                                                                   
  File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator                                                                                                                                                                                         
    yield _result_or_cancel(fs.pop())                                                                                                                                                      
  File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel                                                                                                                                                                                       
    return fut.result(timeout)                                                                                                                                                             
  File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 453, in result                                                                                                                                                                                                  
    self._condition.wait(timeout)                                                                                                                                                                                                           
  File "/opt/conda/envs/main/lib/python3.10/threading.py", line 320, in wait                                                                                                                                                                                                                   
    waiter.acquire()                                                                                                                                                                                                                        
KeyboardInterrupt                                                                                                                                                                                                                                                                              
                                                                                                                      
During handling of the above exception, another exception occurred:                                                                                                                                                                                                                            
                                                                                                                                                                                                                                            
Traceback (most recent call last):                                                                                                                                                                                                                                                             
  File "/TALI/tali/scripts/validate_dataset.py", line 127, in <module>                                                                                                            
    train_dataset.push_to_hub(repo_id="Antreas/TALI-base", max_shard_size="5GB")                                                                                                                                                                                                               
  File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/dataset_dict.py", line 1583, in push_to_hub                                                                                                                                                                                                                                                      
    repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parquet_shards_to_hub(                                                                                                                                                                                             
  File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 5275, in _push_parquet_shards_to_hub                                                                                                                                                                                                                                     
    _retry(                                                                                                                                                                                                                                                                                    
  File "/opt/conda/envs/main/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 282, in _retry                                                                                                                                                                                                                                                        
    return func(*func_args, **func_kwargs)                                                                                                                                                                                                                                                     
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn                                                                                                                                                                                                                                             
    return fn(*args, **kwargs)                                                                                                                 
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 826, in _inner                                                                                                                                                                                                                                                           
    return fn(self, *args, **kwargs)                                                                                                                                                                                                                                                           
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3205, in upload_file                                                                                                                                                                                                                                                     
    commit_info = self.create_commit(                                  
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn                                                                                                                                                                                                                                             
    return fn(*args, **kwargs)                                                           
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 826, in _inner                                                                                                                                                                                                                                                           
    return fn(self, *args, **kwargs)                                   
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2680, in create_commit                                                                                                                                                                                                                                                   
    upload_lfs_files(                                                                    
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn                                                                                                                                                                                                                                             
    return fn(*args, **kwargs)                                                           
  File "/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 353, in upload_lfs_files                                                                                                                                                                                                                                            
    thread_map(                                                                          
  File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 69, in thread_map                                                                                                                                                                                                                                                       
    return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs)                                                                                                       
  File "/opt/conda/envs/main/lib/python3.10/site-packages/tqdm/contrib/concurrent.py", line 49, in _executor_map                                                                                                                                                                                                                                                    
    with PoolExecutor(max_workers=max_workers, initializer=tqdm_class.set_lock,                                                                                                   
  File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/_base.py", line 649, in __exit__                                                                                                                                                                                                                                                                     
    self.shutdown(wait=True)                                                             
  File "/opt/conda/envs/main/lib/python3.10/concurrent/futures/thread.py", line 235, in shutdown                                                                                                                                                                                                                                                                    
    t.join()                                                                             
  File "/opt/conda/envs/main/lib/python3.10/threading.py", line 1096, in join                                                                                                     
    self._wait_for_tstate_lock()                                                         
  File "/opt/conda/envs/main/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock                                                                                                                                                                                                                                                                      
    if lock.acquire(block, timeout):                                                     
KeyboardInterrupt

Jun 16 '23 17:06 AntreasAntoniou

@Wauplin

What is the total dataset size?

There are three variants, and the random hanging happens on all three. The sizes are 2TB, 1TB, and 200GB.

Is it always failing on the same shard or is the hanging problem happening randomly?

It seems to be very much random, as restarting can help move past the previous hang, only to find a new one, or not.

Were you able to save the dataset as parquet locally? This would help us determine if the problem comes from the upload or the file generation.

Yes. The dataset seems to be locally stored as parquet.

Jun 16 '23 17:06 AntreasAntoniou

Hmm it looks like an issue with TQDM lock. Maybe you can try updating TQDM ?

Jun 16 '23 17:06 lhoestq

I am using the latest version of tqdm

⬢ [Docker] ❯ pip install tqdm --upgrade
Requirement already satisfied: tqdm in /opt/conda/envs/main/lib/python3.10/site-packages (4.65.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

Jun 16 '23 20:06 AntreasAntoniou

I tried trying to catch the hanging issue in action again

Pushing dataset shards to the dataset hub:  65%|█████████████████████████████████████████████████████████████████▊                                   | 127/195 [2:28:02<1:19:15, 69.94s/it]                                               
Error while uploading 'data/train-00127-of-00195-3f8d036ade107c27.parquet' to the Hub.                                                                                                                                                    
Pushing split train to the Hub.                                                                                                                                                                                                           
Pushing dataset shards to the dataset hub:  64%|████████████████████████████████████████████████████████████████▏                                    | 124/195 [2:06:10<1:12:14, 61.05s/it]C^[^C^C^C                                      
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮                                                                                                                                      
│ /TALI/tali/scripts/validate_dataset.py:127 in <module>                                           │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│   124 │                                                                                          │                                                                                                                                      
│   125 │   while not succesful_competion:                                                         │                                                                                                                                      
│   126 │   │   try:                                                                               │                                                                                                                                      
│ ❱ 127 │   │   │   train_dataset.push_to_hub(repo_id="Antreas/TALI-base", max_shard_size="5GB")   │                                                                                                                                      
│   128 │   │   │   succesful_competion = True                                                     │                                                                                                                                      
│   129 │   │   except Exception as e:                                                             │                                                                                                                                      
│   130 │   │   │   print(e)                                                                       │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/dataset_dict.py:1583 in push_to_hub   │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│   1580 │   │   for split in self.keys():                                                         │                                                                                                                                      
│   1581 │   │   │   logger.warning(f"Pushing split {split} to the Hub.")                          │                                                                                                                                      
│   1582 │   │   │   # The split=key needs to be removed before merging                            │                                                                                                                                      
│ ❱ 1583 │   │   │   repo_id, split, uploaded_size, dataset_nbytes, _, _ = self[split]._push_parq  │                                                                                                                                      
│   1584 │   │   │   │   repo_id,                                                                  │                                                                                                                                      
│   1585 │   │   │   │   split=split,                                                              │                                                                                                                                      
│   1586 │   │   │   │   private=private,                                                          │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:5263 in              │                                                                                                                                      
│ _push_parquet_shards_to_hub                                                                      │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│   5260 │   │                                                                                     │                                                                                                                                      
│   5261 │   │   uploaded_size = 0                                                                 │                                                                                                                                      
│   5262 │   │   shards_path_in_repo = []                                                          │                                                                                                                                      
│ ❱ 5263 │   │   for index, shard in logging.tqdm(                                                 │                                                                                                                                      
│   5264 │   │   │   enumerate(itertools.chain([first_shard], shards_iter)),                       │                                                                                                                                      
│   5265 │   │   │   desc="Pushing dataset shards to the dataset hub",                             │                                                                                                                                      
│   5266 │   │   │   total=num_shards,                                                             │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│ /opt/conda/envs/main/lib/python3.10/site-packages/tqdm/std.py:1178 in __iter__                   │                                                                                                                                      
│                                                                                                  │                                                                                                                                      
│   1175 │   │   time = self._time                                                                 │                                                                                                                                      
│   1176 │   │                                                                                     │                                                                                                                                      
│   1177 │   │   try:                                                                              │
│ ❱ 1178 │   │   │   for obj in iterable:                                                          │
│   1179 │   │   │   │   yield obj                                                                 │
│   1180 │   │   │   │   # Update and possibly print the progressbar.                              │
│   1181 │   │   │   │   # Note: does not call self.update(1) for speed optimisation.              │
│                                                                                                  │
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:5238 in              │
│ shards_with_embedded_external_files                                                              │
│                                                                                                  │
│   5235 │   │   │   │   for shard in shards:                                                      │
│   5236 │   │   │   │   │   format = shard.format                                                 │
│   5237 │   │   │   │   │   shard = shard.with_format("arrow")                                    │
│ ❱ 5238 │   │   │   │   │   shard = shard.map(                                                    │
│   5239 │   │   │   │   │   │   embed_table_storage,                                              │
│   5240 │   │   │   │   │   │   batched=True,                                                     │
│   5241 │   │   │   │   │   │   batch_size=1000,                                                  │
│                                                                                                  │
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:578 in wrapper       │
│                                                                                                  │
│    575 │   │   else:                                                                             │
│    576 │   │   │   self: "Dataset" = kwargs.pop("self")                                          │
│    577 │   │   # apply actual function                                                           │
│ ❱  578 │   │   out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                │                                         
│    579 │   │   datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou  │                                         
│    580 │   │   for dataset in datasets:                                                          │                                         
│    581 │   │   │   # Remove task templates if a column mapping of the template is no longer val  │                                         
│                                                                                                  │                                         
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:543 in wrapper       │                                         
│                                                                                                  │                                         
│    540 │   │   │   "output_all_columns": self._output_all_columns,                               │                                         
│    541 │   │   }                                                                                 │                                         
│    542 │   │   # apply actual function                                                           │                                                                  
│ ❱  543 │   │   out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                │                                                                  
│    544 │   │   datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [ou  │                                                                  
│    545 │   │   # re-apply format to the output                                                   │                                                                  
│    546 │   │   for dataset in datasets:                                                          │                                                                  
│                                                                                                  │                                                                  
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:3073 in map          │                                                                  
│                                                                                                  │                                                                  
│   3070 │   │   │   │   │   leave=False,                                                          │                                                                  
│   3071 │   │   │   │   │   desc=desc or "Map",                                                   │                                                                  
│   3072 │   │   │   │   ) as pbar:                                                                │                                                                  
│ ❱ 3073 │   │   │   │   │   for rank, done, content in Dataset._map_single(**dataset_kwargs):     │                                                                  
│   3074 │   │   │   │   │   │   if done:                                                          │                                                                  
│   3075 │   │   │   │   │   │   │   shards_done += 1                                              │                                                                                                     
│   3076 │   │   │   │   │   │   │   logger.debug(f"Finished processing shard number {rank} of {n  │                                                                                                     
│                                                                                                  │                                                                                                     
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_dataset.py:3464 in _map_single  │                                                                                                     
│                                                                                                  │                                                                                                     
│   3461 │   │   │   │   │   │   │   │   buf_writer, writer, tmp_file = init_buffer_and_writer()   │                                                                                                     
│   3462 │   │   │   │   │   │   │   │   stack.enter_context(writer)                               │                                                                                                     
│   3463 │   │   │   │   │   │   │   if isinstance(batch, pa.Table):                               │                                                                                                     
│ ❱ 3464 │   │   │   │   │   │   │   │   writer.write_table(batch)                                 │                                                                                                     
│   3465 │   │   │   │   │   │   │   else:                                                         │                                                                                                     
│   3466 │   │   │   │   │   │   │   │   writer.write_batch(batch)                                 │                                                                                                     
│   3467 │   │   │   │   │   │   num_examples_progress_update += num_examples_in_batch             │                                                                                                     
│                                                                                                  │                                                                                                     
│ /opt/conda/envs/main/lib/python3.10/site-packages/datasets/arrow_writer.py:567 in write_table    │                                                                                                     
│                                                                                                  │                                                                                                     
│   564 │   │   │   writer_batch_size = self.writer_batch_size                                     │                                                                                                     
│   565 │   │   if self.pa_writer is None:                                                         │                                                                                                     
│   566 │   │   │   self._build_writer(inferred_schema=pa_table.schema)                            │                                                                                                     
│ ❱ 567 │   │   pa_table = pa_table.combine_chunks()                                               │                                                                                                     
│   568 │   │   pa_table = table_cast(pa_table, self._schema)                                      │                                                                                                     
│   569 │   │   if self.embed_local_files:                                                         │                                                                                                     
│   570 │   │   │   pa_table = embed_table_storage(pa_table)                                       │                                                                                                     
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯                                                                                                     
KeyboardInterrupt

Jun 17 '23 06:06 AntreasAntoniou

I'm on my phone so can't help that much. What I'd advice to do is to save_to_disk if it's not already done and then upload the files/folder to the Hub separately. You can find what you need in the upload guide. It might not help finding the exact issue for now but at least it can unblock you.

Jun 17 '23 06:06 Wauplin

In your last stacktrace it interrupted while embedding external content - in case your dataset in made of images or audio files that live on your disk. Is it the case ?

Jun 17 '23 14:06 lhoestq

Yeah, the dataset has images, audio, video and text.

Jun 17 '23 15:06 AntreasAntoniou

It's maybe related to https://github.com/apache/arrow/issues/34455: are you using ArrayND features ?

Also what's your pyarrow version ? Could you try updating to >= 12.0.1 ?

Jun 19 '23 11:06 lhoestq

I was using pyarrow == 12.0.0

I am not explicitly using ArrayND features, unless the hub API automatically converts my files to such.

Jun 20 '23 04:06 AntreasAntoniou

I have now updated to pyarrow == 12.0.1 and retrying

Jun 20 '23 04:06 AntreasAntoniou

You can also try to reduce the max_shard_size - Sometimes parquet has a hard time working with data bigger than 2GB

Jun 20 '23 13:06 lhoestq

So, updating the pyarrow seems to help. It can still throw errors here and there but I can retry when that happens. It's better than hanging.

However, I am a bit confused about something. I have uploaded my datasets, but while earlier I could see all three sets, now I can only see 1. What's going on? https://huggingface.co/datasets/Antreas/TALI-base

I have seen this happen before as well, so I deleted and reuploaded, but this dataset is way too large for me to do this.

Jun 21 '23 06:06 AntreasAntoniou

It's a bug on our side, I'll update the dataset viewer ;)

Thanks for reporting !

Jun 21 '23 09:06 lhoestq

Apparently this happened because of bad modifications in the README.md split metadata.

I fixed them in this PR: https://huggingface.co/datasets/Antreas/TALI-base/discussions/1

Jun 21 '23 09:06 lhoestq

@lhoestq It's a bit odd that when uploading a dataset, one set at a time "train", "val", "test", the push_to_hub function overwrites the readme and removes differently named sets from previous commits. i.e., you push "val", all is well. Then you push "test", and the "val" entry disappears from the readme, while the data remain intact.

Jun 22 '23 20:06 AntreasAntoniou

Also, just found another related issue. One of the many that make things hang or fail when pushing to hub.

In the following code:

train_generator = lambda: data_generator("train", percentage=1.0)
    val_generator = lambda: data_generator("val")
    test_generator = lambda: data_generator("test")

    train_data = datasets.Dataset.from_generator(
        train_generator,
        num_proc=mp.cpu_count(),
        writer_batch_size=5000,
        cache_dir=tali_dataset_dir,
    )

    val_data = datasets.Dataset.from_generator(
        val_generator,
        writer_batch_size=5000,
        num_proc=mp.cpu_count(),
        cache_dir=tali_dataset_dir,
    )

    test_data = datasets.Dataset.from_generator(
        test_generator,
        writer_batch_size=5000,
        num_proc=mp.cpu_count(),
        cache_dir=tali_dataset_dir,
    )

    print(f"Pushing TALI-large to hub")

    dataset = datasets.DatasetDict(
        {"train": train_data, "val": val_data, "test": test_data}
    )
    succesful_competion = False

    while not succesful_competion:
        try:
            dataset.push_to_hub(repo_id="Antreas/TALI-large", max_shard_size="2GB")
            succesful_competion = True
        except Exception as e:
            print(e)

Things keep failing in the push_to_repo step, at random places, with the following error:

Pushing dataset shards to the dataset hub:   7%|██████████▋                                                                                                                                            | 67/950 [42:41<9:22:37, 38.23s/it]
Error while uploading 'data/train-00067-of-00950-a4d179ed5a593486.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.81ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.20s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.48ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.30s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.39ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.52s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.47ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.39s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.26ba/s]
Upload 1 LFS files:   0%|                                                                                                                                                                                           | 0/1 [16:38<?, ?it/s]
Pushing dataset shards to the dataset hub:   7%|███████████▎                                                                                                                                           | 71/950 [44:37<9:12:28, 37.71s/it]
Error while uploading 'data/train-00071-of-00950-72bab6e5cb223aee.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.18ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.94s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.36ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.67s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.57ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.16s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.68ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:09<00:00,  9.63s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.36ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.67s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.37ba/s]
Upload 1 LFS files:   0%|                                                                                                                                                                                           | 0/1 [16:39<?, ?it/s]
Pushing dataset shards to the dataset hub:   8%|████████████                                                                                                                                           | 76/950 [46:21<8:53:08, 36.60s/it]
Error while uploading 'data/train-00076-of-00950-b90e4e3b433db179.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.21ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:25<00:00, 25.40s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.56ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.40s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.49ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:23<00:00, 23.53s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.27ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.25s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.42ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.03s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.39ba/s]
Upload 1 LFS files:   0%|                                                                                                                                                                                           | 0/1 [16:39<?, ?it/s]
Pushing dataset shards to the dataset hub:   9%|████████████▊                                                                                                                                          | 81/950 [48:30<8:40:22, 35.93s/it]
Error while uploading 'data/train-00081-of-00950-84b0450a1df093a9.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.18ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.65s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.92ba/s]
Upload 1 LFS files:   0%|                                                                                                                                                                                           | 0/1 [16:38<?, ?it/s]
Pushing dataset shards to the dataset hub:   9%|█████████████                                                                                                                                          | 82/950 [48:55<8:37:57, 35.80s/it]
Error while uploading 'data/train-00082-of-00950-0a1f52da35653e08.parquet' to the Hub.
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.31ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:26<00:00, 26.29s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.42ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.57s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.64ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.35s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.64ba/s]
Upload 1 LFS files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.74s/it]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.31ba/s]
Upload 1 LFS files:   0%|                                                                                                                                                                                           | 0/1 [16:40<?, ?it/s]
Pushing dataset shards to the dataset hub:   9%|█████████████▋                                                                                                                                         | 86/950 [50:48<8:30:25, 35.45s/it]
Error while uploading 'data/train-00086-of-00950-e1cc80dd17191b20.parquet' to the Hub.

I have a while loop that forces retries, but it seems that the progress itself is randomly getting lost as well. Any ideas on how to improve this? It has been blocking me for way too long.

Should I build the parquet manually and then push manually as well? If I do things manually, how can I ensure my dataset works properly with "stream=True"?

Thank you for your help and time.

Jun 22 '23 20:06 AntreasAntoniou

@lhoestq It's a bit odd that when uploading a dataset, one set at a time "train", "val", "test", the push_to_hub function overwrites the readme and removes differently named sets from previous commits. i.e., you push "val", all is well. Then you push "test", and the "val" entry disappears from the readme, while the data remain intact.

Hmm this shouldn't happen. What code did you run exactly ? Using which version of datasets ?

Jun 23 '23 08:06 lhoestq

I have a while loop that forces retries, but it seems that the progress itself is randomly getting lost as well. Any ideas on how to improve this? It has been blocking me for way too long.

Could you also print the cause of the error (e.__cause__) ? Or show the full stack trace when the error happens ? This would give more details about why it failed and would help investigate.

Jun 23 '23 08:06 lhoestq

Should I build the parquet manually and then push manually as well? If I do things manually, how can I ensure my dataset works properly with "stream=True"?

Parquet is supported out of the box ^^

If you want to make sure it works as expected you can try locally first:

ds = load_dataset("path/to/local", streaming=True)

Jun 23 '23 08:06 lhoestq

@lhoestq @AntreasAntoniou I transferred this issue to the datasets repository as the questions and answers are more related to this repo. Hope it can help other users find the bug and fixes more easily (like updating tqdm and pyarrow or setting a lower max_shard_size).

~For the initial "pushing large dataset consistently hangs"-issue, I still think it's best to try to save_to_disk first and then upload it manually/with a script (see upload_folder). It's not the most satisfying solution but at least it would confirm from where the problem comes from.~

EDIT: removed suggestion about saving to disk first (see https://github.com/huggingface/datasets/issues/5990#issuecomment-1607186914).

Jun 26 '23 10:06 Wauplin

@lhoestq @AntreasAntoniou I transferred this issue to the datasets repository as the questions and answers are more related to this repo. Hope it can help other users find the bug and fixes more easily (like updating https://github.com/huggingface/datasets/issues/5990#issuecomment-1607120204 and https://github.com/huggingface/datasets/issues/5990#issuecomment-1607120278 or https://github.com/huggingface/datasets/issues/5990#issuecomment-1607120328).

thanks :)

For the initial "pushing large dataset consistently hangs"-issue, I still think it's best to try to save_to_disk first and then upload it manually/with a script (see upload_folder). It's not the most satisfying solution but at least it would confirm from where the problem comes from.

As I've already said in other discussions, I would not recommend pushing files saved with save_to_disk to the Hub but save to parquet shards and upload them instead. The Hub does not support datasets saved with save_to_disk, which is meant for disk only.

Jun 26 '23 10:06 lhoestq

As I've already said in other discussions, I would not recommend pushing files saved with save_to_disk to the Hub but save to parquet shards and upload them instead. The Hub does not support datasets saved with save_to_disk, which is meant for disk only.

Well noted, thanks. That part was not clear to me :)

Jun 26 '23 10:06 Wauplin

Sorry for not replying in a few days, I was on leave. :)

So, here are more information as to the error that causes some of the delay

Pushing Antreas/TALI-tiny to hub
Attempting to push to hub
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:24<00:00,  4.06s/ba]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:24<00:00,  4.15s/ba]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:26<00:00,  4.45s/ba]
/opt/conda/envs/main/lib/python3.10/site-packages/huggingface_hub/lfs.py:310: UserWarning: hf_transfer is enabled but does not support uploading from bytes or BinaryIO, falling back to regular upload
  warnings.warn(
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:25<00:00,  4.26s/ba]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:27<00:00,  4.58s/ba]
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:24<00:00,  4.10s/ba]
Pushing dataset shards to the dataset hub:  22%|████████████████████████▎                                                                                       | 5/23 [52:23<3:08:37, 628.74s/it]
Exception: Error while uploading 'data/train-00005-of-00023-e224d901fd65e062.parquet' to the Hub., with stacktrace: <traceback object at 0x7f745458d0c0>, and type: <class 'RuntimeError'>, and 
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: 
/lfs.huggingface.co/repos/7c/d3/7cd385d9324302dc13e3986331d72d9be6fa0174c63dcfe0e08cd474f7f1e8b7/3415166ae28c0beccbbc692f38742b8dea2c197f5c805321104e888d21d7eb90?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230627%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230627T003349Z&X-Amz-Expires=86400&X-Amz-Signature=5a12ff96f2
91f644134170992a6628e5f3c4e7b2e7fc3e940b4378fe11ae5390&X-Amz-SignedHeaders=host&partNumber=1&uploadId=JSsK8r63XSF.VlKQx3Vf8OW4DEVp5YIIY7LPnuapNIegsxs5EHgM1p4u0.Nn6_wlPlQnvxm8HKMxZhczKE9KB74t0etB
oLcxqBIvsgey3uXBTZMAEGwU6y7CDUADiEIO&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.

One issue is that the uploading does not continue from the chunk it failed off. It often continues from a very old chunk. e.g. if it failed on chunk 192/250, it will continue from say 53/250, and this behaviour appears almost random.

Jun 27 '23 05:06 AntreasAntoniou

Are you using a proxy of some sort ?

Jun 27 '23 09:06 lhoestq

I am using a kubernetes cluster built into a university VPN.

Jun 28 '23 01:06 AntreasAntoniou

So, other than the random connection drops here and there, any idea why the progress does not continue where it left off?

Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 10.79ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.65ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.39ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.04ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 13.52ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 12.28ba/s]
Pushing dataset shards to the dataset hub:  20%|██████████████████████                                                                                          | 75/381 [1:34:39<6:26:11, 75.72s/it]
Exception: Error while uploading 'data/train-00075-of-00381-1614bc251b778766.parquet' to the Hub., with stacktrace: <traceback object at 0x7fab6d9a4980>, and type: <class 'RuntimeError'>, and 
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: 
/lfs.huggingface.co/repos/3b/31/3b311464573d8d63b137fcd5b40af1e7a5b1306843c88e80372d0117157504e5/ed8dae933fb79ae1ef5fb1f698f5125d3e1c02977ac69438631f152bb3bfdd1e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-
Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230629T053004Z&X-Amz-Expires=86400&X-Amz-Signature=da2b26270edfd6d0
d069c015a5a432031107a8664c3f0917717e5e40c688183c&X-Amz-SignedHeaders=host&partNumber=1&uploadId=2erWGHTh3ICqBLU_QvHfnygZ2tkMWbL0rEqpJdYohCKHUHnfwMjvoBIg0TI_KSGn4rSKxUxOyqSIzFUFSRSzixZeLeneaXJOw.Qx8
zLKSV5xV7HRQDj4RBesNve6cSoo&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 12.09ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 11.51ba/s]
Creating parquet from Arrow format: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [00:02<00:00, 10.77ba/s]
Pushing dataset shards to the dataset hub:  20%|██████████████████████▋                                                                                         | 77/381 [1:32:50<6:06:34, 72.35s/it]
Exception: Error while uploading 'data/train-00077-of-00381-368b2327a9908aab.parquet' to the Hub., with stacktrace: <traceback object at 0x7fab45b27f80>, and type: <class 'RuntimeError'>, and 
cause: HTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: 
/lfs.huggingface.co/repos/3b/31/3b311464573d8d63b137fcd5b40af1e7a5b1306843c88e80372d0117157504e5/9462ff2c5e61283b53b091984a22de2f41a2f6e37b681171e2eca4a998f979cb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-
Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGO27GPWFUO%2F20230629%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230629T070510Z&X-Amz-Expires=86400&X-Amz-Signature=9ab8487b93d443cd
21f05476405855d46051a0771b4986bbb20f770ded21b1a4&X-Amz-SignedHeaders=host&partNumber=1&uploadId=UiHX1B.DcoAO2QmIHpWpCuNPwhXU_o1dsTkTGPqZt1P51o9k0yz.EsFD9eKpQMwgAST3jOatRG78I_JWRBeLBDYYVNp8r0TpIdeSg
eUg8uwPZOCPw9y5mWOw8MWJrnBo&x-id=UploadPart (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2426)')))
Push failed, retrying
Attempting to push to hub
Pushing split train to the Hub.
Pushing dataset shards to the dataset hub:   8%|████████▋                                                                                                         | 29/381 [27:39<5:50:03, 59.67s/it]
Map:  36%|████████████████████████████████████████████████████                                                                                            | 1000/2764 [00:35<00:34, 51.63 examples/Map:  72%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                       | 2000/2764 [00:40<00:15, 49.06 examples/Map:  72%|████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                       | 2000/2764 [00:55<00:15, 49.06 examples/Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2764/2764 [00:56<00:00, 48.82 examples/Pushing dataset shards to the dataset hub:   8%|████████▉                                                                                                         | 30/381 [28:35<5:43:03, 58.64s/iPushing dataset shards to the dataset hub:   8%|█████████▎                                                                                                        | 31/381 [29:40<5:52:18, 60.40s/iPushing dataset shards to the dataset hub:   8%|█████████▌                                                                                                        | 32/381 [30:46<6:02:20, 62.29s/it]                                                                                                                                                                                                 
Map:  36%|███████████████████████████████████████████████████▎

This is actually the issue that wastes the most time for me, and I need it fixed. Please advice on how I can go about it.

Notice how the progress goes from | 77/381 to 30/381

Jun 29 '23 07:06 AntreasAntoniou

If the any shard is missing on the Hub, it will re-upload it. It looks like the 30th shard was missing on the Hub in your case.

It also means that the other files up to the 77th that were successfully uploaded won't be uploaded again.

cc @mariosasko who might know better

Jun 29 '23 09:06 lhoestq

datasets datasets copied to clipboard

Pushing a large dataset on the hub consistently hangs

Describe the bug

Reproduction

Logs

System info

datasets
datasets copied to clipboard