img2dataset icon indicating copy to clipboard operation
img2dataset copied to clipboard

Process hanging forever before the end

Open HugoLaurencon opened this issue 1 year ago • 13 comments

Hi. When trying to download many images, I often noticed that the job seemed to not make progress anymore around the end. It could remain less than 1% of the images to download, but nothing would be written in the logs for hours, and the job just doesn't finish so I have to kill it manually. Is there an option to automatically finish the job if I don't mind not downloading these last images that cause the process to hang? Thanks

HugoLaurencon avatar Apr 20 '23 13:04 HugoLaurencon

same here, tried execute img2dataset command line and near the end (by monitoring network ,almost no receival) ,the process just would not exit:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 27344 23008 ? Ss Aug02 0:02 python3 /home/admin/img2dataset_wordir/oss_uploader.py root 6640 0.0 0.0 0 0 ? Z 10:37 0:00 [python3] root 7519 1.3 0.2 5987436 381776 ? Sl 10:38 0:21 /usr/bin/python3 /usr/local/bin/img2dataset /tmp/0174f175a8d04872 root 7586 0.0 0.0 14876 11300 ? S 10:38 0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import root 7587 0.6 0.1 2859996 200612 ? Sl 10:38 0:10 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7588 19.3 0.8 19582796 1162308 ? Sl 10:38 5:01 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7589 19.8 0.8 19576436 1117928 ? Sl 10:38 5:09 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7590 10.4 0.8 19588684 1147524 ? Sl 10:38 2:43 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7591 20.5 0.9 19582968 1312364 ? Sl 10:38 5:20 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7592 17.2 0.9 19582052 1287668 ? Sl 10:38 4:29 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7593 13.9 0.8 19582236 1163680 ? Sl 10:38 3:37 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7594 20.1 0.8 19583208 1123748 ? Sl 10:38 5:14 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 7595 15.4 0.9 19577776 1242608 ? Sl 10:38 4:01 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 21257 0.0 0.0 0 0 ? Z 10:42 0:00 [python3] root 25042 6.0 0.5 19579944 678640 ? Sl 10:42 1:17 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 28048 0.4 0.1 2858944 200792 ? Sl 10:43 0:05 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 35885 0.3 0.1 2858968 200660 ? Sl 10:44 0:04 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; root 36838 0.0 0.0 0 0 ? Z 10:45 0:00 [python3] root 36908 0.0 0.0 0 0 ? Z 10:45 0:00 [python3] root 36944 0.0 0.0 0 0 ? Z 10:45 0:00 [python3] root 37049 0.0 0.0 0 0 ? Z 10:45 0:00 [python3] root 91781 580 0.0 2602164 101144 ? Rl 11:04 0:05 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main;

zwsjink avatar Aug 03 '23 03:08 zwsjink

Same issue here

joon612 avatar Sep 08 '23 06:09 joon612

Any information on what is different about your environment and causing this?

On Fri, Sep 8, 2023, 09:01 JXD @.***> wrote:

Same issue here

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/289#issuecomment-1711121382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437R5KCXHRHCANIO5BDLXZKYDZANCNFSM6AAAAAAXFQ57NU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

rom1504 avatar Sep 08 '23 06:09 rom1504

Any information on what is different about your environment and causing this? On Fri, Sep 8, 2023, 09:01 JXD @.> wrote: Same issue here — Reply to this email directly, view it on GitHub <#289 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437R5KCXHRHCANIO5BDLXZKYDZANCNFSM6AAAAAAXFQ57NU . You are receiving this because you are subscribed to this thread.Message ID: @.>

I am downloading datacomp_1b in Azure Batch nodes, Ubuntu20 LTS, Size standard_d4a_v4.



img2dataset.download(
            url_list=str(metadata_dir),
            image_size=args.image_size,  //512
            output_folder=str(shard_dir),  
            processes_count=args.processes_count,  //4
            thread_count=args.thread_count,  //64
            resize_mode=args.resize_mode,  //keep_ratio_largest
            resize_only_if_bigger=not args.no_resize_only_if_bigger,  //not False
            encode_format=args.encode_format,  //jpg
            output_format=args.output_format,  //webdataset
            input_format="parquet",  
            url_col="url",
            caption_col="text",
            bbox_col=bbox_col,  //face_bboxes
            save_additional_columns=["uid"],
            number_sample_per_shard=10000,
            oom_shard_count=8,
            retries=args.retries,  //2
            enable_wandb=args.enable_wandb,  //False
            wandb_project=args.wandb_project,  //datacomp
        )

joon612 avatar Sep 08 '23 06:09 joon612

When I'm downloading it, I have a 15% chance of being able to reproduce it.

joon612 avatar Sep 14 '23 06:09 joon612

Found the possible reason: parquet file in shards was broken (cannot read).

joon612 avatar Sep 15 '23 06:09 joon612

Same issue here and if I interrupt the hanging process the resulting data is unusable with the following error when feeding a torch-dataloader:

  File "/usr/lib/python3.8/tarfile.py", line 686, in read
    raise ReadError("unexpected end of data")
tarfile.ReadError: unexpected end of data

Does anyone know of any combination of parameters to prevent this hanging?

https://github.com/rom1504/img2dataset/issues/402

coleridge72 avatar Mar 16 '24 14:03 coleridge72

You can delete any partial tar by checking if they have a .json files next to them

On Sat, Mar 16, 2024, 3:55 PM coleridge @.***> wrote:

Same issue here and if I interrupt the hanging process the resulting data is unusable with the following error when feeding a torch-dataloader:

File "/usr/lib/python3.8/tarfile.py", line 686, in read raise ReadError("unexpected end of data") tarfile.ReadError: unexpected end of data

Does anyone know of any combination of parameters to prevent this hanging?

#402 https://github.com/rom1504/img2dataset/issues/402

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/289#issuecomment-2002011021, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437W2PISZP5MK5PBH77DYYRMPTAVCNFSM6AAAAAAXFQ57NWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGAYTCMBSGE . You are receiving this because you commented.Message ID: @.***>

rom1504 avatar Mar 16 '24 15:03 rom1504

hmm I can't see any json files, only:

> du -sh small_10k/*
2.9M    small_10k/00000.parquet
427M    small_10k/00000.tar

fwiw: this is the error I get on killing the script.. I tried running with processes_count=1 but still no luck UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown

  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.11/multiprocessing/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/usr/lib/python3.11/multiprocessing/queues.py", line 364, in get
    with self._rlock:
  File "/usr/lib/python3.11/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()

coleridge72 avatar Mar 16 '24 16:03 coleridge72

Sounds like your issue is then completely different from the current issue which is about successful runs that get stuck at the end.

I advise you open a new issue and put more information about your environment, and command you are running

rom1504 avatar Mar 16 '24 17:03 rom1504

Same here. Observed hanging in the end while trying download laion400m

krishnansr avatar Apr 02 '24 00:04 krishnansr

Please provide any information you have to help figure that out. One more person reporting the same thing does not help much

On Tue, Apr 2, 2024, 2:06 AM Sivaramakrishnan @.***> wrote:

Same here. Observed hanging in the end while trying download laion400m

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/289#issuecomment-2030823886, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437TTMFX5RDXHKU4TIADY3HY77AVCNFSM6AAAAAAXFQ57NWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQHAZDGOBYGY . You are receiving this because you commented.Message ID: @.***>

rom1504 avatar Apr 02 '24 07:04 rom1504

Of course. I have a clone with custome changes that runs distributed download across a cluster of nodes. The output_format was 'files' for my usecase so didn't really have any tars.

Interestingly I had another job that downloaded coyo 700m but completed successfully without hanging.

I'm planning to get to the bottom of it with a small parquet file later this week.

krishnansr avatar Apr 02 '24 19:04 krishnansr