img2dataset icon indicating copy to clipboard operation
img2dataset copied to clipboard

High Initial RAM Usage Leads to Crashes

Open Sypherd opened this issue 2 years ago • 2 comments

I've been downloading select URLs from LAION-400M, -5B, and SBU and have noticed that there is a significant spike in RAM usage on startup that causes instances with <=32GB RAM, such as AWS' c6i.4xlarge, to crash. While img2dataset is running, however, RAM usage remains very low. I'd love if we could somehow mitigate that initial spike to be able to use instances with lower RAM throughout. Here's a screenshot from wandb.ai showing the initial spike on a 64GB instance: image

Sypherd avatar Aug 09 '23 14:08 Sypherd

Here's another sample from a crashed c6i.4xlarge instance where we can see available process memory approach 0 before crashing: image Maybe the cause of the crashes is something else but I have not been able to run img2dataset on a c6i.4xlarge instance yet.

Sypherd avatar Aug 09 '23 14:08 Sypherd

Interesting. I think that's due to how the parquet file is processed (reader file) That's probably easy enough to fix

On Wed, Aug 9, 2023, 16:49 Sypherd @.***> wrote:

Here's another sample from a crashed c6i.4xlarge instance where we can see available process memory approach 0 before crashing: [image: image] https://user-images.githubusercontent.com/50557586/259449237-e6467005-19a0-4748-a96d-2b00bac37eef.png Maybe the cause of the crashes is something else but I have not been able to run img2dataset on a c6i.4xlarge instance yet.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/338#issuecomment-1671553124, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QQO7PCJDUT5I6GXRDXUOPQ3ANCNFSM6AAAAAA3KDL3WA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

rom1504 avatar Aug 09 '23 18:08 rom1504