memery
memery copied to clipboard
"DataLoader worker exited unexpectedly"
Related to #13, in the sense that this issue is made worse by the indexing process not being resumable.
When indexing a large directory of various types of files (with 69834 images), I get this error:
Traceback (most recent call last):
File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.9/multiprocessing/queues.py", line 113, in get
if not self._poll(timeout):
File "/usr/lib/python3.9/multiprocessing/connection.py", line 262, in poll
return self._poll(timeout)
File "/usr/lib/python3.9/multiprocessing/connection.py", line 429, in _poll
r = wait([self], timeout)
File "/usr/lib/python3.9/multiprocessing/connection.py", line 936, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.9/selectors.py", line 416, in select
fd_event_list = self._selector.poll(timeout)
File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 541564) is killed by signal: Killed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/rob/.local/bin/memery", line 8, in <module>
sys.exit(__main__())
File "/home/rob/.local/lib/python3.9/site-packages/memery/cli.py", line 30, in __main__
app()
File "/usr/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "/home/rob/.local/lib/python3.9/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/rob/.local/lib/python3.9/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/rob/.local/lib/python3.9/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/rob/.local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/rob/.local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.9/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/home/rob/.local/lib/python3.9/site-packages/memery/cli.py", line 17, in recall
ranked = memery.core.queryFlow(path, query=query)
File "/home/rob/.local/lib/python3.9/site-packages/memery/core.py", line 59, in queryFlow
dbpath, treepath = indexFlow(root)
File "/home/rob/.local/lib/python3.9/site-packages/memery/core.py", line 31, in indexFlow
new_embeddings = image_encoder(crafted_files, device)
File "/home/rob/.local/lib/python3.9/site-packages/memery/encoder.py", line 18, in image_encoder
for images, labels in tqdm(img_loader):
File "/home/rob/.local/lib/python3.9/site-packages/tqdm/std.py", line 1133, in __iter__
for obj in iterable:
File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
idx, data = self._get_data()
File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1034, in _get_data
success, data = self._try_get_data()
File "/home/rob/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 541564) exited unexpectedly
I'm guessing that the DataLoader process is being killed by the Linux OOM killer? I have no idea what I can do about that though.
Let me know if there's any other information that would help
Adding img_loader.num_workers = 0
to the image_encoder
function doesn't really help, it gets to the same point and just says 'Killed'
48%|███████████████████████████▌ | 264/546 [17:10<05:46, 1.23s/it]
Killed
This happens at image 264, just as it did when num_workers
was the default value. It might be good if I could know which image number 264 is but I don't see an easy way to do that.
Wow, I've never seen this before. Have you watched your RAM increase during this process? Does it top out your memory?
Definitely need better support for resuming index. I am planning to convert the database dictionary to an SQLite file, maybe that will help
Oh interesting question! I logged the memory usage over time and got this:
https://i.imgur.com/ldJIlUb.png
X axis is seconds. Looks like each image uses a different amount of memory, and some take a lot. The one at 1400 seconds or so is big but the system can handle it, and then the one at the end is trying to use ~38 gig of memory(?) and the OOM killer takes it out. That's image 264. So either that image is... very large, or it's broken in a way that makes memery leak memory like crazy? Either way I'd like memery to tell me the path of the file it's working on, so I can figure this image out.
I'm afraid that it's probably batch 264 rather than a specific image. The data loader is a pie torch feature on specifically using for batch processing, so it's doing your 60,000 images in only 500 batches or whatever.
It absolutely should log the current working file if it's going to crash tho, you're right
On Tue, Aug 31, 2021, 8:44 AM Rob Miles @.***> wrote:
Good question! I logged the memory usage over time and got this:
https://i.imgur.com/ldJIlUb.png
X axis is seconds. Looks like each image uses a different amount of memory, and some take a lot. The one at 1400 seconds or so is big but the system can handle it, and then the one at the end is trying to use ~38 gig of memory(?) and the OOM killer takes it out. That's image 264. So either that image is... very large, or it's broken in a way that makes memery leak memory like crazy. Either way I'd like memery to tell me the path of the file it's working on, so I can figure this image out.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/deepfates/memery/issues/22#issuecomment-909304812, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN7DJVE33ZTFBF62Z2V2HL3T7TTGXANCNFSM5C3L33IQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Ah, that makes sense. Does that mean I could maybe fix it in this instance by reducing the batch size?
That could work!
Still doesn't explain why that one batch uses so much memory but could get you through the bottleneck for now.
I think I hard-coded some batch and number of workers variables that should actually be flexible based on hardware 😬
On Thu, Sep 2, 2021, 3:25 AM Rob Miles @.***> wrote:
Ah, that makes sense. Does that mean I could maybe fix it in this instance by reducing the batch size?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/deepfates/memery/issues/22#issuecomment-911445712, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN7DJVA25GKCE2UPQAWG6TDT747IFANCNFSM5C3L33IQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.