Ross Wightman
Ross Wightman
update, looks like it may be related to copy state between processes in the default dataloader setup where `persistent_workers=False` and new workers are created each epoch. If I set `persistent_workers=True`...
@tmbdev ugly as it is, it might be worth continuing to support use of `WDS_EPOCH` in the run() of detshuffle, it would work in all cases... it could be used...
I'm also having issues reading datasets from gs buckets with wds via gopen/pipe (currently on main branch 0.2.3). I'm training on TPU VM instances, 8 train processes, 4 workers per...
@rom1504 that's a good work-around for now, thx. I've enabled warn_and_continue so I can track the occurences and will see how that goes. If it's happening too frequently I'll likely...
@rom1504 I've set that up and had it running and it keeps chugging along now, but it's concerning how many failures I see and the variation ... failing at pretty...
@rom1504 yes, I believe TFDS is using curl (c-lib) directly in C++ code. Some transport errors are handled as retries (not sure if all retries are logged or just some)....
@rom1504 I'm going to try the streaming cp, retries are enabled by default (6 with exponential backoff), but I don't think that applies to cat...
@rom1504 FYI, gsutil cp doesn't behave any differently. Using curl CLI doesn't look particularly fun. There is a streaming blob open in the Pythong google-storage API now, so I might...
Also looking at this more, not sure if this is a main branch issue, but I have a rather insane number of gsutil proceseses lying around (don't seem to be...
Update: * There definitely seems to be a process leak for the gsutil launches, overnight I accumulated 2000+ gsutil related processes on one cloud machine * I ran a quick...