Moritz Gunz

Results 133 comments of Moritz Gunz

> I don’t think it is related to network issues to be honest, this machine is on our enterprise network. Could it be something gets firewalled off inside your enterprise...

I'm thinking about the implications of this right now. Do we deal with ONNX export of RF models at all? How would using `F.scaled_dot_product_attention` affect this? Does it need an...

I switched back to torch 2.5 from 2.6 and it did not happen again so far. I will be looking out for this.

I think torch 2.6 somehow causes this. I can make this issue go away if I switch back to torch 2.5 on the same GPU node.

@Stefanwuu Does the error go away for you if you set this parameter via your RETURNN config? If so, please file another PR that sets `backend = "cpu:gloo,cuda:nccl"` as default...

This is the relevant code from the stack trace where the read fails: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/managers.py#L554-L570 The read goes over a `connection.Pipe()`, which communicates over unix sockets: https://github.com/python/cpython/blob/26d24eeb90d781e381b97d64b4dcb1ee4dd891fe/Lib/multiprocessing/connection.py#L552-L556 and the address comes...

I just got bitten by the same error in a training not using the new dataset or caching mechanism.

Got the address. It seems reproducible for me on g-16. I don't know the root cause yet why it's behaving strangely. `tempdir._get_default_tempdir` returns the project folder, i.e. `/home/mgunz/setups/2024-06-24--[redacted]`, leading to...

> If that is the case Yes, this is the case I think about. I think for simplicity I'll use a single, global lock first (otherwise we have a list...

> Btw, how often does this problem occur? How many people are affected by this? I thought many people were already using DistributeFilesDataset/FileCache since a while, and since the last...