James Lamb

Results 1151 comments of James Lamb

@StrikerRUS I'm really really excited to say that I now have a reliable reproducible example of this problem! Or at least, I have one reproducible example that always produces "Socket...

One thing I noticed while doing this, with more logs...it seems like `LGBM_NetworkFree()` is only ever getting called on the machine that is rank 0. For example, on a successful...

For anyone subscribed to this issue, I THINK I may have found the root cause, and it might be possible to fix this without needing to mark some Dask tests...

I think so! I haven't seen this error since that PR was merged. Let's close this for now.

I'm not sure if it's the same as the origin issue that started this, but it might be, and since we have this open already... in the last few days,...

just saw this again, on macOS `sdist` job 😭 Again, only ranking tests. ```text =========================== short test summary info ============================ FAILED tests/python_package_test/test_dask.py::test_ranker[voting-rf-None-array] - lightgbm.basic.LightGBMError: Socket recv error, Connection reset by...

Saw this again, on the macOS `regular` job: https://github.com/microsoft/LightGBM/actions/runs/15058250502/job/42328442247?pr=6893#step:3:4546 I think there are some macOS-specific issues with distributed training.

Just saw this again on 2 macOS jobs on Azure DevOps: ```text Exception: "LightGBMError('Socket recv error, Connection reset by peer (code: 54)')" ... FAILED tests/python_package_test/test_dask.py::test_ranker[voting-dart-None-array] ``` https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=17802&view=logs&j=52ea7a35-6bb0-5680-4def-de74fd7388e2&t=3a5da3c1-f658-52cd-3d24-edc674c4d20e

This just happened again on a macOS Azure DevOps job. Same as all the above... Dask, ranking, voting parallel learner. https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=17828&view=logs&j=52ea7a35-6bb0-5680-4def-de74fd7388e2&t=3a5da3c1-f658-52cd-3d24-edc674c4d20e

And another, this time on GitHub Actions 😭 ```text FAILED tests/python_package_test/test_dask.py::test_ranker[data-rf-group1-array] - lightgbm.basic.LightGBMError: Socket recv error, Connection reset by peer (code: 54) = 1 failed, 1467 passed, 15 skipped, 8...