xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Socket Connect Error:Connection refused

Open kitty-eu-org opened this issue 4 years ago • 6 comments

error

When I was using spark xgboost, an error occurred: retry connect to ip(retry time 1): [ip] retry connect to ip(retry time 1): [ip] retry connect to ip(retry time 1): [ip] retry connect to ip(retry time 1): [ip] retry connect to ip(retry time 1): [ip] retry connect to ip(retry time 1): [ip] retry connect to ip(retry time 1): [ip] retry connect to ip(retry time 1): [ip] retry connect to ip(retry time 1): [ip] retry connect to ip(retry time 1): [ip] retry connect to ip(retry time 2): [ip] retry connect to ip(retry time 2): [ip] retry connect to ip(retry time 2): [ip] retry connect to ip(retry time 2): [ip] retry connect to ip(retry time 2): [ip] retry connect to ip(retry time 2): [ip] retry connect to ip(retry time 2): [ip] retry connect to ip(retry time 2): [ip] retry connect to ip(retry time 2): [ip] retry connect to ip(retry time 2): [ip] retry connect to ip(retry time 3): [ip] retry connect to ip(retry time 3): [ip] retry connect to ip(retry time 3): [ip] retry connect to ip(retry time 3): [ip] retry connect to ip(retry time 3): [ip] retry connect to ip(retry time 3): [ip] retry connect to ip(retry time 3): [ip] retry connect to ip(retry time 3): [ip] retry connect to ip(retry time 3): [ip] retry connect to ip(retry time 3): [ip] retry connect to ip(retry time 4): [ip] retry connect to ip(retry time 4): [ip] retry connect to ip(retry time 4): [ip] retry connect to ip(retry time 4): [ip] retry connect to ip(retry time 4): [ip] retry connect to ip(retry time 4): [ip] retry connect to ip(retry time 4): [ip] retry connect to ip(retry time 4): [ip] retry connect to ip(retry time 4): [ip] retry connect to ip(retry time 4): [ip] connect to (failed): [ip] connect to (failed): [ip] Socket Connect Error:Connection refused Socket Connect Error:Connection refused

env

  • ubuntu
  • spark2.3.1
  • scala2.11
  • xgboost realease0.9.0
  • Firewall off
  • port 9091 is not occupied I found a solution to it, can I mention a pr ? I think this is a bug, because even tasks that are already training are still affected by this error

kitty-eu-org avatar Nov 25 '21 05:11 kitty-eu-org

The tracker was down for some reason. Please upgrade to latest xgboost and open a new issue if the problem persists, preferably with a reproducible example

trivialfis avatar Nov 27 '21 11:11 trivialfis

@trivialfis hello,first trunk your reply, the latest version of xgboost will still have the same problem. Because of tracker.py, it is very simple to reproduce it. You only need to turn on the tracker service, and then use the telnet ip port, and enter the request to reproduce it. This is not essentially It's a bug, but the prompt information is really not friendly. In addition, the task being trained relies on tracker communication. If someone randomly sends a tacp request to the tracker port of the cluster machine, the ongoing training task will be terminated. I think this is undesirable and uncomfortable. I encountered this problem when using xgboost-spark. It was really distressing. In the end, I had to read most of the source code to solve this problem. I have seen similar questions on github and stackoverflow, but no one has given a solution

kitty-eu-org avatar Nov 27 '21 11:11 kitty-eu-org

@trivialfis I made some changes to the fock code, and added a friendly reminder that the tracker process has died and possible problem reminders. I hope you can consider my ideas, then I would be very grateful:

  1. The tacker.py modified by the dmlc-core project fixes the tracker process exit problem caused by other programs accessing the port occupied by the tracker, and changes it to a more friendly prompt, allowing users to better troubleshoot.
  2. xgboost has changed RabitTracker.java . I think that once the tracker server process dies, the entire java program should also die, because all subsequent calculations depend on the tacker process.

kitty-eu-org avatar Dec 02 '21 10:12 kitty-eu-org

Thanks for the detailed suggestions.

@wbo4958 Could you please help take a look into this?

trivialfis avatar Dec 02 '21 14:12 trivialfis

@hezhaozhao-git Could you put up a PR for you fix?

wbo4958 avatar Dec 03 '21 03:12 wbo4958

@hezhaozhao-git Could you put up a PR for you fix?

@wbo4958 Thank you for your reply, I have created pr and look forward to merging.

kitty-eu-org avatar Dec 03 '21 04:12 kitty-eu-org