Dragonfly
Dragonfly copied to clipboard
偶现dfget下载失败
Question
现象:多个docker pull请求下载时,偶尔会出现某个节点下载失败。 场景:本轮请求13台设备在18:00发起docker pull请求,12台设备59-60s后下载成功,但是一台失败并且无限轮询持续请求,持续失败。 supernode log: 2020-01-02 18:02:31.521 INFO sign:2801 : gc peer: start to deal with peer: tst4-xxx.com-xxx.xx.73.65-1577958913537379557
2020-01-02 18:08:32.552 WARN sign:2801 : failed to get dfget task by dstCID(xxx.xx.73.65-10966-1577959254.445) and taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f), and the srcCID is xxx.xx.73.2-12305-1577959255.676, err: failed to get key xxx.xx.73.65-10966-1577959254.445@d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f from map: {"Code":0,"Msg":"data not found"} dfclient log: 2020-01-02 19:18:03.494 INFO sign:12305-1577959255.676 : pull piece task({"taskID":"d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","superNode":"xxx.xx.72.238:8002","dstCid":"xxx.xx.73.65-10966-1577959254.445","range":"167772160-171966463","result":503,"status":701,"pieceSize":4194304,"pieceNum":40}) result:{"code":602,"msg":"taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f) clientID(172.22.73.2-12305-1577959255.676): {"Code":9,"Msg":"peer should wait"}"} and sleep 1.327s
这里xxx.xx.73.65同样也是一台发起请求的节点,看到这里我的认知是supernode让xxx.xx.73.2从xxx.xx.73.65下载相应的piece,但是supernode在18:08:31时gc了节点的信息导致xxx.xx.73.2再来拉取信息时失败。
我想咨询下为什么会出现这种情况,这应该不算合理的情况吧,请问还需要提供什么信息吗?
我自己这边也结合源码定位
In fact, I think it's not expected. I will try to troubleshoot this with you. 😄
已经将supernode DEBUG开启了,在不断重试,之前的现象看log,dfget node其实一直在发起pullPieceTask请求,一直收到peer should wait,supernode gc前并没有记录,直到gc掉后才开始报错。
And could you please provide the version you used?
In fact, the logs you provided is not for the same peer. And it's hard to locate the problem. Now suppose that you find a peer failed to download the file, and here are a few steps that may help you.
- Check the dfclient.log and find the
sign
andtaskID
value which uniquely specify a task.
$ less ~/.small-dragonfly/logs/dfclient.log
2020-01-02 19:18:03.494 INFO sign: **12305-1577959255.676** :
"taskID":"**d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f**"
- Get all logs about the unsuccessful this download task by grep the
sign
value and analyze the abnormal behavior of this download task.
$ grep 12305-1577959255.676 ~/.small-dragonfly/logs/dfclient.log | less
- If the logs you got in the previous step always output the log
{"Code":9,"Msg":"peer should wait"}"} and sleep 1.327s
till the end. Then you need to troubleshoot with supernode logs which grep thetaskID
you got in the first step.
I hope the above steps will be helpful to you. If you need any specific help, please let me know. 😄
最近没有复现失败的场景,我就按照你说的方式把之前的log梳理下, 发起请求的dfget节点有13个,其中涉及到dfget请求失败的有两个,73.2是请求失败的ip,73.65是73.2请求其中一个piece的peer节点,72.238是supernode节点。
73.2 dfclient日志:
2020-01-02 18:01:26.744 INFO sign:12305-1577959255.676 : downloading piece:{"taskID":"d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","superNode":"xxx.xx.72.238:8002","dstCid":"cdnnode:xxx.xx.72.238~d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","range":"155189248-159383551","result":503,"status":701,"pieceSize":4194304,"pieceNum":37}
2020-01-02 18:01:26.744 INFO sign:12305-1577959255.676 : pull piece task({"taskID":"d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","superNode":"xxx.xx.72.238:8002","dstCid":"cdnnode:xxx.xx.72.238~d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","range":"155189248-159383551","result":503,"status":701,"pieceSize":4194304,"pieceNum":37}) result:{"code":602,"msg":"taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f) clientID(xxx.xx.73.2-12305-1577959255.676): {\"Code\":9,\"Msg\":\"peer should wait\"}"} and sleep 1.384s
2020-01-02 18:01:28.129 INFO sign:12305-1577959255.676 : pull rate result:20971520 cost:274.86µs2020-01-02 18:01:28.129 INFO sign:12305-1577959255.676 : downloading piece:{"taskID":"d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","superNode":"xxx.xx.72.238:8002","dstCid":"cdnnode:xxx.xx.72.238~d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","range":"163577856-167772159","result":503,"status":701,"pieceSize":4194304,"pieceNum":39}
2020-01-02 18:01:28.130 INFO sign:12305-1577959255.676 : pull piece task({"taskID":"d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","superNode":"xxx.xx.72.238:8002","dstCid":"cdnnode:xxx.xx.72.238~d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","range":"163577856-167772159","result":503,"status":701,"pieceSize":4194304,"pieceNum":39}) result:{"code":602,"msg":"taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f) clientID(xxx.xx.73.2-12305-1577959255.676): {\"Code\":9,\"Msg\":\"peer should wait\"}"} and sleep 1.243s
2020-01-02 18:01:29.374 WARN sign:12305-1577959255.676 : request piece result:{"code":602,"msg":"taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f) clientID(xxx.xx.73.2-12305-1577959255.676): {\"Code\":9,\"Msg\":\"peer should wait\"}"}
2020-01-02 18:01:29.374 INFO sign:12305-1577959255.676 : downloading piece:{"taskID":"d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","superNode":"xxx.xx.72.238:8002","dstCid":"xxx.xx.73.65-10966-1577959254.445","range":"167772160-171966463","result":503,"status":701,"pieceSize":4194304,"pieceNum":40}
2020-01-02 18:01:29.374 INFO sign:12305-1577959255.676 : pull piece task({"taskID":"d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","superNode":"xxx.xx.72.238:8002","dstCid":"xxx.xx.73.65-10966-1577959254.445","range":"167772160-171966463","result":503,"status":701,"pieceSize":4194304,"pieceNum":40}) result:{"code":602,"msg":"taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f) clientID(xxx.xx.73.2-12305-1577959255.676): {\"Code\":9,\"Msg\":\"peer should wait\"}"} and sleep 1.104s
2020-01-02 18:01:30.479 INFO sign:12305-1577959255.676 : pull piece task({"taskID":"d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","superNode":"xxx.xx.72.238:8002","dstCid":"xxx.xx.73.65-10966-1577959254.445","range":"167772160-171966463","result":503,"status":701,"pieceSize":4194304,"pieceNum":40}) result:{"code":602,"msg":"taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f) clientID(xxx.xx.73.2-12305-1577959255.676): {\"Code\":9,\"Msg\":\"peer should wait\"}"} and sleep 0.971s
2020-01-02 18:01:31.451 INFO sign:12305-1577959255.676 : pull piece task({"taskID":"d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","superNode":"xxx.xx.72.238:8002","dstCid":"xxx.xx.73.65-10966-1577959254.445","range":"167772160-171966463","result":503,"status":701,"pieceSize":4194304,"pieceNum":40}) result:{"code":602,"msg":"taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f) clientID(xxx.xx.73.2-12305-1577959255.676): {\"Code\":9,\"Msg\":\"peer should wait\"}"} and sleep 1.502s
2020-01-02 18:01:32.954 INFO sign:12305-1577959255.676 : pull piece task({"taskID":"d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","superNode":"xxx.xx.72.238:8002","dstCid":"xxx.xx.73.65-10966-1577959254.445","range":"167772160-171966463","result":503,"status":701,"pieceSize":4194304,"pieceNum":40}) result:{"code":602,"msg":"taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f) clientID(xxx.xx.73.2-12305-1577959255.676): {\"Code\":9,\"Msg\":\"peer should wait\"}"} and sleep 1.381s
从log看是请求最后一个piece(pieceNum:40)失败的,从时间看其实对最后一个piece的pull piece task从18:01:29就开始并且持续不断了,从supernode看log是下面情况:
2020-01-02 18:01:00.218 INFO sign:2801 : success update dfgetTask status to RUNNING with taskID: d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f clientID: xxx.xx.73.2-12305-1577959255.676
2020-01-02 18:01:01.585 INFO sign:2801 : success update dfgetTask status to RUNNING with taskID: d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f clientID: xxx.xx.73.2-12305-1577959255.676
2020-01-02 18:08:32.552 WARN sign:2801 : failed to get dfget task by dstCID(xxx.xx.73.65-10966-1577959254.445) and taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f), and the srcCID is xxx.xx.73.2-12305-1577959255.676, err: failed to get key xxx.xx.73.65-10966-1577959254.445@d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f from map: {"Code":0,"Msg":"data not found"}
2020-01-02 18:08:33.467 WARN sign:2801 : failed to get dfget task by dstCID(xxx.xx.73.65-10966-1577959254.445) and taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f), and the srcCID is xxx.xx.73.2-12305-1577959255.676, err: failed to get key xxx.xx.73.65-10966-1577959254.445@d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f from map: {"Code":0,"Msg":"data not found"}
2020-01-02 18:08:34.716 WARN sign:2801 : failed to get dfget task by dstCID(xxx.xx.73.65-10966-1577959254.445) and taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f), and the srcCID is xxx.xx.73.2-12305-1577959255.676, err: failed to get key xxx.xx.73.65-10966-1577959254.445@d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f from map: {"Code":0,"Msg":"data not found"}
2020-01-02 18:08:35.955 WARN sign:2801 : failed to get dfget task by dstCID(xxx.xx.73.65-10966-1577959254.445) and taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f), and the srcCID is xxx.xx.73.2-12305-1577959255.676, err: failed to get key xxx.xx.73.65-10966-1577959254.445@d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f from map: {"Code":0,"Msg":"data not found"}
2020-01-02 18:08:37.278 WARN sign:2801 : failed to get dfget task by dstCID(xxx.xx.73.65-10966-1577959254.445) and taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f), and the srcCID is xxx.xx.73.2-12305-1577959255.676, err: failed to get key xxx.xx.73.65-10966-1577959254.445@d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f from map: {"Code":0,"Msg":"data not found"}
dstcid指向的是73.65这个ip,我把这台设备的log也贴出来如下:
dfclient.log
5189248-159383551","result":503,"status":701,"pieceSize":4194304,"pieceNum":37}) result:{"code":602,"msg":"taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f) clientID(xxx.xx.73.71-11873-1577959255.126): {\"Code\":9,\"Msg\":\"peer should wait\"}"} and sleep 1.372s
2020-01-02 18:01:28.210 INFO sign:11873-1577959255.126 : downloading piece:{"taskID":"d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","superNode":"xxx.xx.72.238:8002","dstCid":"cdnnode:xxx.xx.72.238~d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","range":"163577856-167772159","result":503,"status":701,"pieceSize":4194304,"pieceNum":39}
2020-01-02 18:01:28.215 INFO sign:11873-1577959255.126 : pull piece task({"taskID":"d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","superNode":"xxx.xx.72.238:8002","dstCid":"cdnnode:xxx.xx.72.238~d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f","range":"163577856-167772159","result":503,"status":701,"pieceSize":4194304,"pieceNum":39}) result:{"code":602,"msg":"taskID(d92205d4ce2007422736d99a34c53d8263b157a8362c4894e1f926c5b406362f) clientID(xxx.xx.73.71-11873-1577959255.126): {\"Code\":9,\"Msg\":\"peer should wait\"}"} and sleep 1.370s
2020-01-02 18:01:29.586 INFO sign:11873-1577959255.126 : remaining piece to be written count:0
2020-01-02 18:01:29.616 INFO sign:11873-1577959255.126 : wait client writer finish cost:0.030,main qu size:1,client qu size:0
2020-01-02 18:01:29.616 INFO sign:11873-1577959255.126 : move src:/root/.small-dragonfly/data/54725503-c7e5-490b-be99-beb7cceaa197-11873-1577959255.126 to dst:/root/.small-dragonfly/dfdaemon/data/54725503-c7e5-490b-be99-beb7cceaa197 result:true cost:0.000
2020-01-02 18:01:29.616 INFO sign:11873-1577959255.126 : download successfully from dragonfly
2020-01-02 18:01:29.619 INFO sign:11873-1577959255.126 : download SUCCESS from supernode [xxx.xx.73.72:8002 xxx.xx.72.238:8002 xxx.xx.73.72:8002] cost:34.493s length:170724484
dfserver.log:
2020-01-02 18:00:44.245 INFO sign:10023-1577958904.607 : update total limit to 20971520
2020-01-02 18:00:44.256 INFO sign:10023-1577958904.607 : update total limit to 20971520
2020-01-02 18:00:44.261 INFO sign:10023-1577958904.607 : update total limit to 20971520
2020-01-02 18:00:44.267 INFO sign:10023-1577958904.607 : update total limit to 20971520
2020-01-02 18:00:55.128 INFO sign:10023-1577958904.607 : update total limit to 20971520
2020-01-02 18:04:04.664 INFO sign:10023-1577958904.607 : server gc, delete file:/root/.small-dragonfly/data/3504269d-a785-4f78-8b1a-2a5bbb2bee67-11759-1577959244.253.service
2020-01-02 18:04:04.667 INFO sign:10023-1577958904.607 : server gc, delete file:/root/.small-dragonfly/data/87f459a6-6004-47b5-aba5-af8a61c0073d-11763-1577959244.260.service
2020-01-02 18:04:04.668 INFO sign:10023-1577958904.607 : server gc, delete file:/root/.small-dragonfly/data/9e5bf398-047a-4590-9244-8dec81600fbc-11774-1577959244.265.service
2020-01-02 18:04:04.669 INFO sign:10023-1577958904.607 : server gc, delete file:/root/.small-dragonfly/data/f972a500-0293-42f6-aa7b-3f86a2a86b3b-11758-1577959244.243.service
2020-01-02 18:04:34.710 INFO sign:10023-1577958904.607 : server gc, delete file:/root/.small-dragonfly/data/54725503-c7e5-490b-be99-beb7cceaa197-11873-1577959255.126.service
2020-01-02 18:06:27.830 INFO sign:10023-1577958904.607 : no more task, peer server will stop...
2020-01-02 18:06:27.831 INFO sign:10023-1577958904.607 : peer server is shutdown.
PS:73.2从73.65请求pieceNum:40的piece分片,但是我在73.65 dfclient log中全局搜索那个时段并没有请求pieceNum:40的piece数据,请求完pieceNum:39后就下载成功了。
supernode 和 dfget 通过version看都是1.0.0版本
我也碰到了类似的问题,大概70个client 同时下载,会偶发某个client 卡住。
代码版本: 7894da121306b5e26db485e1f5ed11bff12422e9
supernode 日志: 2020-01-05 23:36:33.358 INFO sign:10112-1578238471.150 : downloading piece:{"taskID":"dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755","superNode":"11.163.182.164:8002","dstCid":"11.163.187.96-47517-1578238472.082","range":"13227786240-13243514879","result":503,"status":701,"pieceSize":15728640,"pieceNum":841} 2020-01-05 23:36:33.363 INFO sign:10112-1578238471.150 : pull piece task({"taskID":"dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755","superNode":"11.163.182.164:8002","dstCid":"11.163.187.96-47517-1578238472.082","range":"13227786240-13243514879","result":503,"status":701,"pieceSize":15728640,"pieceNum":841}) result:{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} and sleep 1.193s 2020-01-05 23:36:34.562 INFO sign:10112-1578238471.150 : pull piece task({"taskID":"dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755","superNode":"11.163.182.164:8002","dstCid":"11.163.187.96-47517-1578238472.082","range":"13227786240-13243514879","result":503,"status":701,"pieceSize":15728640,"pieceNum":841}) result:{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} and sleep 1.532s 2020-01-05 23:36:36.099 INFO sign:10112-1578238471.150 : pull piece task({"taskID":"dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755","superNode":"11.163.182.164:8002","dstCid":"11.163.187.96-47517-1578238472.082","range":"13227786240-13243514879","result":503,"status":701,"pieceSize":15728640,"pieceNum":841}) result:{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} and sleep 1.267s 2020-01-05 23:36:37.372 INFO sign:10112-1578238471.150 : pull piece task({"taskID":"dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755","superNode":"11.163.182.164:8002","dstCid":"11.163.187.96-47517-1578238472.082","range":"13227786240-13243514879","result":503,"status":701,"pieceSize":15728640,"pieceNum":841}) result:{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} and sleep 1.593s 2020-01-05 23:36:38.970 INFO sign:10112-1578238471.150 : pull piece task({"taskID":"dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755","superNode":"11.163.182.164:8002","dstCid":"11.163.187.96-47517-1578238472.082","range":"13227786240-13243514879","result":503,"status":701,"pieceSize":15728640,"pieceNum":841}) result:{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} and sleep 1.576s
dfget 日志 2020-01-05 15:43:07.151 WARN sign:48888 : failed to get dfget task by dstCID(11.163.187.96-47517-1578238472.082) and taskID(dd9697255d580b1fce938b2d55909 cb27196375df577560a7b5aa2d6f93dd755), and the srcCID is 100.82.23.44-10112-1578238471.150, err: failed to get key 11.163.187.96-47517-1578238472.082@dd96 97255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 from map: {"Code":0,"Msg":"data not found"} 2020-01-05 15:43:08.326 WARN sign:48888 : failed to get dfget task by dstCID(11.163.187.96-47517-1578238472.082) and taskID(dd9697255d580b1fce938b2d55909 cb27196375df577560a7b5aa2d6f93dd755), and the srcCID is 100.82.23.44-10112-1578238471.150, err: failed to get key 11.163.187.96-47517-1578238472.082@dd96 97255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 from map: {"Code":0,"Msg":"data not found"} 2020-01-05 15:43:09.370 WARN sign:48888 : failed to get dfget task by dstCID(11.163.187.96-47517-1578238472.082) and taskID(dd9697255d580b1fce938b2d55909 cb27196375df577560a7b5aa2d6f93dd755), and the srcCID is 100.82.23.44-10112-1578238471.150, err: failed to get key 11.163.187.96-47517-1578238472.082@dd96 97255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 from map: {"Code":0,"Msg":"data not found"} 2020-01-05 15:43:10.226 WARN sign:48888 : failed to get dfget task by dstCID(11.163.187.96-47517-1578238472.082) and taskID(dd9697255d580b1fce938b2d55909 cb27196375df577560a7b5aa2d6f93dd755), and the srcCID is 100.82.23.44-10112-1578238471.150, err: failed to get key 11.163.187.96-47517-1578238472.082@dd96 97255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 from map: {"Code":0,"Msg":"data not found"} 2020-01-05 15:43:11.103 WARN sign:48888 : failed to get dfget task by dstCID(11.163.187.96-47517-1578238472.082) and taskID(dd9697255d580b1fce938b2d55909 cb27196375df577560a7b5aa2d6f93dd755), and the srcCID is 100.82.23.44-10112-1578238471.150, err: failed to get key 11.163.187.96-47517-1578238472.082@dd96 97255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 from map: {"Code":0,"Msg":"data not found"} 2020-01-05 15:43:12.538 WARN sign:48888 : failed to get dfget task by dstCID(11.163.187.96-47517-1578238472.082) and taskID(dd9697255d580b1fce938b2d55909 cb27196375df577560a7b5aa2d6f93dd755), and the srcCID is 100.82.23.44-10112-1578238471.150, err: failed to get key 11.163.187.96-47517-1578238472.082@dd96 97255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 from map: {"Code":0,"Msg":"data not found"}
tcpdump抓包: GET /peer/task?srcCid=100.82.23.44-10112-1578238471.150&dstCid=11.163.187.96-47517-1578238472.082&range=13227786240-13243514879&result=503&status=701&taskId=dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 HTTP/1.1 User-Agent: fasthttp Host: 11.163.182.164:8002
HTTP/1.1 200 OK Content-Type: application/json Date: Sun, 05 Jan 2020 15:47:44 GMT Content-Length: 181
{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} GET /peer/task?srcCid=100.82.23.44-10112-1578238471.150&dstCid=11.163.187.96-47517-1578238472.082&range=13227786240-13243514879&result=503&status=701&taskId=dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 HTTP/1.1 User-Agent: fasthttp Host: 11.163.182.164:8002
HTTP/1.1 200 OK Content-Type: application/json Date: Sun, 05 Jan 2020 15:47:45 GMT Content-Length: 181
{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} GET /peer/task?srcCid=100.82.23.44-10112-1578238471.150&dstCid=11.163.187.96-47517-1578238472.082&range=13227786240-13243514879&result=503&status=701&taskId=dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 HTTP/1.1 User-Agent: fasthttp Host: 11.163.182.164:8002
HTTP/1.1 200 OK Content-Type: application/json Date: Sun, 05 Jan 2020 15:47:46 GMT Content-Length: 181
{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"}
我也碰到了类似的问题,大概70个client 同时下载,会偶发某个client 卡住。
代码版本: 7894da1
supernode 日志: 2020-01-05 23:36:33.358 INFO sign:10112-1578238471.150 : downloading piece:{"taskID":"dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755","superNode":"11.163.182.164:8002","dstCid":"11.163.187.96-47517-1578238472.082","range":"13227786240-13243514879","result":503,"status":701,"pieceSize":15728640,"pieceNum":841} 2020-01-05 23:36:33.363 INFO sign:10112-1578238471.150 : pull piece task({"taskID":"dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755","superNode":"11.163.182.164:8002","dstCid":"11.163.187.96-47517-1578238472.082","range":"13227786240-13243514879","result":503,"status":701,"pieceSize":15728640,"pieceNum":841}) result:{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} and sleep 1.193s 2020-01-05 23:36:34.562 INFO sign:10112-1578238471.150 : pull piece task({"taskID":"dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755","superNode":"11.163.182.164:8002","dstCid":"11.163.187.96-47517-1578238472.082","range":"13227786240-13243514879","result":503,"status":701,"pieceSize":15728640,"pieceNum":841}) result:{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} and sleep 1.532s 2020-01-05 23:36:36.099 INFO sign:10112-1578238471.150 : pull piece task({"taskID":"dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755","superNode":"11.163.182.164:8002","dstCid":"11.163.187.96-47517-1578238472.082","range":"13227786240-13243514879","result":503,"status":701,"pieceSize":15728640,"pieceNum":841}) result:{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} and sleep 1.267s 2020-01-05 23:36:37.372 INFO sign:10112-1578238471.150 : pull piece task({"taskID":"dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755","superNode":"11.163.182.164:8002","dstCid":"11.163.187.96-47517-1578238472.082","range":"13227786240-13243514879","result":503,"status":701,"pieceSize":15728640,"pieceNum":841}) result:{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} and sleep 1.593s 2020-01-05 23:36:38.970 INFO sign:10112-1578238471.150 : pull piece task({"taskID":"dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755","superNode":"11.163.182.164:8002","dstCid":"11.163.187.96-47517-1578238472.082","range":"13227786240-13243514879","result":503,"status":701,"pieceSize":15728640,"pieceNum":841}) result:{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} and sleep 1.576s
dfget 日志 2020-01-05 15:43:07.151 WARN sign:48888 : failed to get dfget task by dstCID(11.163.187.96-47517-1578238472.082) and taskID(dd9697255d580b1fce938b2d55909 cb27196375df577560a7b5aa2d6f93dd755), and the srcCID is 100.82.23.44-10112-1578238471.150, err: failed to get key 11.163.187.96-47517-1578238472.082@dd96 97255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 from map: {"Code":0,"Msg":"data not found"} 2020-01-05 15:43:08.326 WARN sign:48888 : failed to get dfget task by dstCID(11.163.187.96-47517-1578238472.082) and taskID(dd9697255d580b1fce938b2d55909 cb27196375df577560a7b5aa2d6f93dd755), and the srcCID is 100.82.23.44-10112-1578238471.150, err: failed to get key 11.163.187.96-47517-1578238472.082@dd96 97255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 from map: {"Code":0,"Msg":"data not found"} 2020-01-05 15:43:09.370 WARN sign:48888 : failed to get dfget task by dstCID(11.163.187.96-47517-1578238472.082) and taskID(dd9697255d580b1fce938b2d55909 cb27196375df577560a7b5aa2d6f93dd755), and the srcCID is 100.82.23.44-10112-1578238471.150, err: failed to get key 11.163.187.96-47517-1578238472.082@dd96 97255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 from map: {"Code":0,"Msg":"data not found"} 2020-01-05 15:43:10.226 WARN sign:48888 : failed to get dfget task by dstCID(11.163.187.96-47517-1578238472.082) and taskID(dd9697255d580b1fce938b2d55909 cb27196375df577560a7b5aa2d6f93dd755), and the srcCID is 100.82.23.44-10112-1578238471.150, err: failed to get key 11.163.187.96-47517-1578238472.082@dd96 97255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 from map: {"Code":0,"Msg":"data not found"} 2020-01-05 15:43:11.103 WARN sign:48888 : failed to get dfget task by dstCID(11.163.187.96-47517-1578238472.082) and taskID(dd9697255d580b1fce938b2d55909 cb27196375df577560a7b5aa2d6f93dd755), and the srcCID is 100.82.23.44-10112-1578238471.150, err: failed to get key 11.163.187.96-47517-1578238472.082@dd96 97255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 from map: {"Code":0,"Msg":"data not found"} 2020-01-05 15:43:12.538 WARN sign:48888 : failed to get dfget task by dstCID(11.163.187.96-47517-1578238472.082) and taskID(dd9697255d580b1fce938b2d55909 cb27196375df577560a7b5aa2d6f93dd755), and the srcCID is 100.82.23.44-10112-1578238471.150, err: failed to get key 11.163.187.96-47517-1578238472.082@dd96 97255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 from map: {"Code":0,"Msg":"data not found"}
tcpdump抓包: GET /peer/task?srcCid=100.82.23.44-10112-1578238471.150&dstCid=11.163.187.96-47517-1578238472.082&range=13227786240-13243514879&result=503&status=701&taskId=dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 HTTP/1.1 User-Agent: fasthttp Host: 11.163.182.164:8002
HTTP/1.1 200 OK Content-Type: application/json Date: Sun, 05 Jan 2020 15:47:44 GMT Content-Length: 181
{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} GET /peer/task?srcCid=100.82.23.44-10112-1578238471.150&dstCid=11.163.187.96-47517-1578238472.082&range=13227786240-13243514879&result=503&status=701&taskId=dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 HTTP/1.1 User-Agent: fasthttp Host: 11.163.182.164:8002
HTTP/1.1 200 OK Content-Type: application/json Date: Sun, 05 Jan 2020 15:47:45 GMT Content-Length: 181
{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} GET /peer/task?srcCid=100.82.23.44-10112-1578238471.150&dstCid=11.163.187.96-47517-1578238472.082&range=13227786240-13243514879&result=503&status=701&taskId=dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755 HTTP/1.1 User-Agent: fasthttp Host: 11.163.182.164:8002
HTTP/1.1 200 OK Content-Type: application/json Date: Sun, 05 Jan 2020 15:47:46 GMT Content-Length: 181
{"code":602,"msg":"taskID(dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755) clientID(100.82.23.44-10112-1578238471.150): {"Code":9,"Msg":"peer should wait"}"} 哥们,你是下载最后一个piece失败的吗?
@LeslieSun @planetkr Could you please help confirm if the unsuccessful peer node has an error like this. You can do that by exec grep 'failed to report piece' dfclient.log
.
2020-01-06 16:56:12.795 ERRO sign:20734-1578300875.919 : failed to report piece{taskid:dd9697255d580b1fce938b2d55909cb27196375df577560a7b5aa2d6f93dd755,range:2217738240-2233466879},err: dialing to the given TCP address timed out
If that, that means the dfget failed to report supernode that it has downloaded this piece and supernode will judge that this node has not been downloaded yet. So the above situation that you described will happen. 😢
And if you find that it is indeed as mentioned above, please let me know and I will start fixing this. THX.
@Starnop ,没有在dfclient.log中发现有failed to report piece
输出,不过另外一次测试发现一些log我无法从源码中找到合理的解释,我贴出来
2020-01-06 13:56:24.155 DEBU sign:6997 : start to process task(8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab) start 2020-01-06 13:56:24.155 INFO sign:6997 : success update dfgetTask status to RUNNING with taskID: 8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab clientID: xxx.xx.73.97-6796-1578290173.682 2020-01-06 13:56:24.155 DEBU sign:6997 : taskID(8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab) clientID(xxx.xx.73.97-6796-1578290173.682) get successful pieces: [] 2020-01-06 13:56:24.155 DEBU sign:6997 : start scheduler for taskID: 8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab clientID: xxx.xx.73.97-6796-1578290173.682 2020-01-06 13:56:24.155 DEBU sign:6997 : scheduler get available pieces [0] for taskID(8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab) clientID(xxx.xx.73.97-6796-1578290173.682) 2020-01-06 13:56:24.155 DEBU sign:6997 : scheduler get running pieces [] for taskID(8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab) clientID(xxx.xx.73.97-6796-1578290173.682) 2020-01-06 13:56:24.155 DEBU sign:6997 : scheduler get pieces [0] with prioritize for taskID(8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab) clientID(xxx.xx.73.97-6796-1578290173.682) 2020-01-06 13:56:24.155 DEBU sign:6997 : scheduler: success to get peerIDs([]) pieceNum(0) taskID(8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab), clientID(xxx.xx.73.97-6796-1578290173.682) 2020-01-06 13:56:24.155 DEBU sign:6997 : success to update ClientProgress taskID(8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab) srcCID(xxx.xx.73.97-6796-1578290173.682) dstPID(tst4-xuanwu-1.prod.yiran.com-xxx.xx.72.238-1578225491848925872) pieceNum(0) pieceStatus(0) with result: true2020-01-06 13:56:24.156 DEBU sign:6997 : get pieces request: &{DfgetTaskStatus:STARTED DstPID: PieceRange: PieceResult:INVALID} with taskID(8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab) and clientID(xxx.xx.73.97-6796-1578290173.682)2020-01-06 13:56:24.156 DEBU sign:6997 : success to get dfgetTask: &{CID:xxx.xx.73.97-6796-1578290173.682 CallSystem: Dfdaemon:true Path:/peer/file/a3375b2b-ef1e-4c15-add0-cb8f3e15af81-6796-1578290173.682 PeerID:tst4-xuanwu-10.prod.yiran.com-xxx.xx.73.97-1578290174188380008 PieceSize:4194304 Status:RUNNING SupernodeIP:xxx.xx.72.238 TaskID:8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab}
我大致描述下,dfget在请求资源后一直发起pullPieceTask请求,在supernode未成功下载好第一个piece前dfclient.log中日志大致都是这样pull piece task({"taskID":"8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab","superNode":"xxx.xx.72.238:8002","dstCid":"","range":"","result":502,"status":700,"pieceSize":0,"pieceNum":0}) result:{"code":602,"msg":"clientSucCount:0,cdnSucCount:0: {\"Code\":9,\"Msg\":\"peer should wait\"}"} and sleep 1.072s
此时是没有任何问题的,但是当supernode下载完第一个piece时,在响应用户请求时log有点问题,在输出success to update ClientProgress taskID
后(此时已经查找到了available piece),应该还有log输出才对,这段log对应的代码片段如下:
pieceResult, err := tm.schedulerMgr.Schedule(ctx, task.ID, clientID, dfgetTask.PeerID) if err != nil { logrus.Errorf("scheduler failed,%v", err) return false, nil, err } tm.metrics.scheduleDurationMilliSeconds.WithLabelValues(peer.IP.String()).Observe(timeutils.SinceInMilliseconds(startTime)) logrus.Debugf("get scheduler result length(%d) with taskID(%s) and clientID(%s)", len(pieceResult), task.ID, clientID)
其中logrus.Errorf("scheduler failed,%v", err)
是我自己添加的,并且变更了,后面pullPieceTask请求都有这条日志的打印,从代码看无论成功失败都会有日志输出才对,但是在这里此次pullPieceTask的请求log已经没有其他输出了,接着下一毫秒就开始处理下一次pullPieceTask的请求了,这次请求又是没有任何available状态的piece,这又是另一个不合理的情况,因为dfget会休眠一段时间(并没有休眠1ms的dfclient记录),这里我额外统计了dfclient.log中发起pullPieceTask请求次数,是269次,而supernode中处理次数是271次,多出了两次,我在supernode中也确实发现有两个请求的间隔时间非常短暂,我把supernode中间隔很短的log贴出来
2020-01-06 13:56:24.155 DEBU sign:6997 : get pieces request: &{DfgetTaskStatus:STARTED DstPID: PieceRange: PieceResult:INVALID} with taskID(8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab) and clientID(xxx.xx.73.97-6796-1578290173.682) 2020-01-06 13:56:24.156 DEBU sign:6997 : get pieces request: &{DfgetTaskStatus:STARTED DstPID: PieceRange: PieceResult:INVALID} with taskID(8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab) and clientID(xxx.xx.73.97-6796-1578290173.682) 2020-01-06 13:56:25.365 DEBU sign:6997 : get pieces request: &{DfgetTaskStatus:STARTED DstPID: PieceRange: PieceResult:INVALID} with taskID(8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab) and clientID(xxx.xx.73.97-6796-1578290173.682) 2020-01-06 13:56:25.366 DEBU sign:6997 : get pieces request: &{DfgetTaskStatus:STARTED DstPID: PieceRange: PieceResult:INVALID} with taskID(8815caad97127ceac74a986b16a8f083fc7fee4c2b5e2a864d95d7cd8ff6b6ab) and clientID(xxx.xx.73.97-6796-1578290173.682)
第一次间隔时间很短的两次请求貌似dfget是收到了第二个的响应,导致一个问题是supernode已经将piece0的设置为running状态了,但是dfget并没有下载。
PS:本次下载的文件只有piece0和piece1两个分片,其实piece1的情况和piece0是一样的,上面贴出两次间隔时间很短的第二次其实是piece1下载完成时处理pullPieceTask请求时出现的,log比较相似我就不贴了
我也遇到了类似问题,通过 grep dfclient.log | grep ERRO
发现似乎是卡在了从某个peer那里拉piece(piece所在的peer挂了),目前还不清楚dfclient如何处理某个peer挂掉的情况,但眼下因为一个peer挂导致整个下载挂掉实在不是很友好。
我也遇到了,我这大概有100台左右的node,从3台supernode下载一个40G+的文件, 经常出现卡在某一个piece位置,出现卡主。只能等待超时,然后再次重试。 日志大概为: 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.17.103-1611009001772493664) from peerProgress: failed to get key 1.1.17.103-1611009001772493664 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.4.86-1611009001812312823) from peerProgress: failed to get key 1.1.4.86-1611009001812312823 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.4.87-1611009001915801132) from peerProgress: failed to get key 1.1.4.87-1611009001915801132 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.17.114-1611009002135243305) from peerProgress: failed to get key 1.1.17.114-1611009002135243305 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.17.95-1611009001285817745) from peerProgress: failed to get key 1.1.17.95-1611009001285817745 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.5.70-1611009001308879335) from peerProgress: failed to get key 1.1.5.70-1611009001308879335 from map: {"Code":0,"Msg":"data not found"}
我也遇到了,我这大概有100台左右的node,从3台supernode下载一个40G+的文件, 经常出现卡在某一个piece位置,出现卡主。只能等待超时,然后再次重试。 日志大概为: 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.17.103-1611009001772493664) from peerProgress: failed to get key 1.1.17.103-1611009001772493664 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.4.86-1611009001812312823) from peerProgress: failed to get key 1.1.4.86-1611009001812312823 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.4.87-1611009001915801132) from peerProgress: failed to get key 1.1.4.87-1611009001915801132 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.17.114-1611009002135243305) from peerProgress: failed to get key 1.1.17.114-1611009002135243305 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.17.95-1611009001285817745) from peerProgress: failed to get key 1.1.17.95-1611009001285817745 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.5.70-1611009001308879335) from peerProgress: failed to get key 1.1.5.70-1611009001308879335 from map: {"Code":0,"Msg":"data not found"}
今天解决一个小问题,可能对大家有帮助 supernode error log: 2021/01/21 14:59:58 http: Accept error: accept tcp 0.0.0.0:8002: accept4: too many open files; retrying in 1s 2021/01/21 14:59:59 http: Accept error: accept tcp 0.0.0.0:8002: accept4: too many open files; retrying in 1s 2021/01/21 15:00:00 http: Accept error: accept tcp 0.0.0.0:8002: accept4: too many open files; retrying in 1s 2021/01/21 15:00:01 http: Accept error: accept tcp 0.0.0.0:8002: accept4: too many open files; retrying in 1s 2021/01/21 15:00:02 http: Accept error: accept tcp 0.0.0.0:8002: accept4: too many open files; retrying in 1s 2021/01/21 15:00:03 http: Accept error: accept tcp 0.0.0.0:8002: accept4: too many open files; retrying in 1s
当前的open file 是 65535 lsof -p $(pgrep supernode)| wc -l 查看到打开文件数 在 65537
结果显而易见,将openfile 调整到 655350 目前看下载正常,持续关注中。。。
我也遇到了,我这大概有100台左右的node,从3台supernode下载一个40G+的文件, 经常出现卡在某一个piece位置,出现卡主。只能等待超时,然后再次重试。 日志大概为: 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.17.103-1611009001772493664) from peerProgress: failed to get key 1.1.17.103-1611009001772493664 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.4.86-1611009001812312823) from peerProgress: failed to get key 1.1.4.86-1611009001812312823 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.4.87-1611009001915801132) from peerProgress: failed to get key 1.1.4.87-1611009001915801132 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.17.114-1611009002135243305) from peerProgress: failed to get key 1.1.17.114-1611009002135243305 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.17.95-1611009001285817745) from peerProgress: failed to get key 1.1.17.95-1611009001285817745 from map: {"Code":0,"Msg":"data not found"} 2021-01-19 06:31:59.419 ERRO sign:22492 : failed to delete peerID(1.1.5.70-1611009001308879335) from peerProgress: failed to get key 1.1.5.70-1611009001308879335 from map: {"Code":0,"Msg":"data not found"}
今天解决一个小问题,可能对大家有帮助 supernode error log: 2021/01/21 14:59:58 http: Accept error: accept tcp 0.0.0.0:8002: accept4: too many open files; retrying in 1s 2021/01/21 14:59:59 http: Accept error: accept tcp 0.0.0.0:8002: accept4: too many open files; retrying in 1s 2021/01/21 15:00:00 http: Accept error: accept tcp 0.0.0.0:8002: accept4: too many open files; retrying in 1s 2021/01/21 15:00:01 http: Accept error: accept tcp 0.0.0.0:8002: accept4: too many open files; retrying in 1s 2021/01/21 15:00:02 http: Accept error: accept tcp 0.0.0.0:8002: accept4: too many open files; retrying in 1s 2021/01/21 15:00:03 http: Accept error: accept tcp 0.0.0.0:8002: accept4: too many open files; retrying in 1s
当前的open file 是 65535 lsof -p $(pgrep supernode)| wc -l 查看到打开文件数 在 65537
结果显而易见,将openfile 调整到 655350 目前看下载正常,持续关注中。。。
我想 我解决了,其他的同学参考下看是否能解决, 上面的openfile是一个
另外就是 ,1 个 source 多个 supernode 出现peer被gc的情况, 然后把所有的 supernode都改成suroce, 做成单次下载从同一个IP上(supernode+source)去下载, 这样基本没出现错误。 1source + n supernode集群 修改为 n source + n supernode集群。(并且下载时 指定source 和 supernode是同一个)
目前看已经能正常下载,带来的问题是要做n个source 并且有数据同步问题。
期待官方同学 查查为啥。
Is this problem solved?
I also encountered this problem