volume-balancer
volume-balancer copied to clipboard
infinite loop of "subdir already exists, skipping this one"
Hi, I'm just testing your awesome tool on our cluster and it worked fine on one datanode for a bit, then I gracefully shut it down, it finished all dir moves etc. but when I tried to launch it for the second time (because there was still some balancing to do), it always fell into an infinite loop of these messages and it didn't really matter what threshold or concurrency I specified:
14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir58 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir60 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir12 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir15 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir28 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir45 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir26 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir32 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir38 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir43 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir50 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63
Can you please look into this?
Thanks!
Maybe I should mention that I changed the hadoop version in pom.xml to
<hadoop.version>2.3.0-cdh5.1.3</hadoop.version>
and that I'm running the CDH 5.1.3 cluster on CentOS 6.5. Each DataNode has 14 disks assigned for HDFS.
It seems to me that it only happens if the first destination path it constructs in the particular balancer run already exists - because if it happens later in the run, the program seems to handle it well. This theory of mine is also backed by the fact that the first run seemed to be nowhere near the finish but after being interrupted and ran again, the described, bogus behavior occurred. If that's the case, then it might be pretty simple to fix it - maybe I'll look into it when I get a chance.
Problem occurs when first subdirectory is not "subir0", but subdir7. Determine name of next subdirectory by number of subdirectories is not good idea. Here is patch for that: https://gist.github.com/Karry/0c9bb2b23d7431ac6850
your patch seems to be working, thanks a lot :+1: