volume-balancer icon indicating copy to clipboard operation
volume-balancer copied to clipboard

infinite loop of "subdir already exists, skipping this one"

Open dwatzke opened this issue 10 years ago • 4 comments

Hi, I'm just testing your awesome tool on our cluster and it worked fine on one datanode for a bit, then I gracefully shut it down, it finished all dir moves etc. but when I tried to launch it for the second time (because there was still some balancing to do), it always fell into an infinite loop of these messages and it didn't really matter what threshold or concurrency I specified:

14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir58 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir60 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir12 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir15 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir28 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir45 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir26 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir32 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir38 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir43 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 14/10/21 10:34:04 WARN datanode.VolumeBalancer: /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63 already exists, skipping this one. 14/10/21 10:34:04 INFO datanode.VolumeBalancer: Scheduled move from /data/10/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir50 to /data/21/dfs/dn/current/BP-789145406-5.45.60.18-1401435993279/current/finalized/subdir63

Can you please look into this?

Thanks!

dwatzke avatar Oct 21 '14 08:10 dwatzke

Maybe I should mention that I changed the hadoop version in pom.xml to

<hadoop.version>2.3.0-cdh5.1.3</hadoop.version>

and that I'm running the CDH 5.1.3 cluster on CentOS 6.5. Each DataNode has 14 disks assigned for HDFS.

dwatzke avatar Oct 21 '14 09:10 dwatzke

It seems to me that it only happens if the first destination path it constructs in the particular balancer run already exists - because if it happens later in the run, the program seems to handle it well. This theory of mine is also backed by the fact that the first run seemed to be nowhere near the finish but after being interrupted and ran again, the described, bogus behavior occurred. If that's the case, then it might be pretty simple to fix it - maybe I'll look into it when I get a chance.

dwatzke avatar Oct 23 '14 10:10 dwatzke

Problem occurs when first subdirectory is not "subir0", but subdir7. Determine name of next subdirectory by number of subdirectories is not good idea. Here is patch for that: https://gist.github.com/Karry/0c9bb2b23d7431ac6850

Karry avatar Oct 29 '14 14:10 Karry

your patch seems to be working, thanks a lot :+1:

dwatzke avatar Oct 30 '14 08:10 dwatzke