hadoop icon indicating copy to clipboard operation
hadoop copied to clipboard

HDFS-16179. Update loglevel for BlockManager#chooseExcessRedundancySt…

Open tomscut opened this issue 3 years ago • 16 comments

JIRA: HDFS-16179

private void chooseExcessRedundancyStriped(BlockCollection bc,
    final Collection<DatanodeStorageInfo> nonExcess,
    BlockInfo storedBlock,
    DatanodeDescriptor delNodeHint) {
  ...
  // cardinality of found indicates the expected number of internal blocks
  final int numOfTarget = found.cardinality();
  final BlockStoragePolicy storagePolicy = storagePolicySuite.getPolicy(
      bc.getStoragePolicyID());
  final List<StorageType> excessTypes = storagePolicy.chooseExcess(
      (short) numOfTarget, DatanodeStorageInfo.toStorageTypes(nonExcess));
  if (excessTypes.isEmpty()) {
    LOG.warn("excess types chosen for block {} among storages {} is empty",
        storedBlock, nonExcess);
    return;
  }
  ...
}

IMO, here is just detecting excess StorageType and setting the log level to debug has no effect.

We have a cluster that uses the EC policy to store data. The current log level is WARN here, and in about 50 minutes, 286,093 logs are printed, which can cause other important logs to drown out.

image

image

tomscut avatar Aug 19 '21 03:08 tomscut

LGTM

Thanks @ayushtkn for your review.

tomscut avatar Aug 19 '21 05:08 tomscut

:broken_heart: -1 overall

Vote Subsystem Runtime Logfile Comment
+0 :ok: reexec 1m 4s Docker mode activated.
_ Prechecks _
+1 :green_heart: dupname 0m 0s No case conflicting files found.
+0 :ok: codespell 0m 1s codespell was not available.
+1 :green_heart: @author 0m 0s The patch does not contain any @author tags.
-1 :x: test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 :green_heart: mvninstall 37m 48s trunk passed
+1 :green_heart: compile 1m 25s trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 :green_heart: compile 1m 15s trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 :green_heart: checkstyle 1m 0s trunk passed
+1 :green_heart: mvnsite 1m 22s trunk passed
+1 :green_heart: javadoc 0m 56s trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 :green_heart: javadoc 1m 27s trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 :green_heart: spotbugs 3m 19s trunk passed
+1 :green_heart: shadedclient 19m 7s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 :green_heart: mvninstall 1m 15s the patch passed
+1 :green_heart: compile 1m 21s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 :green_heart: javac 1m 21s the patch passed
+1 :green_heart: compile 1m 7s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 :green_heart: javac 1m 7s the patch passed
+1 :green_heart: blanks 0m 0s The patch has no blanks issues.
+1 :green_heart: checkstyle 0m 53s the patch passed
+1 :green_heart: mvnsite 1m 17s the patch passed
+1 :green_heart: javadoc 0m 48s the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04
+1 :green_heart: javadoc 1m 23s the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
+1 :green_heart: spotbugs 3m 21s the patch passed
+1 :green_heart: shadedclient 18m 46s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 :x: unit 354m 15s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 :green_heart: asflicense 0m 37s The patch does not generate ASF License warnings.
451m 10s
Reason Tests
Failed junit tests hadoop.hdfs.server.datanode.TestBlockScanner
hadoop.hdfs.server.namenode.ha.TestEditLogTailer
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3313/1/artifact/out/Dockerfile
GITHUB PR https://github.com/apache/hadoop/pull/3313
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 0cb16d9f8243 4.15.0-142-generic #146-Ubuntu SMP Tue Apr 13 01:11:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / a121470e5b175de09f0697d510e9b26612b2f6e1
Default Java Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3313/1/testReport/
Max. process+thread count 2110 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3313/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

hadoop-yetus avatar Aug 19 '21 11:08 hadoop-yetus

Hi @jojochuang @ferhui , could you please also review the code? Thanks a lot.

tomscut avatar Aug 20 '21 00:08 tomscut

@tomscut Thanks for contribution. I'm confused here. The log is necessary or not. If it's necessary, many logs as you said drown out other logs. If it's not necessary, I think DEBUG level is OK.

ferhui avatar Aug 20 '21 01:08 ferhui

@tomscut Thanks for contribution. I'm confused here. The log is necessary or not. If it's necessary, many logs as you said drown out other logs. If it's not necessary, I think DEBUG level is OK.

Thanks @ferhui for your review. I think this method is just detecting redundancies and deleting them, but there are no redundancies in most cases. IMO, the log is not necessary here.

tomscut avatar Aug 20 '21 01:08 tomscut

@tomscut Thanks for comments.

but there are no redundancies in most cases

I see that if shouldProcessExtraRedundancy returns true, It will enter this method. If no redundancies, it will not enter this method, is it right? Are there any reasons that it prints lots of logs, maybe there is a hidden reason ?

ferhui avatar Aug 20 '21 03:08 ferhui

@tomscut Thanks for comments.

but there are no redundancies in most cases

I see that if shouldProcessExtraRedundancy returns true, It will enter this method. If no redundancies, it will not enter this method, is it right? Are there any reasons that it prints lots of logs, maybe there is a hidden reason ?

Thanks @ferhui for your comments and careful consideration. I found this issue HDFS-9876, and we can refer to it.

Hi @Jing9 , could you please take a look at this. Thanks.

tomscut avatar Aug 20 '21 05:08 tomscut

There are no so many logs you mentioned in our cluster of thousands of nodes. So I guess there is a hidden reason behind of it. If we change it to DEBUG level, maybe we neglect some information.

ferhui avatar Aug 20 '21 06:08 ferhui

There are no so many logs you mentioned in our cluster of thousands of nodes. So I guess there is a hidden reason behind of it. If we change it to DEBUG level, maybe we neglect some information.

Do you also use EC policy? We had a cold data cluster of 350 nodes, all directories are using EC policy, and we ran into this problem.

tomscut avatar Aug 20 '21 06:08 tomscut

Yes, EC policy is used.

ferhui avatar Aug 20 '21 06:08 ferhui

Yes, EC policy is used.

Thank you very much for your quick reply. There may be other reasons, we will take a closer look.

tomscut avatar Aug 20 '21 07:08 tomscut

@tomscut Hi, does this WARN log be printed when only writing EC files ? This WARN logs also appeared in our cluster without writting any files, but not as many as you said. I found that the block in the WARN log belongs to the file written a long time ago. So, I have some guesses:

  • is there a daemon thread calling this method?
  • or other conditions trigger this method?

Here is our 3-hour running log. image

whbing avatar Aug 23 '21 05:08 whbing

@tomscut Hi, does this WARN log be printed when only writing EC files ? This WARN logs also appeared in our cluster without writting any files, but not as many as you said. I found that the block in the WARN log belongs to the file written a long time ago. So, I have some guesses:

  • is there a daemon thread calling this method?
  • or other conditions trigger this method?

Here is our 3-hour running log. image

Thanks @whbing for your comments. I found those logs were printed after completeFile. Triggered by FSDirWriteFileOp#completeFileInternal(). Our hadoop version is 3.1.0.

  private static boolean completeFileInternal(
      FSNamesystem fsn, INodesInPath iip,
      String holder, Block last, long fileId)
      throws IOException {
    (...)
    fsn.finalizeINodeFileUnderConstruction(src, pendingFile,
        Snapshot.CURRENT_STATE_ID, true);
    (...)
    return true;
  }

image

tomscut avatar Aug 23 '21 06:08 tomscut

@tomscut Hi, does this WARN log be printed when only writing EC files ? This WARN logs also appeared in our cluster without writting any files, but not as many as you said. I found that the block in the WARN log belongs to the file written a long time ago. So, I have some guesses:

  • is there a daemon thread calling this method?
  • or other conditions trigger this method?

Here is our 3-hour running log. image

Thanks @whbing for your comments. I found those logs were printed after completeFile. Triggered by FSDirWriteFileOp#completeFileInternal().

  private static boolean completeFileInternal(
      FSNamesystem fsn, INodesInPath iip,
      String holder, Block last, long fileId)
      throws IOException {
    (...)
    fsn.finalizeINodeFileUnderConstruction(src, pendingFile,
        Snapshot.CURRENT_STATE_ID, true);
    (...)
    return true;
  }

image

@tomscut Hi, does this WARN log be printed when only writing EC files ? This WARN logs also appeared in our cluster without writting any files, but not as many as you said. I found that the block in the WARN log belongs to the file written a long time ago. So, I have some guesses:

  • is there a daemon thread calling this method?
  • or other conditions trigger this method?

Here is our 3-hour running log. image

Yes, those WARN logs were printed only writing EC files. Because the logs printed in BlockManager#chooseExcessRedundancyStriped().

tomscut avatar Aug 23 '21 06:08 tomscut

@ferhui @whbing I found if balancer is started, the log will be printed. If balancer is stopped, the log will not be printed.

tomscut avatar Sep 02 '21 00:09 tomscut

@ferhui @whbing I found if balancer is started, the log will be printed. If balancer is stopped, the log will not be printed.

@tomscut Hi, Do you find the root cause? We encountered the same issue after balancer started.

Neilxzn avatar Aug 03 '22 08:08 Neilxzn