hadoop icon indicating copy to clipboard operation
hadoop copied to clipboard

HDFS-16793. [SBN read] ObserverNN failed to select streaming inputStream from JournalNode

Open ZanderXu opened this issue 3 years ago • 3 comments
trafficstars

Description of PR

In out prod environment, we encountered one case that observer namenode failed to select streaming inputStream with a timeout exception. And the related code as bellow:

@Override
public void selectInputStreams(Collection<EditLogInputStream> estreams,
    long fromTxnId, boolean inProgressOk,
    boolean onlyDurableTxns) throws IOException { 
  if (inProgressOk && inProgressTailingEnabled) {
    ...
  }
  // Timeout here.
  selectStreamingInputStreams(streams, fromTxnId, inProgressOk,
      onlyDurableTxns);
}

After looked into the code and found that JournalNode contains one very expensive and redundant operation that scan all of edits of the last in-progress segment with IO. The related code as bellow:

public List<RemoteEditLog> getRemoteEditLogs(long firstTxId,
    boolean inProgressOk) throws IOException {
  File currentDir = sd.getCurrentDir();
  List<EditLogFile> allLogFiles = matchEditLogs(currentDir);
  List<RemoteEditLog> ret = Lists.newArrayListWithCapacity(
      allLogFiles.size());
  for (EditLogFile elf : allLogFiles) {
    if (elf.hasCorruptHeader() || (!inProgressOk && elf.isInProgress())) {
      continue;
    }
    // Here.
    if (elf.isInProgress()) {
      try {
        elf.scanLog(getLastReadableTxId(), true);
      } catch (IOException e) {
        LOG.error("got IOException while trying to validate header of " +
            elf + ".  Skipping.", e);
        continue;
      }
    }
    if (elf.getFirstTxId() >= firstTxId) {
      ret.add(new RemoteEditLog(elf.firstTxId, elf.lastTxId,
          elf.isInProgress()));
    } else if (elf.getFirstTxId() < firstTxId && firstTxId <= elf.getLastTxId()) {
      // If the firstTxId is in the middle of an edit log segment. Return this
      // anyway and let the caller figure out whether it wants to use it.
      ret.add(new RemoteEditLog(elf.firstTxId, elf.lastTxId,
          elf.isInProgress()));
    }
  }
  
  Collections.sort(ret);
  
  return ret;
} 

Expensive:

  • This scan operation will scan all of the edits of the in-progress segment with IO.

Redundant:

  • This scan operation just find the lastTxId of this in-progress segment
  • But the caller method getEditLogManifest(long sinceTxId, boolean inProgressOk) in Journal.java just ignore the lastTxId of the in-progress segment and use getHighestWrittenTxId() as the lastTxId of the in-progress and return to namenode.
  • So, the scan operation is redundant.

If end user enable the Observer Read feature, the delay of the tailing edits from journalnode is very important, whether it is normal process or fallback process. 

And there is no more comments about this scan logic after looked into the code and HDFS-6634 which added this logic.

The only effect I can get is to scan the in-progress segment for corruption. But namenode can handle the corrupted in-progress segment.

ZanderXu avatar Oct 05 '22 04:10 ZanderXu

:broken_heart: -1 overall

Vote Subsystem Runtime Logfile Comment
+0 :ok: reexec 27m 52s Docker mode activated.
_ Prechecks _
+1 :green_heart: dupname 0m 0s No case conflicting files found.
+0 :ok: codespell 0m 0s codespell was not available.
+0 :ok: detsecrets 0m 0s detect-secrets was not available.
+1 :green_heart: @author 0m 0s The patch does not contain any @author tags.
-1 :x: test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 :green_heart: mvninstall 44m 9s trunk passed
+1 :green_heart: compile 1m 43s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 :green_heart: compile 1m 40s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 :green_heart: checkstyle 1m 20s trunk passed
+1 :green_heart: mvnsite 1m 43s trunk passed
+1 :green_heart: javadoc 1m 20s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 :green_heart: javadoc 1m 41s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 :green_heart: spotbugs 3m 47s trunk passed
+1 :green_heart: shadedclient 26m 28s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 :green_heart: mvninstall 1m 30s the patch passed
+1 :green_heart: compile 1m 33s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 :green_heart: javac 1m 33s the patch passed
+1 :green_heart: compile 1m 24s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 :green_heart: javac 1m 24s the patch passed
+1 :green_heart: blanks 0m 0s The patch has no blanks issues.
+1 :green_heart: checkstyle 1m 2s the patch passed
+1 :green_heart: mvnsite 1m 30s the patch passed
+1 :green_heart: javadoc 1m 0s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 :green_heart: javadoc 1m 40s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 :green_heart: spotbugs 3m 44s the patch passed
+1 :green_heart: shadedclient 27m 29s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 :x: unit 355m 59s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 :green_heart: asflicense 0m 57s The patch does not generate ASF License warnings.
507m 6s
Reason Tests
Failed junit tests hadoop.hdfs.TestDFSInotifyEventInputStreamKerberized
hadoop.hdfs.server.mover.TestMover
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4971/1/artifact/out/Dockerfile
GITHUB PR https://github.com/apache/hadoop/pull/4971
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 94ea6c8de82b 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / dade7665fe054a45f14cf2966f3f0f8bd09dcaee
Default Java Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4971/1/testReport/
Max. process+thread count 2308 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4971/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus avatar Oct 05 '22 13:10 hadoop-yetus

Thanks @ZanderXu for reporting this. Changes makes sense. Can you look at the test failure as well? Thanks.

hotcodemacha avatar Oct 12 '22 17:10 hotcodemacha

Thanks @ZanderXu for reporting this. Changes makes sense. Can you look at the test failure as well? Thanks.

@ashutoshcipher Thanks for your review and remainder. I have fix the failed UT testWithKerberizedCluster. The another failed UT TestMover works well locally and does not related to this patch.

ZanderXu avatar Oct 13 '22 06:10 ZanderXu

:broken_heart: -1 overall

Vote Subsystem Runtime Logfile Comment
+0 :ok: reexec 1m 35s Docker mode activated.
_ Prechecks _
+1 :green_heart: dupname 0m 0s No case conflicting files found.
+0 :ok: codespell 0m 0s codespell was not available.
+0 :ok: detsecrets 0m 0s detect-secrets was not available.
+1 :green_heart: @author 0m 0s The patch does not contain any @author tags.
-1 :x: test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 :green_heart: mvninstall 42m 42s trunk passed
+1 :green_heart: compile 1m 57s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 :green_heart: compile 1m 33s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 :green_heart: checkstyle 1m 20s trunk passed
+1 :green_heart: mvnsite 1m 48s trunk passed
+1 :green_heart: javadoc 1m 27s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 :green_heart: javadoc 1m 38s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 :green_heart: spotbugs 3m 58s trunk passed
+1 :green_heart: shadedclient 28m 47s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 :green_heart: mvninstall 1m 39s the patch passed
+1 :green_heart: compile 1m 41s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 :green_heart: javac 1m 41s the patch passed
+1 :green_heart: compile 1m 28s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 :green_heart: javac 1m 28s the patch passed
+1 :green_heart: blanks 0m 0s The patch has no blanks issues.
+1 :green_heart: checkstyle 1m 6s the patch passed
+1 :green_heart: mvnsite 1m 38s the patch passed
+1 :green_heart: javadoc 1m 2s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 :green_heart: javadoc 1m 32s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 :green_heart: spotbugs 4m 3s the patch passed
+1 :green_heart: shadedclient 27m 45s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 :green_heart: unit 364m 36s hadoop-hdfs in the patch passed.
+1 :green_heart: asflicense 1m 10s The patch does not generate ASF License warnings.
491m 20s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4971/2/artifact/out/Dockerfile
GITHUB PR https://github.com/apache/hadoop/pull/4971
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 84ac63d4e941 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 5f1de83d50d8245b92ebdbeb2dd20a5d462286c1
Default Java Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4971/2/testReport/
Max. process+thread count 2088 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4971/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus avatar Oct 13 '22 14:10 hadoop-yetus