hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-26339: HIVE-26047 Related LIKE pattern issues

Open ryukobayashi opened this issue 2 years ago • 1 comments

What changes were proposed in this pull request?

Vectorized LIKE udf is taking proportionately higher time depending on the size of input string in UDF. And, I found a problem with some regular expressions.

Why are the changes needed?

To support filter condition based on input data.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added testcase as part of PR.

ryukobayashi avatar Jan 11 '24 08:01 ryukobayashi

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

12 New issues
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

sonarqubecloud[bot] avatar Jan 11 '24 11:01 sonarqubecloud[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Mar 12 '24 00:03 github-actions[bot]

@deniskuzZ I have corrected what you pointed out, so could you please check it?

ryukobayashi avatar Mar 21 '24 02:03 ryukobayashi

@ryukobayashi, found a bunch of issues, please check if we can replace with the below snippet

    private boolean matches(String pattern) {
      UDFLike.PatternType lastType = UDFLike.PatternType.NONE;
      int length = pattern.length();
      char lastChar = 0;

      for (int i = 0; i < length; i++) {
        char n = pattern.charAt(i);
        if (n == '_' && lastChar != '\\') { // such as "a_bc"
          return false;
        
        } else if (n == '%') {
          if (i == 0) { // such as "%abc"
            lastType = UDFLike.PatternType.END;
          
          } else if (i < length - 1) {
            if (lastChar != '\\') { // such as "a%bc"
              lastType = UDFLike.PatternType.CHAINED;
            }
          } else {
            if (lastChar != '\\') {
              if (lastType == UDFLike.PatternType.END) { // such as "%abc%"
                lastType = UDFLike.PatternType.MIDDLE;
              } else if (lastType != UDFLike.PatternType.CHAINED) {
                lastType = UDFLike.PatternType.BEGIN; // such as "abc%"
              }
            }
          }
        }
        lastChar = n;
      }
      return type == lastType;
    }

looks like we get Complex type only when "_" is present, is that correct?

    private String get(String pattern) {
      int startIndex = 0;
      int endIndex = pattern.length();

      switch (type) {
        case BEGIN:
          endIndex--;
          break;
        case END:
          startIndex++;
          break;
        case MIDDLE:
          startIndex++;
          endIndex--;
          break;
      }
      return pattern.substring(startIndex, endIndex);
    }

    public Checker apply(String pattern) {
      if (matches(pattern)) {
        try {
          return checker.getConstructor(String.class).newInstance(
              get(pattern));
        } catch (Exception e) {
          throw new IllegalArgumentException("unable to initialize Checker");
        }
      }

      return null;
    }

deniskuzZ avatar Apr 08 '24 14:04 deniskuzZ

@deniskuzZ

found a bunch of issues, please check if we can replace with the below snippet

I will try it.

looks like we get Complex type only when "_" is present, is that correct?

Yes, it hasn't changed from the original logic.

ryukobayashi avatar Apr 10 '24 09:04 ryukobayashi

@deniskuzZ I have fixed it as per your suggestion. But, we did not integrate it with UDFLike because it would require major fixes to the logic. Please let me know if you think it should be fixed.

ryukobayashi avatar Apr 16 '24 03:04 ryukobayashi

@deniskuzZ I have fixed it as per your suggestion. But, we did not integrate it with UDFLike because it would require major fixes to the logic. Please let me know if you think it should be fixed.

hi @ryukobayashi, since we directly identify the pattern, why do we need multiple matcher factories, can we have just 1 factory?

public class FilterStringColLikeStringScalar extends AbstractFilterStringColLikeStringScalar {
  private static final long serialVersionUID = 1L;

  private static final List<CheckerFactory> CHECKER_FACTORIES = ImmutableList.of(
    pattern -> {
      UDFLikePattern udfLike = UDFLikePattern.matcher(pattern);
      try {
        return udfLike.checker.getConstructor(String.class).newInstance(
          udfLike.format(pattern));
      } catch (Exception e) {
        throw new IllegalArgumentException("unable to initialize Checker");
      }
    });

  public FilterStringColLikeStringScalar() {
    super();
  }

  public FilterStringColLikeStringScalar(int colNum, byte[] likePattern) {
    super(colNum, null);
    super.setPattern(new String(likePattern, StandardCharsets.UTF_8));
  }

  @Override
  protected List<CheckerFactory> getCheckerFactories() {
    return CHECKER_FACTORIES;
  }

  private enum UDFLikePattern {
    BEGIN(BeginChecker.class),
    END(EndChecker.class),
    MIDDLE(MiddleChecker.class),
    NONE(NoneChecker.class),
    CHAINED(ChainedChecker.class),
    COMPLEX(ComplexChecker.class);

    Class<? extends Checker> checker;

    UDFLikePattern(Class<? extends Checker> checker) {
      this.checker = checker;
    }
    
    private static UDFLikePattern matcher(String pattern) {
      UDFLikePattern lastType = NONE;
      int length = pattern.length();
      char lastChar = 0;

      for (int i = 0; i < length; i++) {
        char n = pattern.charAt(i);
        if (n == '_' && lastChar != '\\') { // such as "a_bc"
          return COMPLEX;
        } else if (n == '%') {
          if (i == 0) { // such as "%abc"
            lastType = END;
          } else if (i < length - 1) {
            if (lastChar != '\\') { // such as "a%bc"
              lastType = CHAINED;
            }
          } else {
            if (lastChar != '\\') {
              if (lastType == END) { // such as "%abc%"
                lastType = MIDDLE;
              } else if (lastType != CHAINED) {
                lastType = BEGIN; // such as "abc%"
              }
            }
          }
        }
        lastChar = n;
      }
      return lastType;
    }

    private String format(String pattern) {
      int startIndex = 0;
      int endIndex = pattern.length();

      switch (this) {
        case BEGIN:
          endIndex--;
          break;
        case END:
          startIndex++;
          break;
        case MIDDLE:
          startIndex++;
          endIndex--;
          break;
        case COMPLEX:
          return "^" + UDFLike.likePatternToRegExp(pattern) + "$";
      }

      return pattern.substring(startIndex, endIndex);
    }
  }
}

deniskuzZ avatar Apr 16 '24 07:04 deniskuzZ

@deniskuzZ This is original and I don't know why multiple matcher factories are needed. As you opinion, one factory should be no problem. So, I fixed.

ryukobayashi avatar Apr 16 '24 08:04 ryukobayashi

@deniskuzZ This is original and I don't know why multiple matcher factories are needed. As you opinion, one factory should be no problem. So, I fixed.

👍 nice, 1 last thing, I forgot to post it, could we please override format method per enum

private enum UDFLikePattern {
    BEGIN(BeginChecker.class) {
      @Override
      String format(String pattern) {
        return pattern.substring(0, pattern.length() - 1);
      }
    },
    END(EndChecker.class) {
      @Override
      String format(String pattern) {
        return pattern.substring(1);
      }
    },
    MIDDLE(MiddleChecker.class) {
      @Override
      String format(String pattern) {
        return pattern.substring(1, pattern.length() - 1);
      }
    },
    COMPLEX(ComplexChecker.class) {
      @Override
      String format(String pattern) {
        return "^" + UDFLike.likePatternToRegExp(pattern) + "$";
      }
    },
    CHAINED(ChainedChecker.class),
    NONE(NoneChecker.class);

    Class<? extends Checker> checker;

    UDFLikePattern(Class<? extends Checker> checker) {
      this.checker = checker;
    }

    String format(String pattern) {
      return pattern;
    }

deniskuzZ avatar Apr 16 '24 18:04 deniskuzZ

@deniskuzZ Thanks for review. I fixed.

ryukobayashi avatar Apr 17 '24 04:04 ryukobayashi

Quality Gate Passed Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

sonarqubecloud[bot] avatar Apr 19 '24 02:04 sonarqubecloud[bot]

@deniskuzZ Thanks! test passed.

ryukobayashi avatar Apr 19 '24 04:04 ryukobayashi