drill icon indicating copy to clipboard operation
drill copied to clipboard

DRILL-4573: Zero copy LIKE, REGEXP_MATCHES, SUBSTR

Open jcmcote opened this issue 8 years ago • 21 comments

All the functions using the java.util.regex.Matcher are currently creating Java string objects to pass into the matcher.reset(). However this creates unnecessary copy of the bytes and a Java string object. The matcher uses a CharSequence, so instead of making a copy we can create an adapter from the DrillBuffer to the CharSequence interface. Gains of 25% in execution speed are possible when going over VARCHAR of 36 chars. The gain will be proportional to the size of the VARCHAR.

jcmcote avatar Apr 02 '16 03:04 jcmcote

Except for the comments, LGTM. Do you run the unit test ?

hsuanyi avatar Apr 10 '16 02:04 hsuanyi

And can you share the details regarding the performance improvement? For example, how the setting and data look like? Observed % of improvements?

hsuanyi avatar Apr 10 '16 03:04 hsuanyi

I have updated the JIRA ticket with the latest patch. Using this approach you can expect 25% less time to execute the like UDF. I've added details in the JIRA ticket. All tests pass.

jcmcote avatar Apr 10 '16 18:04 jcmcote

Just wondering when this patch will be included in the drill code base. I updated the JIRA issue with a new patch file. Thank you

jcmcote avatar Apr 23 '16 01:04 jcmcote

Yes, we'll get it in there. I was thinking that Hsuan was moving this along. Let me see if one of us can get this in. The contribution is very much appreciated. Sorry that people haven't been more helpful.

jacques-n avatar Apr 23 '16 01:04 jacques-n

I guess it slipped through the cracks. No worries.

I did not convert the "isnumeric" UDF to use a CharSequence wrapper since I don't really understand how the templating mechanism works. How difficult would it be to change isnumeric to use a wrapper?

jcmcote avatar Apr 23 '16 01:04 jcmcote

@jcmcote Sorry for the delay. Let me run some tests and commit it if it passes the tests. Your contribution is highly appreciated.

hsuanyi avatar Apr 23 '16 02:04 hsuanyi

just following up, did the tests all pass?

jcmcote avatar Apr 26 '16 22:04 jcmcote

Yes, it passed the tests. I will push it shortly.

hsuanyi avatar Apr 26 '16 22:04 hsuanyi

Just pushed to master; Thanks for the contributions.

hsuanyi avatar Apr 27 '16 00:04 hsuanyi

Hey Sean,

You did not apply the correct patch. I made the code review changes you asked and updated the patch in the jira ticket. But somehow the code that made it into the git repo is the initial patch.

Have a look at the one in the jira ticket you will see it has comments in the CharSequenceWrapper.

jcmcote avatar Apr 29 '16 01:04 jcmcote

Hi @jcmcote, I see. But can you do us a favor. Regex_replace()'s behavior changes after your patch https://issues.apache.org/jira/browse/DRILL-4645

Can you help fix this? And would you mind merging the missing piece with the new patch?

hsuanyi avatar Apr 29 '16 16:04 hsuanyi

Hey @hsuanyi ok, fixed the regex_replace and added a test case. There's a new patch on the ticket. Please apply it and commit.

thanks

jcmcote avatar May 05 '16 22:05 jcmcote

@hsuanyi just following up to see if the patch was applied.

jcmcote avatar May 10 '16 02:05 jcmcote

@jcmcote I'll review your new patch if it's ready.

Can you please submit the new patch as a new PR (or maybe I miss some part of conversation and just could not find the link)?

Thanks for your contribution to Drill.

jinfengni avatar May 12 '16 18:05 jinfengni

I have attached the patch to the Jira ticket.

I'm not sure of the usual process.. but from what I was reading we can create patches and attache them to Jira tickets.

On Thu, May 12, 2016 at 2:10 PM, Jinfeng Ni [email protected] wrote:

@jcmcote https://github.com/jcmcote I'll review your new patch if it's ready.

Can you please submit the new patch as a new PR (or maybe I miss some part of conversation and just could not find the link)?

Thanks for your contribution to Drill.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/apache/drill/pull/458#issuecomment-218839576

jcmcote avatar May 12 '16 23:05 jcmcote

@jcmcote ,

(copy from my comment in DRILL-4573)

I re-visited your first patch. Looks like that the change you made would cause incorrect result when the input string is a udf-8 with each character consisting of multiple bytes. In particular, the original implementation would encode the byte array with udf-8 (which is the default encoding in drill). However, in your CharSequenceWrapper, you will treat each byte as a character. This will cause incorrect result for case when a character is represented by >1 bytes.

For instance, look at the following example, the first query of regexp_matches will produce wrong result.

{code:sql} select regexp_matches('München', 'München') res3 from (values(1)); +--------+ | res3 | +--------+ | false | +--------+ 1 row selected (0.148 seconds) 0: jdbc:drill:zk=local> select regexp_matches('abc', 'abc') from (values(1)); +---------+ | EXPR$0 | +---------+ | true | +---------+ 1 row selected (0.189 seconds) 0: jdbc:drill:zk=local> select 'München' = 'München' res1 from (values(1)); +-------+ | res1 | +-------+ | true | +-------+ 1 row selected (0.186 seconds) {code:sql}

Here is the result for 1st query, without your patch

{code:sql} select regexp_matches('München', 'München') res3 from (values(1)); +-------+ | res3 | +-------+ | true | +-------+ {code:sql}

I think you should modify CharSequenceWrapper, so that the encoding method is honored.

jinfengni avatar May 17 '16 21:05 jinfengni

I know where the issue lies. It's related to the mapping between the unicode encoded bytes from the drillbuffer and the conversion to chars. I'll will address the mapping issue and get back to you.

jcmcote avatar May 19 '16 00:05 jcmcote

@jcmcote , I opened DRILL-4688 to track this incorrect regression. Please use DRILL-4688 to submit either a patch or pull request to address this issue.

As part of the verification for this issue, QA plan to add a set of test suite to cover the cases of multi-byte characters. Previously, there is no such test coverage and hence we are not able to catch this regression.

jinfengni avatar May 19 '16 17:05 jinfengni

@jcmcote , just wanna to check the current status of your latest fix.

FYI, you may consider using couple of existing methods [1] [2], when you decode the byte arrays as UDF8 characters, in CharaSequenceWrapper.

[1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionUtil.java#L77

[2] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionUtil.java#L28

jinfengni avatar May 25 '16 17:05 jinfengni

@jcmcote Did you get a chance to follow up on @jinfengni 's comment on this PR?

kkhatua avatar Jan 09 '18 19:01 kkhatua