OpenMLDB icon indicating copy to clipboard operation
OpenMLDB copied to clipboard

built-in function: Support `locate(substr, str[, pos])` function

Open jingchen2222 opened this issue 3 years ago • 8 comments

Is your feature request related to a problem? Please describe.

locate(substr, str[, pos])

Returns the position of the first occurrence of substr in str after position pos. The given pos and return value are 1-based.

Examples:

> SELECT locate('bar', 'foobarbar');
 4
> SELECT locate('bar', 'foobarbar', 5);
 7

Reference:

https://spark.apache.org/docs/2.3.0/api/sql/#locate

Describe the solution you'd like

We do provide a built-in function development guide for you. Please check OpenMLDB Build-In Function Develop Guide or the Chinese version for help.

Please make sure you:

  • [ ] Add built-in C++ function in src/udf/udf.h and src/udf/udf.cc
  • [ ] Register function to default library from function void DefaultUdfLibrary::IniStringUdf() in src/udf/default_udf_library.cc.
  • [ ] Add related unit test in src/codegen/udf_ir_builder_test.cc
  • [ ] Documenting function

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

jingchen2222 avatar Dec 01 '21 23:12 jingchen2222

Hello, I would like some clarification on the parameter and return types of this function. I've written the function as 1 but I am aware that it is likely something like 2 or 3. Please let me know so I can commit my changes.

  1. unsigned int locate(std::string substr, std::string str, unsigned int pos =0 ) {..}
  2. int32_t locate( hybridse::codec::StringRef substr, hybridse::codec::StringRef str,int32_t pos =0 ) {...}
  3. int32_t locate(string substr,string str,int32_t pos =0 ) {...}

Mouray-Hutchinson avatar Jan 24 '22 02:01 Mouray-Hutchinson

Hi Mouray-Hutchinson, very glad that you would like to help. Firstly, I would like to remove the SELECT POSITION('bar' IN 'foobarbar') since it might be conflict with our IN expression. Maybe we can simply support locate('bar', 'foobarbar') and locate('bar', 'foobarbar', 5) for this issue. Secondly, we had better implement a function looks like int32_t locate( hybridse::codec::StringRef substr, hybridse::codec::StringRef str,int32_t pos =0 ) in udf.cc

Thank you again. Please feel free to reach out to me regarding any questions or updates.

jingchen2222 avatar Jan 24 '22 03:01 jingchen2222

@jingchen2222 two questions

  1. What should the function return if the sub string is not found? I have it returning -1
  2. If I used

t32_t locate( hybridse::codec::StringRef substr, hybridse::codec::StringRef str,int32_t pos =0 )

then is the line following correct syntax for within the test? Test File Location: udf_ir_builder_test.cc

CheckUdf<Nullable<StringRef>, Nullable<StringRef>,Nullable<int32_t>>(udf_name, StringRef("all"), StringRef("helloAll"),int32_t(-1) );

CheckUdf<Nullable<StringRef>, Nullable<StringRef>,Nullable<int32_t>>(udf_name, StringRef("abcd"), StringRef("abcdefg"),int32_t(1) );

Mouray-Hutchinson avatar Jan 26 '22 05:01 Mouray-Hutchinson

locate

@Mouray-Hutchinson Sorry for the late reply.

Locate(substr, str)

Returns the position of the first occurrence of substring substr in string str. This function can be regarded as a variant of Locate(substr, str, position=1)

Locate(substr, str, position)

Locate(substr, str, position) Returns the position of the first occurrence of substring substr in string str, starting at position pos.

The rules are:

  • Returns 0 if substr is not in str.
  • Returns NULL if any argument is NULL.
  • The string is case-sensitive

Locate will be implemented in a way very similar to Mysql except that our string comparison is case-sensitive. Check https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_locate for more details.

> SELECT LOCATE('bar', 'foobarbar');
-- 4

> SELECT LOCATE('bar', 'foobarbar', 5);
-- 7



-- Returns 0 if substr is not in str.
> SELECT LOCATE('xbar', 'foobar');
-- 0  

-- String is case-sensitive
> SELECT LOCATE('all', 'helloAll', 5);
-- 0

-- Returns NULL if any argument is NULL.
> SELECT LOCATE('bar', NULL);
-- NULL
> SELECT LOCATE(NULL,'foobar');
-- NULL


jingchen2222 avatar Feb 09 '22 22:02 jingchen2222

Hi Mouray-Hutchinson, very glad that you would like to help. Firstly, I would like to remove the SELECT POSITION('bar' IN 'foobarbar') since it might be conflict with our IN expression. Maybe we can simply support locate('bar', 'foobarbar') and locate('bar', 'foobarbar', 5) for this issue. Secondly, we had better implement a function looks like int32_t locate( hybridse::codec::StringRef substr, hybridse::codec::StringRef str,int32_t pos =0 ) in udf.cc

Thank you again. Please feel free to reach out to me regarding any questions or updates.

@Mouray-Hutchinson The position should be 1 by default. Secondly, we had better implement a function looks like int32_t locate( hybridse::codec::StringRef substr, hybridse::codec::StringRef str,int32_t pos =1 ) in udf.cc

jingchen2222 avatar Feb 09 '22 22:02 jingchen2222

hi @Mouray-Hutchinson do you have any progress for this? we are looking forward to your contribution. thanks.

lumianph avatar Feb 22 '22 07:02 lumianph

I'd like to tackle it, please assign to me~

mangoGoForward avatar Aug 08 '22 01:08 mangoGoForward

I'd like to tackle it, please assign to me~

thank you !!

lumianph avatar Aug 09 '22 07:08 lumianph

I tried but failed, if anyone others want to contribute, please pick it up

mangoGoForward avatar Oct 11 '22 02:10 mangoGoForward