manticoresearch icon indicating copy to clipboard operation
manticoresearch copied to clipboard

Ability to specify stopword list as part of create table command

Open kroky opened this issue 10 months ago • 5 comments

Is your feature request related to a problem? Please describe. Sometimes web server executed scripts have hard time finding a safe directory to write the stopwords list file that is also readable by the manticore process. We have this problem in stock Virtualmin/Tiki setup where anything under /home/user is only user-readable and system tmp directory is sticky, which introduces delete problems with the stopwords files. Furthermore, when we frequently rebuild our indexes, it is inconvenient to keep track of the stopwords file and make sure to delete it after the corresponding index is deleted.

Describe the solution you'd like Can we specify the stopword list as part of the table creation command (both sql or http json)? It could be a string of space-delimited words, a csv line, a json-encoded string or whatever you decide. Having the ability to specify the list at table creation time and forgetting about any corresponding files after that will be quite a relief.

Describe alternatives you've considered Currently, we are creating the stopwords list file on demand, specifying that on table creation and trying to remove it when index is removed. However, we run into readability and sticky-flag delete problems.

Additional context It is totally fine to impose a max length limit to the stopword list, so longer lists require a file while shorter lists might be passed as part of the command.

kroky avatar Apr 10 '24 13:04 kroky

Thanks for the issue @kroky.

The related task is https://github.com/manticoresoftware/manticoresearch/issues/2083 where a problem with backing up via mysqldump/restoring external files was revealed.

One solution is to make it possible to do:

create table 
  ...
  stopwords='a; the; smth'
  wordforms='running > run; ran > run'
  exceptions='AT&T > AT&T; MS Windows => ms windows'

Internally, it can still use external files created/updated automatically on create table / alter table. It's just important that SHOW CREATE TABLE uses the above format. It should also simplify things like:

  • creating a table with stopwords/exceptions/wordforms remotely
  • updating wordforms
  • mysqldumping remotely
  • handling permission issues

What we need to ensure in this case is proper escaping, since exceptions, wordforms, and stopwords can include ; and other characters that may be sensitive in this context (in the context of a configuration file or an SQL command). Too long values is another thing to think about.

sanikolaev avatar May 01 '24 10:05 sanikolaev

Yes, that's a great idea @sanikolaev ! Using csv with ; as separator and normal csv escaping procedures should probably work here. I assume queries should already be utf8 encoded as specifying special characters in these lists will be common.

kroky avatar May 02 '24 06:05 kroky

There shouldn't be a problem with 2/3/4 byte characters. Example:

mysql> drop table if exists t; create table t(f text); insert into t values(0, 'A ç 汉 🚀'); select * from t;
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text)
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t values(0, 'A ç 汉 🚀')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
select * from t
--------------

+---------------------+---------------+
| id                  | f             |
+---------------------+---------------+
| 1515858028807585797 | A ç 汉 🚀       |
+---------------------+---------------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 0ms ---

➜  ~ curl -s 0:9308/sql\?mode=raw -d "query=insert%20into%20t%20values%280%2C%20%27A%20%C3%A7%20%E6%B1%89%20%F0%9F%9A%80%27%29%3B"|jq .
[
  {
    "total": 1,
    "error": "",
    "warning": ""
  }
]

➜  ~ curl -s 0:9308/sql\?mode=raw -d "query=select%20%2A%20from%20t"
[{
"columns":[{"id":{"type":"long long"}},{"f":{"type":"string"}}],
"data":[
{"id":1515858028807585798,"f":"A ç 汉 🚀"},
{"id":1515858028807585797,"f":"A ç 汉 🚀"}
],
"total":2,

Discussed with the team. We're inclined to implement the following behavior changes:

  • exceptions, wordforms, stopwords, and hitless_words should be able to accept not only paths to files to take values from, but the values directly.
  • The general syntax is:
    exceptions/wordforms/stopwords/hitless_words = '<values>'
    
  • <values>:
    • must be separated with ;
    • there can be spaces around ;
    • > or => can be used to separate elements of a pair of values, i.e., from => to or from > to
    • there can be spaces around > and =>
    • =, >, and ; are to be escaped with \ to be used as non-special characters, e.g., AT\=&T\>\; => AT&T; smthelse > smth meaning AT=T>; => AT&T AND smthelse => smth
  • This behavior change applies only to RT mode (CREATE TABLE). We won't support it in plain mode (in a configuration file).
  • SHOW CREATE TABLE should return not paths, but the values.
  • The create table functionality can be implemented in Buddy, but it doesn't seem to be a big deal to do it right in the daemon.

We'll have another round of discussion to confirm the above.

sanikolaev avatar May 02 '24 10:05 sanikolaev

Discussed more. Updated spec:

  • exceptions, wordforms, stopwords, and hitless_words should be able to accept not only paths to files to take values from, but the values directly.
  • The general syntax is:
    exceptions/wordforms/stopwords/hitless_words = '<values>'
    
  • <values>:
    • must be separated with ;
    • there can be spaces around ;
    • > or => can be used to separate elements of a pair of values, i.e., from => to or from > to
    • there can be spaces around > and =>
    • =, >, and ;, ~ are to be escaped with \ to be used as non-special characters (rel. issue https://github.com/manticoresoftware/manticoresearch/issues/2146), e.g., AT\=&T\>\; => AT&T; smthelse > smth meaning AT=T>; => AT&T AND smthelse => smth
  • This behavior change applies only to RT mode (CREATE TABLE). We won't support it in plain mode (in a configuration file).
  • SHOW CREATE TABLE should return not paths, but the values.
  • The create table functionality can be implemented in Buddy, but it doesn't seem to be a big deal to do it right in the daemon.

sanikolaev avatar May 07 '24 06:05 sanikolaev

Blocked by https://github.com/manticoresoftware/manticoresearch/issues/2146

sanikolaev avatar May 08 '24 08:05 sanikolaev

Blocked by https://github.com/manticoresoftware/manticoresearch/issues/2146

Unblocked

sanikolaev avatar May 31 '24 04:05 sanikolaev