manticoresearch icon indicating copy to clipboard operation
manticoresearch copied to clipboard

Slow performance SNIPPETS on agent index

Open daikoz opened this issue 3 years ago • 3 comments

Hi,

To simplify the issue, i have 2 index:

....  
  
index WebPages  
{  
    type  = distributed  
  
    local = WebPages0  
    local = WebPages1  
    local = WebPages2  
    local = WebPages3  
}  
  
index WebPagesLB  
{  
     type             = distributed  
     agent_persistent = sphinxserver:9312:WebPages  
     ha_strategy      = nodeads  
}  
  

When i execute SNIPPET on WebPages, the result time is ~40ms:

ELECT Id, SNIPPET(Body,  QUERY()) FROM WebPages WHERE MATCH('modele');`  
  
Now, I execute SNIPPET on WebPagesLB and the result time is 1.2s!!!  
  

ELECT Id, SNIPPET(Body, QUERY()) FROM WebPagesLB WHERE MATCH('modele');`

If I remove SNIPPET call, the result time is same.

sphinxserver is localhost.

Why ?

daikoz avatar Jun 04 '21 18:06 daikoz

➤ Sergey Nikolaev commented:

I can't reproduce it like this:

snikolaev@dev:~$ cat csv_dist.conf 
source src { 
    type = csvpipe 
    csvpipe_command = for n in `seq 1 100000`; do echo -n "$n,"; echo $n|md5sum|head -c 10; echo; done 
#    csvpipe_field = f 
    csvpipe_field_string = f 
} 
 
index idx1 { 
    type = plain 
    source = src 
    path = idx1 
    dict = keywords 
    access_plain_attrs = mlock 
    access_blob_attrs = mlock 
    access_doclists = mlock 
    access_hitlists = mlock 
    min_infix_len = 2 
#    stored_fields = f 
} 
 
index idx2:idx1 { 
    path = idx2 
} 
 
index idx3:idx1 { 
    path = idx3 
} 
 
index idx4:idx1 { 
    path = idx4 
} 
 
index dist { 
    type = distributed 
    local = idx1 
    local = idx2 
    local = idx3 
    local = idx4 
} 
 
index distp { 
    type = distributed 
    agent_persistent = localhost:9316:dist 
    ha_strategy      = nodeads 
} 
 
searchd { 
    listen = 127.0.0.1:9315:mysql41 
    listen = 127.0.0.1:9316 
    log = sphinx_min.log 
    pid_file = /home/snikolaev/9315.pid 
    binlog_path = 
    qcache_max_bytes = 0 
} 
mysql> SELECT Id, SNIPPET(f,  QUERY()) FROM distp WHERE MATCH('*ab*') limit 0; show meta; 
Empty set (0.01 sec) 
 
+---------------+-------+ 
| Variable_name | Value | 
+---------------+-------+ 
| total         | 1000  | 
| total_found   | 10700 | 
| time          | 0.010 | 
| keyword[0]    | *ab*  | 
| docs[0]       | 13700 | 
| hits[0]       | 13700 | 
+---------------+-------+ 
6 rows in set (0.00 sec) 
 
mysql> SELECT Id, SNIPPET(f,  QUERY()) FROM dist WHERE MATCH('*ab*') limit 0; show meta; 
Empty set (0.01 sec) 
 
+---------------+-------+ 
| Variable_name | Value | 
+---------------+-------+ 
| total         | 1000  | 
| total_found   | 10700 | 
| time          | 0.004 | 
| keyword[0]    | *ab*  | 
| docs[0]       | 13700 | 
| hits[0]       | 13700 | 
+---------------+-------+ 
6 rows in set (0.00 sec) 

Please provide a reproducible case. Feel free to upload your indexes and config to our ftp - https://mnt.cr/ftp

githubmanticore avatar Jun 09 '21 05:06 githubmanticore

I upload to FTP 2 files:

  • maticore.conf
  • data.zip : the .spX

You can reproduce the issue on Debian 10 and Manticore 3.6.0 96d61d8bf@210504 release

For test you can modify /etc/hosts to redirect SERVERX to localhost: agent_persistent = SERVER1:9312|SERVER2:9312|SERVER3:9312|SERVER4:9312:WebPages

daikoz avatar Jun 09 '21 22:06 daikoz

Thank you! I could reproduce the issue on our side. I could also reproduce:

mysql> SELECT Id, SNIPPET(Body, QUERY()) FROM WebPagesLB WHERE MATCH('modele');
ERROR 1064 (42000): index WebPagesLB: agent localhost:9312: agent has 32-bit docids; no longer supported

sanikolaev avatar Jun 10 '21 04:06 sanikolaev

➤ Aleksey N. Vinogradov commented:

That is because of implicit limit for remotes. For local agents by default limit is 20. For remotes it is 1000. So, when you query the balancer - it sends request to a mirror with internal max_matches=1000. Then it retrieve ALL matches and return you 20 (or whatever limit is set). By default we're trained to deal with aggregations - so if you want something like avg() over several different agents, or even count/count(distinct) - we need many matches to be precise. But the same codepath is in game even for single mirror, where such behavior looks too cruel.

githubmanticore avatar Mar 22 '23 13:03 githubmanticore