manticoresearch icon indicating copy to clipboard operation
manticoresearch copied to clipboard

Manticore crashes with `signal 11` when inserting data

Open yharahuts opened this issue 1 year ago • 6 comments

Describe the bug Manticore crashes when inserting a large amount of data into index.

Manticore is running in rt mode with following tables:

CREATE TABLE redacted_aaa (
id bigint,
entity_id text stored,
name text indexed,
description text indexed,
notes text indexed,
number text indexed,
holder text indexed,
is_deleted bool,
schema string attribute,
source string attribute,
attributes json
) min_infix_len='2' index_exact_words='1' charset_table='non_cjk, U+47->0' min_word_len='2' blend_chars='@, /, +, -, ., _' blend_mode='trim_none, trim_both, skip_pure' morphology='stem_ru' min_stemming_len='3' expand_keywords='1'

Data is inserted via (rather large?) batches of 500 records per single insert, and whole dataset contains about 100m rows splitted into 1-3 indexes. Crash happens randomly, data can be inserted without problems at all, or can crash at ~1-2% at random line.

Since it is prod instance, I'm afraid I can not give you our datasets, or test multiple (older?) manticore versions.

To Reproduce Steps to reproduce the behavior:

  1. Create the tables;
  2. Insert large dataset via batches;
  3. Randomly get a crash;

Expected behavior It should not crash.

Describe the environment:

  • searchd -v: Manticore 6.2.12 dc5144d35@230822
  • Running in official manticore docker image manticoresearch/manticore:6.2.12

Messages from log files: docker logs shows following:

rt: table redacted_aaa: diskchunk 7(8), segments 30  saved in 20.478723 (20.479030) sec, RAM saved/new 127530839/2099120 ratio 0.950000 (soft limit 127506841, conf limit 134217728)
rt: table redacted_bbb: diskchunk 5(6), segments 31  saved in 5.438967 (5.440291) sec, RAM saved/new 127699405/232936 ratio 0.950000 (soft limit 127506841, conf limit 134217728)
# ~20-30 of similar lines 
Crash!!! Handling signal 11

After that it restarts with:

binlog: replaying log /var/lib/manticore/binlog/binlog.011
binlog: table redacted_aaa: recovered from tid 23457 to tid 23487
binlog: table redacted_bbb: recovered from tid 5380 to tid 5418
binlog: replay stats: 316 commits; 0 updates, 0 reconfigure; 0 pq-add; 0 pq-delete; 0 pq-add-delete, 2 tables
binlog: finished replaying /var/lib/manticore/binlog/binlog.011; 85.2 MB in 0.551 sec
binlog: finished replaying total 1 in 0.552 sec

Additional context While writing this issue, I came up with two ideas:

  • ~~decrease batch size to maybe 50 rows per single insert;~~ didn't help
  • try to turn off binlog (we do not need it anyway I think);

I'll try both options, but since crash is happening randomly - I couldnt guarantee it will work or not.

Any advices is greatly appreciated,

indextool --check on both indexes returns:

# other chunks gives same output
checking disk chunk, extension 16, 16(17)...
WARNING: secondary library not loaded; secondary index(es) disabled
checking schema...
checking dictionary...
checking data...
checking rows...
checking attribute blocks index...
checking kill-list...
checking docstore...
checking dead row map...
checking doc-id lookup...
check passed, 196.3 sec elapsed
check passed, 196.3 sec elapsed

yharahuts avatar Feb 27 '24 10:02 yharahuts

I can't reproduce a crash in 6.2.12 with the loading script based on your schema:

#!/usr/bin/php
<?php
if (count($argv) < 5) die("Usage: ".__FILE__." <batch size> <concurrency> <docs> <multiplier>\n");

// This function waits for an idle mysql connection for the $query, runs it and exits
function process($query) {
    global $all_links;
    global $requests;
    foreach ($all_links as $k=>$link) {
        if (@$requests[$k]) continue;
        mysqli_query($link, $query, MYSQLI_ASYNC);
        @$requests[$k] = microtime(true);
        return true;
    }
    do {
        $links = $errors = $reject = array();
        foreach ($all_links as $link) {
            $links[] = $errors[] = $reject[] = $link;
        }
        $count = @mysqli_poll($links, $errors, $reject, 0, 1000);
        if ($count > 0) {
            foreach ($links as $j=>$link) {
                $res = @mysqli_reap_async_query($links[$j]);
                foreach ($all_links as $i=>$link_orig) if ($all_links[$i] === $links[$j]) break;
                if ($link->error) {
                    echo "ERROR: {$link->error}\n";
                    if (!mysqli_ping($link)) {
                        echo "ERROR: mysql connection is down, removing it from the pool\n";
                        unset($all_links[$i]); // remove the original link from the pool
                        unset($requests[$i]); // and from the $requests too
                    }
                    return false;
                }
                if ($res === false and !$link->error) continue;
                if (is_object($res)) {
                    mysqli_free_result($res);
                }
                $requests[$i] = microtime(true);
		mysqli_query($link, $query, MYSQLI_ASYNC); // making next query
                return true;
            }
        };
    } while (true);
    return true;
}

$all_links = [];
$requests = [];
$c = 0;
for ($i=0;$i<$argv[2];$i++) {
  $m = @mysqli_connect('127.0.0.1', '', '', '', 9306);
      if (mysqli_connect_error()) die("Cannot connect to Manticore\n");
      $all_links[] = $m;
  }

// init
mysqli_query($all_links[0], "drop table if exists redacted_aaa");
mysqli_query($all_links[0], "CREATE TABLE redacted_aaa ( id bigint, entity_id text stored, name text indexed, description text indexed, notes text indexed, number text indexed, holder text indexed, is_deleted bool, schema string attribute, source string attribute, attributes json ) min_infix_len='2' index_exact_words='1' charset_table='non_cjk, U+47->0' min_word_len='2' blend_chars='@, /, +, -, ., _' blend_mode='trim_none, trim_both, skip_pure' morphology='stem_ru' min_stemming_len='3' expand_keywords='1'");

$batch = [];
$query_start = "insert into redacted_aaa(id, entity_id, name, description, notes, number, holder, is_deleted, schema, source, attributes) values ";

echo "preparing...\n";
$error = false;
$cache_file_name = '/tmp/'.md5($query_start).'_'.$argv[1].'_'.$argv[3];
$c = 0;
if (!file_exists($cache_file_name)) {
    $batches = [];
    while ($c < $argv[3]) {
      $batch[] = "($c, '1234567890', 'john smith', 'Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum', 'Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum', '0123456789', 'holder1 holder2 holder3', 'true', '{\"a\": 123, \"b\": 345}', 'source', '{\"a\": 123, \"b\": {\"c\": 1.2, \"d\": true}}')";
      $c++;
      if (floor($c/1000) == $c/1000) echo "\r".($c/$argv[3]*100)."%       ";
        if (count($batch) == $argv[1]) {
          $batches[] = $query_start.implode(',', $batch);
          $batch = [];
        }
    }
    if ($batch) $batches[] = $query_start.implode(',', $batch);
    file_put_contents($cache_file_name, serialize($batches));
} else {
    echo "found in cache $cache_file_name\n";
    $batches = unserialize(file_get_contents($cache_file_name));
}

$batchesMulti = [];
for ($n=0;$n<$argv[4];$n++) $batchesMulti = array_merge($batchesMulti, $batches);
$batches = $batchesMulti;

echo "querying...\n";

$t = microtime(true);

foreach ($batches as $batch) {
  if (!process($batch)) die("ERROR\n");
}

// wait until all the workers finish
do {
  $links = $errors = $reject = array();
  foreach ($all_links as $link)  $links[] = $errors[] = $reject[] = $link;
  $count = @mysqli_poll($links, $errors, $reject, 0, 100);
} while (count($all_links) != count($links) + count($errors) + count($reject));

echo "finished inserting\n";
echo "Total time: ".(microtime(true) - $t)."\n";
echo round($argv[3] * $argv[4] / (microtime(true) - $t))." docs per sec\n";

even with the higher concurrency of 8:

# php ~/load_1891.php 500 8 10000000 1
preparing...
100%       querying...
finished inserting
Total time: 832.53445100784
12012 docs per sec
mysql> select count(*) from redacted_aaa;
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (0.00 sec)

There was a somewhat similar issue https://github.com/manticoresoftware/manticoresearch/issues/1458#issuecomment-1790605768 which has already been fixed. I suggest you check if the crash persists in the latest dev version - https://mnt.cr/dev/nightly

You can also try modifying the script, so it reproduces the crash, so we can reproduce it on our end to fix it.

sanikolaev avatar Feb 29 '24 18:02 sanikolaev

@sanikolaev it is happening very randomly, I can load 100Gb of data without any probles at all, or have problems on 15Gb dataset at random point.

It is just like your comment on that issue:

It's also very unstable: sometimes the provided script works fine for the whole night, sometimes it crashes in a minute after started.

I'm currently testing manticoresearch/manticore:dev image - but will need some (rather long) time to test with various data and confirm it is a duplicate and it is fixed.

yharahuts avatar Feb 29 '24 18:02 yharahuts

I dont know if it helps but i managed to get into similar state with two vector fields and columnar engine in the same table

MirosOwners avatar Mar 16 '24 13:03 MirosOwners

@MirosOwners Do you mean in the same table as in the script here https://github.com/manticoresoftware/manticoresearch/issues/1891#issuecomment-1971707674 ?

sanikolaev avatar Mar 16 '24 14:03 sanikolaev

It stopped crashed on this index with dev version, but started to crash on other index. This time logs are clear, manticore just dies and starts again as if nothing happened.

Edit: as far as I can see, it just slowly overflows all available memory, Any ideas how to debug this?

I've tried adding flush ramchunk during inserts, but no luck.

yharahuts avatar Apr 12 '24 15:04 yharahuts

Edit: as far as I can see, it just slowly overflows all available memory, Any ideas how to debug this?

@yharahuts So it doesn't crash in the dev version, but just an OOM occurs?

sanikolaev avatar Apr 24 '24 09:04 sanikolaev