tantivy icon indicating copy to clipboard operation
tantivy copied to clipboard

fix: fixed integer overflow in ExpUnrolledLinkedList for large datasets

Open mdashti opened this issue 3 months ago • 7 comments

What

Changes ExpUnrolledLinkedList::block_num from u16 to u32 to prevent integer overflow when indexing large datasets. The structure now supports up to ~4 billion blocks (128 TB) instead of just 65,535 blocks (2.1 GB).

Why

Users were experiencing index creation failures with the error "mid > len" when creating BM25 indexes on tables with large integer arrays (100k rows × 6,700 elements = 660M operations). This required ~103,000 blocks, exceeding the u16::MAX limit of 65,535, causing:

  • Integer overflow in release builds → memory corruption → "mid > len" errors
  • Direct overflow panic in debug builds → "attempt to add with overflow"

How

  1. Changed block_num type: u16u32 (supports 65,536× more blocks)
  2. Added safety measures:
    • Overflow protection with checked_add() in increment_num_blocks()
    • Metadata corruption detection with assert!() in read_to_end()
  3. Maintained compatibility: Block sizes still cap at 32 KB; only the count limit increased

Tests

Added 8 tests to verify the fix.

mdashti avatar Nov 18 '25 17:11 mdashti

@mdashti Which means you have gigantic segments. The recommended max size for a segment is 10million docs. Do you have a strong reason for this segment size?

fulmicoton avatar Dec 01 '25 11:12 fulmicoton

@fulmicoton That's right. But, this is our user data, which they can have any shape for their data. We've advised them to improve their usage, but it's best if we don't panic on this issue.

mdashti avatar Dec 01 '25 22:12 mdashti

But, this is our user data, which they can have any shape for their data.

Sorry I think I misunderstood the problem. I thought it was caused by a super high number of rows.

I'd like to merge this but it comes with a performance regression.

~/git/tantivy (paradedb-paradedb/fix-overflow-issue*) » cargo bench --bench index-bench                   paul.masurel@COMP-QMLQQJH2R1
   Compiling tantivy-stacker v0.6.0 (/Users/paul.masurel/git/tantivy/stacker)
    Building [=======================> ] 244/248: tantivy-stacker
   Compiling tantivy-columnar v0.6.0 (/Users/paul.masurel/git/tantivy/columnar)
   Compiling tantivy v0.26.0 (/Users/paul.masurel/git/tantivy)
    Finished `bench` profile [optimized + debuginfo] target(s) in 17.77s
     Running benches/index-bench.rs (target/release/deps/index_bench-fa60c5a819b439e8)
index-hdfs/only-indexed-no-commit
                        time:   [232.87 ms 236.64 ms 240.73 ms]
                        thrpt:  [88.768 MiB/s 90.300 MiB/s 91.763 MiB/s]
                 change:
                        time:   [+14.745% +17.293% +19.919%] (p = 0.00 < 0.05)
                        thrpt:  [-16.611% -14.744% -12.851%]
                        Performance has regressed.
Benchmarking index-hdfs/only-indexed-with-commit: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 6.9s, or reduce sample count to 10.
index-hdfs/only-indexed-with-commit
                        time:   [324.49 ms 327.59 ms 331.02 ms]
                        thrpt:  [64.555 MiB/s 65.231 MiB/s 65.854 MiB/s]
                 change:
                        time:   [+11.808% +14.003% +15.959%] (p = 0.00 < 0.05)
                        thrpt:  [-13.762% -12.283% -10.561%]
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)

fulmicoton avatar Dec 02 '25 20:12 fulmicoton

Can you share more detail about the problem?

These ids are local to a single posting lists. They rapidly saturate to a size of 32K, so it should support posting lists of around 2GB. How do you end up with a single posting list being this long? Is this because you encode positions, and the token is very frequent?

fulmicoton avatar Dec 02 '25 20:12 fulmicoton

The performance regression was just caused by lack of inlining.

fulmicoton avatar Dec 02 '25 20:12 fulmicoton

@mdashti I do not have permission to push changes to your upstream repo. You have been invited to this repo. Next time, can you push to a dev branch in tantivy or give me write permissions to push extra commits to your repo?

(@stuhood same situation)

fulmicoton avatar Dec 02 '25 20:12 fulmicoton

Can you share more detail about the problem?

These ids are local to a single posting lists. They rapidly saturate to a size of 32K, so it should support posting lists of around 2GB. How do you end up with a single posting list being this long? Is this because you encode positions, and the token is very frequent?

Here's the original report on ParadeDB:

running into a weird indexing problems we've tried to flatten some of our lists over into a list_id column in our main cmptbl_full table. We've tried this on our DEVELOP environment and the parade index built fine, but when trying it on PROD where we have much more list data, we kept encountering one of two errors, either FATAL: server conn crashed?

SSL connection has been closed unexpectedly or ERROR: mid > len

CONTEXT: parallel worker Making an index without list_id worked fine, and making an index with only list_id still proc'd the above two errors. Is this something that you've run into before?

with this index definition:

CREATE INDEX cmptbl_full_new_idx ON public.cmptbl_full_new USING bm25 (contact_id, ent_domain, ent_industry, ent_sector, ent_sub_sectors, ent_name, ent_shorthand_name, contact_business_email, contact_canonical_shorthand_name, contact_first_name, contact_full_name, contact_job_title, contact_last_name, contact_mobile_phone, ent_id, employee_rank, revenue_rank, ent_emp_rev_details, ent_locations_details, contact_job_details, contact_locations_details, contact_confirmed_connect_date, list_id) WITH (key_field=contact_id, text_fields='{ "ent_domain": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"}, "ent_industry": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}}, "ent_sector": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}}, "ent_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"}, "ent_shorthand_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"}, "contact_business_email": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"}, "contact_canonical_shorthand_name": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"}, "contact_first_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"}, "contact_full_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"}, "contact_job_title": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"}, "contact_last_name": {"fast": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"}, "contact_mobile_phone": {"fast": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "raw"}, "normalizer": "lowercase"} }', numeric_fields='{ "ent_id": {"indexed": true}, "employee_rank": {"indexed": true}, "revenue_rank": {"indexed": true}, "list_id": {"indexed": true} }', boolean_fields='{}', json_fields='{ "ent_emp_rev_details": {"fast": true, "indexed": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "lowercase"}}, "ent_locations_details": {"fast": true, "indexed": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "lowercase"}}, "ent_sub_sectors": {"fast": true, "indexed": true, "tokenizer": {"lowercase": true, "remove_long": 255, "type": "lowercase"}}, "contact_job_details": {"fast": true, "indexed": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "lowercase"}}, "contact_locations_details": { "fast": true, "indexed": true, "tokenizer": {"ascii_folding": true, "lowercase": true, "remove_long": 255, "type": "lowercase"}} }', range_fields='{}', datetime_fields='{ "contact_confirmed_connect_date": {"indexed": true} }');

and here's the query:

CREATE INDEX cmptbl_full_old_list_id_idx ON cmptbl_full_old USING bm25 (contact_id, list_id) WITH ( key_field=contact_id, numeric_fields='{"list_id": {"indexed": true}}' ); ERROR: XX000: mid > len CONTEXT: parallel worker LOCATION: column_operation.rs:100

SELECT array_length(list_id, 1) as array_len, pg_column_size(list_id) as size_bytes, list_id[1:10] as first_10_elements FROM cmptbl_full_old WHERE list_id IS NOT NULL ORDER BY array_length(list_id, 1) DESC LIMIT 10; array_len | size_bytes | first_10_elements -----------+------------+------------------------------------------------------- 6700 | 26820 | {407,434,638,769,1184,1202,1781,2012,2108,2270} 6640 | 26580 | {935,1259,1383,1411,1418,1624,1881,2113,2728,2965} 6639 | 26576 | {407,434,638,769,1184,1202,1683,1781,2012,2270} 6635 | 26560 | {1086,1901,2105,4181,4213,4894,5168,5221,68213,68222} 6635 | 26560 | {871,1067,1683,2204,2345,2805,2572,3356,4311,4557} 6628 | 26532 | {1901,2845,2848,3836,4213,5221,68213,7274,7462,8076} 6627 | 26528 | {935,1259,1383,1418,1881,2345,2728,2965,3164,3172} 6623 | 26512 | {1086,2845,2848,3836,4213,5221,68213,7274,7462,7557} 6606 | 26444 | {564,1634,1683,1856,2022,2405,2805,2572,2988,3730} 6604 | 26436 | {407,434,638,769,1184,1202,1781,2012,2270,2718}

mdashti avatar Dec 03 '25 00:12 mdashti

@fulmicoton @PSeitz any more comments?

mdashti avatar Dec 10 '25 18:12 mdashti

This increased memory consumption per unique term by 4 bytes (12->16 byte for ExpUnrolledLinkedList)

PSeitz avatar Dec 17 '25 00:12 PSeitz