FULL_TEXT index on LIKE queries shows no performance improvement or regression

Open tae898 opened this issue 1 month ago • 1 comments

Summary

FULL_TEXT indexes on text fields do not improve performance for LIKE '%term%' queries, and in some cases cause performance regression compared to sequential scans.

Environment

ArcadeDB Version: 25.9.1
Dataset: MovieLens (ml-latest: 62,423 movies, 33M+ ratings)
Index Type: FULL_TEXT on Movie.genres field
Query Pattern: SELECT count(*) FROM Movie WHERE genres LIKE '%Action%'

Expected Behavior

FULL_TEXT indexes should improve or at least not degrade performance for text search queries.

Actual Behavior

FULL_TEXT index causes 10.8% performance regression on COUNT queries with LIKE:

Performance Results

Query: Count ALL Action movies (LIKE, no LIMIT)

Without index: 58.7ms ± 8.4ms
With FULL_TEXT index: 65.1ms ± 18.0ms
Result: 0.9x speedup (10.8% SLOWER)

Query: Find movies by genre (LIKE with LIMIT 10)

Without index: 1.0ms ± 1.0ms
With FULL_TEXT index: 0.8ms ± 0.3ms
Result: 1.3x speedup (marginal improvement)

Comparison with Other Indexes

For comparison, LSM_TREE indexes on the same dataset show massive improvements:

Rating(userId): 14,836x speedup ✅
Rating(movieId): 107.9x speedup ✅
Movie(movieId): 58.1x speedup ✅

Test Code

-- Create FULL_TEXT index
CREATE INDEX ON Movie (genres) FULL_TEXT

-- Query that shows no improvement
SELECT count(*) as count FROM Movie WHERE genres LIKE '%Action%'

Dataset: Genres field contains pipe-delimited values like "Action|Comedy|Sci-Fi"

Analysis

The FULL_TEXT index appears to add overhead without providing optimization for LIKE '%term%' pattern matching:

Tokenization mismatch: FULL_TEXT tokenizes "Action|Comedy" into separate terms
Pattern matching: LIKE '%Action%' is substring search, not token search
Index overhead: Query planner may attempt to use index inefficiently

Questions

Should LIKE queries automatically use FULL_TEXT indexes when available?
Is there alternative syntax (e.g., SEARCH function) that properly utilizes FULL_TEXT indexes?
Should the query planner avoid using FULL_TEXT indexes for LIKE patterns?

Reproduction

Full reproduction code available in Python bindings example: https://github.com/humemai/arcadedb-embedded-python/blob/main/bindings/python/examples/04_csv_import_documents.py

The example demonstrates:

CSV import with automatic schema inference
Index creation after bulk import
Performance comparison with/without indexes
Result validation

Request

Please clarify:

Expected behavior of FULL_TEXT indexes with LIKE queries
Recommended syntax for full-text search on tokenized fields
Whether this is expected behavior or a query optimizer issue

Oct 25 '25 18:10 tae898

I'm going to work on this issue for the next release, thanks @tae898 for all the details.

Dec 09 '25 19:12 lvca