FULL_TEXT index on LIKE queries shows no performance improvement or regression
Summary
FULL_TEXT indexes on text fields do not improve performance for LIKE '%term%' queries, and in some cases cause performance regression compared to sequential scans.
Environment
- ArcadeDB Version: 25.9.1
- Dataset: MovieLens (ml-latest: 62,423 movies, 33M+ ratings)
- Index Type: FULL_TEXT on
Movie.genresfield - Query Pattern:
SELECT count(*) FROM Movie WHERE genres LIKE '%Action%'
Expected Behavior
FULL_TEXT indexes should improve or at least not degrade performance for text search queries.
Actual Behavior
FULL_TEXT index causes 10.8% performance regression on COUNT queries with LIKE:
Performance Results
Query: Count ALL Action movies (LIKE, no LIMIT)
- Without index: 58.7ms ± 8.4ms
- With FULL_TEXT index: 65.1ms ± 18.0ms
- Result: 0.9x speedup (10.8% SLOWER)
Query: Find movies by genre (LIKE with LIMIT 10)
- Without index: 1.0ms ± 1.0ms
- With FULL_TEXT index: 0.8ms ± 0.3ms
- Result: 1.3x speedup (marginal improvement)
Comparison with Other Indexes
For comparison, LSM_TREE indexes on the same dataset show massive improvements:
Rating(userId): 14,836x speedup ✅Rating(movieId): 107.9x speedup ✅Movie(movieId): 58.1x speedup ✅
Test Code
-- Create FULL_TEXT index
CREATE INDEX ON Movie (genres) FULL_TEXT
-- Query that shows no improvement
SELECT count(*) as count FROM Movie WHERE genres LIKE '%Action%'
Dataset: Genres field contains pipe-delimited values like "Action|Comedy|Sci-Fi"
Analysis
The FULL_TEXT index appears to add overhead without providing optimization for LIKE '%term%' pattern matching:
- Tokenization mismatch: FULL_TEXT tokenizes
"Action|Comedy"into separate terms - Pattern matching:
LIKE '%Action%'is substring search, not token search - Index overhead: Query planner may attempt to use index inefficiently
Questions
- Should
LIKEqueries automatically use FULL_TEXT indexes when available? - Is there alternative syntax (e.g.,
SEARCHfunction) that properly utilizes FULL_TEXT indexes? - Should the query planner avoid using FULL_TEXT indexes for
LIKEpatterns?
Reproduction
Full reproduction code available in Python bindings example: https://github.com/humemai/arcadedb-embedded-python/blob/main/bindings/python/examples/04_csv_import_documents.py
The example demonstrates:
- CSV import with automatic schema inference
- Index creation after bulk import
- Performance comparison with/without indexes
- Result validation
Request
Please clarify:
- Expected behavior of FULL_TEXT indexes with LIKE queries
- Recommended syntax for full-text search on tokenized fields
- Whether this is expected behavior or a query optimizer issue
I'm going to work on this issue for the next release, thanks @tae898 for all the details.