arcadedb icon indicating copy to clipboard operation
arcadedb copied to clipboard

FULL_TEXT index on LIKE queries shows no performance improvement or regression

Open tae898 opened this issue 1 month ago • 1 comments

Summary

FULL_TEXT indexes on text fields do not improve performance for LIKE '%term%' queries, and in some cases cause performance regression compared to sequential scans.

Environment

  • ArcadeDB Version: 25.9.1
  • Dataset: MovieLens (ml-latest: 62,423 movies, 33M+ ratings)
  • Index Type: FULL_TEXT on Movie.genres field
  • Query Pattern: SELECT count(*) FROM Movie WHERE genres LIKE '%Action%'

Expected Behavior

FULL_TEXT indexes should improve or at least not degrade performance for text search queries.

Actual Behavior

FULL_TEXT index causes 10.8% performance regression on COUNT queries with LIKE:

Performance Results

Query: Count ALL Action movies (LIKE, no LIMIT)

  • Without index: 58.7ms ± 8.4ms
  • With FULL_TEXT index: 65.1ms ± 18.0ms
  • Result: 0.9x speedup (10.8% SLOWER)

Query: Find movies by genre (LIKE with LIMIT 10)

  • Without index: 1.0ms ± 1.0ms
  • With FULL_TEXT index: 0.8ms ± 0.3ms
  • Result: 1.3x speedup (marginal improvement)

Comparison with Other Indexes

For comparison, LSM_TREE indexes on the same dataset show massive improvements:

  • Rating(userId): 14,836x speedup
  • Rating(movieId): 107.9x speedup
  • Movie(movieId): 58.1x speedup

Test Code

-- Create FULL_TEXT index
CREATE INDEX ON Movie (genres) FULL_TEXT

-- Query that shows no improvement
SELECT count(*) as count FROM Movie WHERE genres LIKE '%Action%'

Dataset: Genres field contains pipe-delimited values like "Action|Comedy|Sci-Fi"

Analysis

The FULL_TEXT index appears to add overhead without providing optimization for LIKE '%term%' pattern matching:

  1. Tokenization mismatch: FULL_TEXT tokenizes "Action|Comedy" into separate terms
  2. Pattern matching: LIKE '%Action%' is substring search, not token search
  3. Index overhead: Query planner may attempt to use index inefficiently

Questions

  1. Should LIKE queries automatically use FULL_TEXT indexes when available?
  2. Is there alternative syntax (e.g., SEARCH function) that properly utilizes FULL_TEXT indexes?
  3. Should the query planner avoid using FULL_TEXT indexes for LIKE patterns?

Reproduction

Full reproduction code available in Python bindings example: https://github.com/humemai/arcadedb-embedded-python/blob/main/bindings/python/examples/04_csv_import_documents.py

The example demonstrates:

  • CSV import with automatic schema inference
  • Index creation after bulk import
  • Performance comparison with/without indexes
  • Result validation

Request

Please clarify:

  1. Expected behavior of FULL_TEXT indexes with LIKE queries
  2. Recommended syntax for full-text search on tokenized fields
  3. Whether this is expected behavior or a query optimizer issue

tae898 avatar Oct 25 '25 18:10 tae898

I'm going to work on this issue for the next release, thanks @tae898 for all the details.

lvca avatar Dec 09 '25 19:12 lvca