rideshare icon indicating copy to clipboard operation
rideshare copied to clipboard

Degraded full text search performance on larger datasets w/ partial matching

Open jpcamara opened this issue 1 year ago • 3 comments

I was experimenting with table size from the "Implementing Full Text Search (FTS)" chapter and I found that at 25 million rows + prefix: true, the queries start to take a big perf hit (3-5 seconds).

So for instance:

  • Modify db/scripts/bulk_load.sh to use FROM GENERATE_SERIES(1, 25000000) seq; instead of FROM GENERATE_SERIES(1, 10000000) seq;
  • Using the generated column:
pg_search_scope :search_by_full_name,
  against: :searchable_full_name, # stored generated column tsvector
  using: {
    tsearch: {
      dictionary: 'english',
      tsvector_column: 'searchable_full_name'
    }
  }

On this ☝🏼, the performance is pretty good. I see anywhere from 80ms to 150ms.

Benchmark.ms { User.search_by_full_name("ja").limit(5).to_a }
  User Load (72.9ms)  SELECT "users"."id", "users"."first_name", "users"."last_name", "users"."email", "users"."type", "users"."created_at", "users"."updated_at", "users"."password_digest", "users"."trips_count", "users"."drivers_license_number", "users"."searchable_full_name" FROM "users" INNER JOIN (SELECT "users"."id" AS pg_search_id, (ts_rank(("users"."searchable_full_name"), (to_tsquery('english', ''' ' || 'ja' || ' ''')), 0)) AS rank FROM "users" WHERE (("users"."searchable_full_name") @@ (to_tsquery('english', ''' ' || 'ja' || ' ''')))) AS pg_search_7dfb4cf67742cb0660305e ON "users"."id" = pg_search_7dfb4cf67742cb0660305e.pg_search_id ORDER BY pg_search_7dfb4cf67742cb0660305e.rank DESC, "users"."id" ASC LIMIT $1  [["LIMIT", 5]]
 => 114.21500006690621

But if I modify it to use prefix: true, so I can match on partial words, the performance takes a nosedive:

pg_search_scope :search_by_full_name,
  against: :searchable_full_name, # stored generated column tsvector
  using: {
    tsearch: {
      dictionary: 'english',
      tsvector_column: 'searchable_full_name',
      prefix: true
    }
  }
Benchmark.ms { User.search_by_full_name("ja").limit(5).to_a }
  User Load (3771.9ms)  SELECT "users"."id", "users"."first_name", "users"."last_name", "users"."email", "users"."type", "users"."created_at", "users"."updated_at", "users"."password_digest", "users"."trips_count", "users"."drivers_license_number", "users"."searchable_full_name" FROM "users" INNER JOIN (SELECT "users"."id" AS pg_search_id, (ts_rank(("users"."searchable_full_name"), (to_tsquery('english', ''' ' || 'ja' || ' ''' || ':*')), 0)) AS rank FROM "users" WHERE (("users"."searchable_full_name") @@ (to_tsquery('english', ''' ' || 'ja' || ' ''' || ':*')))) AS pg_search_7dfb4cf67742cb0660305e ON "users"."id" = pg_search_7dfb4cf67742cb0660305e.pg_search_id ORDER BY pg_search_7dfb4cf67742cb0660305e.rank DESC, "users"."id" ASC LIMIT $1  [["LIMIT", 5]]
 => 3779.0429999586195

At 10 million records, even with prefix: true, the performance stays pretty good. But as I ramped it up, around 25 million rows things get pretty bad and i'm curious if others have ideas on how to tune this for Rideshare?

jpcamara avatar Jan 17 '24 13:01 jpcamara