rideshare Degraded full text search performance on larger datasets w/ partial matching

Degraded full text search performance on larger datasets w/ partial matching

Open jpcamara opened this issue 1 year ago • 3 comments

I was experimenting with table size from the "Implementing Full Text Search (FTS)" chapter and I found that at 25 million rows + prefix: true, the queries start to take a big perf hit (3-5 seconds).

So for instance:

Modify db/scripts/bulk_load.sh to use FROM GENERATE_SERIES(1, 25000000) seq; instead of FROM GENERATE_SERIES(1, 10000000) seq;
Using the generated column:

pg_search_scope :search_by_full_name,
  against: :searchable_full_name, # stored generated column tsvector
  using: {
    tsearch: {
      dictionary: 'english',
      tsvector_column: 'searchable_full_name'
    }
  }

On this ☝🏼, the performance is pretty good. I see anywhere from 80ms to 150ms.

Benchmark.ms { User.search_by_full_name("ja").limit(5).to_a }
  User Load (72.9ms)  SELECT "users"."id", "users"."first_name", "users"."last_name", "users"."email", "users"."type", "users"."created_at", "users"."updated_at", "users"."password_digest", "users"."trips_count", "users"."drivers_license_number", "users"."searchable_full_name" FROM "users" INNER JOIN (SELECT "users"."id" AS pg_search_id, (ts_rank(("users"."searchable_full_name"), (to_tsquery('english', ''' ' || 'ja' || ' ''')), 0)) AS rank FROM "users" WHERE (("users"."searchable_full_name") @@ (to_tsquery('english', ''' ' || 'ja' || ' ''')))) AS pg_search_7dfb4cf67742cb0660305e ON "users"."id" = pg_search_7dfb4cf67742cb0660305e.pg_search_id ORDER BY pg_search_7dfb4cf67742cb0660305e.rank DESC, "users"."id" ASC LIMIT $1  [["LIMIT", 5]]
 => 114.21500006690621

But if I modify it to use prefix: true, so I can match on partial words, the performance takes a nosedive:

pg_search_scope :search_by_full_name,
  against: :searchable_full_name, # stored generated column tsvector
  using: {
    tsearch: {
      dictionary: 'english',
      tsvector_column: 'searchable_full_name',
      prefix: true
    }
  }

Benchmark.ms { User.search_by_full_name("ja").limit(5).to_a }
  User Load (3771.9ms)  SELECT "users"."id", "users"."first_name", "users"."last_name", "users"."email", "users"."type", "users"."created_at", "users"."updated_at", "users"."password_digest", "users"."trips_count", "users"."drivers_license_number", "users"."searchable_full_name" FROM "users" INNER JOIN (SELECT "users"."id" AS pg_search_id, (ts_rank(("users"."searchable_full_name"), (to_tsquery('english', ''' ' || 'ja' || ' ''' || ':*')), 0)) AS rank FROM "users" WHERE (("users"."searchable_full_name") @@ (to_tsquery('english', ''' ' || 'ja' || ' ''' || ':*')))) AS pg_search_7dfb4cf67742cb0660305e ON "users"."id" = pg_search_7dfb4cf67742cb0660305e.pg_search_id ORDER BY pg_search_7dfb4cf67742cb0660305e.rank DESC, "users"."id" ASC LIMIT $1  [["LIMIT", 5]]
 => 3779.0429999586195

At 10 million records, even with prefix: true, the performance stays pretty good. But as I ramped it up, around 25 million rows things get pretty bad and i'm curious if others have ideas on how to tune this for Rideshare?

Jan 17 '24 13:01 jpcamara

rideshare rideshare copied to clipboard

Degraded full text search performance on larger datasets w/ partial matching

rideshare
rideshare copied to clipboard