rideshare
rideshare copied to clipboard
Degraded full text search performance on larger datasets w/ partial matching
I was experimenting with table size from the "Implementing Full Text Search (FTS)" chapter and I found that at 25 million rows + prefix: true
, the queries start to take a big perf hit (3-5 seconds).
So for instance:
- Modify
db/scripts/bulk_load.sh
to useFROM GENERATE_SERIES(1, 25000000) seq;
instead ofFROM GENERATE_SERIES(1, 10000000) seq;
- Using the generated column:
pg_search_scope :search_by_full_name,
against: :searchable_full_name, # stored generated column tsvector
using: {
tsearch: {
dictionary: 'english',
tsvector_column: 'searchable_full_name'
}
}
On this ☝🏼, the performance is pretty good. I see anywhere from 80ms to 150ms.
Benchmark.ms { User.search_by_full_name("ja").limit(5).to_a }
User Load (72.9ms) SELECT "users"."id", "users"."first_name", "users"."last_name", "users"."email", "users"."type", "users"."created_at", "users"."updated_at", "users"."password_digest", "users"."trips_count", "users"."drivers_license_number", "users"."searchable_full_name" FROM "users" INNER JOIN (SELECT "users"."id" AS pg_search_id, (ts_rank(("users"."searchable_full_name"), (to_tsquery('english', ''' ' || 'ja' || ' ''')), 0)) AS rank FROM "users" WHERE (("users"."searchable_full_name") @@ (to_tsquery('english', ''' ' || 'ja' || ' ''')))) AS pg_search_7dfb4cf67742cb0660305e ON "users"."id" = pg_search_7dfb4cf67742cb0660305e.pg_search_id ORDER BY pg_search_7dfb4cf67742cb0660305e.rank DESC, "users"."id" ASC LIMIT $1 [["LIMIT", 5]]
=> 114.21500006690621
But if I modify it to use prefix: true
, so I can match on partial words, the performance takes a nosedive:
pg_search_scope :search_by_full_name,
against: :searchable_full_name, # stored generated column tsvector
using: {
tsearch: {
dictionary: 'english',
tsvector_column: 'searchable_full_name',
prefix: true
}
}
Benchmark.ms { User.search_by_full_name("ja").limit(5).to_a }
User Load (3771.9ms) SELECT "users"."id", "users"."first_name", "users"."last_name", "users"."email", "users"."type", "users"."created_at", "users"."updated_at", "users"."password_digest", "users"."trips_count", "users"."drivers_license_number", "users"."searchable_full_name" FROM "users" INNER JOIN (SELECT "users"."id" AS pg_search_id, (ts_rank(("users"."searchable_full_name"), (to_tsquery('english', ''' ' || 'ja' || ' ''' || ':*')), 0)) AS rank FROM "users" WHERE (("users"."searchable_full_name") @@ (to_tsquery('english', ''' ' || 'ja' || ' ''' || ':*')))) AS pg_search_7dfb4cf67742cb0660305e ON "users"."id" = pg_search_7dfb4cf67742cb0660305e.pg_search_id ORDER BY pg_search_7dfb4cf67742cb0660305e.rank DESC, "users"."id" ASC LIMIT $1 [["LIMIT", 5]]
=> 3779.0429999586195
At 10 million records, even with prefix: true
, the performance stays pretty good. But as I ramped it up, around 25 million rows things get pretty bad and i'm curious if others have ideas on how to tune this for Rideshare?