pg_search icon indicating copy to clipboard operation
pg_search copied to clipboard

`SELECT DISTINCT` triggers ActiveRecord::StatementInvalid in v1.0.0

Open monfresh opened this issue 10 years ago • 20 comments

I just upgraded to 1.0.0 from 0.7.9 and I'm now getting a bunch of failures for tests that perform a pg_search query. I looked at the Changelog and the Readme, but I don't see any instructions for upgrading from 0.7.9 to 1.0.0 or any mention of breaking changes, so I assumed everything should still work.

Here is the specific error I am getting:

ActiveRecord::StatementInvalid:
       PG::InvalidColumnReference: ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list
       LINE 1: ...ocations"."id" = pg_search.pg_search_id  ORDER BY pg_search....
                                                                    ^
       : SELECT  DISTINCT "locations".* FROM "locations" INNER JOIN (SELECT "locations"."id" AS pg_search_id, (ts_rank(("locations"."tsv_body"), (to_tsquery('english', ''' ' || 'changeme' || ' ''' || ':*')), 0)) AS rank FROM "locations" WHERE ((("locations"."tsv_body") @@ (to_tsquery('english', ''' ' || 'changeme' || ' ''' || ':*'))))) pg_search ON "locations"."id" = pg_search.pg_search_id  ORDER BY pg_search.rank DESC, "locations"."id" ASC LIMIT 30 OFFSET 0

And here is the query: https://github.com/codeforamerica/ohana-api/blob/master/app/models/concerns/search.rb#L26-35

Thoughts?

monfresh avatar Apr 19 '15 18:04 monfresh

In 1.0.0, I moved the search relevance calculation into a subquery, which I think is what triggered your bug. From this point forward, I plan on following Semantic Versioning 2.0.0, so a change to the MAJOR version represents a breaking change, as in the change from 0.x to 1.x.

It looks like we should add test coverage for the SELECT DISTINCT case to pg_search. I think that is what isn't currently working with the subquery. For now, I think a quick workaround might be to chain .select("pg_search.rank", "locations.id") onto the end of that particular call, or to add those two fields to any existing .select() calls you may have.

Please let me know if that works for you. In the meantime I will try to think of a more generic solution.

I'm going to change the title of this issue so to reflect that it's about SELECT DISTINCT.

Thanks for the bug report!

nertzy avatar Apr 19 '15 22:04 nertzy

I'm also having an issue with the new subquery. We are using activerecord_any_of to join pg_search scopes with other conditions via OR instead of AND and the subquery breaks the OR logic.

Is there any way to retain the old non-subquery method add an option to use it when desired?

aaronchi avatar Apr 20 '15 16:04 aaronchi

@aaronchi currently, no. Actually if you could open a separate issue for what you're seeing that would be helpful too so that we can try to add it to the test coverage.

I still haven't found the holy grail of a way to express the complexity of the pg_search query in a way that doesn't clobber other chained scopes. The subquery solution fixes a lot of the longstanding bugs, but adds some new ones as well, apparently.

I'd prefer not to add back the old query style if possible, but I'm still open to the idea.

nertzy avatar Apr 20 '15 17:04 nertzy

@nertzy Thanks for the details and the suggestion. Adding .select("pg_search.rank", "locations.id") to the end of my scope that uses pg_search only fixed a few search queries, but introduced a new error in other queries.

The search function in my API allows various parameters to be passed in, and the errors occur when the keyword parameter (which uses pg_search) is used alone or combined with other parameters.

Searches that only included the keyword parameter were the ones that resulted in the PG::InvalidColumnReference error. When I add your suggested fix to the keyword scope, like this:

scope :keyword, ->(keyword) { keyword_search(keyword).select("pg_search.rank", "locations.id") }

those same tests now fail with a different error:

Failure/Error: get api_search_index_url(keyword: 'food', subdomain: ENV['API_SUBDOMAIN'])
     ActiveModel::MissingAttributeError:
       missing attribute: organization_id

I also discovered another error caused by pg_search 1.0 that isn't resolved by your suggested fix. For a search that includes keyword and category parameters, I get the following error with or without your suggested fix:

Failure/Error: get api_search_index_url(category: 'Jobs', keyword: 'jobs', subdomain: ENV['API_SUBDOMAIN'])
     ActiveRecord::StatementInvalid:
       PG::UndefinedTable: ERROR:  invalid reference to FROM-clause entry for table "categories"
       LINE 1: ...N "locations"."id" = pg_search.pg_search_id WHERE "categorie...
                                                                    ^
       HINT:  Perhaps you meant to reference the table alias "categories_services_2".
       : SELECT  DISTINCT MAX("locations"."updated_at") FROM "locations" INNER JOIN "services" 
"services_locations" ON "services_locations"."location_id" = "locations"."id" INNER JOIN 
"categories_services" "services_categories_services_join" ON 
"services_categories_services_join"."service_id" = "services_locations"."id" INNER JOIN "categories" 
"categories_services_2" ON "categories_services_2"."id" = 
"services_categories_services_join"."category_id" INNER JOIN (SELECT "locations"."id" AS 
pg_search_id, (ts_rank(("locations"."tsv_body"), (to_tsquery('english', ''' ' || 'jobs' || ' ''' || ':*')), 0)) AS rank 
FROM "locations" INNER JOIN "services" ON "services"."location_id" = "locations"."id" INNER JOIN 
"categories_services" ON "categories_services"."service_id" = "services"."id" INNER JOIN "categories" 
ON "categories"."id" = "categories_services"."category_id" WHERE "categories"."name" = 'Jobs' AND 
((("locations"."tsv_body") @@ (to_tsquery('english', ''' ' || 'jobs' || ' ''' || ':*'))))) pg_search ON 
"locations"."id" = pg_search.pg_search_id WHERE "categories"."name" = 'Jobs' LIMIT 30 OFFSET 0

Note that I don't have a table called categories_services_2. That is definitely something new pg_search is doing because in version 0.7.9, the query does not try to use categories_services_2:

SELECT  DISTINCT MAX("locations"."updated_at") FROM "locations" INNER JOIN "services" ON 
"services"."location_id" = "locations"."id" INNER JOIN "categories_services" ON 
"categories_services"."service_id" = "services"."id" INNER JOIN "categories" ON "categories"."id" = 
"categories_services"."category_id" WHERE "categories"."name" = 'Jobs' AND ((("locations"."tsv_body") 
@@ (to_tsquery('english', ''' ' || 'jobs' || ' ''' || ':*')))) LIMIT 30 OFFSET 0

Note how much longer the query is in 1.0 versus 0.7.9.

monfresh avatar Apr 30 '15 02:04 monfresh

@monfresh could you try out the latest master? The fix for #235 might affect this use case as well.

nertzy avatar Apr 30 '15 22:04 nertzy

I am getting the exact same mistake as @monfresh when performing a chained SELECT DISTINCT query, even after updating pg_search to version 1.0.2. The workaround suggested by @nertzy worked, but it would still be good to investigate the error and find a generic solution.

The chaining query that results in the error is Product.find_by_store(store).search_by_name(name), and the workaround was chaining .select("pg_search.rank", "products.*") to it. Here are the details of the model and error message:

class Product < ActiveRecord::Base
  include PgSearch

  has_many :area_category_products
  has_many :area_categories, through: :area_category_products

  pg_search_scope :search_by_name, against: :name,
                                   using: { tsearch: { prefix: true,
                                                       any_word: true } },
                                   ignoring: :accents

  def self.find_by_store(store)
    Product.joins(area_categories: :area).where(areas: { store_id: store.id })
      .uniq
  end
ActiveRecord::StatementInvalid:
       PG::InvalidColumnReference: ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list
       LINE 1: ..._search_id WHERE "areas"."store_id" = 1  ORDER BY pg_search....
                                                                    ^
       : SELECT DISTINCT "products".* FROM "products" INNER JOIN "area_category_products" ON "area_category_products"."product_id" = "products"."id" INNER JOIN "area_categories" ON "area_categories"."id" = "area_category_products"."area_category_id" INNER JOIN "areas" ON "areas"."id" = "area_categories"."area_id" INNER JOIN (SELECT "products"."id" AS pg_search_id, (ts_rank((to_tsvector('simple', unaccent(coalesce("products"."name"::text, '')))), (to_tsquery('simple', ''' ' || unaccent('prod') || ' ''' || ':*')), 0)) AS rank FROM "products" WHERE (((to_tsvector('simple', unaccent(coalesce("products"."name"::text, '')))) @@ (to_tsquery('simple', ''' ' || unaccent('prod') || ' ''' || ':*'))))) pg_search ON "products"."id" = pg_search.pg_search_id WHERE "areas"."store_id" = 1  ORDER BY pg_search.rank DESC, "products"."id" ASC

LauraAntuna avatar May 06 '15 13:05 LauraAntuna

@nertzy I wanted to give you an update on this issue from the current master branch. If I don't change my current code, I get a similar error as the one I originally reported, but it's now referencing pg_search_locations.pg_search_id whereas before it was pg_search.pg_search_id:

Failure/Error: get api_search_index_url(keyword: 'jobs', per_page: 1, subdomain: ENV['API_SUBDOMAIN'])
     ActiveRecord::StatementInvalid:
       PG::InvalidColumnReference: ERROR:  for SELECT DISTINCT, ORDER BY expressions must appear in select list
       LINE 1: ..."id" = pg_search_locations.pg_search_id  ORDER BY pg_search_...
                                                                    ^
       : SELECT  DISTINCT "locations".* FROM "locations" INNER JOIN (SELECT "locations"."id" AS pg_search_id, (ts_rank(("locations"."tsv_body"), (to_tsquery('english', ''' ' || 'jobs' || ' ''' || ':*')), 0)) AS rank FROM "locations" WHERE ((("locations"."tsv_body") @@ (to_tsquery('english', ''' ' || 'jobs' || ' ''' || ':*'))))) pg_search_locations ON "locations"."id" = pg_search_locations.pg_search_id  ORDER BY pg_search_locations.rank DESC, "locations"."id" ASC LIMIT 1 OFFSET 0

This means that I need to change your original workaround to .select("pg_search_locations.rank", "locations.id")

It seems like the workaround is now specific to the model where the pg_search_scope is defined.

When I do that, I get the same error I reported in my previous update:

Failure/Error: get api_search_index_url(keyword: 'jobs', per_page: 1, subdomain: ENV['API_SUBDOMAIN'])
     ActiveModel::MissingAttributeError:
       missing attribute: organization_id

If I select all of the columns in the Locations table, as suggested by @LauraAntuna, then all my tests pass:

.select("pg_search_locations.rank", "locations.*")

Another thing I noticed is that the SQL is now longer. In version 0.7.9 of the gem, the SQL generated was:

SELECT DISTINCT "locations".*, ((ts_rank(("locations"."tsv_body"), (to_tsquery('english', ''' ' || 'job' || ' ''' || ':*')), 0))) AS pg_search_rank FROM "locations" WHERE ((("locations"."tsv_body") @@ (to_tsquery('english', ''' ' || 'job' || ' ''' || ':*'))))  ORDER BY pg_search_rank DESC, "locations"."id" ASC

Now, the SQL adds an INNER JOIN when there wasn't one before:

SELECT DISTINCT pg_search_locations.rank, locations.* FROM "locations" INNER JOIN (SELECT "locations"."id" AS pg_search_id, (ts_rank(("locations"."tsv_body"), (to_tsquery('english', ''' ' || 'job' || ' ''' || ':*')), 0)) AS rank FROM "locations" WHERE ((("locations"."tsv_body") @@ (to_tsquery('english', ''' ' || 'job' || ' ''' || ':*'))))) pg_search_locations ON "locations"."id" = pg_search_locations.pg_search_id  ORDER BY pg_search_locations.rank DESC, "locations"."id" ASC

I haven't tested it yet on a large dataset to see if there's a performance impact.

monfresh avatar May 31 '15 00:05 monfresh

I have the same issue using 1.0.4, and @LauraAntuna's solution works for me.

j-dexx avatar Jul 28 '15 08:07 j-dexx

For me, the solution was to add the with_pg_search_rank scope. I wonder if it's possible to detect this automatically when distinct is present.

travisdahlke avatar Sep 21 '15 15:09 travisdahlke

With 1.0.4 I had to add pg_search_series.rank to SELECTand switching to 1.0.5 needs pg_search_9cda38ea3a2bd78cbaecd4.rank, what is this stamp 9cda38ea3a2bd78cbaecd4, does it change between queries?

mateuszpalyz avatar Dec 06 '15 15:12 mateuszpalyz

@mateuszpalyz, just use the with_pg_search_rank scope, like this: Post.search_by_title('some title').with_pg_search_rank.uniq. See scope source here (v.1.0.5) if interested, it just adds proper selects as suggested above. Adding it automatically is definately a good idea!

ikido avatar Dec 30 '15 09:12 ikido

I'm using version 1.0.6 of this gem and I get: "undefined method `with_pg_search_rank' for #\u003cCompanyFriendshipLocation::ActiveRecord_Relation:0x007f906e728160\u003e"

victorhazbun avatar Oct 21 '16 00:10 victorhazbun

+1 .uniq kills the query

jchatel avatar Nov 22 '16 02:11 jchatel

@ikido's solution works perfectly, it should be documented or something.

mzahidriaz-tr avatar Apr 12 '17 02:04 mzahidriaz-tr

This is still an issue in 2.1.4. Using .uniq is a solid workaround, but it's not a true solution because there is a performance loss when moving the dedupe workload from Postgres to Ruby. Chaining .with_pg_search_rank.distinct fixed everything else, but broke when adding .count to the end of the query:

   ActiveRecord::StatementInvalid:
       PG::SyntaxError: ERROR:  syntax error at or near "AS"
       LINE 1: ...tegories.*, pg_search_a6216ea03e578f212dd604.rank AS pg_sear...
                                                                    ^
       : SELECT COUNT(DISTINCT categories.*, pg_search_a6216ea03e578f212dd604.rank AS pg_search_rank) FROM "categories" INNER JOIN "expert_categories" ON "expert_categories"."category_id" = "categories"."id" INNER JOIN "experts" ON "experts"."id" = "expert_categories"."expert_id" INNER JOIN "cities" ON "cities"."id" = "experts"."city_id" INNER JOIN (SELECT "categories"."id" AS pg_search_id, (ts_rank((setweight(to_tsvector('english', coalesce("categories"."name"::text, '')), 'A') || setweight(to_tsvector('english', coalesce("categories"."search_synonyms"::text, '')), 'B')), (to_tsquery('english', ''' ' || 'painting' || ' ''' || ':*')), 0)) AS rank FROM "categories" WHERE (((setweight(to_tsvector('english', coalesce("categories"."name"::text, '')), 'A') || setweight(to_tsvector('english', coalesce("categories"."search_synonyms"::text, '')), 'B')) @@ (to_tsquery('english', ''' ' || 'painting' || ' ''' || ':*'))))) AS pg_search_a6216ea03e578f212dd604 ON "categories"."id" = pg_search_a6216ea03e578f212dd604.pg_search_id WHERE "experts"."city_id" = $1 AND "experts"."status" = $2

We could use Ruby to count, but we'll continue to live with the .uniq workaround in fear that this would be a footgun to developers using scope for other purposes.

LilyReile avatar Mar 25 '19 20:03 LilyReile

@ikido's solution worked fine!

chrislabarge avatar May 21 '19 14:05 chrislabarge

Another workaround that works well with distinct, count and pg_search is to reorder(''). It looks like the 'order by rank' part is problematic, so if ordering by pg_search's rank isn't required, then this workaround would be a great fit.

Example, Post.search_by_title('some title').reorder('').distinct.count

HusseinElMotayam avatar Oct 18 '19 12:10 HusseinElMotayam

thx @HusseinElMotayam Post.search_by_title('some title').reorder('').distinct works fine and doesn't kill query like .uniq

IgrekYY avatar Jun 23 '20 07:06 IgrekYY

@HusseinElMotayam solution fixed it for me. many thanks

GMolini avatar Jan 11 '23 14:01 GMolini

Still seeing the same issue if trying to chain multiple search scopes together. Example Post.search_by_title('some title').search_by_body('some text').with_pg_search_rank.distinct will produce #<ActiveRecord::StatementInvalid: PG::InvalidColumnReference: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list

Looking into a fix, but haven't came up with anything yet. Thanks for any help!

adam-onboardpro avatar May 09 '23 19:05 adam-onboardpro