neighbor icon indicating copy to clipboard operation
neighbor copied to clipboard

Adding support for Threshold, Limit, and Order Arguments

Open sebscholl opened this issue 2 years ago • 7 comments

This pull request adds 3 keyword arguments to the nearest_neighbor method. They are:

order

movie = Movie.find_by(name: "Star Wars (1977)")
# Order all results by the neighbor_distance column in descending order
movie.nearest_neighbors(:factors, distance: "inner_product", order: { neighbor_distance: :desc })

limit

movie = Movie.find_by(name: "Star Wars (1977)")
# Limit the results to 3 records
movie.nearest_neighbors(:factors, distance: "inner_product", limit: 3)

threshold

movie = Movie.find_by(name: "Star Wars (1977)")
# Only return records where the neighbor_distance is greater than or equal to 0.9
movie.nearest_neighbors(:factors, distance: "inner_product", threshold: { gte: 0.9 })

Multiple Options

All options can be used at the same time or separately.

movie = Movie.find_by(name: "Star Wars (1977)")

# Only return 5 records where the neighbor_distance is greater than or equal to 0.9 in descending order
movie.nearest_neighbors(
  :factors,
  distance: "inner_product", 
  limit: 5,
  threshold: { gte: 0.9 },
  order: { neighbor_distance: :desc }
)

These options manipulate the SQL statement generated by ActiveRecord. All original test suits are intact and passing, and the new tests were written with the new options.

sebscholl avatar Sep 02 '23 16:09 sebscholl

Hi @sebscholl, thanks for the PR.

  1. nearest_neighbors currently returns a relation, so you can limit with limit(n) or first(n).
  2. Results are currently ordered by distance. However, if you have a default scope on the model, that'll take precedence.
  3. For thresholds, you can use where("(embedding <#> ?) * -1 > ?", vector, 0.9) or filter in memory with select { |v| v.neighbor_distance > 0.9 }. I may add an option for this at some point, but want to think more about the design.

ankane avatar Sep 24 '23 19:09 ankane

Makes sense. Do you believe it would be helpful to add this info to the docs (e.g, where("(embedding <#> ?) * -1 > ?", vector, 0.9)) or prefer to sit tight until you feel you have more clarity on the design? Lmk, and I can make an update if it would help.

sebscholl avatar Sep 26 '23 12:09 sebscholl

@ankane I believe in the case of using class method like Movie.nearest_neighbor(embedding, my_gen_embedding, ...) the ordering is not set by distance. Instead I'm getting ORDER BY "text_nodes"."id" ASC LIMIT $1 on these queries. I'm encountering this exact problem and so i'll implement the query manually as a workaround for now.

P.S. I set Movie.unscoped {} but still am getting ORDER BY ID, AFAIK there is no way to set to order by distance with the gem.

P.P.S I set .order(Arel.sql("neighbor_distance DESC")) but it didn't actually apply that to the query instead still ordering by ID.

gvkhna avatar Oct 22 '23 22:10 gvkhna

Regarding thresholds, if others are working on this, here is some relevant code I came up with:

def filter_by_within_distance(scope)
        return scope unless @params[:within_distance] && @params[:distance_type]

        distance_type = @params[:distance_type].to_sym

        # Determine the correct operator based on distance type
        operator = case distance_type
                   when :euclidean
                     "<->"
                   when :cosine
                     "<=>"
                   when :inner_product
                     "<#>"
                   else
                     raise ArgumentError, "Unsupported distance type: #{@params[:distance_type]}"
                   end

        condition_pattern = if distance_type == :inner_product
                              # Negative inner product
                              "((#{@params[:search_vector_column]} #{operator} '[?]') * -1) < ?"
                            else
                              "(#{@params[:search_vector_column]} #{operator} '[?]') < ?"
                            end

        scope.where(condition_pattern, @query_vector, @params[:within_distance])
      end

Some nuances to take note of:

  • When used within a Rails order clause, I needed to wrap the vector in single quotes and the square bracket to end up with valid SQL: '[?]'
  • This is elementary but not obvious at first to those new to pgvector: depending on the distance type you're using for the ordering, you need to use the same comparison operator (<->, <=>, <#>) in your limit statement.

vestedpr-dev avatar Apr 22 '24 15:04 vestedpr-dev

@ankane Any thoughts on how best to support ordering by distance?

ccurtisj avatar Apr 27 '24 11:04 ccurtisj