neighbor
neighbor copied to clipboard
Adding support for Threshold, Limit, and Order Arguments
This pull request adds 3 keyword arguments to the nearest_neighbor method. They are:
order
movie = Movie.find_by(name: "Star Wars (1977)")
# Order all results by the neighbor_distance column in descending order
movie.nearest_neighbors(:factors, distance: "inner_product", order: { neighbor_distance: :desc })
limit
movie = Movie.find_by(name: "Star Wars (1977)")
# Limit the results to 3 records
movie.nearest_neighbors(:factors, distance: "inner_product", limit: 3)
threshold
movie = Movie.find_by(name: "Star Wars (1977)")
# Only return records where the neighbor_distance is greater than or equal to 0.9
movie.nearest_neighbors(:factors, distance: "inner_product", threshold: { gte: 0.9 })
Multiple Options
All options can be used at the same time or separately.
movie = Movie.find_by(name: "Star Wars (1977)")
# Only return 5 records where the neighbor_distance is greater than or equal to 0.9 in descending order
movie.nearest_neighbors(
:factors,
distance: "inner_product",
limit: 5,
threshold: { gte: 0.9 },
order: { neighbor_distance: :desc }
)
These options manipulate the SQL statement generated by ActiveRecord. All original test suits are intact and passing, and the new tests were written with the new options.
Hi @sebscholl, thanks for the PR.
nearest_neighborscurrently returns a relation, so you can limit withlimit(n)orfirst(n).- Results are currently ordered by distance. However, if you have a default scope on the model, that'll take precedence.
- For thresholds, you can use
where("(embedding <#> ?) * -1 > ?", vector, 0.9)or filter in memory withselect { |v| v.neighbor_distance > 0.9 }. I may add an option for this at some point, but want to think more about the design.
Makes sense. Do you believe it would be helpful to add this info to the docs (e.g, where("(embedding <#> ?) * -1 > ?", vector, 0.9)) or prefer to sit tight until you feel you have more clarity on the design? Lmk, and I can make an update if it would help.
@ankane I believe in the case of using class method like Movie.nearest_neighbor(embedding, my_gen_embedding, ...) the ordering is not set by distance. Instead I'm getting ORDER BY "text_nodes"."id" ASC LIMIT $1 on these queries. I'm encountering this exact problem and so i'll implement the query manually as a workaround for now.
P.S. I set Movie.unscoped {} but still am getting ORDER BY ID, AFAIK there is no way to set to order by distance with the gem.
P.P.S I set .order(Arel.sql("neighbor_distance DESC")) but it didn't actually apply that to the query instead still ordering by ID.
Regarding thresholds, if others are working on this, here is some relevant code I came up with:
def filter_by_within_distance(scope)
return scope unless @params[:within_distance] && @params[:distance_type]
distance_type = @params[:distance_type].to_sym
# Determine the correct operator based on distance type
operator = case distance_type
when :euclidean
"<->"
when :cosine
"<=>"
when :inner_product
"<#>"
else
raise ArgumentError, "Unsupported distance type: #{@params[:distance_type]}"
end
condition_pattern = if distance_type == :inner_product
# Negative inner product
"((#{@params[:search_vector_column]} #{operator} '[?]') * -1) < ?"
else
"(#{@params[:search_vector_column]} #{operator} '[?]') < ?"
end
scope.where(condition_pattern, @query_vector, @params[:within_distance])
end
Some nuances to take note of:
- When used within a Rails order clause, I needed to wrap the vector in single quotes and the square bracket to end up with valid SQL: '[?]'
- This is elementary but not obvious at first to those new to pgvector: depending on the distance type you're using for the ordering, you need to use the same comparison operator (<->, <=>, <#>) in your limit statement.
@ankane Any thoughts on how best to support ordering by distance?