openai-java icon indicating copy to clipboard operation
openai-java copied to clipboard

Could you please offer an example of doing semantic search using embeddings?

Open kubecutle opened this issue 2 years ago • 3 comments

The official documentation has an example in python: https://platform.openai.com/docs/guides/embeddings. I am not seeing how this great Java library could help achiece the same, especially this line:

df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))

kubecutle avatar Feb 08 '23 00:02 kubecutle

Java supports lamdas, but I'm guessing you are requesting a feature to access embeddings from the endpoint: https://api.openai.com/v2/embeddings

cryptoapebot avatar Feb 08 '23 00:02 cryptoapebot

I guess it starts with the cosine_similarity() function. Is there an equivalent in the openai-java library? The other part is really my lack of understanding of the nature of df. It is constructed with this code segment: `` import pandas as pd import numpy as np

datafile_path = "data/fine_food_reviews_with_embeddings_1k.csv"

df = pd.read_csv(datafile_path) df["embedding"] = df.embedding.apply(eval).apply(np.array) `` First a csv file is loaded, then I don't understand what they do with it and how to do the same with openai-java.

kubecutle avatar Feb 08 '23 01:02 kubecutle

I'm just sort of going off the cuff here.

Create a java object w/ the fields in the CSV.

class FoodReview {
  Integer index;
 String ProductId;
 String UserId;
 Integer Score;
 String Summary;
 String Text;
 String Combined; // Title and Body both labeled & separated by ';'
 ArrayList<Double> embedding;

// constructor w/ String[]
 
}

ArrayList<FoodReview> reviews = new ArrayList<FoodReview>();

try (BufferedReader br = Files.newBufferedReader(pathToFile, StandardCharsets.US_ASCII)) {
  String line = br.readLine();
  While (line != null) {
    String[] attributes = line.split(","); // Or use CSVReader here as you might have issues with the last field "[1,2,3]"
    FoodReview fr = new FoodReview(String[] attrs);
    reviews.add(fr);
  }
}

df['embedding'] is simply grabbing the embedding field from the dataframe (df). That's equivalent to grabbing the value from the named field in java.

You will now have all the values in your reviews list. From there you can do a for each or stream() through the list. So you call OpenAI get embeddings for the fr.getCombined(); and it'll return a list, e.g. [0.5, 0.959,0.33] You also have your word embeddings, fr.getEmbedding(); and it'll return a list e.g. [0.33, 0.875, 0.55] Now you can do a cosine similarity. https://stackoverflow.com/questions/520241/how-do-i-calculate-the-cosine-similarity-of-two-vectors

After looping through all of them, you can find the FoodReview object that is the closest and do something with it. Otherwise, you can keep a Map between the object and the similarity and sort them by value HashMap<FoodReview, Double> mymap; https://www.geeksforgeeks.org/sorting-a-hashmap-according-to-values/

cryptoapebot avatar Feb 08 '23 02:02 cryptoapebot