pdfGPT
pdfGPT copied to clipboard
universal-sentence-encoder suitable for only English or multi languages?
the most significant point for this work is using universal-sentence-encoder instead of openai embedding. I use it for english embedding, it seems good according to the KNN result based on the embedding . I use it for other language embedding, it seems not good.
You are right. This work's main novelty is to improve the embeddings by using Universal Sentence Encoder and marrying it with KNN to get the best top-K results. Universal Sentence Encoder has not been tested extensively with languages other than English. So you are right, more needs to be explored. This code was written somewhere around Jan 2023 when all this interaction with PDF files were relatively new. Now, there are many advanced libraries to do this work but I am not sure why everyone is so obsessed with OpenAI embeddings. I have tested them in various scenarios and it gives poor results. I will upgrade or create a new STOA for multiple-languages document embeddings when I get time. Right now caught with other things in life.
I seem to have seen multilingual encoders from Universal Sentence Encoder (https://tfhub.dev/google/collections/universal-sentence-encoder/1). I'm typically in the 1991Troy situation, and I'm going to try different packages such as these. There are also encoders that have been created with the aim of performing a kind of similarity between text in the sentence-transformers library, which I honestly found not bad when I worked on it for one of my projects.