pdfGPT universal-sentence-encoder suitable for only English or multi languages?

universal-sentence-encoder suitable for only English or multi languages?

Open 1991Troy opened this issue 2 years ago • 1 comments

trafficstars

the most significant point for this work is using universal-sentence-encoder instead of openai embedding. I use it for english embedding, it seems good according to the KNN result based on the embedding . I use it for other language embedding, it seems not good.

May 23 '23 09:05 1991Troy

You are right. This work's main novelty is to improve the embeddings by using Universal Sentence Encoder and marrying it with KNN to get the best top-K results. Universal Sentence Encoder has not been tested extensively with languages other than English. So you are right, more needs to be explored. This code was written somewhere around Jan 2023 when all this interaction with PDF files were relatively new. Now, there are many advanced libraries to do this work but I am not sure why everyone is so obsessed with OpenAI embeddings. I have tested them in various scenarios and it gives poor results. I will upgrade or create a new STOA for multiple-languages document embeddings when I get time. Right now caught with other things in life.

May 23 '23 12:05 bhaskatripathi

I seem to have seen multilingual encoders from Universal Sentence Encoder (https://tfhub.dev/google/collections/universal-sentence-encoder/1). I'm typically in the 1991Troy situation, and I'm going to try different packages such as these. There are also encoders that have been created with the aim of performing a kind of similarity between text in the sentence-transformers library, which I honestly found not bad when I worked on it for one of my projects.

Jun 22 '23 13:06 FlowlionAI

pdfGPT pdfGPT copied to clipboard

universal-sentence-encoder suitable for only English or multi languages?

pdfGPT
pdfGPT copied to clipboard