semantic-text-similarity icon indicating copy to clipboard operation
semantic-text-similarity copied to clipboard

Performance on quora qa data set

Open Chandrak1907 opened this issue 4 years ago • 1 comments

I used this model on quora qa data set (http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv). Performance of the model is below: -----------------|Model_output - 0 | |Model_output - 1 is_duplicate -0 | 218,328 | 36,696 is_duplicate -1 | 72,739 | 76,524

Do you have any suggestions for improving the performance of the model.

Code is here:

from semantic_text_similarity.models import WebBertSimilarity from semantic_text_similarity.models import ClinicalBertSimilarity web_model = WebBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction

web_model.predict([("She won an olympic gold medal","The women is an olympic champion")])

# Quora

def check_score(row): return web_model.predict([(row['question1'],row['question2'])])[0] import pandas as pd t2 = pd.read_csv("./quora_duplicate_questions.tsv",sep='\t') t3= t2.dropna() t3['model_score']=t3.apply(check_score,axis=1) t3.to_csv("./t3_Jan10.csv",index=False) t3 = pd.read_csv("./t3_Jan10.csv") t3[t3.is_duplicate==0]['model_score'].mean() t3[t3.is_duplicate==1]['model_score'].mean() t3['model_output']=0 t3.loc[t3.model_score>3.71, 'model_output']=1 pd.crosstab(t3.is_duplicate, t3.model_output)

Chandrak1907 avatar Jan 13 '20 02:01 Chandrak1907

Fine-tune on your task specific data. Best of luck!

On Sun, Jan 12, 2020, 9:05 PM Chandrak1907 [email protected] wrote:

I used this model on quora qa data set ( http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv). Performance of the model is below: -----------------|Model_output - 0 | |Model_output - 1 is_duplicate -0 | 218,328 | 36,696 is_duplicate -1 | 72,739 | 76,524

Do you have any suggestions for improving the performance of the model.

Code is here:

from semantic_text_similarity.models import WebBertSimilarity from semantic_text_similarity.models import ClinicalBertSimilarity web_model = WebBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction

web_model.predict([("She won an olympic gold medal","The women is an olympic champion")])

Quora

def check_score(row): return web_model.predict([(row['question1'],row['question2'])])[0] import pandas as pd t2 = pd.read_csv("./quora_duplicate_questions.tsv",sep='\t') t3= t2.dropna() t3['model_score']=t3.apply(check_score,axis=1) t3.to_csv("./t3_Jan10.csv",index=False) t3 = pd.read_csv("./t3_Jan10.csv") t3[t3.is_duplicate==0]['model_score'].mean() t3[t3.is_duplicate==1]['model_score'].mean() t3['model_output']=0 t3.loc[t3.model_score>3.71, 'model_output']=1 pd.crosstab(t3.is_duplicate, t3.model_output)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AndriyMulyar/semantic-text-similarity/issues/7?email_source=notifications&email_token=ADJ4TBSGCNDSUUKJQSHAMTTQ5PD7LA5CNFSM4KF3LJS2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IFT7ZRA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJ4TBTNGFDEFXLGXESLQOLQ5PD7LANCNFSM4KF3LJSQ .

AndriyMulyar avatar Jan 13 '20 02:01 AndriyMulyar