NeuralCitationNetwork icon indicating copy to clipboard operation
NeuralCitationNetwork copied to clipboard

Dataset configuration criteria

Open janguck opened this issue 6 years ago • 10 comments

I downloaded it from "https://psu.app.box.com/v/refseer" which you mentioned, and the number of data is 110 million pieces. you have used 4,549,267(training : 4,258,383(~~2012) validation : 141,957(2013) test : 148,927(2014~) pieces and can you tell me configuration criteria? For example, not title, year.

Thanks.

Best Regard.

janguck avatar Mar 04 '19 07:03 janguck

I inserted all the documents into mongodb then performed some preprocessing. Here are my notes/commands I used to prepare it.

Notes_for_NCN.pdf

tebesu avatar Mar 07 '19 03:03 tebesu

Thank you for excellent answer. i have one more question. train, valid, test split of year is citing paper year? or cited paper year?

Thanks. Best Regard.

janguck avatar Mar 11 '19 12:03 janguck

I split the data by the citing paper year.

tebesu avatar Mar 12 '19 16:03 tebesu

Thanks for you answer.

but 'https://psu.app.box.com/v/refseer' it has 112903 citations(2012<year) validation, test set are not match you speak number of citations

SELECT count(*) FROM kdd2019.citations where kdd2019.citations.year>2012;

Thanks

janguck avatar Mar 13 '19 00:03 janguck

What do you mean? After preprocessing?

tebesu avatar Mar 18 '19 03:03 tebesu

No, just .sql file activate Are you problem this point?

2019년 3월 18일 (월) 오후 12:07, Travis [email protected]님이 작성:

What do you mean? After preprocessing?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tebesu/NeuralCitationNetwork/issues/4#issuecomment-473753788, or mute the thread https://github.com/notifications/unsubscribe-auth/AP1lDV5oG-J6KifAXsKyNlBbtm_M2va5ks5vXwMOgaJpZM4bbv3T .

janguck avatar Mar 19 '19 01:03 janguck

I believe there are some problems with the sql file so I did some preprocessing then inserted into mongodb.

Take a look at https://github.com/harrywy/NPM

tebesu avatar Mar 19 '19 02:03 tebesu

Hi @tebesu, I'm having trouble obtaining the same training/validation/test sets as described in the paper. Do you maybe have a list of citation context IDs from sql dumps that were used in experiments?

zoranmedic avatar Dec 16 '19 11:12 zoranmedic

@zoranmedic

It should be in the dataset I provided.

tebesu avatar Dec 21 '19 21:12 tebesu

@tebesu Right, I found it, thanks!

zoranmedic avatar Dec 26 '19 15:12 zoranmedic