snorkeling icon indicating copy to clipboard operation
snorkeling copied to clipboard

Is there a snorkel_labels_train.xlsx file anywhere?

Open jambo6 opened this issue 2 years ago • 5 comments

I'd like to utilise these labels for another project. It seems the folder

snorkeling/disease_gene/disease_associates_gene/data/sentences

should also have snorkel_labels_train.xlsx to go along with its test and dev files. Does this exist and if so is there any chance of getting access?

jambo6 avatar Oct 29 '21 14:10 jambo6

should also have snorkel_labels_train.xlsx to go along with its test and dev files. Does this exist and if so is there any chance of getting access?

So this folder only contains sentences that were manually hand labeled for this project. The train version isn't available as it is supposes to consist of all the remaining documents within Pubtator. The following output would be too big of a file for github to host on their LFS (max file is 2GB).

Currently, the main way to get those sentences is to download a snapshot of pubtator central and extract those sentences into a database. Otherwise I have a snapshot of the database used for this project that you could import (118GB); however, would need to figure out how to transport that large of a file. Overall recommendation is to use the first option as you would have the most current version for whichever project you are going to work on.

danich1 avatar Nov 01 '21 15:11 danich1

I was after the hand labelled train/dev/test sentences to bolster my dataset for a similar RE project, not the entire pubtator db. Would it be okay for me to use these and if so, is there a straightforward method to download just these sentences with hand labellings?

jambo6 avatar Nov 02 '21 12:11 jambo6

I was after the hand labelled train/dev/test sentences to bolster my dataset for a similar RE project, not the entire pubtator db. Would it be okay for me to use these and if so, is there a straightforward method to download just these sentences with hand labellings?

Sure. Can't guarantee that train.xlsx exists or has a lot of sentences annotated but here are the quick links to the available data atm:

Compound Treats Disease Train Compound Treats Disease Dev Compound Treats Disease Test

Disease Associates Gene Dev Disease Associates Gene Test

Gene interacts Gene Train Gene interacts Gene Dev Gene interacts Gene Test

Compound binds Gene would take a bit for me to get to you so if you need that let me know.

danich1 avatar Nov 02 '21 15:11 danich1

So do there not exist handcrafted labels for Disease Associates Gene Train?

jambo6 avatar Nov 04 '21 12:11 jambo6

I forgot to upload onto this repository, but here is your request file: Disease Associates Gene Train

danich1 avatar Nov 04 '21 18:11 danich1