pgxmine icon indicating copy to clipboard operation
pgxmine copied to clipboard

Input files for pgxmine

Open rykovan opened this issue 2 years ago • 4 comments

@jakelever could you please elaborate on how to prepare input files for pgxmine? There is pubmed_26736037.bioc.xml file in "example" folder but it is not clear how you obtained it. I tried to use BioText project but its output doesn't contain that file. What am I doing wrong?

$ snakemake --cores 1 downloaded.flag
$ snakemake --cores 1 converted.flag
$ snakemake --cores 1 pubtator_downloaded.flag
$ snakemake --cores 1 pubtator.flag
$ cd biocxml
$ grep '26736037' *.xml #nothing
$ cd ../pubtator
$ grep '26736037' *.xml #nothing

I'm trying to run pgxmine with some of the files outputed by BioTex but the result is empty:

$ ls -l example1
pubmed_test.bioc.xml -> ../../biotext/pubtator/pmc_baseline.oa_comm_xml.PMC008xxxxxx.baseline.2022-09-03_36.bioc.xml
$ python findPGxSentences.py --inBioc example1/pubmed_test.bioc.xml \
    --filterTermsFile pgx_filter_terms.txt \
    --outBioc example1/pubmed_test.sentences.bioc.xml

$ python getRelevantMeSH.py --inBioc example1/pubmed_test.bioc.xml \
    --outJSONGZ example1/pubmed_test.mesh.json.gz

$ python createKB.py \
    --trainingFiles data/annotations.variant_star_rs.bioc.xml,data/annotations.variant_other.bioc.xml \
    --inBioC example1/pubmed_test.sentences.bioc.xml \
    --selectedChemicals data/selected_chemicals.json \
    --dbsnp data/dbsnp_selected.tsv \
    --variantStopwords stopword_variants.txt \
    --genes data/gene_names.tsv \
    --relevantMeSH example1/pubmed_test.mesh.json.gz  \
    --outKB example1/pubmed_test.kb.tsv

$ python filterAndCollate.py \
    --inData example1 \
    --outUnfiltered example1/mini_unfiltered.tsv \
    --outCollated example1/mini_collated.tsv \
    --outSentences example1/mini_sentences.tsv

Output:

+ python findPGxSentences.py --inBioc example1/pubmed_test.bioc.xml --filterTermsFile pgx_filter_terms.txt --outBioc example1/pubmed_test.sentences.bioc.xml
Found 0 candidate sentences
+ python getRelevantMeSH.py --inBioc example1/pubmed_test.bioc.xml --outJSONGZ example1/pubmed_test.mesh.json.gz
Loaded PMIDs from corpus file...
Searching for MeSH terms in:  ['Adolescent', 'Adult', 'Aged', 'Birth Cohort', 'Child', 'Child, Preschool', 'Infant', 'Infant, Newborn', 'Middle Aged', 'Pediatrics', 'Young Adult']

Found 0 PubMed ID(s) with relevant MeSH terms
+ python createKB.py --trainingFiles data/annotations.variant_star_rs.bioc.xml,data/annotations.variant_other.bioc.xml --inBioC example1/pubmed_test.sentences.bioc.xml --selectedChemicals data/selected_chemicals.json --dbsnp data/dbsnp_selected.tsv --variantStopwords stopword_variants.txt --genes data/gene_names.tsv --relevantMeSH example1/pubmed_test.mesh.json.gz --outKB example1/pubmed_test.kb.tsv
Loaded chemical, gene and variant data
Loaded mesh PMIDs for pediatric/adult terms
Creating classifier for star_rs
Predicted 0 association(s) for star_rs variants
Creating classifier for other
Predicted 0 association(s) for other variants
+ python filterAndCollate.py --inData example1 --outUnfiltered example1/mini_unfiltered.tsv --outCollated example1/mini_collated.tsv --outSentences example1/mini_sentences.tsv
Found 1 PubMed files
Found 0 PMC files
0 records filtered to 0 sentences and collated to 0 chemical/variant associations
Written to example1/mini_sentences.tsv and example1/mini_collated.tsv

rykovan avatar Oct 28 '22 21:10 rykovan

I tried to run it with snakemake as it is mentioned in README but it creates empty (with header only) pgxmine/test_working/pgxmine_* files:

$ MODE=test snakemake --cores 1

Running it in full mode also produces empty files:

$ MODE=full BIOTEXT=../biotext/biocxml snakemake --cores 10
$ cat working/pgxmine_sentences.tsv | wc -l
1

rykovan avatar Oct 30 '22 09:10 rykovan

Hey, I've added some documentation on how the single test file was created. I've also merged the example and test_data folders so there is a single test file used by run_example and the Snakemake script. Have another try of it.

Your full run shouldn't be giving no results (and should take a long long time to run). I put in an error check to make sure that there are the expected input files, so hopefully that will help to.

jakelever avatar Nov 04 '22 11:11 jakelever

Thank you for your effort! Do you know what is the reason the CI/CD test is failing?

rykovan avatar Nov 04 '22 12:11 rykovan

@jakelever any updates on this? Apparently something is not working properly.

rykovan avatar Nov 11 '22 19:11 rykovan