pgxmine
pgxmine copied to clipboard
Input files for pgxmine
@jakelever could you please elaborate on how to prepare input files for pgxmine? There is pubmed_26736037.bioc.xml file in "example" folder but it is not clear how you obtained it. I tried to use BioText project but its output doesn't contain that file. What am I doing wrong?
$ snakemake --cores 1 downloaded.flag
$ snakemake --cores 1 converted.flag
$ snakemake --cores 1 pubtator_downloaded.flag
$ snakemake --cores 1 pubtator.flag
$ cd biocxml
$ grep '26736037' *.xml #nothing
$ cd ../pubtator
$ grep '26736037' *.xml #nothing
I'm trying to run pgxmine with some of the files outputed by BioTex but the result is empty:
$ ls -l example1
pubmed_test.bioc.xml -> ../../biotext/pubtator/pmc_baseline.oa_comm_xml.PMC008xxxxxx.baseline.2022-09-03_36.bioc.xml
$ python findPGxSentences.py --inBioc example1/pubmed_test.bioc.xml \
--filterTermsFile pgx_filter_terms.txt \
--outBioc example1/pubmed_test.sentences.bioc.xml
$ python getRelevantMeSH.py --inBioc example1/pubmed_test.bioc.xml \
--outJSONGZ example1/pubmed_test.mesh.json.gz
$ python createKB.py \
--trainingFiles data/annotations.variant_star_rs.bioc.xml,data/annotations.variant_other.bioc.xml \
--inBioC example1/pubmed_test.sentences.bioc.xml \
--selectedChemicals data/selected_chemicals.json \
--dbsnp data/dbsnp_selected.tsv \
--variantStopwords stopword_variants.txt \
--genes data/gene_names.tsv \
--relevantMeSH example1/pubmed_test.mesh.json.gz \
--outKB example1/pubmed_test.kb.tsv
$ python filterAndCollate.py \
--inData example1 \
--outUnfiltered example1/mini_unfiltered.tsv \
--outCollated example1/mini_collated.tsv \
--outSentences example1/mini_sentences.tsv
Output:
+ python findPGxSentences.py --inBioc example1/pubmed_test.bioc.xml --filterTermsFile pgx_filter_terms.txt --outBioc example1/pubmed_test.sentences.bioc.xml
Found 0 candidate sentences
+ python getRelevantMeSH.py --inBioc example1/pubmed_test.bioc.xml --outJSONGZ example1/pubmed_test.mesh.json.gz
Loaded PMIDs from corpus file...
Searching for MeSH terms in: ['Adolescent', 'Adult', 'Aged', 'Birth Cohort', 'Child', 'Child, Preschool', 'Infant', 'Infant, Newborn', 'Middle Aged', 'Pediatrics', 'Young Adult']
Found 0 PubMed ID(s) with relevant MeSH terms
+ python createKB.py --trainingFiles data/annotations.variant_star_rs.bioc.xml,data/annotations.variant_other.bioc.xml --inBioC example1/pubmed_test.sentences.bioc.xml --selectedChemicals data/selected_chemicals.json --dbsnp data/dbsnp_selected.tsv --variantStopwords stopword_variants.txt --genes data/gene_names.tsv --relevantMeSH example1/pubmed_test.mesh.json.gz --outKB example1/pubmed_test.kb.tsv
Loaded chemical, gene and variant data
Loaded mesh PMIDs for pediatric/adult terms
Creating classifier for star_rs
Predicted 0 association(s) for star_rs variants
Creating classifier for other
Predicted 0 association(s) for other variants
+ python filterAndCollate.py --inData example1 --outUnfiltered example1/mini_unfiltered.tsv --outCollated example1/mini_collated.tsv --outSentences example1/mini_sentences.tsv
Found 1 PubMed files
Found 0 PMC files
0 records filtered to 0 sentences and collated to 0 chemical/variant associations
Written to example1/mini_sentences.tsv and example1/mini_collated.tsv
I tried to run it with snakemake as it is mentioned in README but it creates empty (with header only) pgxmine/test_working/pgxmine_*
files:
$ MODE=test snakemake --cores 1
Running it in full
mode also produces empty files:
$ MODE=full BIOTEXT=../biotext/biocxml snakemake --cores 10
$ cat working/pgxmine_sentences.tsv | wc -l
1
Hey, I've added some documentation on how the single test file was created. I've also merged the example and test_data folders so there is a single test file used by run_example and the Snakemake script. Have another try of it.
Your full run shouldn't be giving no results (and should take a long long time to run). I put in an error check to make sure that there are the expected input files, so hopefully that will help to.
Thank you for your effort! Do you know what is the reason the CI/CD test is failing?
@jakelever any updates on this? Apparently something is not working properly.