REPTILE icon indicating copy to clipboard operation
REPTILE copied to clipboard

Variables in the training data missing in newdata

Open karamveerverma37 opened this issue 8 months ago • 7 comments

Hi, I am trying to run reptile on pre-trained model mm_model_coreMarks.reptile using methylation data. Is there any issue with bw generation, I have methylation base call bed files containing chr no, start, end, methylation rate. I convereted it into bw file using the following commands: awk '{printf "%s\t%d\t%d\t%2.3f\n" , $1,$2,$3,$4}' myBed.bed > myFile.bedgraph sort -k1,1 -k2,2n myFile.bedgraph > myFile_sorted.bedgraph bedGraphToBigWig myFile_sorted.bedgraph myChrom.sizes myBigWig.bw

I tried alone Meth epimark as well as all four H3K4me1 etc given for mm_model_coreMarks.reptile model. The output of REPTILE_preprocess.py is preprocessed.region_with_epimark.tsv file and look like this: chr start end id Meth_E4 H3K4me1_E4 H3K4me3_E4 H3K27ac_E4 chr1 0 2000 bin_0 0.0 0.0 0.0 0.0 chr1 100 2100 bin_1 0.0 0.0 0.0 0.0 chr1 200 2200 bin_2 0.0 0.0 0.0 0.0 chr1 300 2300 bin_3 0.0 0.0 0.0 0.0 chr1 400 2400 bin_4 0.0 0.0 0.0 0.0 chr1 500 2500 bin_5 0.0 0.0 0.0 0.0 chr1 600 2600 bin_6 0.0 0.0 0.0 0.0 chr1 700 2700 bin_7 0.0 0.0 0.0 0.0 chr1 800 2800 bin_8 0.0 0.0 0.0 0.0 chr1 900 2900 bin_9 0.0 0.0 0.0 0.0 chr1 1000 3000 bin_10 0.0 0.0 0.0 0.0 . . chr1 3211200 3213200 bin_32112 5.0 5.0 5.0 5.0 chr1 3211300 3213300 bin_32113 5.0 5.0 5.0 5.0 chr1 3211400 3213400 bin_32114 5.0 5.0 5.0 5.0 chr1 3211500 3213500 bin_32115 4.0 4.0 4.0 4.0 chr1 3211600 3213600 bin_32116 3.3 3.3 3.3 3.3 chr1 3211700 3213700 bin_32117 2.54545 2.54545 2.54545 2.54545 chr1 3211800 3213800 bin_32118 2.69231 2.69231 2.69231 2.69231 chr1 3211900 3213900 bin_32119 3.0 3.0 3.0 3.0 chr1 3212000 3214000 bin_32120 2.85714 2.85714 2.85714 2.85714

Now when I run the compute score command: REPTILE_compute_score.R -i data_info_file2 -m mm_model_coreMarks.reptile -a tmp/mm39_w2kb_s100bp_preprocessed.region_with_epimark.tsv -s E4 -o tmp/E4__compute_pred

I get the following error: Error in predict.randomForest(reptile_classifier, epimark, type = "prob") : variables in the training data missing in newdata Calls: reptile_predict_genome_wide ... reptile_predict_one_mode -> predict -> predict.randomForest Execution halted Are there any specific trained model available for only DNA methylation data to predict enhancers. Note: I tried with both genome wide and region specific.

karamveerverma37 avatar May 31 '24 16:05 karamveerverma37