FEELnc
FEELnc copied to clipboard
`$proc` argument in `FEELnc_codpot.pl` (number of processes) not mentioned in help
Hello there!
I noticed that the $proc argument can be passed to FEELnc_codpot.pl to then run KmerInShort with a number of processes equal to $proc.
https://github.com/tderrien/FEELnc/blob/8ea728ebbe0b17a090fa23c8361168f1729fb7f7/scripts/FEELnc_codpot.pl#L90-L114
However, it is not mentioned in the --help or man page, which currently outputs:
Usage:
FEELnc_codpot.pl -i transcripts.GTF -a known_mRNA.GTF -g genome.FA -l
known_lnc.GTF [options...]
Options:
General:
--help Print this help
--man Open man page
--verbosity Level of verbosity
Mandatory arguments:
-i,--infile=file.gtf/.fasta Specify the .GTF or .FASTA file (such as a cufflinks transcripts/merged .GTF or .FASTA file)
-a,--mRNAfile=file.gtf/.fasta Specify the annotation .GTF or .FASTA file (file of protein coding transcripts .GTF or .FASTA file)
Optional arguments:
-g,--genome=genome.fa Genome file or directory with chr files (mandatory if input is .GTF) [ default undef ]
-l,--lncRNAfile=file.gtf/.fasta Specify a known set of lncRNA for training .GTF or .FASTA [ default undef ]
-b,--biotype Only consider transcripts having this(these) biotype(s) from the reference annotation (e.g : -b transcript_biotype=protein_coding,pseudogene) [default undef i.e all transcripts]
-n,--numtx=undef Number of mRNA and lncRNA transcripts required for the training. mRNAs and lncRNAs numbers need to be separate by a ',': i.e. 1500,1000 for 1500 mRNAs and 1000 lncRNAs. For all the annotation, let it blank [ default undef, all the two annotations ]
-r,--rfcut=[0-1] Random forest voting cutoff [ default undef i.e will compute best cutoff ]
--spethres=undef Two specificity threshold based on the 10-fold cross-validation, first one for mRNA and the second for lncRNA, need to be in ]0,1[ on separated by a ','
-k,--kmer=1,2,3,6,9,12 Kmer size list with size separate by ',' as string [ default "1,2,3,6,9,12" ], the maximum value for one size is '15'
-o,--outname={INFILENAME} Output filename [ default infile_name ]
--outdir="feelnc_codpot_out/" Output directory [ default "./feelnc_codpot_out/" ]
-m,--mode The mode of the lncRNA sequences simulation if no lncRNA sequences have been provided. The mode can be:
'shuffle' : make a permutation of mRNA sequences while preserving the 7mer count. Can be done on either FASTA and GTF input file;
'intergenic': extract intergenic sequences. Can be done *only* on GTF input file.
-s,--sizeinter=0.75 Ratio between mRNA sequence lengths and non coding intergenic region sequence lengths as, by default, ncInter = mRNA * 0.75
--learnorftype=3 Integer [0,1,2,3,4] to specify the type of longest ORF calculate [ default: 3 ] for learning data set.
If the CDS is annotated in the .GTF, then the CDS is considered as the longest ORF, whatever the --orftype value.
'0': ORF with start and stop codon;
'1': same as '0' and ORF with only a start codon, take the longest;
'2': same as '1' but with a stop codon;
'3': same as '0' and ORF with a start or a stop, take the longest (see '1' and '2');
'4': same as '3' but if no ORF is found, take the input sequence as ORF.
--testorftype=3 Integer [0,1,2,3,4] to specify the type of longest ORF calculate [ default: 3 ] for test data set. See --learnortype description for more informations.
--ntree Number of trees used in random forest [ default 500 ]
--percentage=0.1 Percentage of the training file use for the training of the kmer model. What remains will be used to train the random forest
Debug arguments:
--keeptmp=0 To keep the temporary files in a 'tmp' directory the outdir, by default don't keep it (0 value). Any other value than 0 will keep the temporary files
--verbosity=1 Integer [0,1,2]: which level of information that need to be print [ default 1 ]. Note that that printing is made on STDERR
--seed=1234 Use to fixe the seed value for the extraction of intergenic DNA region to get lncRNA like sequences and for the random forest [ default 1234 ]
Intergenic lncrna extraction:
-to be added
It would be great to add it, so people that are getting introduced to the tool can take advantage of its parallel processing capabilities.
Cheers,
Dani