FEELnc icon indicating copy to clipboard operation
FEELnc copied to clipboard

`$proc` argument in `FEELnc_codpot.pl` (number of processes) not mentioned in help

Open dgruano opened this issue 5 months ago • 0 comments

Hello there!

I noticed that the $proc argument can be passed to FEELnc_codpot.pl to then run KmerInShort with a number of processes equal to $proc.

https://github.com/tderrien/FEELnc/blob/8ea728ebbe0b17a090fa23c8361168f1729fb7f7/scripts/FEELnc_codpot.pl#L90-L114

However, it is not mentioned in the --help or man page, which currently outputs:

Usage:
    FEELnc_codpot.pl -i transcripts.GTF -a known_mRNA.GTF -g genome.FA -l
    known_lnc.GTF [options...]

Options:
  General:
      --help                                Print this help
      --man                                 Open man page
      --verbosity                           Level of verbosity

  Mandatory arguments:
      -i,--infile=file.gtf/.fasta           Specify the .GTF or .FASTA file  (such as a cufflinks transcripts/merged .GTF or .FASTA file)
      -a,--mRNAfile=file.gtf/.fasta         Specify the annotation .GTF or .FASTA file  (file of protein coding transcripts .GTF or .FASTA file)

  Optional arguments:
      -g,--genome=genome.fa                 Genome file or directory with chr files (mandatory if input is .GTF) [ default undef ]
      -l,--lncRNAfile=file.gtf/.fasta       Specify a known set of lncRNA for training .GTF or .FASTA  [ default undef ]
      -b,--biotype                          Only consider transcripts having this(these) biotype(s) from the reference annotation (e.g : -b transcript_biotype=protein_coding,pseudogene) [default undef i.e all transcripts]
      -n,--numtx=undef                      Number of mRNA and lncRNA transcripts required for the training. mRNAs and lncRNAs numbers need to be separate by a ',': i.e. 1500,1000 for 1500 mRNAs and 1000 lncRNAs. For all the annotation, let it blank [ default undef, all the two annotations ]
      -r,--rfcut=[0-1]                      Random forest voting cutoff [ default undef i.e will compute best cutoff ]
      --spethres=undef                      Two specificity threshold based on the 10-fold cross-validation, first one for mRNA and the second for lncRNA, need to be in ]0,1[ on separated by a ','
      -k,--kmer=1,2,3,6,9,12                Kmer size list with size separate by ',' as string [ default "1,2,3,6,9,12" ], the maximum value for one size is '15'
      -o,--outname={INFILENAME}             Output filename [ default infile_name ]
      --outdir="feelnc_codpot_out/"         Output directory [ default "./feelnc_codpot_out/" ]
      -m,--mode                             The mode of the lncRNA sequences simulation if no lncRNA sequences have been provided. The mode can be:
                                                    'shuffle'   : make a permutation of mRNA sequences while preserving the 7mer count. Can be done on either FASTA and GTF input file;
                                                    'intergenic': extract intergenic sequences. Can be done *only* on GTF input file.
      -s,--sizeinter=0.75                   Ratio between mRNA sequence lengths and non coding intergenic region sequence lengths as, by default, ncInter = mRNA * 0.75
      --learnorftype=3                      Integer [0,1,2,3,4] to specify the type of longest ORF calculate [ default: 3 ] for learning data set.
                                            If the CDS is annotated in the .GTF, then the CDS is considered as the longest ORF, whatever the --orftype value.
                                                    '0': ORF with start and stop codon;
                                                    '1': same as '0' and ORF with only a start codon, take the longest;
                                                    '2': same as '1' but with a stop codon;
                                                    '3': same as '0' and ORF with a start or a stop, take the longest (see '1' and '2');
                                                    '4': same as '3' but if no ORF is found, take the input sequence as ORF.
      --testorftype=3                       Integer [0,1,2,3,4] to specify the type of longest ORF calculate [ default: 3 ] for test data set. See --learnortype description for more informations.
      --ntree                               Number of trees used in random forest [ default 500 ]
      --percentage=0.1                      Percentage of the training file use for the training of the kmer model. What remains will be used to train the random forest

  Debug arguments:
      --keeptmp=0                           To keep the temporary files in a 'tmp' directory the outdir, by default don't keep it (0 value). Any other value than 0 will keep the temporary files
      --verbosity=1                         Integer [0,1,2]: which level of information that need to be print [ default 1 ]. Note that that printing is made on STDERR
      --seed=1234                           Use to fixe the seed value for the extraction of intergenic DNA region to get lncRNA like sequences and for the random forest [ default 1234 ]

  Intergenic lncrna extraction:
            -to be added

It would be great to add it, so people that are getting introduced to the tool can take advantage of its parallel processing capabilities.

Cheers,

Dani

dgruano avatar Aug 01 '25 08:08 dgruano