fastp icon indicating copy to clipboard operation
fastp copied to clipboard

In overrepresented sequence analysis, it seems like it should be i<=len-step instead of <

Open yanlifeng opened this issue 3 years ago • 0 comments

Hi! I am using the latest fastp and I found in state.cpp:

// do overrepresentation analysis for 1 of every 100 reads
    if(mOptions->overRepAnalysis.enabled) {
        if(mReads % mOptions->overRepAnalysis.sampling == 0) {
            const int steps[5] = {10, 20, 40, 100, min(150, mEvaluatedSeqLen-2)};
            for(int s=0; s<5; s++) {
                int step = steps[s];
                for(int i=0; i<len-step; i++) {
                    string seq = r->mSeq->substr(i, step);
                    if(mOverRepSeq.count(seq)>0) {
                        mOverRepSeq[seq]++;
                        for(int p = i; p < seq.length() + i && p < mEvaluatedSeqLen; p++) {
                            mOverRepSeqDist[seq][p]++;
                        }
                        i+=step;
                    }
                }
            }
        }
    }

this line : for(int i=0; i<len-step; i++), it seems like it should be i<=len-step instead of <. If it is <, it seems to cause the number of hotseqs found during preprocessing to be 0 at the end.

Incidentally, why i+=step, is it because over-representation of sequences cannot have overlap?

Thank you!

yanlifeng avatar Jul 10 '22 08:07 yanlifeng