nanocomp icon indicating copy to clipboard operation
nanocomp copied to clipboard

Got negative values on weighted histogram plot

Open najohink opened this issue 1 year ago • 7 comments

Hello,

I am using NanoComp v1.23.1 and got a weird plot after filtering my input fastq files (see attached image). Screenshot 2023-09-22 163016

When I did the same command on input fastq which were not filtered, I got normal plots. But after filtering my fastq files to only keep 1-27kb reads, I now get negative values in the weighted plots. Is this "normal"?

Can you also explain the difference between weighted and normalized?

best, S

najohink avatar Sep 22 '23 23:09 najohink

I forgot to add the photo of the unfiltered fastq output plot:

Screenshot 2023-09-22 164308

najohink avatar Sep 22 '23 23:09 najohink

I am very confused and will need to think about this.

wdecoster avatar Sep 26 '23 17:09 wdecoster

I filtered my dataset with FiltLong before running NanoComp and getting the weird result.

In the meantime, I figured out how to do what I wanted by running this:

df3 = pickle.load(open('barcode03_1-27kb_NanoComp-data.pickle', 'rb'))

bins = numpy.arange(0, 30000, 500)
h3 = numpy.histogram(df3['lengths'], bins=bins)

plt.bar(h3[1][:-1], height = h3[0], width=450)

xdata3 = (h3[1][:-1] + h3[1][1:])/2
ydata3 = xdata3 * h3[0]
plt.bar(xdata3, ydata3, width=450)
ydata3[xdata3 > 25000].sum() / ydata3.sum()

I was interested in knowing what percent of the total bases my full length sequence was. So I wanted to divide the 26kb bases by the total number of bases, but wanted to also keep out the weird long stuff from the dataset, hence filtering with FiltLong.

najohink avatar Sep 26 '23 17:09 najohink

Does the plot without weighted look normal? I will explain later what those mean when I'm at the computer...

wdecoster avatar Sep 26 '23 18:09 wdecoster

Yes, the others look normal. Only the two weighted plots have negative values.

najohink avatar Sep 26 '23 18:09 najohink

So normalized plots mean that every dataset in the plot adds up to "1" - so datasets with significant differences in yield can still be compared on length. Without normalization, just the number of reads is used. And weighted means that instead of the number of reads per bin, the number of bases per bin is used (as is also the case in the minKNOW interface). As such, a read of 25000 bases in the bin of 24000-26000 will increase the count on the y-axis for 25000 rather than just 1.

wdecoster avatar Sep 27 '23 06:09 wdecoster

Do you think it would be possible to share the data that caused this?

wdecoster avatar Sep 28 '23 10:09 wdecoster