TranscriptClean No such file or directory

Hi Dana

I am having an issue with one particular file in my data set when I attempt to run it through TC.

Below are commands

python $TC --threads 6 --sam $SSAM --genome $REF --spliceJns $SPLICE --deleteTmp --outprefix EXT2_TC

Reading genome .............................. Reading genome .............................. Reading genome .............................. cat: 'TC_tmp//.sam': No such file or directory cat: 'TC_tmp//.fa': No such file or directory cat: 'TC_tmp//.log': No such file or directory cat: 'TC_tmp//.TElog': No such file or directory Took 0:00:00 to combine all outputs.

I have attempted to clear /tmp/ directory before trying this again (I notice pybedtools creates many files here) but it didn't help. After restarting my PC I got further (thinking it may clear temporary files causing issues), yet the output files don't appear to be correct. This replicate has the largest file size out of all, yet it was processed via TC with an end file size substantially smaller than all the rest, it was also processed very quickly, where as the rest took 3+ hours. To be sure, I remapped the original file in minimap2 again and tried once more. Have also tried without the --deleteTmp option.

Cheers Dean

Feb 05 '20 04:02 number-25

Hi Dean, I'll try the trivial solution first- have you double-checked that the path you are providing in the $SSAM bash variable is correct? If that doesn't help, feel free to send me a sample of the SAM file in question at [email protected] and I'd be happy to take a look. Cheers, Dana

Feb 05 '20 18:02 dewyman

Hi Dana,

The SAM file path is all good. I tried tweaking a few parameters, in that perhaps the program exits if the computational load is too high for the computer. I reduced the threads to 4 from 6, and the program appears to have ran to completion - now my main confusion here is that the end file sizes are quite different (counter-intuitively). The starting file size of one replicate was 5gb, and the output SAM from TC was 3.4gb, my second replicate which had a starting file size of 23gb, ended up with a output SAM that was only 2.5gb - smaller than even the first replicate? I am confused as to what may be happening here? Is TC culling upwards ~90% of the data? Most replicates have mapping stats of ~85% also

Perhaps indel size in my data? number of mismatches?

Cheers Dean

Feb 07 '20 01:02 number-25

Hi Dean, One of the drawbacks to the multithreading is that it does result in higher memory usage, which as you remarked can lead to a crash on large inputs. One way to mitigate this might be to pre-filter your SAM file and run TC on only the primary alignments (ie keep reads where the second column is 0 or 16). TranscriptClean doesn't correct the unmapped reads or the non-primary alignments, so it is possible that the file size difference you are seeing is related to that. In particular, the --canonOnly and --primaryOnly command-line options would be expected to decrease the size of the final output that you get. For instance, if your mapping rate is 85% and you ran TC with one of these options enabled, then at least 15% of the reads in your input file would not be found in the output. But of course I would not typically expect a 23G to 2.5G drop unless the mapping rate/multimapping rate was really bad- that is a big reduction. Do you still have the tmp files from that run? Best, Dana

Feb 07 '20 22:02 dewyman

Hi Dana,

Just some more info - the large replicate has ~26% non-primary alignments. The first smaller one has ~28%.

.TE.log file indicates that (Large replicate) 61907421 Corrected 658870 Uncorrected

I've got the tmp files, see the contents of the directory below /TC_tmp/split/un_corr_sams

Also, I sent you the .sam via email

uncorrected

Greatly appreciate the help! Dean

Feb 10 '20 00:02 number-25