CARE
CARE copied to clipboard
subsetting reads from large genome
It seems likely that the deeper the coverage, the better the read correction will work, but I'm working with a large genome (>20 Gb haploid complement) so RAM is a limiting factor - 50x coverage is > 1 terabase of reads. Would it be feasible to align all the reads to a draft (highly-fragmented) genome assembly, then divide the alignments into 10 sets of approximately equal sizes, extract the reads that map to each subset, and correct the subsets independently? Reads that don't align to the draft genome could be included in all 10 of the subsets in case they represent important information lost in the draft assembly, and reads with supplementary or secondary alignments in different subsets could be included in both subsets. This should give the full depth of coverage for single-copy sequences, along with proportional representation of repetitive sequences, in each subset.