Phasing into a single phaseset across targeted region of interest
Hello,
I am interested in a 200kb region derived from a segmental duplication that includes 2 paralogous blocks with 98% identity. I have been targeting the region for nanopore sequencing with both cas9-mediated enrichment and adaptive sampling. I would love some help with parameterisation as sometimes Pepper-Margin-DeepVariant is able to phase across the whole region but sometimes it is split into many, many phasesets.
For example, here is the output from a single adaptive sampling run where it's able to phase into a 198kb phaseset:

However, in this output from a run with cas9-targeted reads I end up with 58 phasesets with large portions not covered at all.

This is after I have changed some of the options slightly to the below because with the defaults it splits into 68 phasesets.
singularity exec --bind /usr/lib/locale/,/mainfs/cansci/ \
/mainfs/cansci/pepper_deepvariant_r0.8.sif \
run_pepper_margin_deepvariant call_variant \
-b "${BAM}" \
-f "${REF}" \
-o "${OUTPUT_DIR2}" \
-p "${OUTPUT_PREFIX2}" \
-t ${THREADS} \
-r chr1:161440000-161700000 \
--phased_output \
--pepper_min_mapq 10 \
--pepper_min_snp_baseq 5 \
--pepper_min_indel_baseq 5 \
--pepper_snp_frequency 0.2 \
--pepper_insert_frequency 0.2 \
--pepper_delete_frequency 0.2 \
--pepper_min_coverage_threshold 10 \
--pepper_candidate_support_threshold 3 \
--pepper_snp_candidate_frequency_threshold 0.2 \
--pepper_indel_candidate_frequency_threshold 0.2 \
--ont_r9_guppy5_sup
Is there an explanation for this? Obviously the distribution of coverage is different for the cas9 vs adaptive sampling runs but the cas9 runs have a lot more depth. This is just two examples where library prep type is separated but for most of my samples I have combined data from cas9 and adaptive sampling runs. When I use nanopolish + whatshap on the cas9 reads, they are able to be phased within a single 198kb haploblock.
An additional question I had was about how variants are assigned to be hetero or homozygous? In the second image you can see lots of the homozgyous variants (light blue?) are above variants which in the raw read coverage track look more like heterozygous from the colours?
Any help for optimising the parameters would be immensely helpful
Cheers!
Sarah