pore-c
pore-c copied to clipboard
Merging Salsa beds from multiple experiments
Hi Eoghan et al,
After sorting out my python env issues, I've managed to successfully run Pore-C-Snakemake through to its salsa bed outputs, and it seemed to work beautifully in that (good news!).
The bad news for my specific experiment is that the coverage I have in any one of my enzymes is too low to get much benefit in scaffolding out of SALSA - it runs, but doesn't really change the assembly contiguity considerably. I think this might be expected, since the throughput of the experiment wasn't great and if I map all my reads (split across 4 enzymes, each with different throughputs) the average coverage on my reference genome is 11X (with, however, only 3% not covered at all, with 90% of it covered by more than 5 reads). I can collect more data if I need to, but first I'd like to get the most out of the data I have at the moment (realizing that this is all very experimental).
As far as I can gather from the documentation and help files, PoreC & its snakemake wrapper is configured to run multiple experiments & different enzymes at once, but it gives each a separate bed output file when run with "to_salsa_bed". Is there any way to tell it to merge from multiple enzymes when writing a salsa bed file?
I tried to use samtools merge to merge the multiple bam files in mapping, then bamtobed as the salsa documentation suggests, but SALSA fails with these bed files, I guess because they don't have the form of "simulated" paired-end reads as poreC generates.
Thanks again for your consideration,
Regards, Chris L
PS - FYI in the Pore-C-Snakemake documentation there's one line
"snakemake --use-conda salsa2_bed"
that doesn't run, because it should be "salsa_bed". Nitpicky but it threw me for a moment.
Hi Chris,
The best way to merge multiple experiments is to generate your salsa bed through the pore-c pipeline. Once the individual bed files have been generated you can simply concatenate the bed files into a single file and sort by the read name column (4th column), please see the example below. Thereafter, you should be able to follow the salsa documentation
cat file1.salsa.bed file2.salsa.bed ... > combined.salsa.bed
sort -k 4 combined.salsa.bed > combined.salsa.sorted.bed
Hope this helps, Priyesh
Hi Priyesh,
Thanks for that. Just to be sure, the operation you suggest does result in some sorts that split paired end reads:
ctg252 130945 131216 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_10/1 30 - ctg135 432630 432888 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_10/2 26 + ctg252 130945 131216 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_11/1 30 - ctg113 139183 139981 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_1/1 154 - ctg277 164796 164972 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_11/2 4 - ctg287 211126 211272 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_12/1 20 + ctg135 432630 432888 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_12/2 26 + ctg252 130945 131216 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_1/2 30 - ctg287 211126 211272 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_13/1 20 + ctg277 164796 164972 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_13/2 4 - ctg135 432630 432888 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_14/1 26 + ctg277 164796 164972 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_14/2 4 -
Whereas this doesn't seem to be the case in the bed files created directly in the output of PoreC. Does order matter here?
Using a salsa file created in the manner you suggest, with all 4 restriction enzymes I used supplied via -e, doesn't really budge my assembly at all. I know the overall coverage is low - but I am a little surprised it doesn't increase the contiguity at least a little.
Hi Chris,
It appears the sort operation is splitting the pairs up. Since the bed file out of the PoreC pipeline is pre-sorted by name, you could omit the sorting step and go straight from concatenation to running the salsa pipeline.
I would expect there to be a change in the number of contigs as well as the continuity, could you compare the number of contigs with assembly.cleaned.fasta
and scaffolds_FINAL.fasta
both of those files should be in the output directory of salsa