P contigs of length 0
Hello, I realized that after Falcon Unzip I have some p-contigs of length 0 (there is an empty line), while there are haplotigs for that sequence. Counting the length:
000870F [empty]
000870F_001 15026
000870F_002 152914
000870F_003 63913
Looking in the 3-unzip folder, ctg 870F has phased reads:
0 000870F 1 1 10 36 m161228_035316_42219_c101154552550000001823268507191790_s1_p0/103469/10714_36102
1 000870F 1 1 2 13 m54138_170625_065846/29360350/0_23161
2 000870F 1 1 9 32 m170104_032305_42219_c101154862550000001823268507191796_s1_p0/148504/775_31342
5 000870F 1 1 4 17 m170114_035841_42219_c101154482550000001823268507191790_s1_p0/85422/0_17582
7 000870F 1 1 10 65 m170112_110003_42219_c101154682550000001823268507191732_s1_p0/8381/0_22775
but no lines in all_p_ctg_edges
the 3-unzip/0-phasing/000870F folder contains:
[dcopetti@pac /wing2/users/jzhang/work/bermuda/falcon1st/3-unzip/0-phasing/000870F]$ ls -lrth
total 48K
drwxr-xr-x 2 jzhang bioinfo 4.0K Aug 11 18:02 blasr
-rw-r--r-- 1 jzhang bioinfo 849 Aug 12 06:40 p_000870F.sh
drwxr-xr-x 6 jzhang bioinfo 61 Aug 12 06:40 mypwatcher
drwxr-xr-x 2 jzhang bioinfo 137 Aug 12 06:41 het_call
drwxr-xr-x 2 jzhang bioinfo 99 Aug 12 06:41 g_atable
drwxr-xr-x 2 jzhang bioinfo 108 Aug 12 06:41 get_phased_blocks
-rw-r--r-- 1 jzhang bioinfo 600 Aug 12 06:41 task.json
-rw-r--r-- 1 jzhang bioinfo 159 Aug 12 06:41 task.sh
-rw-r--r-- 1 jzhang bioinfo 207 Aug 12 06:41 run.sh
lrwxrwxrwx 1 jzhang bioinfo 108 Aug 12 06:41 pwatcher.dir -> /newwing/wing2/users/jzhang/work/bermuda/falcon1st/3-unzip/0-phasing/000870F/mypwatcher/jobs/P65c7e75853c6dd
-rw-r--r-- 1 jzhang bioinfo 17K Aug 12 06:41 phased_reads
-rw-r--r-- 1 jzhang bioinfo 0 Aug 12 06:41 run.sh.done
-rw-r--r-- 1 jzhang bioinfo 6.6K Aug 12 06:41 rid_to_phase.000870F
drwxr-xr-x 2 jzhang bioinfo 133 Aug 12 06:41 phasing
I have 13 contigs like this in this assembly, and two in another, always made with the Unzip step. Is there a way to recover those contigs? Thanks, Dario
Hi Dario,
It’s typical that there are fewer contigs present after running the unzip pipeline. This most often occurs in coverage limited situations. e.g. you might have a primary contig w/ sufficient coverage in the primary FALCON assembly, however after unzipping and partitioning the reads into two sets, there is less than enough support to accurately call consensus on a particular contig. Generally they appear in the higher number shorter length contigs as well, and typically represent a tiny fraction of the genome. We haven’t spent a lot of time investigating other reasons for why some contigs don’t make it through the unzip pipeline, we just tend to take that as “evidence” that it wasn’t a highly supported contig in the first place and ignore them in the short term.
We realize there an issue using 3-unzip/all_p_ctg.fa files that contain 0length sequences breaking SMRTLink. Unfortunately that problem won’t be fixed in SMRTLink anytime soon, but I have a script that will quickly clean your fasta for you so that it’s useable in SMRTLink.
If you’ve installed the virtualenv here: http://pb-falcon.readthedocs.io/en/latest/quick_start.html
You should be able to pip install the falcon_tools repo into your existing FALCON (fc_env) virtualenv
(fc_env)$ git clone https://github.com/gconcepcion/falcon_tools.git (fc_env)$ cd falcon_tools (fc_env)$ pip install ./ (fc_env)$ clean_fasta.py –help
You can also plot readlength / overlap distributions w/ plot_distributions.py
(fc_env)$ plot_distributions.py –debug /path/to/falcon/root
Hope this helps,
Greg
Thanks Greg, It definitely does.
For now I will remove the lines with empty p-contigs and the corresponding haplotigs, they sum to 1.3 Mb out of more than 1.2 Gb assembly. Maybe these contigs with lower coverage are allelic p-contigs?
In general, after Unzip are the IDs of the p-contigs maintained the same as they were before? Cheers