FALCON-integrate icon indicating copy to clipboard operation
FALCON-integrate copied to clipboard

P contigs of length 0

Open dcopetti opened this issue 8 years ago • 2 comments

Hello, I realized that after Falcon Unzip I have some p-contigs of length 0 (there is an empty line), while there are haplotigs for that sequence. Counting the length:

000870F [empty]
000870F_001    15026
000870F_002    152914
000870F_003    63913

Looking in the 3-unzip folder, ctg 870F has phased reads:

0 000870F 1 1 10 36 m161228_035316_42219_c101154552550000001823268507191790_s1_p0/103469/10714_36102
1 000870F 1 1 2 13 m54138_170625_065846/29360350/0_23161
2 000870F 1 1 9 32 m170104_032305_42219_c101154862550000001823268507191796_s1_p0/148504/775_31342
5 000870F 1 1 4 17 m170114_035841_42219_c101154482550000001823268507191790_s1_p0/85422/0_17582
7 000870F 1 1 10 65 m170112_110003_42219_c101154682550000001823268507191732_s1_p0/8381/0_22775

but no lines in all_p_ctg_edges

the 3-unzip/0-phasing/000870F folder contains:

[dcopetti@pac /wing2/users/jzhang/work/bermuda/falcon1st/3-unzip/0-phasing/000870F]$ ls -lrth
total 48K
drwxr-xr-x 2 jzhang bioinfo 4.0K Aug 11 18:02 blasr
-rw-r--r-- 1 jzhang bioinfo  849 Aug 12 06:40 p_000870F.sh
drwxr-xr-x 6 jzhang bioinfo   61 Aug 12 06:40 mypwatcher
drwxr-xr-x 2 jzhang bioinfo  137 Aug 12 06:41 het_call
drwxr-xr-x 2 jzhang bioinfo   99 Aug 12 06:41 g_atable
drwxr-xr-x 2 jzhang bioinfo  108 Aug 12 06:41 get_phased_blocks
-rw-r--r-- 1 jzhang bioinfo  600 Aug 12 06:41 task.json
-rw-r--r-- 1 jzhang bioinfo  159 Aug 12 06:41 task.sh
-rw-r--r-- 1 jzhang bioinfo  207 Aug 12 06:41 run.sh
lrwxrwxrwx 1 jzhang bioinfo  108 Aug 12 06:41 pwatcher.dir -> /newwing/wing2/users/jzhang/work/bermuda/falcon1st/3-unzip/0-phasing/000870F/mypwatcher/jobs/P65c7e75853c6dd
-rw-r--r-- 1 jzhang bioinfo  17K Aug 12 06:41 phased_reads
-rw-r--r-- 1 jzhang bioinfo    0 Aug 12 06:41 run.sh.done
-rw-r--r-- 1 jzhang bioinfo 6.6K Aug 12 06:41 rid_to_phase.000870F
drwxr-xr-x 2 jzhang bioinfo  133 Aug 12 06:41 phasing

I have 13 contigs like this in this assembly, and two in another, always made with the Unzip step. Is there a way to recover those contigs? Thanks, Dario

dcopetti avatar Aug 28 '17 07:08 dcopetti

Hi Dario,

It’s typical that there are fewer contigs present after running the unzip pipeline. This most often occurs in coverage limited situations. e.g. you might have a primary contig w/ sufficient coverage in the primary FALCON assembly, however after unzipping and partitioning the reads into two sets, there is less than enough support to accurately call consensus on a particular contig. Generally they appear in the higher number shorter length contigs as well, and typically represent a tiny fraction of the genome. We haven’t spent a lot of time investigating other reasons for why some contigs don’t make it through the unzip pipeline, we just tend to take that as “evidence” that it wasn’t a highly supported contig in the first place and ignore them in the short term.

We realize there an issue using 3-unzip/all_p_ctg.fa files that contain 0length sequences breaking SMRTLink. Unfortunately that problem won’t be fixed in SMRTLink anytime soon, but I have a script that will quickly clean your fasta for you so that it’s useable in SMRTLink.

If you’ve installed the virtualenv here: http://pb-falcon.readthedocs.io/en/latest/quick_start.html

You should be able to pip install the falcon_tools repo into your existing FALCON (fc_env) virtualenv

(fc_env)$ git clone https://github.com/gconcepcion/falcon_tools.git (fc_env)$ cd falcon_tools (fc_env)$ pip install ./ (fc_env)$ clean_fasta.py –help

You can also plot readlength / overlap distributions w/ plot_distributions.py

(fc_env)$ plot_distributions.py –debug /path/to/falcon/root

Hope this helps,

Greg

gconcepcion avatar Aug 28 '17 17:08 gconcepcion

Thanks Greg, It definitely does.

For now I will remove the lines with empty p-contigs and the corresponding haplotigs, they sum to 1.3 Mb out of more than 1.2 Gb assembly. Maybe these contigs with lower coverage are allelic p-contigs?

In general, after Unzip are the IDs of the p-contigs maintained the same as they were before? Cheers

dcopetti avatar Aug 29 '17 07:08 dcopetti