SubPhaser icon indicating copy to clipboard operation
SubPhaser copied to clipboard

sg.config configuration (The homology of the de novo assembled genome is not known)

Open hushaoqiang opened this issue 2 years ago • 24 comments

Hello, thank you very much for your SubPhaser software, which is very useful for genotyping. I used HIFI and HIC to assemble the genome of Fragaria x ananassa but I do not know the homology of the 28 scaffolds. Should I put all the chromosomes in the sg.config file in one line? I am looking forward to your reply.

hushaoqiang avatar Apr 28 '22 03:04 hushaoqiang

No. It may be very easy to know the homology by aligning your chromosomes to the genome of Fragaria vesca using tools such as minimap2 etc. You may be able to get a figure like below:

fan-fve

So the sg.config configuration can be set like:

1-1 1-2 1-3 1-4
2-1 2-2 2-3 2-4
...

zhangrengang avatar Apr 28 '22 03:04 zhangrengang

Another way for you, the HiC contact heatmap can also show diagonal signals between homeologous chromosomes like: hic-heatmap So the sg.config configuration can be set like:

chr01a chr01b chr01c chr01d
chr02a chr02b chr02c chr02d
...

Note that the HiC heatmap should not be filtered by mapping qualtiy of reads.

zhangrengang avatar Apr 28 '22 03:04 zhangrengang

谢谢老师您的解答,我将使用您所推荐的方式来做同源分析。另外还有个问题请教一下老师您,为了说详细点,我就是用中文了。 老师,使用Camarosa草莓的基因组序列想要试运行一下,基因组文件是发表的序列。配置文件内容: Fvb1-1 Fvb1-2 Fvb1-3 Fvb1-4 Fvb2-1 Fvb2-2 Fvb2-3 Fvb2-4 Fvb3-1 Fvb3-2 Fvb3-3 Fvb3-4 Fvb4-1 Fvb4-2 Fvb4-3 Fvb4-4 Fvb5-1 Fvb5-2 Fvb5-3 Fvb5-4 Fvb6-1 Fvb6-2 Fvb6-3 Fvb6-4 Fvb7-1 Fvb7-2 Fvb7-3 Fvb7-4

运行命令是:subphaser -i F_ana_Camarosa_6-28-17_hardmasked.fasta -c sg.config 但是会出现: 22-04-28 18:21:39 [INFO] Consistent with subgenome assignment: 16 (6.93%); potential exchange: 149 (64.50%) 22-04-28 18:21:39 [INFO] Output: /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.bin.enrich 22-04-28 18:21:39 [INFO] ###Step: LTR 22-04-28 18:21:39 [INFO] Identifying LTR-RTs by ['ltr_harvest'] 22-04-28 18:21:45 [INFO] Check point file: /home/hsq/hhh/subphaser/kmls/tmp/LTR.scn.ok exists; skip this step 22-04-28 18:21:45 [INFO] 8186 LTRs identified 22-04-28 18:21:45 [INFO] Check point file: /home/hsq/hhh/subphaser/kmls/tmp/LTR.tesort.ok exists; skip this step 22-04-28 18:21:45 [INFO] By TEsorter, 190 (2.3%) are classified as LTRs, of which 0 (0.0%) are intact with complete protein domains 22-04-28 18:21:45 [INFO] After filtering, 186 / 8186 (2.3%) LTRs retained 22-04-28 18:21:45 [INFO] Outputing coordinate - LTR maps to /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.ltr.bin.count 22-04-28 18:21:45 [INFO] Start Pool with 32 process(es) 22-04-28 18:21:45 [INFO] Processed 0 sequences Traceback (most recent call last): File "/home/hsq/miniconda3/envs/SubPhaser/bin/subphaser", line 33, in sys.exit(load_entry_point('subphaser==1.2.5', 'console_scripts', 'subphaser')()) File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 779, in main pipeline.run() File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 516, in run ltr_bedlines, enrich_ltr_bedlines = self.step_ltr(d_kmers) if not self.disable_ltr else ([],[]) File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 563, in step_l tr Seqs.map_kmer3([ltrfile], d_kmers, fout=fout, k=self.k, ncpu=self.ncpu, File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/Seqs.py", line 97, in map_kmer3 logger.info('{} ({:.2%}) sequences contain subgenome-specific kmers'.format(mapped_seqs, mapped_seqs/i)) ZeroDivisionError: division by zero

k15_q200_f2.kmer_pca.pdf是成功生成的。生成的pca图与老师您附表中的图不一样。 请问一下老师我是不是需要调整参数或者其他设置有什么问题。 期待老师您的回复

hushaoqiang avatar Apr 28 '22 10:04 hushaoqiang

Do not use hardmasked genome sequences, because subphaser is based on repeated sequences (TEs in most cases) in fact.

zhangrengang avatar Apr 28 '22 10:04 zhangrengang

老师,请问您对开麦罗莎草莓做分析时(Fig. S27 Subgenome phasing with SubPhaser in the Fragaria x ananassa genome.),使用的文件是哪一个啊。

hushaoqiang avatar Apr 28 '22 11:04 hushaoqiang

F_ana_Camarosa_6-28-17.fasta.gz. But you may not get the identical results but similar results.

zhangrengang avatar Apr 28 '22 11:04 zhangrengang

老师,我刚刚又去GDR网站看了一下,我只看到了 F_ana_Camarosa_6-28-17_hardmasked.fasta.gz跟[F_ana_Camarosa_6-28-17_unmasked.fasta.gz两个文件,请问一下您告诉我的F_ana_Camarosa_6-28-17.fasta.gz文件这个在哪个链接下载啊,麻烦老师您了。

hushaoqiang avatar Apr 28 '22 11:04 hushaoqiang

F_ana_Camarosa_6-28-17_unmasked.fasta.gz

zhangrengang avatar Apr 28 '22 11:04 zhangrengang

老师,已成功运行,4套亚基因组成功分型。图非常好看,感谢老师您的耐心指导与为科研工作者所开发的SubPhaser软件。

hushaoqiang avatar Apr 28 '22 13:04 hushaoqiang

Teacher, I have successfully found homologous chromosomes using MUMmer software, and then performed genotyping using SubPhaser. The results are very good. Thank you!

hushaoqiang avatar Apr 30 '22 10:04 hushaoqiang

Great. Thanks for your feedback.

zhangrengang avatar Apr 30 '22 11:04 zhangrengang

Hi @zhangrengang, Thanks for this excellent software. I have been using this for my newly assembled allopolyploidy genome (2n = 4x = 40). The genome has two subgenomes, but I don't know the homologous pairs. So as you suggested in previous comments, I used mummer and minimapper2 to find the homologous pairs and assigned them in the configurations file.

But after running the SubPhaser with default parameters I got 19 chr in SG1 and only one chr in SG2. Idially, it should be 10 chr in each subgenome. image The differential k-mer heatmap also shows the similar result: image

Can you suggest why it happend? Am I approaching rightly?

kashiff007 avatar Jun 05 '22 12:06 kashiff007

@kashiff007 It seems that your genome is contig-level, but not chromosome-level?

zhangrengang avatar Jun 05 '22 12:06 zhangrengang

Yes, It is a newly assembled genome and from cytometry, we speculate that the largest 20 contigs are full genome length.

kashiff007 avatar Jun 05 '22 12:06 kashiff007

@kashiff007 Can you show the homoelogous relationship such as dot plot, and your config file? How much percent do the largest 20 contigs account for the whole genome?

zhangrengang avatar Jun 05 '22 12:06 zhangrengang

image

here y-axis is my new genome and when we align with another closely related species Oropetium (x-asix), id gives the following plot. You can see each of the x-axis has two aligned positions in y-axis, which tells that these are homologous.

Based on this I made a config file which looks like: ptg000016l ptg000003l ptg000013l ptg000004l ptg000022l ptg000015l ptg000011l ptg000007l ptg000012l ptg000006l ptg000019l ptg000008l ptg000017l ptg000018l ptg000002l ptg000010l ptg000001l ptg000005l ptg000020l ptg000014l Each row has two contigs (assuming homologous chromosomes) separated by space.

Total genoem size is 587 mb (587146212) and top 20 contigs are abount 573mb (573812465). or 97.72% of the total genome.

kashiff007 avatar Jun 05 '22 12:06 kashiff007

@kashiff007 It seems no problem. Do you have evidence for allo?

zhangrengang avatar Jun 05 '22 13:06 zhangrengang

I don't have evidence for allo. My primary job was to separate the subgenemes so I followed this thread. The circos looks like this: circos

kashiff007 avatar Jun 05 '22 13:06 kashiff007

is it something to do with clustering?

kashiff007 avatar Jun 05 '22 13:06 kashiff007

@kashiff007 If there are no much assembly errors (swtich errors, etc,), your genome is likely to be an auto-polyploid or be heavily recombined, which can not by phased by subphaser. You can try to change the parameters, such as setting -q 100 or reducing the number of chromosome sets like:

ptg000016l ptg000003l
ptg000013l ptg000004l
ptg000022l ptg000015l
ptg000011l ptg000007l
ptg000012l 
ptg000006l
ptg000019l 
...
ptg000020l 
ptg000014l

But these may also not work.

zhangrengang avatar Jun 05 '22 13:06 zhangrengang

OK, I appreciate the explanation.

kashiff007 avatar Jun 05 '22 13:06 kashiff007