Discrepancy protein assessment compleasm vs BUSCO v5.8.3
Hi, I am studying a marine mammal, and did a completeness assessment with BUSCO v5.8.3 and with compleasm. With the genome, I got a score from BUSCO of ------------------------------------------------------------------------------------------- |Results from dataset carnivora_odb12 | ------------------------------------------------------------------------------------------- |C:98.5%[S:97.1%,D:1.4%],F:1.0%,M:0.5%,n:13727,E:8.3% | |13516 Complete BUSCOs (C) (of which 1122 contain internal stop codons) | |13328 Complete and single-copy BUSCOs (S) | |188 Complete and duplicated BUSCOs (D) | |140 Fragmented BUSCOs (F) | |71 Missing BUSCOs (M) | |13727 Total BUSCO groups searched | ------------------------------------------------------------------------------------------- We were pretty impressed since we got a both C score of 98.5 and a S score of 97.1 for just an assembly with PacBio.
The compleasm score was pretty similar. S:98.67%, 13544 D:1.01%, 138 F:0.16%, 22 I:0.00%, 0 M:0.17%, 23 N:13727
Which looks a little better, compleasm don't have a Complete score, which might be misleading, yet a S score, of 98.67, and a lower duplicated score as well as a lower fragmented and missing score.
I heard that gene prediction tends to look better. I did work with a bat genome, and the BUSCO score got better after gene prediction, and I used the newly released eviann software from @alekseyzimin. I was really excited, because he released incredible software, using sophisticated statistical analysis, and its platform uses protein and RNA-seq data. The structural annotation found 23k genes while its sister species has about 22k, so pretty close since I used more data than the authors of the sister species gene prediction.
The shocking result was the score with the proteins. Busco, was pretty similar ---------------------------------------------------- |Results from dataset carnivora_odb12 | ---------------------------------------------------- |C:92.4%[S:30.5%,D:62.0%],F:1.8%,M:5.8%,n:13727 | |12688 Complete BUSCOs (C) | |4183 Complete and single-copy BUSCOs (S) | |8505 Complete and duplicated BUSCOs (D) | |242 Fragmented BUSCOs (F) | |797 Missing BUSCOs (M) | |13727 Total BUSCO groups searched | ---------------------------------------------------- Nonetheless, kinda disappointed because the C complete was lower as well as the S score, but the shocking score is the D and M score. I can't believe the M score from the genome got that higher. And compleasm wasn't any better. A S:25.95%, 3562, lower than BUSCO, a higher D:66.53%, 9132, a higher F:1.84%, 253, but a lower M:5.68%, 780.
Any comments on this discrepancy?
Also, would it be possible to include the name of the filename in the output? I ran a loop with several genomes, and the log is confusing because it doesn't state the filename
Searching for miniprot in the path where compleasm.py is located
Searching for miniprot in the current execution path
Searching for miniprot in $PATH
Searching for hmmsearch in the path where compleasm.py is located
Searching for hmmsearch in the current execution path
Searching for hmmsearch in $PATH
miniprot execute command:
/data/common/juanpablo.aguilar/apps/miniconda3/envs/compleasm/bin/miniprot
lineage: carnivora_odb12
hmmsearch execute command:
/data/common/juanpablo.aguilar/apps/miniconda3/envs/compleasm/bin/hmmsearch
S:98.67%, 13544
D:1.01%, 138
F:0.16%, 22
I:0.00%, 0
M:0.17%, 23
N:13727
Thus, with a loop to run everything one can't tell which file was at that iteration, and it is also confusing and is weird that, the output, for protein, or genome, kinda gets stuck with "Searching for hmmsearch in $PATH" and doesn't mention anything about progress or what it is running until a long time when it outputs the results.
Hi @desmodus1984
The logic behind the protein mode is to compare protein sequences against the BUSCO database using HMMER search, in order to determine how many of the proteins are single-copy complete, duplicated, fragmented, or missing. While both BUSCO and compleasm follow essentially the same approach, the thresholds used to define these categories (e.g., single-complete vs duplicated) may differ slightly, and they may also apply different filtering strategies. As a result, the final outputs are always a little bit different.
Thank you for raising the issue about looped runs for multiple genomes. The current solution is to run different genomes under separate directories and run them via own job script. I understand this is not very convenient, and I agree it’s time to improve this functionality. I'll fix this as soon as possible.
Hello,
what I have faced recently is a discrepancy on the duplicated genes when running BUSCO (48) vs Compleasm (128). Also, this is the first time it has happened and I have worked with 6 different plant genomes so far! I have a .prot.fasta file, created using BRAKER3 and filtered by the longest isoform gff3 (AGAT) and for duplicated sequences using seqkit!
This is the output of BUSCO v6.0.0 with viridiplantae_odb12:
This is the output of compleasm v0.2.7 with viridiplantae_odb12:
I do agree that they can have different outputs given the different filtering stringencies, but I thought that this was a big difference! And again, the Complete, Fragment and Missing scores are similar, it's only the duplicated ones that are weird. I also looked at the "full_table.tsv" output of Compleasm and looked at sequences that were considered "Duplicated" based on the same database protein ID (col 1).
The alignment did not look great, and when I blasted the sequences individually, some of them had a similar match (Like cytochrome P450), but others did not. I also checked for some of these proteins localizations and they are generally spread out through the pseudo-chromosomes. I was wondering if the protein headers that I assigned on the .prot.fasta file were misleading, or something else on my end is going wrong. I couldn't narrow it down!
Anyways, Compleasm has been GREAT!!!!!! Just wondering what could be generating this weird differences!
Best,
Y
I assume this is the genome mode. If yes, compleasm writes the miniprot output somewhere. It would be good to first check the alignment of these genes.
It's the protein mode!
Interestingly, the genome also has a low number of "Duplicated" copies, that's why I was really surprise with the 15% that I got!
This is the genome mode:
@yanarizzieri
Could you try the dev branch of compleasm? You can directly download the updated compleasm.py script and run it within your existing environment.
The key difference between how compleasm and BUSCO handle duplicates is that BUSCO applies additional filtering strategies. For example, removing all duplicate matches and discarding any matches that fall below 85% of the top match score for each BUSCO.
I’ve added some similar filtering logic in the dev branch, but I’m not sure how well it performs on your dataset. Any feedback would be greatly appreciated!
Hey Neng,
same results as the previous compleasm! The dev branch has more lines that the other one (0.2.7) so I THINK that I am using the corect version!
Also, the duplicated genes seem to have similar matches:
Hi @huangnengCSU
Hope you are doing well after Thanksgiving. I wanted to ask you a question. Does compleasm properly detects in-frame stop codons?
I want to do comparative genomics of two sister species and I want to have an annotation as-complete-as-possible .
I wanted to share some results that I have had with BUSCO. I have two sister species, and for species A, the genome completeness was 99.1% and the proteome/annotation completeness was lower 95%. For the other one, the genome completeness was 98.6% while the proteome completeness was even lower 89.1%.
The author of EviAnn added the option to start with an initial gff, and I thought, that I could "reuse BUSCO output" for this. I concatenated the gffs of the single and multiple BUSCOS, and then I validated it with gffread, orfanage, and agat. I counted the number of genes/mRNAs,
grep -c '[[:space:]]mRNA[[:space:]]' LW2.HiC.BUSCO-multi.gff
and the results were confusing because the genome BUSCO complete were 13533 genes/mRNAs, but after merging the gff, there were 13721; the gff of only single-BUSCOs, matched the BUSCO single score: 13360, while the multiple didn't BUSCO:173/mRNA:361, which is expected.
Then, I checked for protein errors with gffread. BUSCO outputs this E-value, the number of internal stop-codons. Gffread can check various features of the input gff like presence of START/STOP codons, and in-frame STOP codons and also check CDS phase. I assessed individually, the merged-gff of the single-copy vs the multi-copy for START/STOP and in-frame STOP codons, and gffread didn't detect any in-frame STOP in the multi-copy, however, there were a lot in the single-copy, after discarding genes with in-frame, the total decreased from 13360 to 10786, while the BUSCO E was 383. Thus, according to gffread, there are 10 times more in-frame stop codons than what BUSCO reported. They are valid in the sense that they have START/STOP codons, but many of them have in-frame STOP codons.
How does compleasm deals with "orthologs" with in-frame stop codons? For comparing genes, detecting orthologs I am worried that this can be a problem that BUSCO kinda partially pointed out but was actually bigger.
Hope compleasm can do a better assessment. Looking forward to your comments, ideas, and suggestions.
Thanks
@desmodus1984
Compleasm does not perform stop-codon detection itself. It relies on miniprot to align protein sequences to the genome and to produce the translated protein sequences.
If you would like more details on how stop codons are handled during alignment or translation, I recommend raising the question under the miniprot repository, where you can get a more accurate answer.