GECCO
GECCO copied to clipboard
[Question] Documentation - Gecco use cases for 'annotation', downstream 'antismash'
Hi @althonos
I have some questions pertaining to documentation
. I know you mention here some documentation and also have a disclaimer
Before I ask my questions, I there is a bug or something wrong in the help text for -vvv
(verbose debugging). I do not think that the -vvv
is working. Does it stand for very very verbose
- When I invoke it, it causes the program to exit
gecco -vvv run --genome GENOME.fasta -o gecco_GENOME >& verbose_GENOME_gecco.txt &
- However, the same works if I
change vvv to vv
Here is the relevant gecco --help
text - it states vvv shows debug information
gecco --help
Parameters:
-h, --help show the message for ``gecco`` or
for a given subcommand.
-q, --quiet silence any output other than errors
(-qq silences everything).
-v, --verbose increase verbosity (-v is minimal,
-vv is verbose, and -vvv shows
debug information).
-V, --version show the program version and exit.
I have some questions/feature requests:
- When do you use the
gecco annotate
command and what is the purpose of it - In what scenarios does one use
gecco
for downstream post-processing withantismash
. I could not understand the use case for it from the preprint - I am assuming you would have done a downstream BiG-SLiCE process with your datasets. As a
feature request
orenhancement
, it would be nice to have gecco outputs (or scripts) in a compatible way for BiG-SLiCE.
- I do also note that you mention here to write our own scripts to make it compatible for BiG-SLiCE
Parameters - Cluster Detection:
-c, --cds <N> the minimum number of coding sequences a
valid cluster must contain. [default: 3]
-m <m>, --threshold <m> the probability threshold for cluster
detection. Default depends on the
post-processing method (0.4 for gecco,
0.6 for antismash).
--postproc <method> the method to use for cluster validation
(antismash or gecco). [default: gecco]
Hi @tamuanand
I do not think that the -vvv is working.
Yes, this is an old option and it doesn't work anymore, I just forgot to remove the old prompt. There are just three verbosity level now (nothing, -v
and -vv
). I've fixed the help message but we have yet to publish the next release with that fix.
When do you use the gecco annotate command and what is the purpose of it
I added this command to make it easier to create training data, it creates the feature tables that are then to be used with gecco embed
and gecco train
. It basically does the ORF detection and the HMM annotation stages. If you don't plan to re-train GECCO yourself you won't have much interest for this command.
In what scenarios does one use gecco for downstream post-processing with antismash
Well, none really. You'd probably want to use them in complement with one another, as they will give you different putative clusters (AntiSMASH being very good at finding clusters close to known things, GECCO being better at identifying novel architectures)
If you are confused about the --postproc
option, it's not actually for post-processing AntiSMASH results with GECCO or anything: it controls how we filter candidate cluster regions identified by the CRF (the antismash
criterion being harsher, and requiring some domains AntiSMASH considers "biosynthetic" to be present in the candidate BGC).
I am assuming you would have done a downstream BiG-SLiCE process with your datasets
We actually didn't, as we didn't find BiG-SLiCE scalable enough for our dataset: it doesn't support heavily-distributed computations and requires to annotate the entirety of the BGCs with hmmscan
(which couldn't be done on our HPC cluster).
I do also note that you mention here to write our own scripts to make it compatible for BiG-SLiCE
I am currently writing a dedicated command to help getting results into BiG-SLiCE, but everything is already still there in the GenBank "structured comments" of the output.
Hi @althonos, I am not able to get the datasets.tsv file and the taxonomy folders. Are those supposed to be generated via the convert command?
I am not able to get the datasets.tsv file and the taxonomy folders. Are those supposed to be generated via the convert command?
BiG-SLiCE requires these files to work because of their expected input structure, GECCO cannot generate them for you.
Hi @althonos
Thanks for responding to my queries.
I have a follow up query: You suggest to use gecco as a complement to antiSMASH
gecco being better at identifying novel architectures and antiSMASH at finding known things.
My question: I am assuming gecco
will still be able to find clusters to known things also - correct? Based on Fig 3a of the pre-print, is my understanding below correct for just the gecco vs antiSMASH
comparison
- gecco alone - 374,849
- gecco and antiSMASH intersection - 301,201 plus 75,048
- antiSMASH alone - 524,420
Were the above done with antiSMASH 5.1 or 5.2
?
The reason I ask this is because the preprint at one place talks about antiSMASH 4.2 - any specific reason as to why 4.2 when 5.1 or 5.2 was already available.
The command-line implementation of antiSMASH v4.2.0
was then used to identify the coordinates of known BGCs in all selected contigs (using default
settings), and ORFs/domains that overlapped with the resulting known BGC regions were
removed from the feature table, yielding a final BGC-negative feature table for each
prokaryotic contig (Supplementary Figure S2).
Hi @althonos
I was wondering if you could elaborate on the above.
Thanks
@tamuanand : The Figure 3.a was done with antiSMASH 5.2.
We used antiSMASH 4.2 to mask the biosynthetic regions from our training data, because we prepared the sequences at a time were antiSMASH 5 was not available. We are in the process of improving our training set, which includes rebuilding our set of contigs, and for this will use antiSMASH 5.2 as well.
Hi @althonos
AntiSMASH 6 is now available - if you are planning to use antiSMASH I would recommend using antiSMASH 6.0