bgcflow
bgcflow copied to clipboard
Feature: df_bgcs tables with all metadata
Proposal:
Create a df_bgcs
and df_gcfs
tables in the processed/{project_name}/tables
directory with several metadata directly from antiSMASH results.
I think a general table will be valuable with several metadata of bgcs in the main tables directory instead of the current for_cytoscape directory.
List of extra metadata:
I think some extra columns will be beneficial, adding a few below and look for more recommendations:
- antiSMASH based prediction of MIBIG similar BGCs based on knownclusterblast results with %similarity. (some times antiSMASH finds known clusters that BIGSCAPE misses)
- whether a BGC is on contig edge (very useful)
- number of genes in BGC and size of the BGC in KB
- number of A-domains in the case when present
- number of core biosynthetic genes
- path to BGC gbk file
- assigned GCF with 0.3 cutoff
- Number of genomes where the GCF is present
- Whether BGC is known based on BiGSCAPE
- Whether BGC is known based on antiSMASH
- BiGSLICE family assignment
- BGCs in BiGSLICE model
Need anything more - @matinnuhamunada ?
This looks perfect! Some of the data on this table can answer questions that @EVBAST and @tilmweber discussed this morning. Adding URLs to the MIBIG hits proven to be useful for end users too.
This looks perfect! Some of the data on this table can answer questions that @EVBAST and @tilmweber discussed this morning. Adding URLs to the MIBIG hits proven to be useful for end users too.
THANKS @matinnuhamunada and @OmkarSaMo
Hi, this issue will be adressed in the 0.6.1 release. As the table is huge, it will be stored in .parquet format and will be loaded to duckdb (instead of sqlite) https://github.com/NBChub/bgcflow/tree/dev-0.5.1