CAMISIM icon indicating copy to clipboard operation
CAMISIM copied to clipboard

Is profile-based design with custom genomes possible?

Open CassandraHjo opened this issue 1 year ago • 11 comments

I want to use my own fasta files as input genomes to the simulation. I am wondering if profile-based community design is possible with custom genomes. i.e. I do not want to use genomes from NCBI. Would it be an option to provide my own genome sequence collection in the profile based design, like I can do in the de-novo based design?

CassandraHjo avatar Dec 04 '23 13:12 CassandraHjo

Actually this is possible, yes! You will need to use the -ar/--additional-references option. This file has to be in a tab-separated format with the 4 columns NCBI_ID Scientific_name genome_path novelty_category (without header). The NCBI ID is required for mapping the scientific name from the profile to your genome, for novelty category you can just use known_strain. If you do not want to use genomes from the NCBI at all, you will additionally need to use the -ref/--reference-genomes option and point to an empty file, so CAMISIM does not use the default reference list in addition to the one provided for you. The command could look like this: ./metagenome_from_profile --additional-references /your/reference/file.tsv --reference-genomes /path/to/empty/file.tsv -p /your/profile.biom

AlphaSquad avatar Dec 04 '23 17:12 AlphaSquad

I am not interested in gsa or pooled_gsa. Do I still need to find a unique NCBI ID or can I just use 2?

For example in the reference_file.tsv can an entry look like this? 2 MAG0001 genomes/MAG0001.fasta known_strain

CassandraHjo avatar Dec 12 '23 09:12 CassandraHjo

You don't need to provide these NCBI IDs then, so yes it could look like this, though just to be safe I'd advise using absolute paths to your genomes.

AlphaSquad avatar Dec 12 '23 12:12 AlphaSquad

What does the biom file need to contain if an entry in the reference file looks like the one above?

CassandraHjo avatar Dec 12 '23 12:12 CassandraHjo

Actually, looking at the code right now I was mistaken. For CAMISIM to work, every entry in the reference file needs to have a "correct" NCBI ID and scientific name, if you choose 2 as your taxonomy ID, CAMISIM assume that all your input genomes are on the taxonomic level of superkingdom and the mapping will not work. Additionally, the mapping from BIOM profile to your genome is performed via the scientific name, so using MAG0001 will not work as this will not be recognised as scientific name. The format of your BIOM profile should be similar to the mini.biom profile provided. The abundances are stored under data and the taxonomy in the same format as in the mini.biom, i.e. they need the metadata and taxonomy keywords - usually QIIME produces these files in the correct format already.

AlphaSquad avatar Dec 12 '23 13:12 AlphaSquad

I do not have the NCBI ID for all my custom genomes. Is there another way to make the profile-based design work, or is de-novo design (which I am able to run) be the best option?

CassandraHjo avatar Jan 04 '24 08:01 CassandraHjo

Since you do not use CAMISIM's option to download genomes the de novo design might actually be best (and more accurate). To use the abundances from the input profile you would need to use the distribution_file_paths option to provide them for your genomes, tab-separated with genome ID and abundance from the BIOM file. Note that for the de novo design to work you will still need to provide NCBI taxonomy IDs, but if you do not plan on using the taxonomic profile gold standard any valid NCBI ID should work (e.g. 2 for Bacteria)

AlphaSquad avatar Jan 04 '24 10:01 AlphaSquad

Do I need to change the phase in the config file if I am using the distribution_file_path option?

CassandraHjo avatar Jan 08 '24 08:01 CassandraHjo

No, you should not need to change the phase, CAMISIM will automatically use the files if they are provided. Note that for multiple samples these need to be absolute paths and comma-separated without whitespace: distribution_file_paths=/path/to/sample1.tsv,/path/to/sample2.tsv

AlphaSquad avatar Jan 08 '24 09:01 AlphaSquad

Should the tsv files include headers?

CassandraHjo avatar Jan 08 '24 10:01 CassandraHjo

No, these do not need a header, just genome_ID and abundance tab-separated

AlphaSquad avatar Jan 08 '24 10:01 AlphaSquad