CAMISIM icon indicating copy to clipboard operation
CAMISIM copied to clipboard

Cover viral genomes

Open emiliomastriani opened this issue 2 years ago • 10 comments

Hello, I am planning to create my microbial-genomes population composed by viruses, bacteria and ribosomes to test our metagenomic pipeline. To achieve this objective, we are going to use a downloaded list of genomes and create the population moving from them (de-novo community design).

But we some opened question to address before starting our test:

  1. When we generate simulated data, will we have reads that cover all the viral genomes or will it be random?
  2. I mean, can we set it up so that we want each viral genome to have enough reads that correspond to the whole genome sequence? I don't know if that's clear.

Thank you Emilio

emiliomastriani avatar Jun 20 '22 05:06 emiliomastriani

Hi Emilio, that is definitely possible, but it might require some manual work beforehand. As far as I understood you want to use CAMISIM's option to generate genome sequences in a log-normal distribution and not set the abundances manually. This would cause random abundances assigned to genomes and it would be random whether there are enough sequences per virus. But CAMISIM actually has an option which was designed exactly for this problem: If you have a look at the default config, you see that on the bottom (commented out) there is a so-called "second community" with a "ratio" parameter. You would need to set the num_communities parameter to 2 and add the required options for the second community and a relatively high "ratio" parameter. The exact value depends on the number of viruses versus the number of bacteria and the total size of the data set. Let me know if that works or I can help further - there are, though a little more complex, ways to ensure that all viruses are covered at least 1x.

AlphaSquad avatar Jun 20 '22 07:06 AlphaSquad

Hello Adrian, thank you very much for your fast reply. I will spend some words to better describe my problem, then I am sure you will suggest me the best solution using CAMISIM. I have already downloaded a collection of 478 genomes: 63 (Viruses), 34 (Bacteria), 181 (SILVA ssr138), 200 (SILVA slr138) [total—> 478 genomes]. We looked for the exact composition (viral/bacterial) of the population to be more similar as possible to the real one (from the lab). My idea is to use the CAMISIM to produce synthetic data from those genomes to test your metagenomic pipeline will correctly classify the reads and guess the composition of the synthetic sample. We are concerned to have enough reads in the synthetic data to guarantee the genomic coverage. I hope it’s clear. Thank you, Emilio

emiliomastriani avatar Jun 20 '22 08:06 emiliomastriani

Please, can you give me some practical suggestion ? Thank you

emiliomastriani avatar Jun 20 '22 14:06 emiliomastriani

Hi, with this additional information we are getting closer, there still might be some variables though ;) If you have the exact composition (but not a BIOM profile), then I suggest to use the distribution_file_paths option, where you can provide files which contain this composition. Then you can set the number of bases of your data set, such that the lowest abundant virus gets enough reads for the whole genome to be covered. For example, if your lowest abundant virus is 0.1% of the data set and is 10,000 bases long, then your dataset should have at least 10 million bases (0.01 Gigabases). Since that does not absolutely guarantee that there will be enough reads, it is probably safest to add enough of a "buffer" in bases.

AlphaSquad avatar Jun 20 '22 15:06 AlphaSquad

Thank you Adrian for your explanation. It's clear, even if a little bit complicated. For example, what do you exactly mean by "enough buffer in bases"? Is there available a step-by-step guide illustrating this use of case? Do you think using the BIOM profile could give me some advantages (easiness of usage, guaranteed genome coverage, etc...)? Thank you very much for your kind support. Emilio

emiliomastriani avatar Jun 21 '22 01:06 emiliomastriani

Hello Adrian, I moved forward with my project and now I am facing with the following error:

emilio@frodo:~/CAMISIM$ python metagenomesimulation.py /home/emilio/KatrinaVirBactList/virus_bact_rib_mini_config.ini -p 16 -debug 2022-06-23 16:03:28 INFO: [MetagenomeSimulationPipeline] Metagenome simulation starting 2022-06-23 16:03:28 INFO: [MetagenomeSimulationPipeline] Validating Genomes 2022-06-23 16:03:28 INFO: [MetadataReader] Reading file: '/home/emilio/KatrinaVirBactList/Vir_Bact_Ribgenome_to_id.tsv' 2022-06-23 16:03:58 INFO: [MetadataReader 61355713354] Reading file: '/home/emilio/KatrinaVirBactList/virus_bacteria_ribo_metadata.tsv' 2022-06-23 16:03:58 INFO: [MetadataReader 31395689975] Reading file: '/home/emilio/CAMISIM/vir_bact_rib_out/internal/genome_locations.tsv' 2022-06-23 16:04:00 INFO: [NcbiTaxonomy] Building taxonomy tree... 2022-06-23 16:04:00 INFO: [NcbiTaxonomy] Reading 'nodes' file: '/tmp/tmpkd2w83zr/NCBI/nodes.dmp' 2022-06-23 16:04:12 INFO: [NcbiTaxonomy] Reading 'names' file: '/tmp/tmpkd2w83zr/NCBI/names.dmp' 2022-06-23 16:04:14 INFO: [NcbiTaxonomy] Reading 'merged' file: '/tmp/tmpkd2w83zr/NCBI/merged.dmp' 2022-06-23 16:04:14 INFO: [NcbiTaxonomy] Done (14s) 2022-06-23 16:04:14 DEBUG: [MetagenomeSimulationPipeline] Traceback (most recent call last): File "metagenomesimulation.py", line 80, in run_pipeline self.write_profile_gold_standard(meta_data_table, list_of_file_paths_distributions) File "metagenomesimulation.py", line 199, in write_profile_gold_standard taxonomic_profile.write_taxonomic_profile_from_abundance_files( File "/home/emilio/CAMISIM/scripts/ComunityDesign/taxonomicprofile.py", line 54, in write_taxonomic_profile_from_abundance_files self.write_taxonomic_profile( File "/home/emilio/CAMISIM/scripts/ComunityDesign/taxonomicprofile.py", line 89, in write_taxonomic_profile self._stream_taxonomic_profile(stream_output, genome_abundance, metadata_table, sample_id) File "/home/emilio/CAMISIM/scripts/ComunityDesign/taxonomicprofile.py", line 117, in _stream_taxonomic_profile genome_id_to_lineage = self._get_genome_id_to_lineage( File "/home/emilio/CAMISIM/scripts/ComunityDesign/taxonomicprofile.py", line 152, in _get_genome_id_to_lineage tax_id = genome_id_to_taxid[genome_id] KeyError: '1'

2022-06-23 16:04:14 ERROR: [MetagenomeSimulationPipeline] '1' in line 80 2022-06-23 16:04:14 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted 2022-06-23 16:04:14 INFO: [MetagenomeSimulationPipeline] Temporary data stored at: /tmp/tmp9tjzww9v

Please, can you help me to identify the problem? Find attached the configuration files I used Thank you very much. Emilio

virus_bacteria_ribo_metadata.txt virus_bact_rib_mini_config.txt Vir_Bact_Ribgenome_to_id.txt abundances01.txt

emiliomastriani avatar Jun 23 '22 08:06 emiliomastriani

Hi Emilio, with "buffer" I meant that if you calculate some value for the number of bases, it is probably best to add something like 50% more, just to be sure. But the best way is probably what you are doing right now, i.e. simulating a dataset and then test its properties. I checked your files and while they generally look fine, there are a few genomes which do not have a NCBI ID (e.g. GCVR01033941 394 or GFKJ01058510 395) which CAMISIM cannot handle. If you do not have a NCBI ID and don't care about the profiling gold standard, you can just use any valid NCBI ID, e.g. 1 (root), 12908 (unclassified sequences) or 28384 (other sequences).

Additionally, you used the novelty_category column for storing the lineages. It is possible that this does not cause problems since in principle CAMISIM accepts any string there, but the "best practice" here would be to add known_strain as novelty category since all of the sequences are database sequences (?)

Finally, if you restart CAMISIM, make sure that the output folder is empty.

AlphaSquad avatar Jun 23 '22 08:06 AlphaSquad

Thank you Adrian. I solved most of the problems thanks to your support. Anyway, I encountered this error on the last running:

emilio@frodo:~/CAMISIM$ python metagenomesimulation.py /home/emilio/KatrinaVirBactList/virus_bact_rib_mini_config_default.ini -p 16 -debug 2022-06-24 08:49:50 INFO: [MetagenomeSimulationPipeline] Metagenome simulation starting 2022-06-24 08:49:50 INFO: [MetagenomeSimulationPipeline] Validating Genomes 2022-06-24 08:49:50 INFO: [MetadataReader] Reading file: '/home/emilio/KatrinaVirBactList/Vir_Bact_Ribgenome_to_id.tsv' 2022-06-24 08:50:20 INFO: [MetagenomeSimulationPipeline] Design Communities 2022-06-24 08:50:20 INFO: [CommunityDesign] Drawing strains. 2022-06-24 08:50:20 INFO: [MetadataReader 31395689975] Reading file: '/home/emilio/KatrinaVirBactList/virus_bacteria_ribo_metadata.tsv' 2022-06-24 08:50:20 INFO: [MetadataReader 57864289696] Reading file: '/home/emilio/KatrinaVirBactList/Vir_Bact_Ribgenome_to_id.tsv' 2022-06-24 08:50:20 INFO: [CommunityDesign] Validating raw sequence files! 2022-06-24 08:50:52 INFO: [NcbiTaxonomy] Building taxonomy tree... 2022-06-24 08:50:52 INFO: [NcbiTaxonomy] Reading 'nodes' file: '/tmp/tmpmujfkb4x/NCBI/nodes.dmp' 2022-06-24 08:51:04 INFO: [NcbiTaxonomy] Reading 'names' file: '/tmp/tmpmujfkb4x/NCBI/names.dmp' 2022-06-24 08:51:06 INFO: [NcbiTaxonomy] Reading 'merged' file: '/tmp/tmpmujfkb4x/NCBI/merged.dmp' 2022-06-24 08:51:06 INFO: [NcbiTaxonomy] Done (14s) 2022-06-24 08:51:06 ERROR: [NcbiTaxonomy] Invalid taxid: '1803956' 2022-06-24 08:51:06 DEBUG: [MetagenomeSimulationPipeline] Traceback (most recent call last): File "metagenomesimulation.py", line 83, in run_pipeline genome_id_to_path_map, list_of_file_paths_distributions = self._design_community() File "metagenomesimulation.py", line 263, in _design_community self.write_profile_gold_standard(meta_data_table, list_of_file_paths_distribution) File "metagenomesimulation.py", line 199, in write_profile_gold_standard taxonomic_profile.write_taxonomic_profile_from_abundance_files( File "/home/emilio/CAMISIM/scripts/ComunityDesign/taxonomicprofile.py", line 54, in write_taxonomic_profile_from_abundance_files self.write_taxonomic_profile( File "/home/emilio/CAMISIM/scripts/ComunityDesign/taxonomicprofile.py", line 89, in write_taxonomic_profile self._stream_taxonomic_profile(stream_output, genome_abundance, metadata_table, sample_id) File "/home/emilio/CAMISIM/scripts/ComunityDesign/taxonomicprofile.py", line 117, in _stream_taxonomic_profile genome_id_to_lineage = self._get_genome_id_to_lineage( File "/home/emilio/CAMISIM/scripts/ComunityDesign/taxonomicprofile.py", line 155, in _get_genome_id_to_lineage tax_id = self._taxonomy.get_updated_taxid(tax_id) File "/home/emilio/CAMISIM/scripts/NcbiTaxonomy/ncbitaxonomy.py", line 146, in get_updated_taxid raise ValueError("Invalid taxid") ValueError: Invalid taxid

2022-06-24 08:51:06 ERROR: [MetagenomeSimulationPipeline] Invalid taxid in line 83 2022-06-24 08:51:06 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted 2022-06-24 08:51:06 INFO: [MetagenomeSimulationPipeline] Temporary data stored at: /tmp/tmpf_of5x2i

From the metafile: emilio@frodo:~/KatrinaVirBactList$ grep 1803956 virus_bacteria_ribo_metadata.tsv NC_001897.1 55 1803956 Human parechovirus

From the names.dmp: 1803956 | Human parechovirus | Human parechovirus | synonym |

I don't understand the reason of the error. Please, let me know how to solve it. Emilio

emiliomastriani avatar Jun 24 '22 00:06 emiliomastriani

Please Adrian, do you have more suggestions to solve my last problem? Thank you Emilio

emiliomastriani avatar Jun 25 '22 07:06 emiliomastriani

The entry from the names.dmp is the one CAMISIM stored in the temporary files (i.e. /tmp/tmpmujfkb4x/NCBI/names.dmp)? Unfortunately, internally CAMISIM sometimes falls back to an older taxonomy version which might not have this virus yet. The best suggestion I can give here is to either try a higher rank for this virus (e.g. 138954 for Parechovirus). But if that is missing from the older taxonomy it might still cause problems. Again, if you do not need the profiling gold standard (since you provide the composition anyway) then the taxonomy ID is not that important and you can just substitute it with one that is already present in older taxonomies, e.g. 10239 for Virus should be safe. I am sorry for the inconvenience here, we are hoping to fix these kind of problems in our updated CAMISIM version, but that might still take a while.

AlphaSquad avatar Jun 27 '22 07:06 AlphaSquad