CAMISIM
CAMISIM copied to clipboard
Guidance on abundance.tsv file and output
Dear authors,
I remember asking this question before, but when looking back, I still am having some last doubts I was hoping you could help me with.
This is a summary of organism, abundance.tsv, and genome size Salmonella bongori | 0.10 | 4487548 Salmonella enterica | 0.4 | 5028552 escherichia coli | 0.5 | 4643559
When I use CAMISIM to simulate my sample, I use the readmappings.tsv output file with grep -c to check my abundance. In this output file, I have exactly 10% of reads belonging o S bongori, 40% to S enterica, and 50 to e.coli
Is this expected behaviour? I read that the simulation is supposed to take genome length into consideration, so I expected that the amount of reads belonging to S.enterica would be inflated, since the genome is bigger. Why is this not the case?
Hm, that is strange, are you sure that these values fit exactly? The difference between S. bongori and S. enterica in genome size is only ~10%, so instead of 10% and 40% I would expect CAMISIM to simulated something like 9% and 41% of reads for both: If the coverage is ratio 4:1 between the two genomes and the genome size ratio is 1.1:1, then we expect a read ratio of 4.4:1 in reads between the two - which translates to ~40.75%:9.25% (here neglecting the third genome which is not exactly halfway between the two Salmonella)
Hi and thank you for your quick reply. Your calculations are also what I expected to happen.
When I use grep -c on my readmappings.tsv file, I find this file has a total of 1333338 lines, with the counts of lines and percentage: bongori | 133334 | 0.10000015 enterica | 666666 | 0.4000021 e.coli | 533338 | 0.49999775
So while not exact, my readmappings file does fit closer to the abundance I put in the abundance.tsv file, then what I would expact from making these caclulations myself. Is there a better way to check the fraction of reads simulated besides grep -c on the readmappings.tsv?
I think grep -c
on the readmappings file should be fine. I tried the same on a few of my datasets and the results were as I expected it (i.e. the larger genomes having far more reads than the smaller ones, even with lower abundance).
Which read simulator did you use?
wgsim, so error free simulation. Could this cause the behaviour?
I will have to look into it, but I have the strong suspicion that this is a bug in CAMISIM when using wgsim
. Could you run the same data set using ART and look if the distribution of reads is as you (and I :wink: ) would expect?
sure!
Hi again,
I ran the simulation again with:
readsim=tools/art_illumina-2.3.6/art_illumina error_profiles=tools/art_illumina-2.3.6/profiles samtools=tools/samtools-1.3/samtools profile=mbarc size=0.1 type=art fragments_size_mean=270 fragment_size_standard_deviation=27
This resulted in:
grep -c % | grep -c count | expected | |
---|---|---|---|
S bongori | 0.094 | 62884 | ~0.09 |
E coli | 0.490 | 326554 | ~0.49 |
S enterica | 0.416 | 277222 | ~0.42 |
666660 |
The expected row is what I (and you :) )expected my distribution to look like. It is a very good match when using art! For error free simulation, I think I can still use my simulated reads (my only goal is to use them together with kraken/bracken, and check if I can find the abundance simulated with bracken), since the simulated numbers are what I expected.
Okay, that is good to hear. It is still a bug/inconsistency for wgsim
(and likely for Nanosim
as well) which I will fix.
If you can use your data regardless that is great.