emg-viral-pipeline
emg-viral-pipeline copied to clipboard
Fix sankey plot visualization for undefined ranks
The Sankey plot visualization has difficulties when taxonomic ranks are missing. This can be solved to some extent by introducing "unclassified" ranks based on the parent rank. However, this is currently not working properly for all levels. For example, using the assembly.fasta
test file:
https://github.com/EBI-Metagenomics/emg-viral-pipeline/blob/master/nextflow/test/assembly.fasta
produces such a sankey:
but correct is:
As illustrated, the problem is that the subfamily Guernseyvirinae does not have a family or an order rank; only a class _ Caudoviricetes_. Now, the current script introduces Unclassified Caudoviricetes to fill the order rank but then the family rank is still missing and the arrangement will be wrong (see first figure).
I think we can fix that by
a) introducing multiple "unclassified" (or better: "undefined" !) ranks b) adding the rank level to the label (because we need unique labels)
For example, for Jerseyvirus we would then have in the Sankey:
Caudoviricetes --> Undefined Caudoviricetes (Order) --> Undefined Caudoviricetes (Family) --> Guernseyvirinae --> Jerseyvirus