emg-viral-pipeline icon indicating copy to clipboard operation
emg-viral-pipeline copied to clipboard

Fix sankey plot visualization for undefined ranks

Open hoelzer opened this issue 1 year ago • 0 comments

The Sankey plot visualization has difficulties when taxonomic ranks are missing. This can be solved to some extent by introducing "unclassified" ranks based on the parent rank. However, this is currently not working properly for all levels. For example, using the assembly.fasta test file:

https://github.com/EBI-Metagenomics/emg-viral-pipeline/blob/master/nextflow/test/assembly.fasta

produces such a sankey:

Screenshot 2023-07-14 at 16 59 21

but correct is:

Screenshot 2023-07-14 at 17 01 56

As illustrated, the problem is that the subfamily Guernseyvirinae does not have a family or an order rank; only a class _ Caudoviricetes_. Now, the current script introduces Unclassified Caudoviricetes to fill the order rank but then the family rank is still missing and the arrangement will be wrong (see first figure).

I think we can fix that by

a) introducing multiple "unclassified" (or better: "undefined" !) ranks b) adding the rank level to the label (because we need unique labels)

For example, for Jerseyvirus we would then have in the Sankey:

Caudoviricetes --> Undefined Caudoviricetes (Order) --> Undefined Caudoviricetes (Family) --> Guernseyvirinae --> Jerseyvirus

hoelzer avatar Jul 14 '23 15:07 hoelzer