sourmash
sourmash copied to clipboard
sourmash compare matrix plot matplotlib labels too large/overlapping
Running sourmash plot --pdf --labels example.npy with ~200 signatures gives plots where the labels are too large and therefore overlap.
Looking at https://github.com/sourmash-bio/sourmash/blob/latest/src/sourmash/fig.py it does not appear to alter the matplotlib default font sizes, but resources like https://stackoverflow.com/questions/3899980/how-to-change-the-font-size-on-a-matplotlib-plot suggests we might reduce the font size and/or increase the image size for larger datasets.
Is this a bug, or would your recommendation be to follow https://sourmash.readthedocs.io/en/latest/plotting-compare.html#Customizing-plots and customise the plot by writing a modified version of the sourmash/fig.py code?
sourmash plot could certainly use some love! It was one of the first things we implemented ~6 years ago, and (FBFW) has driven a lot of our citations... but we haven't upgraded it, ever. This was due to some combination of:
- I'm not a plot-focused person, and my general feeling has been that we should provide the raw data in convenient formats to support other people doing custom things with it.
- the most plot focused people on the sourmash team tend to be R programmers ;)
- the slow but progressive addition of functionality that supported many more sketches, more types of comparisons, and much better naming/renaming of sketches.
This is all me saying that it's never risen to the level of "gotta fix" but has definitely risen to the level of "hmmmm yeah we should really be doing something about that."
A few related thoughts and issues -
the R package, sourmashconsumr
sourmashconsumr https://github.com/sourmash-bio/sourmash/issues/2492 is an R package that has some nice viz:
sourmash plot isn't doing the right thing, I think
per https://github.com/sourmash-bio/sourmash/issues/2406, I appear to have mixed up my similarity and distance matrices.
better label handling, plot annotation, etc
per https://github.com/sourmash-bio/sourmash/issues/2452, there are some good opportunities to make editing label names better (since I intuit that is a lot of what people want to do)
per https://github.com/sourmash-bio/sourmash/issues/2583 there are lots of opportunities to annotate dendrograms with more information
plugins are now a thing
per https://github.com/sourmash-bio/sourmash/issues/1353 and https://github.com/sourmash-bio/sourmash/pull/2438 in particular it would now be straightforward to experiment with other clustering and viz techniques all from within the relative safety of the sourmash command line.
this would permit the addition of dependencies that we don't want to add to core sourmash (for size and/or platform/install and/or support reasons) to support better output viz.
this is all to say... we just need someone who cares, or at least pointers to some good plots from other packages that we can steal ;). I know this is an active area, I just don't have a starting point!
That all makes sense. One size fits all visualisation defaults are not easy.
additional thoughts -
- can easily make binders with R and Python scripts/notebooks that show loading & viz code and permit further customization
- also at the very least we can provide loading code that shows how this ties into viz examples
- might make sense to create examples/good default viz for ~10 genomes, ~100 genomes, and ~1000 genomes
more from slack:
Christopher Gulvik Fig 1c minimum spanning tree style in GrapeTree rocks by [@jcarrico] and [@happykhan] . I've grown to appreciate it more and more for a broader audience than heirclust or phytrees to show outbreak or cluster data (SNPs, ANI, or cgMLST). The software that currently makes that style here has end of life this year.
The betterplot plugin would be a good place to add custom plotting code for very large plots.