pyani
pyani copied to clipboard
Missing labels and captions in plots with default settings
Following workflow produced plots without labels or class colors (edited for brevity):
pyani index pyani_sample_genomes
pyani anim pyani_sample_genomes pyani_sample_genomes_anim
pyani plot --formats png,pdf pyani_sample_genomes_anim 1
Solution is to explicitly set labels and classes when call pyani anim
,
pyani index pyani_sample_genomes
pyani anim pyani_sample_genomes pyani_sample_genomes_anim \
--labels pyani_sample_genomes/labels.txt --classes pyani_sample_genomes/classes.txt
pyani plot --formats png,pdf pyani_sample_genomes_anim 1
Would it make sense to have --labels
and --classes
default to $DIR/labels.txt
and $DIR/classes.txt
if present when run on input directory $DIR
?
If no labels are given, would it make sense to use the filename stems as the default labels?
(I'm also puzzled why the classes and labels are tied to the run; I expected pyani index $DIR
to record them from $DIR/labels.txt
and $DIR/classes.txt
)
That's expected behaviour, at the moment… v0.3.0a is in active development and should be considered unfinished.
Default behaviour might eventually turn out to use the hash, or the filestem, for labels (where not provided). My intention is that classes will be ignored if not specified.
Tying classes and labels to the run allows the user to rerun the same analysis with different labels and classes, e.g. for generating plots. It may be helpful to provide a way to override the database classes/labels specifically for an analysis, but at the moment, this is intended behaviour.
I agree that the hash would be another practical default for labels when not provided.
I don't yet understand why you would tie the class and label metadata to the comparison computation stage. I wouldn't want to recompute the comparisons (even with recovery mode) just to re-plot with different classes (e.g. samples sites, or sample year) or different labels (e.g. sampling method).
Can we supply the classes and labels to the plotting (and report?) commands?
The database stores the results from previous comparisons, so you don't need to recompute them.
Considering a use-case:
- I have a set of genomes I don't know how to classify
- I use ANI to generate a likely classification (labels/classes are arbitrary at this point)
- I plot the results of the analysis, and this tells me what the classes should be (and suggests labels, e.g. new species divisions)
Now I want to plot the results again, but with my new classes/labels. There are two obvious options:
- replot, but use a new classes/labels file specific to that plot
- "reanalyse" but with a new classes/labels file (this is computationally almost cost-free, as the results are stored)
Both will give the same file output. However, if you only redo the plot step, using a new class/label file, this gives an output that isn't consistent with the database (though you could make notes/log the changed labels/classes).
To have the database be "reproducible" such that plotting/writing tables for a particular run gives the same outputs each time with only the database as input, we'd need to capture the labels/classes files used for the plot, and remember that it's a combination of ANI run and plot command (and we could have arbitrarily many plots for a single run) that defines the output.
One goal is to have a Flask/whatever is useful at the time interface onto the local database, so that interactive plots can be produced, as well as those which are written statically to a file. These will get their information from the database. It makes sense in that context to have a "run" defined as the genomes + corresponding labels/classes. Changing labels/classes (keeping the same genomes) corresponds to another "run", in the same way removing genomes, but keeping labels/classes, corresponds to another "run". When no new calculations are required, this is a straightforward database update in both cases.
Now, I do see the utility of providing a classes/labels file at the pyani plot step, but it breaks that definition of "run" being "genomes + their labels/classes" that I want to keep for the more advanced interaction with the database. For quick and dirty outputs I see the attraction of having --classes
/--labels
options in pyani plot
. Maybe that's worth implementing - but I'm quite keen on enforcing that "run" definition.
That did clarify your design goals, thank you.
The "quick and dirty" option of --classes
/ --labels
options in pyani plot
is attractive, especially while "reanalyse" remains somewhat slow (even in -recovery
mode).
I should really write this stuff down somewhere ;)
This is another thing that should go into the doumentation - the design goals and motivation for the database integration and how that affects the way we need to provide metadata for visualisation.
Which part of the documentation? Design goals and motivation sounds like wiki material; there is already a bit of text in indexing.rst
that seems related to the use of class and label files discussed here.
Which part of the documentation? Design goals and motivation sounds like wiki material;
It does.
there is already a bit of text in
indexing.rst
that seems related to the use of class and label files discussed here.
Yes, there is. As ever there may be a judgement call involved to decide what is appropriately user-facing (so goes in ReadTheDocs) and what is "motivation/design detail" (so goes in the Wiki) - and some items may be represented, with different levels of detail perhaps, in both places.