nextclade icon indicating copy to clipboard operation
nextclade copied to clipboard

Failure to align should maybe not be logged as [WARN] but as [INFO]?

Open corneliusroemer opened this issue 4 years ago • 4 comments

Here's a user who was confused by the fact that running Nextclade with example sequences outputs a warning:

[WARN] Nextclade: Warning: in sequence "USA/VA-DCLS-3814/2021": When processing gene "ORF6": Unable to align: no seed matches. Note that this gene will not be included in the results of the sequence. 

The user thought, this was due to them making a mistake, or there being a problem, when in fact this is intended behaviour.

Two possible resolutions:

  1. We should not include sequences that don't align in Nextclade test runs, otherwise it may falsely look like there's a problem with Nextclade and/or the installation
  2. We may want to change the logging level from WARN to INFO for this particular issue. WARN sounds like there's a problem with the software. But it's a problem with the data. Maybe, the wording of the warning could be improved to make it clear the problem is with the sequence, not with the program.

Relevant excerpt from discussion

So! The bio-informatician of the lab worked with nextflow and docker to adapt the artic pipeline for sars-cov-2 sequences analysis. I'm trying to keep the same strategy and use the docker image in order to add nextclade to this pipeline, but I'm really far from knowing how all of it works.

I first installed docker and then I pulled the image, downloaded the sars-cov-2 dataset and tried to run the analysis with example sequences.

The error message was a succession of this kind of message : [WARN] Nextclade: Warning: in sequence "USA/VA-DCLS-3814/2021": When processing gene "ORF6": Unable to align: no seed matches. Note that this gene will not be included in the results of the sequence.

I don't know why there is no match between the example dataset and sequences. Do you know the reason? Anyway, I tried to run the analysis with my own sequences, inside a multi fasta, with the command below :

Originally posted by @valentinelsra in https://github.com/nextstrain/nextclade/discussions/551#discussioncomment-1394858

corneliusroemer avatar Sep 28 '21 21:09 corneliusroemer

WARN sounds like there's a problem with the software

Never heard of it. I was mostly modelling this around dev tools, the things I am more familiar with, like compilers and linters. In these, a defect in a source code file you feed into they produce warnings and errors. In case file can be processed with limited success - it's a warning. In case it cannot be processed - it's an error. If there is something wrong with the compiler itself, it just crashes typically and that's what Nextclade does too - there is not much to warn about in this case.

wording of the warning could be improved to make it clear the problem is with the sequence, not with the program.

Perhaps. It says "in sequence" already:

Warning
in sequence <name>
When processing gene <name>
Unable to align
no seed matches

What improvements would you suggest?

We should not include sequences that don't align in Nextclade test runs, otherwise it may falsely look like there's a problem with Nextclade and/or the installation

Not sure about that one. I would not worry about it too much. But I defer the decision to the science department :)

ivan-aksamentov avatar Sep 29 '21 22:09 ivan-aksamentov

Fair point. I just wouldn't apply software log levels in this case.

Warning should be reserved for a real problem. Not for QC issues. Failure to align is basically because of a QC issue. We also don't issue warnings when there's a frame shift, or when there are too many private mutations.

Clearer wording could be:

Unable to align  gene <name>  in sequence <name> due to no seed matches
Note: This is likely due to a quality problem with the provided sequence. 

Maybe just adding a note, that this is a problem of the sequence not the software. And making it INFO rather than WARN would be quick fixes.

I would rather treat these failures to align simply as a QC problem. Right now it could be understood as: there's a problem, we can't say anything here because of an issue with Nextclade.

corneliusroemer avatar Sep 29 '21 23:09 corneliusroemer

Warning should be reserved for a real problem. Not for QC issues. Failure to align is basically because of a QC issue. We also don't issue warnings when there's a frame shift, or when there are too many private mutations.

In this case it's a real problem:

 Note that this gene will not be included in the results of the sequence.

Historically, we did not report any problems with sequences at all in the console. But because the gene in this case is missing from results, and in case of a whole- sequence alignment failure, the results for the entire sequence are missing, users asked to provide warnings because they could not find what's wrong. So that's what it reports.

My current feeling is that the warning severity is correct here, as I described with the example with compilers. Warnings attract attention to faults in the input data and in this case this is what happens. And, specifically, we want to tell users to not look for the proteins and aa mutations for this gene, because they are not there.

However I like the suggestion of adding some more context:

 This is likely due to a quality problem with the provided sequence.

And making it INFO rather than WARN would be quick fixes.

INFO severity is hidden by default and this is what I have seen in most tools.

Now, if the warnings confuses beginners, it's whole another talk. In this case perhaps we need to include only the sequences that give no warnings.

But again, I am not concerned too much, because after 10 min with Nextclade, after they run on their own data, they can probably figure that out too.

ivan-aksamentov avatar Sep 29 '21 23:09 ivan-aksamentov

Agree with most here. Maybe adding context/explanation would be all that's needed.

The biggest issue is I think when users install Nextclade, they run Nextclade for the first time with example data, and they are surprised to see a warning. That's not very user friendly. By default, warnings should not appear if the installation is correct. This is what confuses users.

corneliusroemer avatar Sep 30 '21 00:09 corneliusroemer

The messages are improved in https://github.com/nextstrain/nextclade/pull/1099. Our current example sequences for SC2 produce no warnings (clean output), so the Nextclade newbies should hopefully not be confused anymore.

ivan-aksamentov avatar Jan 27 '23 21:01 ivan-aksamentov