htslib icon indicating copy to clipboard operation
htslib copied to clipboard

SAM flag filter specification strings and parsing & filtering API functions

Open jmarshall opened this issue 2 years ago • 13 comments

In https://github.com/samtools/samtools/pull/1442#issuecomment-871244137 I wrote:

(I can never remember which option I want to use and have to look it up each time. What I would really like to have for this is to come up with some [+-=].* syntax something like chmod(1) symbolic permissions so that all these options could be unified into a single -f/--flags option — at which point these legacy options would become somewhat moot.)

It turns out I was thinking of find … -perm [+-]mode syntax probably more than symbolic Unix permissions per se, but the point stands: that perhaps some other notation would solve the ‑f/ ‑‑ff/ ‑‑rf/etc plethora of confusing flag filtering options problem better than trying to come up with multiple more memorable overlapping option names.

Motivated by the renewed attempt to come up with a new set of option names in samtools/bcftools#1611, here is a specific proposal for that:

Flag specification strings are of the form #flag,…,flag,#flag,…,flag,…, i.e., a comma-separated list of flag items (each of which is either a numeric flag bitset, a symbolic flag bitset (/[pPuUrR12sxdS_]+/, as in sam.5 and samtools/htsjdk#232), or a single flag name (eg PAIRED)), each of which may be prefixed by a punctuation character (represented by # in the previous template) that controls which filter the subsequent items contribute to (until the next punctuation character):

Chr To pass filtering and be kept, the read must…
- & …have ALL of the flags prefixed by either - or &
! ^ …have NONE of the flags prefixed by either ! or ^
+ | …have ANY (at least one) of the flags prefixed by either + or |
~ …NOT have at least one of the flags prefixed by ~
= …have EXACTLY the flags prefixed by = (and no others)

(- and + are the same characters as used by find … -perm for similar tests, and & and | represent AND (all of these flags) and OR (any of these flags). ! and ^ are as per regexp and glob character classes. ~ suggests negative, but is a more tentative suggestion than the others.)

The various samtools/bcftools commands would then use sam_parse_flag_filter() with different initial mode arguments to parse their various flag filtering command line options. Hence for example the following would be equivalent:

samtools view -f '3|12^48~192' …

samtools view -f '&1,2' -f '|4,8' -f '^16,32' -f '~64,128' …

samtools view -f '&1,2,|4,8,^16,32,~64,128' …

samtools view -f 1 -f 2 --rf 4 --rf 8 -F 16 -F 32 -G 64 -G 128 …

samtools view --require-flags 3 --include-flags 12 --exclude-flags 48 -G 192 …

Is something like this better or worse than trying to come up with mnemonic long option names and mnemonic short option letters for all these possibilities?

IMHO such punctuation characters certainly have more mnemonic value than a plethora of even more cryptic ‑f/‑F/‑‑ff/‑‑rf/-G short options. Whether they are more mnemonic than ‑‑require‑flags/ ‑‑excl[ude]‑flags/ ‑‑incl[ude]‑flags/ ‑‑something‑flags long options is more of an open question: the long option names are quite distinctive from each other but as we've seen whether e.g. require means & or | is not quite as obvious as would be ideal.

jmarshall avatar Nov 25 '21 15:11 jmarshall

Right now I think we have simple cases (albeit hard to rember) or if we want utmost flexibility we have the expression language to be very precise what bits we want / don't want.

My gut tells me I'd prefer use of those over a new mini-language. (Although this may well be faster to execute.) I'll need to think more on this though.

jkbonfield avatar Nov 25 '21 15:11 jkbonfield

Agreed that now that there is an arbitrary expression language via view -e et al, ideally most of these bespoke flag options are obsolete and wouldn't need to exist.

But to the extent that they do… ‑f/‑F/‑‑ff/‑‑rf/-G and -f -&!^+|~= are both mini-languages for the simple cases. The question is which one is more mnemonic.

jmarshall avatar Nov 25 '21 15:11 jmarshall

I think the basic idea is reasonable, although as noted you can get the same result with the existing expression language. I'm not at all sure about the punctuation symbols though, which reminds me of "write-only" languages like regular expression syntaxes or APL derivatives. They also have a high risk of shell catastrophe if you forget to quote them properly. Something like -f all[<list>],any[<list>],none[<list>],not[<list>] where <list> is a list of flag specifiers as you already use might be a bit easier to understand and remember.

daviesrob avatar Dec 02 '21 11:12 daviesrob

By Jove, I think you've cracked it!

I think typing the close square bracket would be a bit of a pain and still at risk of quoting disasters, but

-f all:<list>,any:<list>,none:<list>,not:<list>

would work very nicely, perhaps keeping the punctuation characters as even shorter synonyms of these word specifiers that would be the canonical spellings of these specifiers.

I think not:… needs a bit more workshopping, as the distinction between the meanings of none: and not: will not be clear in isolation. (This is where the punctuation characters have an advantage, as ~ doesn't particularly have a misleading inherent meaning…)

jmarshall avatar Dec 02 '21 12:12 jmarshall

It's wordy, but the ambiguous nature of English words such as none vs not is why I suggested things like all-on, all-off, any-on, any-off in https://github.com/samtools/bcftools/pull/1611#issuecomment-977853409 (well actually I had set/unset instead of on/off, but the latter is shorter).

We also need to consider what "filter" means. My gut instinct when I see a filter is removing something - eg in the way that filter paper works, a fuel filter, a filter tip, etc. Admittedly in some contexts it's the reverse, such as a notch-filter for signals. Whatever we use it must be consistent, and this is where I feel the current terms are a bit of a mess as they go against (my) expectation of what a filter does. If it was "--require-flags none:<>,all:<> etc then it'd be explicit as to which direction it's applying.

jkbonfield avatar Dec 02 '21 14:12 jkbonfield

Indeed; perhaps notall:<list> or anynot:<list> (or accept either) would be clearer words to use than not: for the hard-to-name entry.

As for “filter”, this PR adds flag_filter routines to the htslib API that are modelled on the naming of hts_expr.h's filter routines. How it would be best described in the user-facing samtools.1 documentation is a separate question.

IMHO the f in samtools view -f stands for “flag” :smile:

jmarshall avatar Dec 02 '21 15:12 jmarshall

I've updated this to use the word prefixes as workshopped between Rob and me, thus:

samtools view -f all:3,any:12,none:48,anynot:192 …

# or equivalently…

samtools view -f 3 -f any:12 -f none:48 -f anynot:192 …

I'm torn between removing the punctuation codes and just having these, and keeping them for brevity in quick command line experimenting. I suspect trying to write the documentation will push me over to favouring just the word prefixes… :smile:

jmarshall avatar Dec 14 '21 11:12 jmarshall

To be clear, the purpose of this PR is to provide infrastructure that samtools and bcftools can use instead of needing their own code and multiple command line options to implement flag-based filtering. So if the maintainers like the approach, there would be followup samtools and bcftools PRs to use the new centralised facility.

This allows usage such as

samtools view -f all:3 -f any:12 …
bcftools mpileup --ff all:3,any:12 …
bcftools mpileup --ff all:3 --ff any:12 …

which I believe, and I hope the maintainers agree, is much more user-friendly and memorable than e.g. the --ls/--ns/--lu/--nu command line options proposed in PR samtools/bcftools#1611.

For this PR to proceed, it would be useful for the maintainers to comment on the direction: in particular, I am now leaning towards dropping the punctuation characters entirely in favour of the word prefixes as per https://github.com/samtools/htslib/pull/1361#issuecomment-993464145. If the maintainers are in favour of that approach, I will rework sam_parse_flag_filter()'s initial mode parameter etc accordingly and finish up the proposed documentation.

jmarshall avatar Jan 18 '22 13:01 jmarshall

Sorry, this was languishing as while in Draft we were assuming it was still in a state of flux. I assume by your comment you're happy with it now and want a review again. (Ie, I'll take another look soon.)

jkbonfield avatar Jan 18 '22 16:01 jkbonfield

For this PR to proceed, it would be useful for the maintainers to comment on the direction: in particular, I am now leaning towards dropping the punctuation characters entirely in favour of the word prefixes as per #1361 (comment). If the maintainers are in favour of that approach, I will rework sam_parse_flag_filter()'s initial mode parameter etc accordingly and finish up the proposed documentation.

That gets my thumbs up. I think I'd find the single character codes somewhat tricky to remember. It's like a nearly-regexp but not enough to be mnemonically unambiguous.

jkbonfield avatar Jan 19 '22 11:01 jkbonfield

What's the status of this, other that "draft"?

I had a play again this morning and think there are still some open questions. Eg:

  • Why does the external struct fields such as omit_any, but the parse API uses string "notall". We should be consistent.

  • Lack of testing. Could we get a test tool?

  • Lack of documentation. Eg "notall" string occurs nowhere in the header file, only in the source. Some of it needs more improvement too. Eg what does #flag mean? It didn't accept a literal # and it's not clear it means numeric given we mix real numbers with symbolic flag numbers (1 and 2). Some examples would help.

I wrote a tiny demo program just to explore, but I'm still struggling to understand some of it.

#include <stdio.h>                                                                                                      
#include "htslib/sam.h"                                                         
                                                                                
int main(int argc, char **argv) {                                               
    sam_flag_filter_t filt = {0};                                               
                                                                                
    printf("Parse=%d\n", sam_parse_flag_filter(argv[2], *argv[1], &filt));      
    printf("%d    all\t%x\n",     filt.has_all,      filt.all);                 
    printf("%d    none\t%x\n",    filt.has_none,     filt.none);                
    printf("%d    any\t%x\n",     filt.has_any,      filt.any);                 
    printf("%d    omitany\t%x\n", filt.has_omit_any, filt.omit_any);            
    printf("%d    exact\t%x\n",   filt.has_exact,    filt.exact);               
                                                                                
    return 0;                                                                   
}                                                                               

This does seem to imply though that there are issues with the parsing. Eg:

$ ./a.out + 'd,R,2'
Parse=0
0    all	0
0    none	0
1    any	422
0    omitany	0
0    exact	0

d, R and 2 are all listed in the documentation as symbolic names:

Flag specification strings are of the form #flag,...,flag,#flag,...,flag,..., i.e., a comma-separated list of flag items (each of which is either a numeric
flag bitset, a symbolic (/[pPuUrR12sxdS_]+/) flag bitset, or a single flag
name (eg "PAIRED")), each of which may be prefixed by a punctuation character
that controls which filter the subsequent items contribute to (until the next
punctuation character):

However 2 has been taken as the number 2 and not symbol 2. How to distinguish? It seems impossible.

Edit: I can use "REVERSE", "DUP" etc, but "READ2" fails. I think this is because the token ends after the D due to the isalpha or '_' check, but I haven't verified that yet.

jkbonfield avatar Apr 26 '22 11:04 jkbonfield

As an attempt to fix the READ1/READ2 bug, my first stab was this:

@@ -5014,8 +5018,16 @@ static unsigned long sam_parse_flag1(const char *s, char **endptr)
 
         case 'R':
             if (strcmp(name, "REVERSE") == 0) return BAM_FREVERSE;
-            else if (strcmp(name, "READ1") == 0) return BAM_FREAD1;
-            else if (strcmp(name, "READ2") == 0) return BAM_FREAD2;
+            else if (strcmp(name, "READ") == 0) {
+                if (*s == '1') {
+                    (*endptr)++;
+                    return BAM_FREAD1;
+                }
+                if (*s == '2') {
+                    (*endptr)++;
+                    return BAM_FREAD2;
+                }
+            }
             break;

It's a bit hacky, but it works. As expected, the issue is the first loop of for (; isalpha_c(*s) || *s == '_'; s++) truncates on "READ", leaving name and endptr incorrect.

My fix does cure this, but I'm thinking it's not the right one. Specifically if we do something like "UNMAP2REVERSE" then it's parsed as if it was "UNMAP,2,REVERSE" and gives us flag 0x16. I'm thinking it should be a parse error. Likely an isalnum above (starting from an isalpha for first char) instead would make these an error while also fixing the READ1/READ2 parsing.

jkbonfield avatar Apr 26 '22 12:04 jkbonfield

The status of this is draft, and as of back in January it is being refactored as discussed at the time. So don't bother reading the current draft's code too closely, as I'll push something rather different when I get back to it.

jmarshall avatar Aug 05 '22 15:08 jmarshall