htslib
htslib copied to clipboard
Proposal for an aux tagged field iterator API
As suggested in #1319, this adds new API functions for iterating through a BAM record's aux fields, inline accessor methods for field tag and type (or code can continue to use s-2
and *s
), and a variant of bam_aux_del()
that returns the (updated) iterator to the following field (for use in iterator-based loops that delete fields). IMHO there's no need for an additional bam_tag_iter_t
type, as the uint8_t*
pointer to aux fields is already all over this part of the API and there's no need for more iterator state anyway.
This draft rewrites bam_aux_get()
in terms of the new iterator functions. It's slightly debatable whether to do that (and remove the old version) or to leave it alone.
Tossing up between using the C++-style bam_aux_erase()
nomenclature or calling the new deletion function bam_aux_del1()
or so. I also have a version with an additional bam_aux_erase_range()
function that deletes several aux fields at once (so could be used to directly reimplement the way James did the deleting in samtools/samtools#516, which probably reduces the number of memmove()
calls so is 0.1% more efficient), but not totally sure the slight complication is worth it.
This somewhat parallels PR #1351, but it's a separate part of the API (aux fields vs. headers) so might as well be a separate PR. Needs some test cases once the basic approach has basic approval.
Closes #1319.
(See jmarshall/samtools@aefbf50cf18fef678beb2b1bf364cda68c2e174c for a draft of the samtools --keep-tag
/--remove-tag
functionality re-implemented via this API.)
I'll take complete silence as signalling approval of the basic approach :smile:
The new iterator-oriented field deletion functions are now bam_aux_remove()
and bam_aux_remove2()
, as …remove
is fairly widespread in HTSlib API function names. Test cases exercising the new API functions added.
@daviesrob et al: This is now ready for review.
I like this feature. Bar a few niggles, it looks solid.
Thanks
Bumping this again to see where we're at.
I've removed bam_aux_remove2(), which was amongst other things a pain to use, and reunified bam_aux_remove() accordingly. (See force push to https://github.com/samtools/htslib/commit/99954c125b47166945a751b3a0d1fa20284b4cac below.)
I'll add something else more useful for implementing stuff like keep_tag/remove_tag in its place.
I haven't refreshed my mind of the discussions and code yet, but was this something you were planning on doing still, and if so is it worth waiting for it or would you view it as part 2 of the API, to arrive in a subsequent PR? (If so I assume it's a purely additional thing and not something that would need changing of the existing API.)
~~Don't refresh your memory of the code as it stands in this PR. Back in April, I got most of the way through refactoring it as discussed. I may have time to finish that up in the next few days.~~
Oops, actually it is #1361 that has been substantially refactored from the current state of its PR.
I have an in-progress commit adding bam_aux_remove_if()
to the existing PR. I was planning on adding it to this PR (as there's some risk it might involve tweaks to the existing proposed addition), or it probably could be done as a Part II.
I may have time to finish that up in the next few days, so maybe put this one on your list to have another look at on Wednesday :smile:
Thanks. I did check if it was draft before commenting, as I had a recollection it was, but as you noted it's the samtools side that is the draft bit.
I'm refreshing my memory of what this does anyway as it'll help get my head in the zone for the upcoming changes.
(That force-push was just to bring this up to current develop unchanged. Further pushes will make actual changes.)
I've addressed the review comments to bam_aux_remove()
and think that function should be ready for merging.
I've added the proposed API for bam_aux_remove_if()
, which facilitates removing several aux fields at once, for review. This third commit is still in draft form, as the implementation needs testing and test cases; it's currently implemented similarly to James's versions over in samtools/sam_view.c, though I've been trying to implement a version that memmoves several kept fields at once.
Hence I've turned this PR back to draft, but feel free to (squash/)merge the first two commits.
Thanks. I'll review the first couple commits with a view to merging them in isolation.
Gah, I'll remove that last commit and bring bam_aux_remove_if()
back as a separate PR later — this one has been hanging around quite long enough!
So you can review this part 1 as a normal PR…
Hah beat me to it. I was going to squash and merge outside of this PR so as not to confuse your own branch (after testing it in anger with another samtools branch). Are you happy for me to squash and push over the top of your branch so this tidies itself up automagically on merge? (Or do it yourself - I don't care either way)
I think I know which other samtools branch that is :smile:
Feel free to (force-)push to this PR's branch. Or use the “Squash and merge” UI button in due course, which will turn the PR happily purple/merged too.
Yeah the draft sanitize function. We also discussed whether it belongs in htslib. My preference would be there and maybe enabled by an hts_set_opt
call so any program can read *AM records with some basic aligner (cough bwa) fix-ups applied and be sure it doesn't have to worry about silly things like bases aligning off the ends of chromosomes or unmapped data with silly mapping qualities and non-sensical cigar strings.
But for now it's a useful test of an external tool against this API too. :)
Ah frogger-it... meant squash and merge rather than just rebase. Oh well, it is as it is :)