htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Proposal for an aux tagged field iterator API

Open jmarshall opened this issue 2 years ago • 7 comments

As suggested in #1319, this adds new API functions for iterating through a BAM record's aux fields, inline accessor methods for field tag and type (or code can continue to use s-2 and *s), and a variant of bam_aux_del() that returns the (updated) iterator to the following field (for use in iterator-based loops that delete fields). IMHO there's no need for an additional bam_tag_iter_t type, as the uint8_t* pointer to aux fields is already all over this part of the API and there's no need for more iterator state anyway.

This draft rewrites bam_aux_get() in terms of the new iterator functions. It's slightly debatable whether to do that (and remove the old version) or to leave it alone.

Tossing up between using the C++-style bam_aux_erase() nomenclature or calling the new deletion function bam_aux_del1() or so. I also have a version with an additional bam_aux_erase_range() function that deletes several aux fields at once (so could be used to directly reimplement the way James did the deleting in samtools/samtools#516, which probably reduces the number of memmove() calls so is 0.1% more efficient), but not totally sure the slight complication is worth it.

This somewhat parallels PR #1351, but it's a separate part of the API (aux fields vs. headers) so might as well be a separate PR. Needs some test cases once the basic approach has basic approval.

Closes #1319.

jmarshall avatar Nov 09 '21 13:11 jmarshall

(See jmarshall/samtools@aefbf50cf18fef678beb2b1bf364cda68c2e174c for a draft of the samtools --keep-tag/--remove-tag functionality re-implemented via this API.)

jmarshall avatar Nov 09 '21 13:11 jmarshall

I'll take complete silence as signalling approval of the basic approach :smile:

The new iterator-oriented field deletion functions are now bam_aux_remove() and bam_aux_remove2(), as …remove is fairly widespread in HTSlib API function names. Test cases exercising the new API functions added.

@daviesrob et al: This is now ready for review.

jmarshall avatar Apr 05 '22 12:04 jmarshall

I like this feature. Bar a few niggles, it looks solid.

Thanks

jkbonfield avatar Apr 06 '22 11:04 jkbonfield

Bumping this again to see where we're at.

I've removed bam_aux_remove2(), which was amongst other things a pain to use, and reunified bam_aux_remove() accordingly. (See force push to https://github.com/samtools/htslib/commit/99954c125b47166945a751b3a0d1fa20284b4cac below.)

I'll add something else more useful for implementing stuff like keep_tag/remove_tag in its place.

I haven't refreshed my mind of the discussions and code yet, but was this something you were planning on doing still, and if so is it worth waiting for it or would you view it as part 2 of the API, to arrive in a subsequent PR? (If so I assume it's a purely additional thing and not something that would need changing of the existing API.)

jkbonfield avatar Aug 05 '22 15:08 jkbonfield

~~Don't refresh your memory of the code as it stands in this PR. Back in April, I got most of the way through refactoring it as discussed. I may have time to finish that up in the next few days.~~

jmarshall avatar Aug 05 '22 15:08 jmarshall

Oops, actually it is #1361 that has been substantially refactored from the current state of its PR.

I have an in-progress commit adding bam_aux_remove_if() to the existing PR. I was planning on adding it to this PR (as there's some risk it might involve tweaks to the existing proposed addition), or it probably could be done as a Part II.

I may have time to finish that up in the next few days, so maybe put this one on your list to have another look at on Wednesday :smile:

jmarshall avatar Aug 05 '22 15:08 jmarshall

Thanks. I did check if it was draft before commenting, as I had a recollection it was, but as you noted it's the samtools side that is the draft bit.

I'm refreshing my memory of what this does anyway as it'll help get my head in the zone for the upcoming changes.

jkbonfield avatar Aug 05 '22 16:08 jkbonfield

(That force-push was just to bring this up to current develop unchanged. Further pushes will make actual changes.)

jmarshall avatar Aug 22 '22 12:08 jmarshall

I've addressed the review comments to bam_aux_remove() and think that function should be ready for merging.

I've added the proposed API for bam_aux_remove_if(), which facilitates removing several aux fields at once, for review. This third commit is still in draft form, as the implementation needs testing and test cases; it's currently implemented similarly to James's versions over in samtools/sam_view.c, though I've been trying to implement a version that memmoves several kept fields at once.

Hence I've turned this PR back to draft, but feel free to (squash/)merge the first two commits.

jmarshall avatar Aug 30 '22 12:08 jmarshall

Thanks. I'll review the first couple commits with a view to merging them in isolation.

jkbonfield avatar Aug 30 '22 14:08 jkbonfield

Gah, I'll remove that last commit and bring bam_aux_remove_if() back as a separate PR later — this one has been hanging around quite long enough!

So you can review this part 1 as a normal PR…

jmarshall avatar Aug 30 '22 14:08 jmarshall

Hah beat me to it. I was going to squash and merge outside of this PR so as not to confuse your own branch (after testing it in anger with another samtools branch). Are you happy for me to squash and push over the top of your branch so this tidies itself up automagically on merge? (Or do it yourself - I don't care either way)

jkbonfield avatar Aug 30 '22 14:08 jkbonfield

I think I know which other samtools branch that is :smile:

Feel free to (force-)push to this PR's branch. Or use the “Squash and merge” UI button in due course, which will turn the PR happily purple/merged too.

jmarshall avatar Aug 30 '22 14:08 jmarshall

Yeah the draft sanitize function. We also discussed whether it belongs in htslib. My preference would be there and maybe enabled by an hts_set_opt call so any program can read *AM records with some basic aligner (cough bwa) fix-ups applied and be sure it doesn't have to worry about silly things like bases aligning off the ends of chromosomes or unmapped data with silly mapping qualities and non-sensical cigar strings.

But for now it's a useful test of an external tool against this API too. :)

jkbonfield avatar Aug 30 '22 14:08 jkbonfield

Ah frogger-it... meant squash and merge rather than just rebase. Oh well, it is as it is :)

jkbonfield avatar Aug 30 '22 16:08 jkbonfield