adapterremoval AdapterRemoval v3 Feedback

AdapterRemoval v3 Feedback

Open jfy133 opened this issue 2 years ago • 12 comments

Hi @MikkelSchubert I decided to make a dedicated issue for general feedback of AdapterRemoval v3 testing, as I may find other points to discuss:

Version v3.0.0pre https://github.com/MikkelSchubert/adapterremoval/commit/344591c303da2b219f12c47f088c1fc573a84916

~~Leaving in single reads with Ns~~
```
    --combined-output  
        If set, all reads are written to the same file(s), specified by
        --output1 and --output2 (--output1 only if --interleaved-output is
        not set). Discarded reads are replaced with a single 'N' with Phred
        score 0 [default: off].
```
~~While I used to do this, @ashildv recently was informed by the ENA that include 'discarded reads' with a single 'N' will not be accepted by their pipeline (it breaks, and the data gets rejected). Maybe it would be worth having e.g. 5 Ns or something (or remove them entirely)?~~ <- I realise could just do the custom output instead and make sure discarded goes in a separate file
--singleton flag: would it make sense for consistency to have --outputsingleton as the other output flags (1,2,merged) start with --output?
--settings FILE: could maybe be renamed, as the bulk of the contents of the JSON is stats rather than the settings itself
json output:
- it would be nice for this to also include the physical number of entries that are in the resulting output files when also merged (as a separate value), sort of equivalent to retained reads in v2.3.2. Currently the JSON only reports the number of output (passed) reads as it would be if everything was unmerged. So something like in addition to the passed, discarded and unidentified sections of the output JSON, having something like in_files or output_file would be nice to have as it helps match the expectation of a (unfamiliar) end-user between the file itself and the JSON report. However I recognise that this could be complicated given the very flexible output system now.
- It would nice to have some documentation for what each value means. I've tried playing around but I still can't work out how the various reads entry in the JSON relate to each other as what is in the final output FASTQ files

initial tests completed most of the above are more quality-of-life issues, otherwise everything is working as expected :+1:

Nov 08 '21 09:11 jfy133

Thank you for your comments!

I apologize for the lack of documentation, but as you can probably imagine I'd prefer to wait until I am closer to release (I should probably have labeled this 3.0-alpha rather than 3.0-pre for clarity. Not sure why I picked the latter, tbh). But this is also why now is a good time for you to mention any need/nice-to-have features on your wishlist.

I'm already planning a few features such as filtering of low-complexity reads, detection of over-represented sequences, possibly a better quality trimming (window based or Phred ala. seqtk, maybe by default), trimming of poly-G tails, GC content distribution, a read-only QC mode so that I can save time by using ARv3 instead of fastqc, adapter identification in SE reads (might be a pipedream), further optimizations, and a number of minor improvements. I'll add these to the changelog at some point, but prioritized listing breaking changes to begin with. I cannot promise that I'll get to everything, but I'll try!

You are right that you can just combine the various output options to get what you want; the --combined-output was added based on issue #8, with the goal of having an interleaved file maintaining correct pairing that contained everything. However, this has caused problems before (see #44), so it was already on my radar, and if it causes problems with ENA then that is another reason to revisit it. Perhaps @kdm9 can comment on their use-case? (Apologies for the ping if this is no longer relevant to you!)
The plan is to go over all command-line options apply a more consistent naming scheme. I'm currently considering something like --output1, --output-merged, etc. but it is relatively low priority at the moment (mainly because I dread having to name things). The plan to alias the old names so that this won't be a breaking change in itself.
As above, this will probably be renamed to --output-json along with an --output-html option for a planned human-readable report akin to multiqc/fastqc/fastp.

json_output:

Having an entries recording the number of records actually written is a good idea. I choose to write everything in terms of input reads since it made it easier to follow where a given portion of your reads ended up (something that was fairly confusing in the old settings file), but it does mean that the the numbers don't necessarily match the content of the files. Having both numbers should make it more obvious, hopefully. Listing the actual files a given read-type was written to is an excellent idea and shouldn't be too difficult to implement, though it will mean that the same filename can show up multiple places.
That's the plan. But I probably won't write proper documentation until I'm closer to release and have locked down the JSON layout (and everything else).

Cheers and thanks again!

Nov 08 '21 15:11 MikkelSchubert

Hi Mikkel,

No worries at all - completely understand about documentation. Given that most of my comments are that it shows that everything is pretty stable already :+1:

Re: JSON output - yes your logic makes sense, and indeed for 'power users' considering at input reads makes perfect sense, as generally they have a feeling on how everything is meant to be working. I think even if the same filename appears in multiple places it's OK: as long as the user can manually do the maths to get everything total up as they expect should be fine :+1:

The extra functionality sounds great, lots of cherries on the cake. Happy to test any of those out (poly-G tail trimming in particularly sounds nice!) - feel free to ping me.

Nov 09 '21 07:11 jfy133

HI @MikkelSchubert, and sorry for the slow response. I no longer need truly paired files a la #8, as all tools I use now support broken paired files like below[1]. I can't speak for others, but I suspect removing this behaviour is now acceptable.

Best, Kevin

[1]: To be clear, a broken paired file contains both paired reads if they exist, and single reads if the other pair is removed or collapsed or otherwise missing. Like the below (sans quality for brevity)

@1/1
ACGT
@1/2
CCGT
@2/1
CACT
@3/1
GGCA
@3/2
CCAG

Nov 16 '21 13:11 kdm9

Thank you very much for the response.

If that is the case then I probably will remove that option. If there turns out to be people who rely on it then I can reintroduce it later.

Nov 16 '21 13:11 MikkelSchubert

Hi @MikkelSchubert is there anything else you would like testing? There is a discussion in our department to switch entirely to AdapterRemoval or Fastp and there is testing going on. Given have been overhauling a lot of things I was wondering if the new version is ready and worth comparing instead (with poly-G trimming etc)

Oct 21 '22 16:10 jfy133

Hi @jfy133. Sorry for the lack of progress. I've had a lot of projects that needed my attention, which unfortunately meant that v3 has not progressed as far as I would have liked. However, I'm intending to release a proper alpha version within the next week or two, so that you (and anyone else interested) would have an almost feature-complete tool to test, and then hopefully wrap up the 3.0 release not to long after that (depending on feedback).

Oct 24 '22 17:10 MikkelSchubert

@MikkelSchubert no worries! You've said before, just the discussion in my department prompted my nudge.

I'll let people know an alpha version of 3 is iminent, so they are ready for testing :+1:

Oct 25 '22 06:10 jfy133

@jfy133 The first alpha release of V3 is available under releases: https://github.com/MikkelSchubert/adapterremoval/releases/tag/v3.0.0-alpha1

The linked documentation contains updated instructions for how to compile v3, and (preliminary) instructions on how to migrate from v2. There is also a pre-compiled binary, which should hopefully work on most 64 bit linux systems, and which you are very welcome to try. I've attached the WIP container file that I used to build the static binary, in case you prefer building your own binaries and would like a static binary.

Any and all feedback is greatly appreciated.

Cheers

Containerfile.txt

Nov 07 '22 19:11 MikkelSchubert

Hi Mikkel,

There was some internal testing comparison in our department with AR2 and a coupl eof other tools, and overall performance/output was good/comparable :+1: .

Some docs feedback (I've not actually tried it myself yet) - something that is not clear to me, based on the docs is of --post-trim-polyx. An example might be useful somewhere, but what I mean is do I just do --{pre,post}-trim-polyx G and then AR2 will trim off any length of Gs? Or do I have to specify the exact number --{pre,post}-trim-polyx GGGGG?

What I think confuses me is the phrasing of

Zero or more nucleotides may be specified after the option seperated by spaces

If it's the former interpretation I've mentioned above, I would maybe rephrase to something like:

A command separated list of one or any combination of A, C, G, or T.

Would maybe make more sense to me (I think)

Nov 29 '22 12:11 jfy133

Hi James,

Thank you very much for the feedback! Out of curiosity, what tools are you comparing ar3 to?

With regards to the --{pre,post}-trim-polyx option, yes, that should definitely be clarified.

The idea is that you can use adapterremoval --pre-trim-polyx G, adapterremoval --pre-trim-polyx A G, or some other combination of bases, to only trim those specific types of tails. Or you can use adapterremoval3 --pre-trim-polyx (ie. without any explicit bases) which is the equivalent to adapterremoval3 --pre-trim-polyx A C G T. The list of bases can also be written as a single parameter to make scripting simpler, e.g. --pre-trim-polyx ACGT.

The --{pre,post}-trim-polyx options are controlled by the --trim-polyx-threshold option, which sets the minimum length of tails to be trimmed (defaulting to 10 bp).

Also, if you've been benchmarking the static binary I provided, then I'd suggest trying the one attached to this message instead. It includes a few significant optimizations that didn't make it into the first alpha and replaces the slow musl allocator used when building on alpine with mimalloc (a problem I unfortunately didn't notice at the time). This results in a nice speed increase and a higher max throughput when the number of threads are increased.

Best, Mikkel

adapterremoval-3.0.0-6b3098a9-linux64.tar.gz Containerfile.txt

Nov 29 '22 14:11 MikkelSchubert

Hi Mikkel

I wasn't involved in the testing (unfortunately I have other deadlines), however I believe it was AR2, fastp, and leeHom.

Ah there is a --trim-polyx-threshold? This is probably what tripped me up then! It isn't present in the HTML manpage in the manual docs in your previous upload (but I see it is in the --help of the latest binary). This would probably help a lot with my confusion the first time around ;).

Otherwise, OK this makes a bit more sense. Probably having the those in examples.html might be sufficient, as that was my first port of call after not understanding the manpage on the first reading.

I can ask the people doing the testing here to try your latest version again, but I think they've finished for now (they were not so concerned about speed, rather trimming/merging performance).

Nov 29 '22 14:11 jfy133

Good catch! I'll have to double-check that all options are listed. And adding an example command sounds like a good idea.

With regards to your colleagues, I don't think it is worthwhile to re-run anything if they were mainly looking at trimming/merging accuracy/sensitivity/etc. There are differences with this build, as one of the new optimizations also fixed a small corner case where good alignments would be missed (and I'll probably tweak it bit more to handle some other edge cases better), but as far as I could tell it only affected a tiny fraction of reads.

Best, Mikkel

Nov 29 '22 16:11 MikkelSchubert

adapterremoval adapterremoval copied to clipboard

AdapterRemoval v3 Feedback

adapterremoval
adapterremoval copied to clipboard