cutadapt
cutadapt copied to clipboard
Change the priority read modification order
It has been mentioned in the docs that cutadapt modify the reads in a certain order. But it might not be suitable for some applications.
For example, adapter will be trimmed after quality trimming. But if the quality of reads drop too fast, them some adapter will be partially removed in the quality trimming (-q) and leave several bases (<=3), which might be too short to be recognized in adapter trimming step (-a).
Another example is UMI or inline barcode trimming. They are fixed in length, so can be removed by -u argument. But they are in the upstream of the adapter sequence, so the -a need to be executed before the -u argument.
Could you add an option for this adjust the order of each modification step?
Thanks!
Read modification order
The read modifications described above are applied in the following order to each read. Steps not requested on the command-line are skipped.
- Unconditional base removal with --cut
- Quality trimming (-q)
- Adapter trimming (-a, -b, -g and uppercase versions)
- Read shortening (--length)
- N-end trimming (--trim-n)
- Length tag modification (--length-tag)
- Read name suffix removal (--strip-suffix)
- Addition of prefix and suffix to read name (-x/--prefix and -y/--suffix)
- Read renaming according to --rename
- Replace negative quality values with zero (zero capping)
Ref: https://github.com/marcelm/cutadapt/issues/172#issuecomment-1062372817
Some notes.
To me, the biggest problem for things like this is how to design the user interface. A couple of ideas:
- Use sth. like
--order=trim,cut,qtrim,rename - Add an option
--orderthat changes the command-line parsing so that read modifications are done in the order in which they are listed on the command line. It’s a bit difficult to get the order of the options out of argparse, though. Also,-a ACGT -q 10 -a GGTAwould now have to result in two adapter trimming steps. Or what if someone wants to have two adapter trimming steps, but not do any quality trimming in between?-a ACGT -a GGTAwould be ambiguous, so we need a no-op command-line option, something like--andor--then. - Come up with a way to describe workflows in a YAML file or so. This could be much more flexible than the command-line interface.
- Implement an API so that users can write their own little scripts using Cutadapt functions. This could be almost as simple as the YAML file above in the best case.
Thanks @marcelm. Sound that solution 2 is more balance. YAML is also a good solution, but it will also create some complexity.
Hi, @marcelm. I would like to know if this feature is in the master branch now?
Hi @marcelm, I am thinking the API you mentioned in idea 4. It will be very interesting if the API can run in pipeline in the same way as what pandas package do.
I’ll have a look after my vacation.
Hi @marcelm,
I would like to follow up if there is some update for this feature in the master branch now? Thank you!