cutadapt icon indicating copy to clipboard operation
cutadapt copied to clipboard

Change the priority read modification order

Open y9c opened this issue 3 years ago • 7 comments

It has been mentioned in the docs that cutadapt modify the reads in a certain order. But it might not be suitable for some applications.

For example, adapter will be trimmed after quality trimming. But if the quality of reads drop too fast, them some adapter will be partially removed in the quality trimming (-q) and leave several bases (<=3), which might be too short to be recognized in adapter trimming step (-a).

Another example is UMI or inline barcode trimming. They are fixed in length, so can be removed by -u argument. But they are in the upstream of the adapter sequence, so the -a need to be executed before the -u argument.

Could you add an option for this adjust the order of each modification step?

Thanks!

Read modification order

The read modifications described above are applied in the following order to each read. Steps not requested on the command-line are skipped.

  1. Unconditional base removal with --cut
  2. Quality trimming (-q)
  3. Adapter trimming (-a, -b, -g and uppercase versions)
  4. Read shortening (--length)
  5. N-end trimming (--trim-n)
  6. Length tag modification (--length-tag)
  7. Read name suffix removal (--strip-suffix)
  8. Addition of prefix and suffix to read name (-x/--prefix and -y/--suffix)
  9. Read renaming according to --rename
  10. Replace negative quality values with zero (zero capping)

Ref: https://github.com/marcelm/cutadapt/issues/172#issuecomment-1062372817

y9c avatar Mar 09 '22 19:03 y9c

Some notes.

To me, the biggest problem for things like this is how to design the user interface. A couple of ideas:

  1. Use sth. like --order=trim,cut,qtrim,rename
  2. Add an option --order that changes the command-line parsing so that read modifications are done in the order in which they are listed on the command line. It’s a bit difficult to get the order of the options out of argparse, though. Also, -a ACGT -q 10 -a GGTA would now have to result in two adapter trimming steps. Or what if someone wants to have two adapter trimming steps, but not do any quality trimming in between? -a ACGT -a GGTA would be ambiguous, so we need a no-op command-line option, something like --and or --then.
  3. Come up with a way to describe workflows in a YAML file or so. This could be much more flexible than the command-line interface.
  4. Implement an API so that users can write their own little scripts using Cutadapt functions. This could be almost as simple as the YAML file above in the best case.

marcelm avatar Mar 23 '22 14:03 marcelm

Thanks @marcelm. Sound that solution 2 is more balance. YAML is also a good solution, but it will also create some complexity.

y9c avatar Mar 29 '22 22:03 y9c

Hi, @marcelm. I would like to know if this feature is in the master branch now?

y9c avatar May 17 '22 05:05 y9c

Hi @marcelm, I am thinking the API you mentioned in idea 4. It will be very interesting if the API can run in pipeline in the same way as what pandas package do.

y9c avatar Jul 04 '22 20:07 y9c

I’ll have a look after my vacation.

marcelm avatar Jul 06 '22 14:07 marcelm

Hi @marcelm,

I would like to follow up if there is some update for this feature in the master branch now? Thank you!

y9c avatar Mar 08 '23 05:03 y9c