how to clip a variable *end* 5' adapter?
I have a Library that has a variable length ADAPTER at the 5' end of the read, but unlike what cutadapt expected, the variablilty is at the end of the ADAPTER. So (e.g.) both
ADAPTERsequence and ADAPsequence
are possible reads.
I tried using cutadapt with -g ADAPTER but this didn't work for the second example, so I ended up building a fastafile:
ADAPTER-1 ^ADAPTER ADAPTER-2 ^ADAPTE ADAPTER-3 ^ADAPT ADAPTER-4 ^ADAP ADAPTER-5 ^ADA Because I know that the read will start with the start of the ADAPTER.
I ran cutadapt with --no-indel because it complains that the different sequences are too similar.
Long stroy short: it worked...but a) feels wrong b) I don't like the fact that I need to use --no-indels
So, my questions are:
Am I missing something? and if not, Do you think it would be difficult to include this allowed matching in cutadapt? Thanks!
Yeah, this is actually the only way to do this at the moment. Note that you can omit --no-indels and simply ignore the warning; there shouldn’t be any downsides to it in this case. Also, although it feels inefficient to provide multiple sequences like this, it shouldn’t be that bad because Cutadapt creates an index if you search for multiple anchored 5' adapters, so it doesn’t have to search for the adapters individually.
One reason this feature isn’t implemented is that no one has asked for it, as far as I can remember. But it is also counter to a basic assumption that I made, which is that an adapter in principle always occurs in full, we only don’t see it fully because the read doesn’t extend far enough.
I’m happy to leave this issue open so that others who would also be interested can add their vote, but realistically, this isn’t going to happen for a while. It isn’t that difficult to implement algorithmically; the hardest part for me is coming up with the user interface. For example, I don’t know whether this would be a new command-line option or whether I’d have to add some syntax to the way adapters are specified.
Thanks @marcelm for your work and very fast response, as a fellow developer of OSS software I know how difficult it can be to find the time to answer all the questions.
Regarding API: I think that a modifier at the modifiable end of the ADAPTER would work. e.g. ADAPTER# for 5' adapters and #ADAPTER for 3' adapters.
Regarding usage: I'm using cutadapt to remove a known part of a fusion in order to keep the part that it was fused to. And while the fused part should be of a certain length, the reagents involved are not 100% accurate which is why I need the flexibility.
Regarding --no indels: If it were just a simple warning, I'd probably not add the --no-indels argument...but
- The warning was quite severe:
WARNING: The adapters are too similar. When creating the index, 346687 ambiguous sequences were found that cannot be assigned uniquely.
WARNING: For example, '<REDACTED>', when found in a read, would result in 24 matches for both ADAPTER-61 '<REDACTED>' and ADAPTER-62 '<REDACTED>'
WARNING: Reads with ambiguous sequence will *not* be trimmed.
and
- it was immediately followed by an exception:
Traceback (most recent call last):
File "/Users/yossifarjoun/micromamba/envs/cutadapt/bin/cutadapt", line 8, in <module>
sys.exit(main_cli())
^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/cutadapt/lib/python3.12/site-packages/cutadapt/cli.py", line 1148, in main_cli
main(sys.argv[1:])
File "/Users/yossifarjoun/micromamba/envs/cutadapt/lib/python3.12/site-packages/cutadapt/cli.py", line 1228, in main
pipeline = make_pipeline_from_args(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/cutadapt/lib/python3.12/site-packages/cutadapt/cli.py", line 939, in make_pipeline_from_args
modifiers.extend(
File "/Users/yossifarjoun/micromamba/envs/cutadapt/lib/python3.12/site-packages/cutadapt/cli.py", line 1081, in make_adapter_cutter
adapter_cutter2 = AdapterCutter(adapters2, times, action, allow_index)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/cutadapt/lib/python3.12/site-packages/cutadapt/modifiers.py", line 112, in __init__
self._regroup_into_indexed_adapters(adapters)
File "/Users/yossifarjoun/micromamba/envs/cutadapt/lib/python3.12/site-packages/cutadapt/modifiers.py", line 132, in _regroup_into_indexed_adapters
result.append(IndexedPrefixAdapters(prefix))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/cutadapt/lib/python3.12/site-packages/cutadapt/adapters.py", line 1503, in __init__
self._index = AdapterIndex(adapters, prefix=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/cutadapt/lib/python3.12/site-packages/cutadapt/adapters.py", line 1258, in __init__
self._lengths, self._index, self._ambiguous = self._make_index()
^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/cutadapt/lib/python3.12/site-packages/cutadapt/adapters.py", line 1412, in _make_index
del index[s]
~~~~~^^^
KeyError: '<REDACTED>'
ps. sorry for the redactions. I cannot share the details of the sequence that I'm trimming.
Maybe helpful as a start: The KeyError crash is fixed in Cutadapt 5.1, which I just released.
Thanks! Il give it a try.
Message ID: @.***>
The KetError crash is indeed fixed for me in version 5.0 even! thanks!!