duplex-tools icon indicating copy to clipboard operation
duplex-tools copied to clipboard

Unexpected input file name changes output file file name format on split_on_adapter

Open groodri opened this issue 3 years ago • 1 comments

When input FASTX file names include a dot (.) that is not a file extension suffix (example: testfile.1.fastq.gz), split_on_adapter will read .1.fastq.gz as the whole suffix, instead of .fastq.gz. Thus, the output file will be called testfile.fastq.gz, instead of testfile.1_split.fastq.gz. This can break processes downstream in pipelines, because the output file name is not as expected when new naming schemes are introduced.

This is due to lines 123-126 in split_on_adapter.py.

For example:

>>> from natsort import natsorted
>>> from pathlib import Path
>>> fastxs = natsorted(list(Path('.').rglob('*.fastq*')), key=str)
>>> fastx = fastxs[2]
>>> fastx.name
'testfile.1.fastq.gz'
>>> fastx.with_name(fastx.name.replace('.fastq', '').replace('.gz', '') + '_split').with_suffix('.fastq.gz')
PosixPath('testfile.fastq.gz')

Can be solved with this example:

>>> fastx.with_name(fastx.name.replace('.fastq', '').replace('.gz', '') + '_split.fastq.gz')
PosixPath('testfile.1_split.fastq.gz')

Essentially the current code is just overwriting its own addition of '_split' when an unexpected "suffix" occurs. Accounting for these unexpected suffixes with the --pattern flag can be quite difficult (what would work for this case, assuming there will be more files in the folder named .2.fastq.gz, ..., .600.fastq.gz?), so this seems a pertinent change.

File names that include non-suffix dots can happen due to a variety of reasons. For example, when FASTQ files are split into multiple files with N number of reads in each, for better memory management.

groodri avatar Jan 12 '23 10:01 groodri

Thanks @groodri, I would agree with you that it would be sensible to fix this.

In the meantime, if this is an issue that needs an immediate workaround (and for the benefit of other people who may need a fix), please feel free to rename the files like below:

rename "s/testfile./testfile_/g" *.fastq.gz

onordesjo avatar Jan 12 '23 11:01 onordesjo