Feature request: control info-file column separator
I use the info-file quite often, either with awk one liners or importing the file in R. I have been using them with Illumina data, with great success. Now I am trying to use them with Nanopore data, with a catch: Different fields in the Nanopore fastq header are also separated by tabs, as the different fields in the info-file. This makes parsing these files more troublesome. Would it be possible, for next versions, to include a way of choosing the desired column-separating character for the info-file?
- Cutadapt v4.6 and Python version 3.10.3
- How you installed the tool (conda or pip, for example) miniconda
Interesting. Where do these tab characters come from? Looking at some of the Nanopore data that I have here, I don’t see any.
If possible, I try to avoid adding options to Cutadapt if I can just make the default behavior better. I’m wondering whether an alternative would be to replace all tab characters in the read header with a space character. I consider it a bug that this isn’t done at the moment because the output is an invalid TSV otherwise, as you found out.
samtools fastq with the -T flag generates tab-delimited fields.
Would quoting the name be an option?
@marcelm Here I copy the first line of a fastq file generated with dorado basecaller v 0.7 , using the --emit-fastq option (so it's not a bam from which I later extracted the fastq) I was worried it would take a lot of time to make the tabs for spaces substitutions, but it might be the best approach. I also realized that any other column separator character I can think of is also a valid Qscore, so there would be even more unpredictable parsing issues. good_length.txt
Quoting from the read header you attached:
@54c591fa-c560-405b-bc82-b3cd603b84fc st:Z:2024-10-17T01:47:46.145+00:00 RG:Z:bf71225953a48fb33134df81086f6e5d64deeca6_dna_r10.4.1_e8.2_400bps_sup@v5.0.0 DS:Z:gpu:NVIDIA GeForce RTX 3070 Laptop GPU
This is apparently intended to be used by a read mapper to be added to its SAM output, such as with BWA-MEM’s -C option:
-C append FASTA/FASTQ comment to SAM output
This is kind of the inverse of samtools fastq -T that Ruben mentioned.
So if you want to let Cutadapt output an info file, manipulate the info file and then write back a FASTQ file that would still be usable in this way, then something needs to be done to the tabs that is reversible. Just replacing them with spaces won’t work because then they cannot be distinguished from spaces. Even in your example, there’s already a value NVIDIA GeForce RTX 3070 Laptop GPU that contains spaces.
I’m not sure what is best here. Maybe replace tab with backslash plus t ("\\t")?
Thanks for your reply - I ended doing a replacement of all tabs for spaces, before feeding the fastqs to cutadapt. And even though that won't work backwards (if I needed to restore the headers to their initial status), that will work for the rest of the pipeline.
Many thanks,
On Mon, 18 Nov 2024 at 15:57, Marcel Martin @.***> wrote:
Quoting from the read header you attached:
@54c591fa-c560-405b-bc82-b3cd603b84fc st:Z:2024-10-17T01:47:46.145+00:00 @.*** DS:Z:gpu:NVIDIA GeForce RTX 3070 Laptop GPU
This is apparently intended to be used by a read mapper to be added to its SAM output, such as with BWA-MEM’s -C option:
-C append FASTA/FASTQ comment to SAM output
This is kind of the inverse of samtools fastq -T that Ruben mentioned.
So if you want to let Cutadapt output an info file, manipulate the info file and then write back a FASTQ file that would still be usable in this way, then something needs to be done to the tabs that is reversible. Just replacing them with spaces won’t work because then they cannot be distinguished from spaces. Even in your example, there’s already a value NVIDIA GeForce RTX 3070 Laptop GPU that contains spaces.
I’m not sure what is best here. Maybe replace tab with backslash plus t ( "\t")?
— Reply to this email directly, view it on GitHub https://github.com/marcelm/cutadapt/issues/816#issuecomment-2483293445, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXKS3ZYFTWZ2D6CKVUZHMT2BH54NAVCNFSM6AAAAABRJDM2XWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBTGI4TGNBUGU . You are receiving this because you authored the thread.Message ID: @.***>