Support accepting stdin instead of a specific filepath for single ended data
A common use I have for hostile is to cleanup and concatenate a directory of fastqs prior to moving them off instrument, if I could do all of this with a single one liner such as:
cat *.fastq.gz | hostile clean --fastq1 - > combined_clean.fastq.gz
This would make my life easier, as it stands hostile works absolutely fine since I can just concatenate then run the hostile clean command but doing it all in a more pipe-centric way would be a nice to have!
Hey Sam, thanks for opening this. Agree it would be nice to have. When I get time I'll investigate whether it's straightforward and how much it improves performance. Can imagine it helping a fair bit on cheap cloud infra with slow block storage.
Hey Sam, Sorry for delay. I've just had a look and am deciding whether to implement it. Streaming the reads through the Python interpreter adds some overhead, but not as much as I expected, and it could be viable.
Firstly I'm planning to implement https://github.com/bede/hostile/issues/39 which should dramatically speed up the processing of one sample after another with Minimap2.
In the meantime however, you could use the following hack to do what you need:
- Inside Hostile's container / conda env, run
hostile cleanas you would for a single file but include the--debugflag:
$ hostile clean --debug --fastq1 tests/data/tuberculosis_1_1.fastq.gz --debug
20:30:20 DEBUG: clean_fastqs() threads=6
20:30:20 INFO: Hostile version 1.1.0. Mode: long read (Minimap2)
20:30:20 INFO: Found cached standard index human-t2t-hla
20:30:20 DEBUG: backend_cmds=["minimap2 -ax map-ont -m 40 --secondary no -t 6 '/Users/bede.constantinides/Library/Application Support/hostile/human-t2t-hla.fa.gz' '/Users/bede.constantinides/Research/Git/hostile/tests/data/tuberculosis_1_1.fastq.gz' | tee >(samtools view -F 2304 -c - > '/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.reads_in.txt') | samtools view -f 4 - | tee >(samtools view -F 2304 -c - > '/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.reads_out.txt') | samtools fastq --threads 4 -c 6 -0 '/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.clean.fastq.gz'"]
20:30:20 INFO: Cleaning…
20:30:59 DEBUG: path=PosixPath('/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.reads_in.txt') count=1
20:30:59 DEBUG: path=PosixPath('/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.reads_out.txt') count=1
20:30:59 INFO: Cleaning complete
[
{
"version": "1.1.0",
"aligner": "minimap2",
"index": "human-t2t-hla",
"options": [],
"fastq1_in_name": "tuberculosis_1_1.fastq.gz",
"fastq1_in_path": "/Users/bede.constantinides/Research/Git/hostile/tests/data/tuberculosis_1_1.fastq.gz",
"fastq1_out_name": "tuberculosis_1_1.clean.fastq.gz",
"fastq1_out_path": "/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.clean.fastq.gz",
"reads_in": 1,
"reads_out": 1,
"reads_removed": 0,
"reads_removed_proportion": 0.0
}
]
- Extract the bash pipeline from the stderr line with
DEBUG: backend_cmds=:
minimap2 -ax map-ont -m 40 --secondary no -t 6 '/Users/bede.constantinides/Library/Application Support/hostile/human-t2t-hla.fa.gz' '/Users/bede.constantinides/Research/Git/hostile/tests/data/tuberculosis_1_1.fastq.gz' | tee >(samtools view -F 2304 -c - > '/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.reads_in.txt') | samtools view -f 4 - | tee >(samtools view -F 2304 -c - > '/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.reads_out.txt') | samtools fastq --threads 4 -c 6 -0 '/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.clean.fastq.gz'
- Replace the input FASTQ path with
-and pipe your FASTQs into minimap2 stdin:
cat *.fastq.gz | minimap2 -ax map-ont -m 40 --secondary no -t 6 '/Users/bede.constantinides/Library/Application Support/hostile/human-t2t-hla.fa.gz' - |…
This will be released in v2.0.0 soon
Released in 2.0.0