hostile icon indicating copy to clipboard operation
hostile copied to clipboard

Support accepting stdin instead of a specific filepath for single ended data

Open BioWilko opened this issue 1 year ago • 2 comments

A common use I have for hostile is to cleanup and concatenate a directory of fastqs prior to moving them off instrument, if I could do all of this with a single one liner such as:

cat *.fastq.gz | hostile clean --fastq1 - > combined_clean.fastq.gz

This would make my life easier, as it stands hostile works absolutely fine since I can just concatenate then run the hostile clean command but doing it all in a more pipe-centric way would be a nice to have!

BioWilko avatar Apr 22 '24 12:04 BioWilko

Hey Sam, thanks for opening this. Agree it would be nice to have. When I get time I'll investigate whether it's straightforward and how much it improves performance. Can imagine it helping a fair bit on cheap cloud infra with slow block storage.

bede avatar Apr 22 '24 16:04 bede

Hey Sam, Sorry for delay. I've just had a look and am deciding whether to implement it. Streaming the reads through the Python interpreter adds some overhead, but not as much as I expected, and it could be viable.

Firstly I'm planning to implement https://github.com/bede/hostile/issues/39 which should dramatically speed up the processing of one sample after another with Minimap2.

In the meantime however, you could use the following hack to do what you need:

  1. Inside Hostile's container / conda env, run hostile clean as you would for a single file but include the --debug flag:
$ hostile clean --debug --fastq1 tests/data/tuberculosis_1_1.fastq.gz --debug        
20:30:20 DEBUG: clean_fastqs() threads=6
20:30:20 INFO: Hostile version 1.1.0. Mode: long read (Minimap2)
20:30:20 INFO: Found cached standard index human-t2t-hla
20:30:20 DEBUG: backend_cmds=["minimap2 -ax map-ont -m 40 --secondary no -t 6  '/Users/bede.constantinides/Library/Application Support/hostile/human-t2t-hla.fa.gz' '/Users/bede.constantinides/Research/Git/hostile/tests/data/tuberculosis_1_1.fastq.gz' | tee >(samtools view -F 2304 -c - > '/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.reads_in.txt') | samtools view -f 4 - | tee >(samtools view -F 2304 -c - > '/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.reads_out.txt') | samtools fastq --threads 4 -c 6 -0 '/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.clean.fastq.gz'"]
20:30:20 INFO: Cleaning…
20:30:59 DEBUG: path=PosixPath('/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.reads_in.txt') count=1
20:30:59 DEBUG: path=PosixPath('/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.reads_out.txt') count=1
20:30:59 INFO: Cleaning complete
[
  {
      "version": "1.1.0",
      "aligner": "minimap2",
      "index": "human-t2t-hla",
      "options": [],
      "fastq1_in_name": "tuberculosis_1_1.fastq.gz",
      "fastq1_in_path": "/Users/bede.constantinides/Research/Git/hostile/tests/data/tuberculosis_1_1.fastq.gz",
      "fastq1_out_name": "tuberculosis_1_1.clean.fastq.gz",
      "fastq1_out_path": "/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.clean.fastq.gz",
      "reads_in": 1,
      "reads_out": 1,
      "reads_removed": 0,
      "reads_removed_proportion": 0.0
  }
]
  1. Extract the bash pipeline from the stderr line with DEBUG: backend_cmds=:
minimap2 -ax map-ont -m 40 --secondary no -t 6  '/Users/bede.constantinides/Library/Application Support/hostile/human-t2t-hla.fa.gz' '/Users/bede.constantinides/Research/Git/hostile/tests/data/tuberculosis_1_1.fastq.gz' | tee >(samtools view -F 2304 -c - > '/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.reads_in.txt') | samtools view -f 4 - | tee >(samtools view -F 2304 -c - > '/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.reads_out.txt') | samtools fastq --threads 4 -c 6 -0 '/Users/bede.constantinides/Research/Git/hostile/tuberculosis_1_1.clean.fastq.gz'
  1. Replace the input FASTQ path with - and pipe your FASTQs into minimap2 stdin:
cat *.fastq.gz | minimap2 -ax map-ont -m 40 --secondary no -t 6  '/Users/bede.constantinides/Library/Application Support/hostile/human-t2t-hla.fa.gz' - |…

bede avatar Jul 16 '24 19:07 bede

This will be released in v2.0.0 soon

bede avatar Dec 16 '24 18:12 bede

Released in 2.0.0

bede avatar Dec 19 '24 17:12 bede