hostile Metatranscriptome performance

Hey @bede,

Thanks for great tool! I've included it in my pipeline for host-microbiome dual transcriptomics, and our lab has been using it on a variety of projects.

I've been finding that it performs well for metagenomic-like simulated datasets, but for metatranscriptome-like simulated datasets, not all host reads are being removed.

I wonder if this is due to alternative splicing, and if inclusion of a splice-aware aligner as one of the options could address this. What do you think? Could you think of ways to use a different reference to avoid needing to add a splice-aware aligner?

Details on my simulated data can be found here. To summarize the results:

If I naively simulate genomic-like community data with human reads from the human pangenome project, then run a linear regression on percent_removed_reads ~ true_percent_host:
- $\beta$ = 0.999 (so almost flawless performance)
If I simulate transcriptome-like community data using Polyester, where human reads are coming from the rna.fna.gz file found here, then run a linear regression on percent_removed_reads ~ true_percent_host:
- $\beta$ = 0.856 (15% of host reads are missed)
If I create semisynthetic communities, where real RNA-seq reads from bacterial isolates + human colon chip samples are subsampled and combined in known quantities,, then run a linear regression on percent_removed_reads ~ true_percent_host:
- $\beta$ = 0.932 (7% of host reads are missed)

Thanks!

Feb 27 '25 21:02 sterrettJD

Hi John, Thanks for your feedback and suggestion! I really appreciate you sharing your results and methods like this. As you've observed, this approach was not intended for metatranscriptomic sequences, and your results are in line with expectation.

I see that you are simulating 150bp reads. I worry that 150bp reads lack the information content to reliably yield spliced alignments in a large repetitive genome like human. But I'd be interested to see how HISAT2 performs in practice, and whether it offers a worthwhile improvement over Hostile in Bowtie2 mode (which is what you should be using). Thanks for linking to rna.fna.gz – I'll look into HISAT2 and share my findings here.

On a side note, I hope to release an improved alternative human index for Hostile soon once benchmarking is complete.

Thanks

Mar 03 '25 10:03 bede

Thanks @bede !

To avoid adding another aligner, what do you think about the idea of creating a larger reference for host-rich metatranscriptomic data including that includes T2T + HLA + GRCh38 RNA?

Theoretically, that would increase the size of the index maybe 10-20%. It would allow for mapping with a non-splice aware aligner, and it would include:

all of the variation encompassed by the current database, (which could also be useful for introns/unspliced pre-RNA)
the difference splice variants in the GRCh38 reference transcriptome

If you think this is a viable option, I can create a version of that index and try it with my data. I'd also be interested to try it with your new alternate index.

Thanks, John

Mar 03 '25 23:03 sterrettJD

To avoid adding another aligner, what do you think about the idea of creating a larger reference for host-rich metatranscriptomic data including that includes T2T + HLA + GRCh38 RNA?

This should help and it would be interesting to see how much – please let me know if you try. I plan to release an improved index soon including at least some of GRCh38. I'll add a combination of CHM13v2.0 and GRCh38 alts to my benchmark – I've recently tested several composite references against real genomic data from diverse humans, albeit mainly long reads. Counterintuitively, making large redundant human references can often lead to fewer—not more—reads mapping due to repetitive k-mer masking behaviour that requires tuning out of read aligners. There is certainly in the order of ~0.1% additional sensitivity to be found here, and hopefully more in your case?

Mar 03 '25 23:03 bede

Hey @bede,

I didn't know that CHM13v2.0 has an RNA fasta (I thought it was just DNA), so I incorporated this by rebuilding the human-t2t-hla using the T2T RNA instead of DNA fasta from their database.

Quite unsurprisingly, this increases hostile's performance in simulated metatrascriptome/dual transcriptome data (but not metagenomic-like data). Here are the updates to my results from the previous comment:

If I naively simulate genomic-like community data with human reads from the human pangenome project, then run a linear regression on percent_removed_reads ~ true_percent_host:
- DNA index: β = 0.999 (almost flawless performance)
- RNA index: β = 0.104 (very bad performance - makes sense, since this is essentially genomic data getting filtered against the RNA index)
If I simulate transcriptome-like community data using Polyester, where human reads are coming from the rna.fna.gz file found here, then run a linear regression on percent_removed_reads ~ true_percent_host:
- DNA index: β = 0.856 (15% of host reads are missed)
- RNA index β = 0.976 (2.4% of host reads are missed)
If I create semisynthetic communities, where real RNA-seq reads from bacterial isolates + human colon chip samples are subsampled and combined in known quantities,, then run a linear regression on percent_removed_reads ~ true_percent_host:
- DNA index: β = 0.932 (7% of host reads are missed)
- RNA index: β = 0.965 (3.5% of host reads are missed)

Notably, the DNA index used here was masked against the argos985, but the RNA index was not masked. The README says that only 0.010% of the positions are masked by this, so I don't imagine it will make up for the ~12% increase in host read removal from using the RNA index.

I'm a bit busy this week, but I'll continue comparing this to:

RNA index masked against the argos985 genomes
using a splice aware aligner with the existing DNA indexes

I can keep you posted on how well those work, please let me know if you have any thoughts in the meantime. Based on this, do you think it's worth including default indexes built from GRCh/T2T RNA.fa in addition to the DNA.fa files?

Mar 17 '25 18:03 sterrettJD

Hi John, Thanks for sharing your data, I'm very interested albeit busy with a new job. If you are happy to do so, please keep me posted until I look into this properly.

Best, Bede

Apr 01 '25 11:04 bede

Hey @bede,

I've implemented HISAT2 filtering and testing into my fork of hostile. I just wrapped up a more busy phase of work, but now I'll be able to benchmark it on the host-rich metatranscriptome data. Would you like me to open a pull request so you could look over the implementation, or would you prefer for that to wait until after I've completed the benchmarks?

Thanks!

Apr 14 '25 14:04 sterrettJD

Hey @bede,

Overview

Just wanted to share the HISAT2 benchmarking results. As suspected, using HISAT2 on host-rich simulated metatranscriptome samples using the DNA T2T + HLA index performs better than Bowtie2 on either the DNA or RNA indexes.

There are a few discrepancies in my methods (listed below), but I think the patterns still hold. I wanted to go ahead and share the results justifying the inclusion of a splice-aware aligner such as HISAT2, so I can open my PR with HISAT2 included.

Do you agree that adding HISAT2 for metatranscriptomics host decontamination is justified based on these results? If so, I'll open a PR with the HISAT2 implementation and tests.

Notes

There are a couple of different parameters at play here, mostly because this has been a piecemeal process of slowly adding more options. I've listed them here:

Original DNA benchmarking was done using the argos985 masked index. I can add in masking on the other ones if you think it's worth it.
The latest comparison of bowtie2 vs HISAT2 used raw simulated samples without read QC (trimmomatic). This is denoted below using "TRIMMED" vs "NOT TRIMMED" at the end of each line. This didn't affect the purely synthetic samples, but it seemed to have a small effect on the percent reads removed in semisynthetic samples. I suspect that performance is better on the trimmed samples, so I'll be doing that to double check it all.

Results

If I simulate transcriptome-like community data using Polyester, where human reads are coming from the rna.fna.gz file found here, then run a linear regression on percent_removed_reads ~ true_percent_host:
- Hostile using bowtie2 with argos985 masked DNA index: β = 0.856 (15% of host reads are missed) (TRIMMED)
- Hostile using bowtie2 with unmasked RNA index: β = 0.976 (2.4% of host reads are missed) (TRIMMED)
- Hostile using bowtie2 with unmasked RNA index: β = 0.976 (2.4% of host reads are missed) (NOT TRIMMED)
- Hostile using HISAT2 (implemented on my fork) with unmasked DNA index: β = 1.003 (0.3% too many reads removed) (NOT TRIMMED)
If I create semisynthetic communities, where real RNA-seq reads from bacterial isolates + human colon chip samples are subsampled and combined in known quantities, then run a linear regression on percent_removed_reads ~ true_percent_host:
- Hostile using bowtie2 with argos985 masked DNA index: β = 0.932 (7% of host reads are missed) (TRIMMED)
- Hostile using bowtie2 with unmasked RNA index: β = 0.965 (3.5% of host reads are missed) (TRIMMED)
- Hostile using bowtie2 with unmasked RNA index: β = 0.947 (5.3% of host reads are missed) (NOT TRIMMED)
- Hostile using HISAT2 (implemented on my fork) with unmasked DNA index: β = 0.975 (2.5% of host reads are missed) (NOT TRIMMED)

You can see that from this image that HISAT2 removes a more accurate number of reads, as the HISAT2 slope is closer to 1 than the Bowtie2 slope. Bowtie2 shown here is using the RNA index. (Focus on synthetic and semisynthetic transcriptomes. "synthetic communities" is the metagenomic-like data mentioned in previous comments)

Regression summaries from the latest benchmarks included as a csv:

hostile_alt_lms.csv

The benchmarking workflow for HISAT2 can be found here

Apr 16 '25 18:04 sterrettJD

Hi John, This is great work with promising results, thanks for sharing. Well done for modifying Hostile to use HISAT2: Hostile was only briefly intended to accommodate many alignment backends, so the internals you've dealt with are not pretty – apologies!

I'm reluctant to endorse a major PR like this without comprehensively benchmarking it myself, in particular to understand its precision against large collections of simulated viral genomes for short and long reads. There are also the considerations of reference genome hosting, automatic index construction etc. I am happy to discuss further (here or over email, call etc if you prefer?), but I do need to prioritise other (inc. related) work before spending the time to consider this as a PR. A fork may serve you better in the short term, but happy to discuss.

A question that springs to mind for spliced alignment is whether have you have already tried selecting --local or --sensitive-local Bowtie2 modes in Hostile? While I advise against this in the general case, I've written instructions for doing this here: https://github.com/bede/hostile?tab=readme-ov-file#alignment-parameters

Bede

Apr 18 '25 13:04 bede

Hey @bede,

The internals were actually pretty friendly to work with!

I'd love to chat about more extensive benchmarking. If you have a more comprehensive benchmarking pipeline that you've used, I'd be happy to incorporate HISAT2 into that.

I can use my forked version in the meantime, but incorporating HISAT2 into a more official version of Hostile would be great for ease of use in my pipelines, so I'm happy to do a bit of the work to benchmark/test that.

I just emailed you at the email linked in the readme (from @protonmail.com).

Thanks! John

Apr 18 '25 19:04 sterrettJD