Add 2FAST2Q as an alternative to the MAGeCK count / Bowtie2 combo on the screening workflow
Description of feature
Hi,
I think it would be interesting to offer an alternative to the current MAGeCK count or Bowtie2 approach (step2 on the screening workflow).
Let me introduce to you 2FAST2Q. By integrating this tool, available now also as a nf-core module (https://nf-co.re/modules/fast2q/), we would expand the current pipeline to the following:
- Accept reads with mutiple features (such as 2 or more sgRNAs per read), and count these events.
- Have the possibility of directly searching for features when these are delimited by any given up/downstream sequences. Expanding on this, have the possibility of directly pulling and counting any new found sequences without the need for a library file (optional).
- Directly configure and filter based on sequence alignment mismatches (Hamming distance only), and sequence phred-scores.
2FAST2Q would essentialy sit as a read count matrix file provider.
Some considerations would be:
- The output is in raw read counts (unnormalized). CRISPRseq´s following steps require normalized read counts.
- The output displays only sgRNA names and respective counts (rows) per fastq file (columns). CRISPRseq´s current count matrix file includes sgRNA names AND respective linked genes.
I would be happy to work on this if this is a proposition you agree with.
Hi @afombravo, I am tagging @LaurenceKuhl, who is the maintainer of the screening subworkflow. Also, @matbonfanti is working on other pipeline features now, so I will tag him too.
I am not an expert on screenings, but I think it's a nice idea if you are willing to implement this 🙂 We are now running Bowtie2 if the library fast is provided, and afterwards, we run MAGeCK count. As I understand, your module would replace both steps. I think we would need to provide a new parameter to specify which approach to follow. From your comment, I understand we would also need to add an extra step for normalization and to convert the output to the required format. This looks like a somewhat big reformatting of the pipeline, so let's discuss the strategy first. But in any case, I think it's a nice addition.
Hi @afombravo,
Just as a side note, I think that the command "count" of mageck can be run directly on count matrices, and the compute only summary statistics and normalize the dataset. So, I think that your alterative quantification could easily fit in the standard mageck workflow by including this additional normalization step.
Regarding the target genes of each guide, these are usually contained in the standard mageck reference. Isn't this the case for the 2FAST2Q input as well?
Hi!
If MAGeCK can just normalize the dataset that would make life easier!
Regarding the output of 2FAST2Q, the first column is always the chosen sgRNA name from the library file (or a auto generated one if no library file is used), and the following columns are the reads per sample, for all samples. As an example, one could include in the sgRNA name the gene name as well (concatenate them), and then split these where required downstream.
I can also move this topic to slack, if you think it will be more productive there.