alevin-fry icon indicating copy to clipboard operation
alevin-fry copied to clipboard

Questions regarding barcodes with zero reads after deduplication

Open wmacnair opened this issue 2 years ago • 26 comments

Hi

I've been reading and working through various tutorials and bits of documentation, and have found the alevinQC package very helpful for understanding what is going on with the generate-permit-list step.

I'm a bit confused about the mapping step after the permit list step. It feels like there's a missing QC step here. I'm describing my understanding of the process, both to check I've got it right and in case it's helpful to someone else.

In the (very nice) alevinQC report, we get something like this for the white list, telling us which of the barcodes are most likely "real cells" rather than empty droplets: image

These reads are then mapped, and some of them end up not mapping to any genes at all. This gives us the actual counts for each white-listed barcode, and looks like this: image

I have a couple of questions:

  • The barcodes with 0 / very few genes assigned to them are useless, right? Shouldn't we also have a step to exclude them? For example a second knee step based on number of genes detected?
  • More out of curiosity, what is going on with these barcodes?? The plot below shows that we get many barcodes whose frequencies are similar to "real" cells, and even ones with super high frequencies (e.g. the highest frequency in this sample), where the reads do not map to any genes at all. Is this expected?

Thanks! Will

image

wmacnair avatar May 12 '22 09:05 wmacnair

Hi @wmacnair,

Thanks for the detailed question (and for including the nice graphics!). I will also cc @csoneson on my reply for her — always invaluable — take.

Regarding your first question; yes. There is often a desire to perform a filtering based on the final UMI counts of each cell. Note, that the UMI counts (after deduplication) can differ, sometimes dramatically, from the read counts before deduplication. Once you have loaded the expression matrix from alevin-fry into your analysis environment (say R), then you likely want to use some package like DropletUtils to do a cell filtering. In fact, @DongzeHE has directly contributed to DropletUtils the function for applying the CellRanger-like emptyDrops filtering (originally reverse-engineered from the CellRanger implementation by Alex Dobin). Of course, the challenge is that, to keep the processing both time and memory efficient, we must determine a putative set of barcodes prior to quantification, so that we can group together all reads that belong to the same cell (corrected barcode), which is what is done in the collate step.

Regarding your second question, it's a bit less clear what's going on with the number of detected genes. One question I have is, what count values are you using for gene detection? Is it possible that some of the cells have high unspliced gene content that is being discarded if using the default scRNA output_format when loading in the expression matrices? It is worth noting that the barcode counts for unmapped reads are propagated though the analysis, and are used to provide some of the output statistics that AlevinQC eventually uses. However, they are segregated in the initial mapping step, since unmapped reads cannot be used for quantifying genes. To explain the behavior you see, I can think of a couple of possible explanations off the top of my head (there may, of course, be more or better ones), in no particular order (1) There are some highly-sequenced empty/contaminated droplets that get a high read count but result in no expressed genes, (2) These cells have a high fraction of intronic reads which are being discarded in the default input mode, leading to a large count but no detected genes. It is worth noting that after quantification, the reads mapping to genes have been collapsed based on UMI. However, for unmapped reads, the UMIs cannot be reliably collapsed (since you need to know the gene from which they derive to collapse them). This means, it makes sense to compare the counts of reads deriving from mapped and unmapped reads per barcode initially, but it doesn't really make sense to compare the collapsed UMI count of mapped reads post-quantification to the read count of unmapped reads. I'll punt to @csoneson on what, particularly, is being displayed in the plots at the bottom of your post.

Best, Rob

rob-p avatar May 12 '22 14:05 rob-p

Hi @wmacnair 🙂

I'm a bit surprised by the detected gene knee plot. I have seen other datasets where there were cells with basically no detected genes, but there were usually maybe 2-10 such cells, not hundreds as it looks like in your case 🤔 . In those data sets, I was running alevin-fry with a splici index, so intronic reads would be counted. Looking at the data in more detail (e.g., via the cbTable output of alevinQC::readAlevinFryQC()), it seems there is extreme UMI duplication in those cells - there is a fair number of UMIs mapped, but only very few unique ones (and consequently very few genes). Is that the case also for you? Maybe extracting the reads corresponding to these cell barcodes and looking where they map on the genome would give some hints of what's going on and whether there really seems to be a high degree of duplication?

As for what's being shown in the plots, alevinQC doesn't read the count matrix, but only considers the metadata. The gene knee plot displays the NumGenesExpressed column from the feature dump. The last set of plots relate the CorrectedReads, DeduplicatedReads and NumGenesExpressed columns.

csoneson avatar May 12 '22 15:05 csoneson

Hi both, thanks for your detailed responses. I need to read them carefully, so it will take me a little while to respond to the tricky bits.

On the easy bit, that makes sense regarding the knee plots.

I think I would then suggest making this more explicit in the documentation. At present, the help page for generate-permit-list says 'it determines what cell barcodes should be associated with “true” cells', and it sounds like that's not quite right. Is that fair? I think it would be helpful to add a summary of what you've said here, and also say explicitly that most users will want to run something like DropletUtils::emptyDrops in addition. I think my confusion comes from expecting it to be like cellranger, where (hope I've got this right) there is also a knee step, but after that you're in principle good to go.

I will have a look at which genes are present in the barcodes that collapse to zero. I didn't make any choices around which values are used for gene detection - these are just alevinQC default outputs (as Charlotte has described).

For some additional context, here are some notes on what I've done:

  • This is single cell PBMC data.
  • I used the most recent annotations / genomes from gencode (specifically gencode.v40.primary_assembly.annotation_annots.txt.gz and gencode.v40.primary_assembly.annotation.gtf.gz).
  • I generated a splici index from these.
  • I ran the code at the bottom for each sample.

I'm a complete noob when it comes to mapping, so there are multiple things I'm unsure about:

  • Is this the right reference? A colleague who knows more about genomes than I do suggested this, but we could be wrong.
  • There are a very large number of pseudogenes in the reference, which could be expected but surprised me. Could this explain the weird collapsing barcodes?
  • The percentage of spliced reads seems low. If I compare S and U reads, it seems to be ~30% spliced. This is similar to what I saw in single nuclei data, and is quite a bit lower than I was expecting.

Hope this helps give some clues. I'll dig around in the genes for the collapsing barcodes and let you know what I find.

Will

part of my bsub job:

## map 
salmon alevin -lISR -1 $r1_fqs -2 $r2_fqs --chromiumV3 -i $idx_f -p $n_cores \
-o outs_alevin --tgMap $t2g_f --sketch --dumpFeatures

## find proper cells
alevin-fry generate-permit-list --input outs_alevin --expected-ori fw --output-dir outs_fry -k

## group barcodes
alevin-fry collate -i outs_fry -r outs_alevin -t $n_cores

## get counts
alevin-fry quant -t $n_cores -i outs_fry -o outs_fry -m $t2g_f -r cr-like --use-mtx

wmacnair avatar May 12 '22 16:05 wmacnair

A bit of further analysis, looking at all of our samples. It seems that we consistently get ~5% of permit-list barcodes ending up with 0 genes (ignoring two samples with v small permit lists).

    sample_id n_total n_zero pct_zero
 1:     CN003     141      2      1.4
 2:     CN008    1550     56      3.6
 3:     CN004    4420    193      4.4
 4:   EG001_P   19977    966      4.8
 5:   EG004_C   10269    508      4.9
 6:     CN005    8256    427      5.2
 7:   EG004_P   17866   1009      5.6
 8:     CN002    6367    361      5.7
 9:   EG002_C    3784    216      5.7
10:   EG003_C   15359    870      5.7
11:     CN007    7334    432      5.9
12:     CN001    1721    106      6.2
13:   EG001_C    3383    215      6.4
14:   EG002_P    9179    587      6.4
15:     CN006    2154    145      6.7
16:   EG003_P     186     21     11.3

I picked the sample with the largest number of zero-gene barcodes (EG004_P), and calculated medians for various measures recorded in cbTable. The number of mapped UMIs is very similar, and I think the 0s for umi count and deduplication rate are expected.

Is the higher mapping rate in zero-gene barcodes surprising? This is a chunk higher in the barcodes with zero genes.

   zero_genes quantile q_mapped_umis q_umi_count q_dedup q_mapping
1:      FALSE     0.25          3353         500  0.1452    0.6734
2:      FALSE     0.50          4645        1180  0.2864    0.7115
3:      FALSE     0.75          6678        1672  0.3072    0.7737
4:       TRUE     0.25          2751           0  0.0000    0.9549
5:       TRUE     0.50          4208           0  0.0000    0.9591
6:       TRUE     0.75          7073           0  0.0000    0.9625

What would be the easiest way to see where these highly duplicated barcodes map to in the genome?

Thanks! Will

wmacnair avatar May 13 '22 07:05 wmacnair

Hi @wmacnair,

Are any of these samples sharable? There are a few ways one might check what's going on. One would be to take the collated RAD file, and then use the view command to produce a text-formatted version. From there, you could grep the corrected cell barcode to get all of the records for that cell. This would produce a list of records whose raw UMI frequency could then be analyzed. Any thoughts on this @csoneson?

Best, Rob

rob-p avatar May 17 '22 02:05 rob-p

If you happen to have a bam file of reads mapped to the genome, I think you could extract the reads corresponding to a given cell barcode using e.g. https://github.com/10XGenomics/subset-bam. Regarding the higher mapping rate I don't know - if the reads are all coming from the same place maybe that does make sense (either they all map, or none of them maps). I'm a bit puzzled by the fact that the deduplicated UMI count is exactly zero (do I read the table correctly?) - in my examples there were few, but not zero, entries retained. 🤔

csoneson avatar May 17 '22 12:05 csoneson

Is it possible that in the QC aggregation that, somehow, unmapped UMIs are being considered as mapped?

rob-p avatar May 17 '22 13:05 rob-p

The nbrMappedUMI columns directly corresponds to the MappedReads column in the alevin-fry output, and the totalUMICount to the DeduplicatedReads one:

https://github.com/csoneson/alevinQC/blob/f406fd6b3f1e327de749e7dec989351137f07022/R/readAlevinFryQC.R#L182-L183

csoneson avatar May 17 '22 14:05 csoneson

One of @wmacnair's points that I've overlooked is the reference itself. It may indeed be worth considering using something like the 10X reference annotation. A large number of pseudogenes could result in a larger degree of multi-mapping, which, under the standard cr-like resolution strategy (as well as under the default strategy of other tools) will be discarded. Is this a standard / publicly-available PBMC dataset we could look at?

rob-p avatar May 17 '22 17:05 rob-p

Hi both, thanks again for taking the time!

Some answers / responses, slightly reordered to help me make sense of new information:

Are any of these samples sharable? Is this a standard / publicly-available PBMC dataset we could look at?

Sadly not, I think - they're human samples from a clinical study that's only just started.

I'm a bit puzzled by the fact that the deduplicated UMI count is exactly zero (do I read the table correctly?) - in my examples there were few, but not zero, entries retained.

Yes, these are exact zeros.

There are a few ways one might check what's going on. One would be to take the collated RAD file, and then use the view command to produce a text-formatted version. From there, you could grep the corrected cell barcode to get all of the records for that cell. If you happen to have a bam file of reads mapped to the genome, I think you could extract the reads corresponding to a given cell barcode using e.g. https://github.com/10XGenomics/subset-bam.

I'm not promising success, but I will have a go at this...!

One of @wmacnair's points that I've overlooked is the reference itself. It may indeed be worth considering using something like the 10X reference annotation. A large number of pseudogenes could result in a larger degree of multi-mapping, which, under the standard cr-like resolution strategy (as well as under the default strategy of other tools) will be discarded. Regarding the higher mapping rate I don't know - if the reads are all coming from the same place maybe that does make sense (either they all map, or none of them maps).

Thanks for these points. Here's my go at synthesising them into a possible explanation, with some questions / comments:

  • The reference transcriptome I'm using might have a lot of additional pseudogenes (because I'm using the primary_assembly file? I don't really understand the difference between gencode.v40.annotation.gtf.gz and gencode.v40.primary_assembly.annotation.gtf.gz.)
  • For this to explain our completely deduplicated barcodes, these strange barcodes must map pretty much exclusively to some ambiguous pseudogenes. Is this right? Would this also tend to mean that they mapped to a small number of genomic locations?

Things I can do to provide more information:

  • Rerun alevin-fry on this sample with the cellranger reference.
  • Try to work out where the "zero" reads are mapping to in the genome.

Let me know if there is anything else useful that I can look at. And also if you have any comments on the possible explanation.

Best Will

wmacnair avatar May 18 '22 07:05 wmacnair

I guess I would be a bit surprised if it was because of using the primary file rather than the basic gencode one - it adds 54 genes/59 transcripts (to the 61544/246624 in the annotation.gtf.gz one), and most of them appears to be annotated as protein coding. Maybe one other thing that you can try to pinpoint the cause would be to rerun the alevin-fry quant step with a different UMI resolution strategy (e.g. cr-like-em).

csoneson avatar May 18 '22 08:05 csoneson

Ok, thanks that's helpful for understanding the differences.

I've just been discussing this with a colleague and we're wondering if it could be the choice of gene biotype that is the issue, specifically the inclusion of pseudogenes. A possible explanation is:

  • I include pseudogenes in my reference transcriptome (compared to cellranger's reference which restricts to protein coding / lincRNA / antisense / IG + TR genes; cellranger build notes here).
  • Reads that map to both a gene and its v similar pseudogene end up being discarded due to multi-mapping, so the UMI counts for such genes are massively reduced, possibly to zero.
  • For this to explain the zero-read droplets, we would have to have some barcodes that only derive from one / few gene, all of which have corresponding v similar pseudogenes.

Does this seem plausible? And in general, what is your advice on including / excluding pseudogenes? (Either at the stage of making your reference transcriptome, or later.)

wmacnair avatar May 18 '22 08:05 wmacnair

I think that a situation like that could potentially explain what you see. @rob-p can provide more insight into the UMI deduplication, but I think that the gene/pseudogene pairs must also be very similar, so that there is a tie for the highest count gene for all the UMIs. Running with the cr-like-em UMI resolution would not discard these UMIs, so I think it would indicate if this is where the problem lies.

csoneson avatar May 18 '22 08:05 csoneson

Ah nice, something testable! I will try to rerun both with cr-like-em, and using the cellranger reference, and let you know the results.

wmacnair avatar May 18 '22 10:05 wmacnair

@rob-p, quick Q - could you point me towards the syntax for the view command? Thanks!

wmacnair avatar May 18 '22 12:05 wmacnair

@wmacnair,

Sure thing! So, first please make sure you grab the latest release (0.5.1) which fixed an options conflict between the top-level help and the flag for the view mode to include the header information. View looks like this.

❯alevin-fry view -h
alevin-fry-view 0.5.1
Avi Srivastava <[email protected]>
Hirak Sarkar <[email protected]>
Dongze He <[email protected]>
Mohsen Zakeri <[email protected]>
Rob Patro <[email protected]>
View a RAD file

USAGE:
    alevin-fry view [OPTIONS] --rad <RADFILE>

OPTIONS:
    -h, --help                Print help information
    -H, --header              flag for printing header
    -o, --output <RADFILE>    output plain-text-file file
    -r, --rad <RADFILE>       input RAD file
    -V, --version             Print version information

If you don't pass a file to -o, then it will print to stdout by default. So you could do something like:

alevin-fry view -r collated_rad | grep cell_barcode_of_interest > interesting_records.txt

to get all of the records corresponding to a barcode of interest.

rob-p avatar May 18 '22 12:05 rob-p

Hmm, cr-like-em doesn't make any difference to that zero tail...

With cr-like (i.e. same plot as above): image

With cr-like-em: image

The deduplication rate also appears to be tighter.

With cr-like: image

With cr-like-em: image

wmacnair avatar May 18 '22 12:05 wmacnair

Thanks for the view explanation @rob-p. I've done this for the zero-gene barcode with the highest number of reads associated to it. The last column of this output gives the ensemble ID, and counting by this we get:

                       ENSG      N
   1:  ENSG00000278996.1-I1 313651
   2:   ENSG00000280441.3-I 313651
   3: ENSG00000163046.16-I1    125
   4:   ENSG00000230876.8-I     87
   5:     ENST00000629969.1     34
  ---
1233:     ENST00000684490.1      1
1234:     ENST00000684676.1      1
1235:     ENST00000692836.1      1
1236:     ENST00000695035.1      1
1237:     ENST00000695038.1      1

i.e. 627302 out of 629498 reads for these barcodes are derived from two genes.

Any thoughts? I've had a look at those genes and I don't really get much from what I see...

wmacnair avatar May 18 '22 13:05 wmacnair

Ahhh are these huge counts multimappers?

rob-p avatar May 18 '22 13:05 rob-p

I'm not really sure what a multimapper is... I've had a look at those two transcripts, but couldn't really see any way to check if they were very similar.

Is that what you meant, or was it something else?

wmacnair avatar May 18 '22 13:05 wmacnair

So each entry output by view is a mapping record. But the same read can have many mapping records if the read matches more than one place. This would show up in the output as the same read id having multiple mappings to different genes. I think we also explicitly print a tag for the number of hits for each read (NH, I think).

rob-p avatar May 18 '22 13:05 rob-p

Ok, yes I think that is the case. These are outputs for the top 40 zero-gene barcodes, with the NH variable added (NH is dominated by 2 and 3, so I lumped the others together):

                     ensg N_total    NH:2   NH:3 Other
 1:  ENSG00000278996.1-I1 2838790 2687637 151040   113
 2:   ENSG00000280441.3-I 2838790 2687637 151040   113
 3: ENSG00000162669.16-I2  114504       0 114493    11
 4:   ENSG00000172572.7-I   30160       0  30082    78
 5: ENSG00000163046.16-I1    6301       0   6168   133
 6:   ENSG00000230876.8-I     665       1      0   664
 7:     ENST00000631211.1     339     313      0    26
 8:     ENST00000629969.1     314     313      0     1
 9: ENSG00000137571.11-I1     182       0    168    14
10:   ENSG00000268981.5-I     169       0    117    52
11:  ENSG00000008282.9-I2      78       0      0    78
12:  ENSG00000078328.23-I      53       3      3    47
13:   ENSG00000259053.1-I      50       3      1    46
14:  ENSG00000140025.16-I      49       3      1    45
15: ENSG00000015568.13-I8      47      12     12    23
16:  ENSG00000236332.2-I1      32       4      0    28
17:  ENSG00000150275.20-I      31       0      0    31
18:  ENSG00000198586.14-I      30       0      0    30
19: ENSG00000187323.12-I2      29       3      0    26
20:  ENSG00000153233.13-I      28       1      0    27

wmacnair avatar May 18 '22 13:05 wmacnair

Ok, so this suggests that most of these reads are mapping to > 1 distinct gene (and therefore being discarded). Also, most of them are mapping to (what looks like) the same subset of distinct genes. This doesn't quite explain why those counts don't show up in cr-like-em, but let's table that for a moment.

When a read maps to multiple locations, the rules for how it are resolved depend on what those locations are. If they are transcripts of the same gene (spliced) it will be assigned to that gene. If they are spliced and unspliced regions of the same gene, it will be assigned to that gene (as an ambiguous UMI). However, if the mapping is to transcripts of different genes, and (as @csoneson noted) neither gene has a larger number of reads displaying that UMI, then the UMI is not uniquely resolvable to a gene and will be discarded. The top two rows, at least, seem to fall into this category as there are two distinct lncRNA to which the reads map equally well.

So overall, one observation here is that, for these cells, it looks like the vast majority of mappings come from only a handful of genes (perhaps as few as 2) where almost all of the reads map to more than one gene.

rob-p avatar May 18 '22 14:05 rob-p

Ok, so I think I'm happy with how the mapping works, and why we end up excluding those reads.

The thing that's confusing me is why we end up with some barcodes mapping just to those two genes... I've just checked, and it's not just most of the reads, for almost all of the top 40 barcodes it's almost exclusively those two genes:

                     bc      N prop_top2
 1: CB:TGCCCTCCAATGGATC 166124     0.999
 2: CB:CCGACTTTCGTTCTTG 142004     0.999
 3: CB:GGTGCGTCTTGCGCCG 113253     0.999
 4: CB:CGTCACTACCTCCCCG 108055     0.999
 5: CB:AGGCCGTCCCTCTTAA 105101     0.999
 6: CB:GGTCCGTCTTGCGCCG  98084     0.999
 7: CB:TGGCGAATGCTTTCGC  98432     0.999
 8: CB:ATCATGGCCTCAGTTC  87362     0.999
 9: CB:TGGCAAATGCTTTCGC  84189     0.999
10: CB:CCCTCCTAATCATGGC  75437     0.999
11: CB:GATCGTCTTCGAACCT 211216     0.998
12: CB:GAGTCCGTCTTGCGCC 278231     0.998
13: CB:TGGCGCAATACGAATG  94752     0.998
14: CB:TTCGGTCTTGATTAAT 629498     0.997
15: CB:TGCCCTACAATGGATC 117496     0.996
16: CB:GTTACGACTTTTACTT 101171     0.996
17: CB:CGACTTTTACTTCCTC 149619     0.995
18: CB:GTCTTCGAACCTCCGA 136971     0.995
19: CB:GCAATCATACTCCCCC 130063     0.995
20: CB:GCTCTGGTCCGTCTTG  87547     0.995
21: CB:ATCCACCAACTAAGAA 219455     0.994
22: CB:TCAGCTTTGCAACCAT  93262     0.994
23: CB:CAACCATACTCCCCCC  93048     0.994
24: CB:GTAGGCCCCGCTTTCA  88253     0.994
25: CB:GTCTCGTTCGTTATCG 285555     0.993
26: CB:CGTTCGTTATCGGAAT 483959     0.992
27: CB:ACGTCAATTCCTTTAA 120961     0.992
28: CB:GCCTCTAGATAGTCAA  89414     0.991
29: CB:CCTCAGTTCCGAAAAC  88648     0.990
30: CB:GCGACCATACTCCCCC  88433     0.985
31: CB:TGAGCCGCAGGCTCCA 120459     0.984
32: CB:TGACTTTTACTTCCTC  83973     0.984
33: CB:CTCGGGTGGCTGAACG  60776     0.971
34: CB:AATCGGTAGTAGCGAC  92626     0.968
35: CB:GTGCGGTGTGTACAAA 153338     0.959
36: CB:AAGGCAGGGACTTAAT  69945     0.949
37: CB:GCACTTACTGGGAATT 161946     0.910
38: CB:TTGCCGTATCGTTCCG 107710     0.811
39: CB:GACCTGGGCGGGATTC  99845     0.782
40: CB:CTCGTACTGAGCAGGA 230899     0.680

What could be going on here, so that for a small but non-zero proportion of droplets they only map to these two v similar genes?

[just saw you had also noticed this in your edited comment!]

wmacnair avatar May 18 '22 14:05 wmacnair

Doesn't look like using the cellranger reference actually makes much difference. Here is the genes vs barcodes plot for the same sample, mapped to the cellranger reference:

image

wmacnair avatar May 18 '22 14:05 wmacnair

Hi @wmacnair,

Thanks for all of the useful diagnostic info. Regarding:

The thing that's confusing me is why we end up with some barcodes mapping just to those two genes... I've just checked, and it's not just most of the reads, for almost all of the top 40 barcodes it's almost exclusively those two genes:

The only way to really dig down into that is to try to look at the sequencing reads themselves that are inducing those mappings. Since the RAD file encodes where the reads maps but, for purposes of keeping the files compact and fast to parse, doesn't write down the reads themselves (like a SAM file). If you really want to track this down, you could certainly go back to the original FASTQ files, grep for the barcode of interest in the technical read to extract the read names, and then pull those read names out of the biological read file. Then you would have the actual sequence reads that are inducing these mappings.

rob-p avatar May 18 '22 15:05 rob-p

I just remembered that I still had this issue open, apologies...!

I've reread through the thread, and I remain confused about what was going on in my samples. I have moved on by (a) using the default 10x reference, rather than rolling my own from the newest Gencode data, and (b) using simpleaf.

It doesn't seem like much was gained from the discussion, apart from teaching me some basics of mapping 😅 However, reading through again it was clear to me that it would be great to have a small, purpose-built standalone R package for loading simpleaf / alevin-fry outputs into R.

At the moment you suggest using fishpond::loadFry. This works fine, but I find it a bit unsatisfying, partly because it's not the main purpose of the package, but also some additional functionality would be great.

For example, I would love to see a package, e.g. simpleafR, that does:

  • loading processed data from multiple directories in one function call
  • excluding barcodes with zero reads
  • produce either sce or Seurat objects as outputs
  • (perhaps an option would be to combine this with alevinQC?)

I'm aware that you must have a very long list of things to do, and you've probably already considered this as something you may do. So just consider this as an additional point of support for if and when you get it :)

Thanks! Will

wmacnair avatar Jun 30 '23 15:06 wmacnair

Hi @wmacnair,

Thanks for the thoughtful suggestions! Regarding alevinQC, we actually have another project in the works that is designed to make an even easier-to-use QC report. It's based on scran.js and bkana and so it runs entirely in the browser :). More on that soon.

Regarding your other suggestion, I'd like to pull @mikelove and @DongzeHE into this discussion. That is, I know that Mike likes to think of fishpond as a general catch-all package for lots of salmon/alevin/alevin-fry related functionality. However, if you think it would be useful to have a more focused / dedicated package for loading and doing basic processing of simpleaf output in R, then I think this is worth a broader discussion. I'll open a separate discussion item about this :).

Best, Rob

rob-p avatar Jun 30 '23 15:06 rob-p

We have some stuff in fishpond for import and processing, and welcome to consider other things to go in there. Just need docs and tests really. There’s even some new stuff for working with EC counts.

Recently we took out non-R code to move to a separate package for ease of installation. EDS package on Bioc (Avi’s) now has the original alevin data storage format which is read in with C++.

mikelove avatar Jul 01 '23 12:07 mikelove