sg-nex-data question about sequin concentration

Hi @cying111 , thank you so much for putting this dataset together. I am interested in comparing different quantification methods using this dataset, and I am specifically interested in this sample:

SGNex_Hct116_directcDNA_replicate3_run2 which has RNA sequin Mix A at this concentration according to the sample spreadsheet: 1% RNA sequin Mix A v1.0 @3ng

I was wondering if you could help breakdown what the 1% and 3ng means in this case?

In the RNA Mix A spreadsheet, under version 1, say there are these four quantities,

Mix A (version 1) R1_11_1 | 161.132813 R1_11_2 | 80.5664063 R1_12_1 | 1.77720014 R1_12_2 | 28.4352022

Does 1% mean that 1% of these values are the number of corresponding reads found in the fastq file? I'm still confused about what 3ng refers to. Hope you can help clarify.

Thank you,

Sowmya

Nov 19 '24 18:11 sparthib

Hi @sparthib,

We are glad to know that you find this resource helpful! And I am very sorry for getting back lately.

For the spike-in concentration, 1% means that of total RNAs, the spike-in is 1%, so you can interpret it as for the total sequencing reads for that sample, 1% should be expected to be spike-in reads. 3ng is just the amount of RNAs for the spike-in, so it should be 1% of the total mRNA amount for sample.

Hope this clarifies your question!

Thank you Warm regards, Ying

Nov 29 '24 06:11 cying111

Thanks for your response @cying111! Do you have suggestions on how to go about finding the true counts or CPM of the spike-ins or SIRVs in each of the samples?

I came across the SIRV-1 concentration calculator on the main README and I am not sure I am using it right. Would be great if there's a pre-exisiting table with the true counts information for the spiked samples listed here.

Thanks, Sowmya

Dec 05 '24 16:12 sparthib

Hi @cying111 as a follow up, I am going through the transcriptome aligned bam files, and I expected only 1% of the reads to be spike-ins, for example, I expected most of the transcripts this sample: SGNex_Hct116_cDNA_replicate3_run3 aligns to, to be ENSEMBL ID'ed, but turns out they are all spikeins? I'm not sure if I am misinterpreting this. Also, I see that a lot of the alignments here are secondary/supplementary how would you suggest I go about calculating the CPM and comparing them against the known concentrations?

Additionally, I am trying to obtain the length of the transcript these originate from, should I calculate that from the length of the strings in the transcriptome fasta or directly obtain from the GTF file? (I'm assuming the length in the GTF file includes intronic regions?)

Thanks!

Dec 30 '24 00:12 sparthib

hi @cying111 thanks for updating the SIRV Set1 and 4 concentration files, they are very helpful. I see that in the SIRV4 file, there's the concentration column conc (ng/µl) which is not in the Set1 spreadsheet. I am taking this as the relative abundance for the sequins which can be used to calculate the expected CPM as described in the paper.

For Set1, for each of the mixes (E0, E1, and E2). This would just be molecular weight (MW) times the fmol/µl, right? Would deeply appreciate your clarification on this.

Jun 23 '25 17:06 sparthib