drop icon indicating copy to clipboard operation
drop copied to clipboard

Error in Aberrant Expression pipeline using external counts

Open SathyaDarmalinggam opened this issue 3 years ago • 15 comments

Hi there, I've been trying to include external counts into my aberrant expression analysis on whole blood samples. I keep running into an error at the mergeCount step: The row (genes) of the count matrices to be merged are not the same.

I've tried it with various external counts eg. GTEX and other in-house count matrix, but I keep running into the same error. I've also played around & tried to debug and also made sure the samples are of the same genome build (GRCh38) and use the same annotation file. I've also checked that similar Ensemble geneIDs are present in both my external count matrix & the count matrix from the bam files.

Just an additional note, the pipeline runs smoothly without the external counts.

A related question: is it possible to use this pipeline on external count matrix without any bam files? Or does the pipeline only work when external counts are used in addition to bam files?

mergeCounts Error

Any help is much appreciated!

DROP version: 1.1.0 R version : 4.0.2

config.txt

SathyaDarmalinggam avatar Jan 09 '22 12:01 SathyaDarmalinggam

Are you trying to mix GTEx and in house counts? I could imagine that if the gene annotations are not exactly the same then it would throw errors. Looking at your config file you only have the one drop group. Have you tried running it with only GTEx or only in-house counts?

Would it be possible to share the entire error message? That could also help us identify the problem. Thank you in advance

nickhsmith avatar Jan 10 '22 10:01 nickhsmith

Another thing could be that if the external count tsv files have been filtered before that it would cause an issue. Do each of the external count matricies have exactly the same number of rows, and exactly the same row/geneIDs?

To answer your other question, the aberrantExpression pipeline should work with only external counts, as long as you indicate a minimum of 10 samples to run in the sample annotation table.

nickhsmith avatar Jan 10 '22 10:01 nickhsmith

Hi @nickhsmith , thanks for your quick reply. Here's the snakemake log.txt file with the error message. I then proceeded to do this to arrive at the error I posted above:

snakemake <- readRDS("PATH/.drop/tmp/AE/v38/WB/merge.Rds")
snakemake@input
source("./Scripts/AberrantExpression/pipeline/Counting/mergeCounts.R")

Just to clarify, I tried running this pipeline with 69 samples (bam files) + either a GTEX or in-house external count saved as a .tsv file. The various external counts I have tried have > 56k geneIDs/rows.

Running the pipeline without the external counts works beautifully.

I will try running it with just the external counts as you suggested & will double check if the external count matrices have had the counts filtered.

SathyaDarmalinggam avatar Jan 10 '22 11:01 SathyaDarmalinggam

Hi Sathya. Can you please check that the row names of the following 2 files are identical? PATH/Output_GTEX/processed_data/aberrant_expression/v38/counts/COV032_20_37_PAX.Rds PATH/Input/GTEX_External.tsv

vyepez88 avatar Jan 10 '22 11:01 vyepez88

Hi @vyepez88 , looks like the 2 files are not identical.

identical(GTEX_list,Sample_list)
FALSE

all.equal(GTEX_list,Sample_list)
'Lengths (56200, 60649) differ (string compare on first 56200)''56200 string mismatches'

length(GTEX_list)
56200

length(Sample_list)
60649

Does this just mean that the annotation files are not similar & hence I keep getting the error message?

SathyaDarmalinggam avatar Jan 10 '22 12:01 SathyaDarmalinggam

Yes, it means that the annotation files used to generate the GTEx count matrix is different than the one you are using to count your own samples. Did you generate the GTEx count matrix yourself or did you download it? Moreover, we recommend that external matrices are used only if they were generated using DROP as others could have been generated using other software and/or parameters, even if the annotation file is the same.

vyepez88 avatar Jan 10 '22 12:01 vyepez88

Check if those "extra genes" belongs to chrM. O found that "125 strand specific blood, build hg19, Baylor College of Medicine" have not mithocondrial genes.

marcDabad avatar Jan 17 '22 09:01 marcDabad

Were you able to clean up the annotations and get things to work?

nickhsmith avatar Feb 15 '22 09:02 nickhsmith

Hi @nickhsmith , unfortunately not. I tried various gene annotations but I always got the error at the same step.

I am just currently working with processing some raw data from another similar cohort and analyse the data together instead of providing an external count.

SathyaDarmalinggam avatar Feb 15 '22 09:02 SathyaDarmalinggam

I'm sorry to hear you are having trouble. Did you generate the external counts yourself (or through another method), or did you download them from the DROP external count data sets?

On Tue, Feb 15, 2022 at 10:55 AM Sathya Darmalinggam < @.***> wrote:

Hi @nickhsmith https://github.com/nickhsmith , unfortunately not. I tried various gene annotations but I always got the error at the same step.

I am just currently working with processing some raw data from another similar cohort and analyse the data together instead of providing an external count.

— Reply to this email directly, view it on GitHub https://github.com/gagneurlab/drop/issues/284#issuecomment-1040072386, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWP74YI7EL4TY5DCTTPPG3U3IPHNANCNFSM5LR2ATCA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

nickhsmith avatar Feb 15 '22 11:02 nickhsmith

I have similar question about the external count file. Could you please provide an example about how the external count file should look like? Should this count file be raw count file or FPKM file? Any specific checks need to be done before being included?

Also may I include external count file for the aberrant splicing module?

Thanks.

C.

capricy avatar Apr 01 '22 11:04 capricy

Hi, the external counts (containing the matrices needed for both expression and splicing) can be created using DROP with the rule: snakemake exportCounts

Furthermore, you have to fill a couple of parameters in the config file: https://gagneurlab-drop.readthedocs.io/en/latest/prepare.html#export-counts-dictionary

We hope to have the external count functionality for the splicing module running next week!

vyepez88 avatar Apr 01 '22 11:04 vyepez88

I would like clarify my questions again:

  1. External count file: should it be raw count matrix or FPKM matrix? How do I do the related annotation version check?
  2. Is there an example of external count file so that I know how the header or row names look like?
  3. snakemake exportCounts: should I run it before the AE module run and then include the output in the config file? Where would I find its output then?

Thanks.

C.

capricy avatar Apr 01 '22 11:04 capricy

  1. Raw counts. If you exported the counts using the same gtf file as you are using with the local samples, should work. DROP checks that the number and names of genes match.
  2. You can execute drop demo and find an example under Data/external_geneCounts.tsv.gz
  3. You should run it before the AE module and include it in the sample annotation. You can find it under processed_results/exported_counts

vyepez88 avatar Apr 01 '22 12:04 vyepez88

We have updated the splicing module to include external splicing to increase the number of samples possible! Please follow the latest documentation and update DROP to version 1.2.1

nickhsmith avatar Jul 06 '22 15:07 nickhsmith