drop icon indicating copy to clipboard operation
drop copied to clipboard

Error in AberrantSplicing_pipeline_FRASER_08_extract_results_FraseR_R

Open wanqingshao opened this issue 1 year ago • 31 comments

Hello,

I encountered the following error in the AS module when running DROP version 1.3.3 with FRASER 2.0, 68 samples were run together. Could you help take a look, really appreciate any suggestions on how to fix this! Thanks in advance!

[1] "Results per junction extracted"
Tue Aug  1 08:15:41 2023: Collecting results for: jaccard (transcriptome-wide)
Tue Aug  1 08:16:16 2023: Process chunk: 1 for: jaccard
Tue Aug  1 08:19:58 2023: Process chunk: 2 for: jaccard
Tue Aug  1 08:23:40 2023: Process chunk: 3 for: jaccard
Tue Aug  1 08:27:19 2023: Process chunk: 4 for: jaccard

[1] "Results per gene extracted"
Loading required package: GenomicFeatures
preparing ...
calculating splice types ...
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'start': subscript contains NAs
Calls: annotatePotentialImpact ... normalizeSingleBracketSubscript -> NSBS -> NSBS -> .subscript_error
In addition: Warning message:
In min(distance(test_junction, refseq.genes), na.rm = TRUE) :
  no non-missing arguments to min; returning Inf
Execution halted

wanqingshao avatar Aug 04 '23 13:08 wanqingshao

Hi, can you tell us which gtf file are you using?

vyepez88 avatar Aug 04 '23 21:08 vyepez88

Hi,

Thank you so much for the reply, really appreciate the help!

I used GENCODE V39, below is the header from the gtf. It worked in module AE for the same set of samples. For AS, this gtf worked with other datasets in FRASER V1 through older version of DROP, have not tried running it with DROP v1.3.3 and FRASER V1.

##description: evidence-based annotation of the human genome (GRCh38), version 39 (Ensembl 105)
##provider: GENCODE
##contact: [email protected]
##format: gtf
##date: 2021-09-02

Wanqing

wanqingshao avatar Aug 05 '23 14:08 wanqingshao

Hi @wanqingshao , I came across a similar error myself recently, I'm looking into it. I'll come back to you when I know why this happens.

ischeller avatar Aug 07 '23 10:08 ischeller

Thank you @ischeller !

wanqingshao avatar Aug 07 '23 12:08 wanqingshao

Hello @ischeller and @vyepez88,

Wondering did you get a chance to look into this? Or is there any test that you would recommend me trying? Would switching to a different gene model gtf help?

Best, Wanqing

wanqingshao avatar Aug 14 '23 16:08 wanqingshao

Hi @wanqingshao , sorry for not coming back sooner. While I know that I saw that same error myself some time ago, I have not been able to reproduce it now with DROP 1.3.3. Maybe you could try to do a fresh install of DROP and see if this solves the issue for you? Otherwise, would you have any way to reproduce the behavior you observe? I don't think its related to the gtf, so I would not expect switching that to change. The other thing you could do is to comment out those lines of code in the results script, as they are not strictly needed (they add additional annotation to the results table of whether the outlier introns are in blacklist regions of the genome etc).

ischeller avatar Aug 17 '23 14:08 ischeller

Thanks @ischeller for the reply!

I'll try adding --keep-incomplete in the snakemake run to keep the fraser/fraser/results_gene_all.tsv output and see if I can do some debugging later, will keep you posted if anything comes out.

Many thanks! Wanqing

wanqingshao avatar Aug 17 '23 17:08 wanqingshao

Hi @wanqingshao , were you able to have a look at this and sort it out?

vyepez88 avatar Sep 14 '23 09:09 vyepez88

Hi @vyepez88,

Unfortunately no. It appears to be errored out in function annotatePotentialImpact during some steps in FRASER resultAnnotation.R. My input for this step has almost 2 million rows, it took 1-2 days of running time before hitting the problematic line. It seems that the way this function accesses txdb is restricted to only a single thread. I tried modifying resultAnnotations.R with some trycatch commands but couldn't really figure it out. I haven't been able to identify the problematic record in the data.

I ended up commenting out the following two lines in script 08_extract_results_FraseR.R as suggested by ischeller

res_junc_dt <- annotatePotentialImpact(result=res_junc_dt, txdb=txdb, fds=fds) res_genes_dt <- annotatePotentialImpact(result=res_genes_dt, txdb=txdb, fds=fds)

please let me know if you have any suggestions!

-Wanqing

wanqingshao avatar Sep 14 '23 13:09 wanqingshao

Hi, 2 million rows is a lot. Can you share with us the values of the aberrantSplicing dictionary of your config file?

vyepez88 avatar Sep 14 '23 13:09 vyepez88

filter is set to false, was trying to get everything then apply filter afterwards in downstream analysis.

aberrantSplicing:
  run: true
  groups:
  - fraser
  #- fraser_external
  recount: true
  longRead: false
  keepNonStandardChrs: false
  filter: false
  minExpressionInOneSample: 20
  quantileMinExpression: 10
  quantileForFiltering: 0.95
  minDeltaPsi: 0.05
  implementation: PCA
  padjCutoff: 1
  maxTestedDimensionProportion: 6
  genesToTest:
  FRASER_version: FRASER2
  deltaPsiCutoff: 0.05

wanqingshao avatar Sep 14 '23 13:09 wanqingshao

The problem might be that you're not restricting your results all, so you're extracting and annotating all junction-sample combinations. Can you please try after changing the following?

 padjCutoff: 0.1
 deltaPsiCutoff: 0.1

vyepez88 avatar Sep 14 '23 13:09 vyepez88

Thank you, I'll give this a try, and to confirm, in addition to changing padj and deltaPsi cutoffs, filter should also be set to ture? The documentation recommended true, but the default value from drop demo is false

wanqingshao avatar Sep 14 '23 13:09 wanqingshao

Hi @wanqingshao , yes, we definitely recommend filter: true. It is only set false in the demo as the test dataset there is very small to allow fast testing, so filtering is unnecessary there. But on real datasets not filtering makes the computation much slower, as filtering allows to remove many noisy junctions. Typically before filtering the fds objects have millions of rows, while afterwards they contain only hundreds of thousands of rows.

ischeller avatar Sep 14 '23 14:09 ischeller

Thank you @ischeller and @vyepez88 ! I'll add the filter as suggested. We plan to run DROP again with a larger dataset (including the samples in this initial test) in the next few weeks, I'll report back if the same error persist.

Many thanks for your help!

-Wanqing

wanqingshao avatar Sep 14 '23 14:09 wanqingshao

Hello!

I am experiencing the same issue with my current dataset. It consists of 57 samples (target enriched for 49 genes). I have tried various combinations of including only subsets of the samples and found that it seems to (not always but) most often work with smaller sets. I got it to work in smaller sets with each sample, so I can exclude that the reason is one sample being corrupt. Also, the error occurred both with filter true & false.

I am currently testing it on the development branch and got the analysis of other datasets of similar size to run nicely in the past.

Any ideas or suggestions on what to test or what data to provide would be greatly appreciated!

Thank you :) Florentine

recount: false
longRead: false
keepNonStandardChrs: false
filter: true (/false)
minExpressionInOneSample: 20
quantileMinExpression: 10
minDeltaPsi: 0.05
implementation: PCA
padjCutoff: 0.1
maxTestedDimensionProportion: 6
genesToTest: null
FRASER_version: "FRASER2"
deltaPsiCutoff : 0.1
quantileForFiltering: 0.75

scharf-f avatar Nov 28 '23 07:11 scharf-f

Hi Florentine, Can you please give us more details on your error? In which step is it failing? Also, if I understand correctly, you only have 57 samples x 49 genes?

vyepez88 avatar Dec 12 '23 15:12 vyepez88

Hello Vicente,

it seems to me to be the exact same error as indicated by the issue opener in the AberrantSplicing_pipeline_FRASER_08_extract_results_FraseR_R step:

-------------------- BAM parameters --------------------
class: ScanBamParam
bamFlag (NA unless specified):
bamSimpleCigar: FALSE
bamReverseComplement: FALSE
bamTag:
bamTagFilter:
bamWhich: 0 ranges
bamWhat:
bamMapqFilter: 0


preparing ...
calculating splice types ...
Error in h(simpleError(msg, call)) :
  error in evaluating the argument 'x' in selecting a method for function 'start': subscript contains NAs
Calls: annotatePotentialImpact ... normalizeSingleBracketSubscript -> NSBS -> NSBS -> .subscript_error
In addition: Warning message:
In min(distance(test_junction, refseq.genes), na.rm = TRUE) :
  no non-missing arguments to min; returning Inf
Execution halted

and

RuleException:
CalledProcessError in file /tmp/tmpw3oei168, line 157:
Command 'set -euo pipefail;  Rscript --vanilla /mnt/output/drop_dev/.snakemake/scripts/tmp59r6ko_d.08_extract_results_FraseR.R' returned non-zero exit status 1.
 File "/tmp/tmpw3oei168", line 157, in __rule_AberrantSplicing_pipeline_FRASER_08_extract_results_FraseR_R
 File "/root/miniconda3/envs/drop_dev/lib/python3.11/concurrent/futures/thread.py", line 58, in run

Yes, you are understanding correctly, it is 57 x 49.

Thank you very much and kind regards, Flo

scharf-f avatar Dec 13 '23 07:12 scharf-f

Hi Flo, I think the problem is that, probably due to the low number of samples and/or genes, there were no significant results. Can you try setting up the padjcutoff to 1? And maybe also the deltaPsiCutoff to 0.05 or lower? We'll fix DROP to catch that error.

vyepez88 avatar Dec 13 '23 12:12 vyepez88

Actually, we had already fixed this in the dev branch. So also installing the dev branch and rerunning the module should solve your problem

vyepez88 avatar Dec 13 '23 13:12 vyepez88

Hello Vicente,

thank you for your reply.

I have been doing the analyses in the dev branch because I encountered this (or another, not sure anymore) issue in the 1.3.3 version, which could be solved using dev.

I found that it seems like one specific sample is causing the error - if I repeat the analysis will all samples except for that specific one the analysis finished successfully. Do you have any idea on how to solve this (or get down to the bottom of the issue)?

Thanks again! Flo

scharf-f avatar Dec 13 '23 13:12 scharf-f

Super weird. Did you try with all the samples but modifying the cut-offs to include all the significant events?

vyepez88 avatar Dec 13 '23 13:12 vyepez88

I tried running the analysis with filter: true and filter: false and this did not change the outcome.

Additionally, I tested it with your suggested adaptations (deltaPsiCutoff: 0.01 and padjCutoff: 1) and still get the same error.

(I assume when the filter is set to false the adaptations in the cutoffs would anyway not change anything, right?)

scharf-f avatar Dec 14 '23 10:12 scharf-f

The filter parameter shouldn't have any effect in the cut-offs. Can you further lower the deltaPsiCutoff to 0?

vyepez88 avatar Dec 18 '23 14:12 vyepez88

I tried, it didn't change the result unfortunately.

scharf-f avatar Dec 20 '23 08:12 scharf-f

can you try the following?

library(FRASER)
fds <- loadFraserDataSet(file = '{root}/processed_results/aberrant_splicing/datasets/savedObjects/{DROP_GROUP}/fds-object.RDS')
res <- results(fds_mus[1:1000, 1:2], padjCutoff = 1, deltaPsiCutoff = 0)
res

Did you obtain any results?

vyepez88 avatar Dec 20 '23 09:12 vyepez88

Yes, I did:

GRanges object with 138 ranges and 16 metadata columns:
                       seqnames            ranges strand | sampleID hgncSymbol
                          <Rle>         <IRanges>  <Rle> |    <Rle>      <Rle>
  Intron Jaccard Index     chr1 10408141-10411417      + |      NK1       <NA>
  Intron Jaccard Index     chr1 17044889-17053947      + |     NK10       <NA>
  Intron Jaccard Index     chr1 15428010-15428592      + |     NK10       <NA>
  Intron Jaccard Index     chr1 10400573-10404160      + |     NK10       <NA>
  Intron Jaccard Index     chr1   9722357-9722527      + |      NK1       <NA>
                   ...      ...               ...    ... .      ...        ...
  Intron Jaccard Index     chr1 12202179-12206739      + |     NK10       <NA>
  Intron Jaccard Index     chr1 17024075-17027748      + |     NK10       <NA>
  Intron Jaccard Index     chr1 17024075-17027759      + |     NK10       <NA>
  Intron Jaccard Index     chr1 17050649-17053947      + |     NK10       <NA>
  Intron Jaccard Index     chr1 15426019-15428592      + |     NK10       <NA>
                          type    pValue   padjust psiValue  deltaPsi counts
                         <Rle> <numeric> <numeric>    <Rle> <numeric>  <Rle>
  Intron Jaccard Index jaccard  0.017551         1      0.2     -0.77      1
  Intron Jaccard Index jaccard  0.028675         1     0.07     -0.70      3
  Intron Jaccard Index jaccard  0.133430         1     0.02      0.02      1
  Intron Jaccard Index jaccard  0.155090         1     0.02      0.02      1
  Intron Jaccard Index jaccard  0.167130         1     0.29     -0.44      4
                   ...     ...       ...       ...      ...       ...    ...
  Intron Jaccard Index jaccard         1         1        0     -0.01      0
  Intron Jaccard Index jaccard         1         1        1      0.01     16
  Intron Jaccard Index jaccard         1         1        0     -0.01      0
  Intron Jaccard Index jaccard         1         1        0     -0.01      0
  Intron Jaccard Index jaccard         1         1        0      0.00      0
                       totalCounts meanCounts meanTotalCounts nonsplitCounts
                             <Rle>      <Rle>           <Rle>          <Rle>
  Intron Jaccard Index           5         10              12              4
  Intron Jaccard Index          41         13              32             38
  Intron Jaccard Index          47        0.5              35              3
  Intron Jaccard Index          53        0.5              30              0
  Intron Jaccard Index          14          5            12.5             10
                   ...         ...        ...             ...            ...
  Intron Jaccard Index           9          0             9.5              0
  Intron Jaccard Index          16         19              19              0
  Intron Jaccard Index          16          0              19              0
  Intron Jaccard Index          41          0              32             38
  Intron Jaccard Index          75          0              56              3
                       nonsplitProportion nonsplitProportion_99quantile
                                    <Rle>                         <Rle>
  Intron Jaccard Index                0.8                          0.79
  Intron Jaccard Index               0.93                          0.92
  Intron Jaccard Index               0.06                          0.06
  Intron Jaccard Index                  0                             0
  Intron Jaccard Index               0.71                          0.71
                   ...                ...                           ...
  Intron Jaccard Index                  0                             0
  Intron Jaccard Index                  0                             0
  Intron Jaccard Index                  0                             0
  Intron Jaccard Index               0.93                          0.92
  Intron Jaccard Index               0.04                          0.04
                       annotatedJunction            FDR_set
                                   <Rle>        <character>
  Intron Jaccard Index              none transcriptome-wide
  Intron Jaccard Index              none transcriptome-wide
  Intron Jaccard Index              none transcriptome-wide
  Intron Jaccard Index              none transcriptome-wide
  Intron Jaccard Index              none transcriptome-wide
                   ...               ...                ...
  Intron Jaccard Index              none transcriptome-wide
  Intron Jaccard Index              none transcriptome-wide
  Intron Jaccard Index              none transcriptome-wide
  Intron Jaccard Index              none transcriptome-wide
  Intron Jaccard Index              none transcriptome-wide
  -------
  seqinfo: 25 sequences from an unspecified genome

scharf-f avatar Dec 20 '23 09:12 scharf-f

then, can you please follow step by step the 08_extract_results_FraseR script in an R session? For that, first you will need to load the snakemake object by executing: snakemake <- readRDS(log_file) which should look like this: tmp_dir / "AS" / "{dataset}--{annotation}" / "08_results.Rds" Otherwise, we could set up a zoom.

vyepez88 avatar Dec 20 '23 09:12 vyepez88

Yes, I tried it. It crashes in the command res_junc_dt <- annotatePotentialImpact(result=res_junc_dt, txdb=txdb, fds=fds) with the error

preparing ...
calculating splice types ...
Error in h(simpleError(msg, call)) :
  error in evaluating the argument 'x' in selecting a method for function 'start': subscript contains NAs
In addition: Warning message:
In min(distance(test_junction, refseq.genes), na.rm = TRUE) :
  no non-missing arguments to min; returning Inf

scharf-f avatar Dec 21 '23 08:12 scharf-f

Hi Flo, sorry this got lost after the holidays. Was it resolved? If not, can you please check that all is good with the GenomicRanges of res_junc_dt? e.g. all junctions have correct chr and valid start and end positions?

vyepez88 avatar Feb 12 '24 08:02 vyepez88