gnomad-browser icon indicating copy to clipboard operation
gnomad-browser copied to clipboard

Gene task failing when importing GTEx tpm file

Open mattsolo1 opened this issue 2 years ago • 11 comments

Running the genes task fails

./deployctl data-pipeline run --cluster <cluster> genes

Stack trace:

2022-06-22 17:06:59,249 - gnomad_data_pipeline - INFO - Running prepare_gtex_v7_expression_data (Output does not exist)
2022-06-22 17:06:59 Hail: WARN: file 'gs://gnomadev-data-pipeline-output/2022-06-22/2/external_sources/gtex/v7/GTEx_Analysis_2016-01-15_v7_RSEMv1.2.22_transcript_tpm.txt.gz' is 1.8G
  It will be loaded serially (on one core) due to usage of the 'force' argument.
  If it is actually block-gzipped, either rename to .bgz or use the 'force_bgz'
  argument.
2022-06-22 17:07:08 Hail: INFO: Loading <StructExpression of type struct{transcript_id: str, gene_id: str, `GTEX-1117F-0226-SM-5GZZ7`: str, `GTEX-1117F-0426-SM-5EGHI`: str, ...etc, fields. Counts by type:
  str: 11690
2022-06-22 17:07:11 Hail: WARN: file 'gs://gnomadev-data-pipeline-output/2022-06-22/2/external_sources/gtex/v7/GTEx_Analysis_2016-01-15_v7_RSEMv1.2.22_transcript_tpm.txt.gz' is 1.8G
  It will be loaded serially (on one core) due to usage of the 'force' argument.
  If it is actually block-gzipped, either rename to .bgz or use the 'force_bgz'
  argument.
Traceback (most recent call last):
  File "/tmp/645b5598df4e41dd9a0e24f748f6bb15/genes.py", line 325, in <module>
    run_pipeline(pipeline)
  File "/tmp/645b5598df4e41dd9a0e24f748f6bb15/pyfiles_zdd6fpyt.zip/data_pipeline/pipeline.py", line 197, in run_pipeline
  File "/tmp/645b5598df4e41dd9a0e24f748f6bb15/pyfiles_zdd6fpyt.zip/data_pipeline/pipeline.py", line 164, in run
  File "/tmp/645b5598df4e41dd9a0e24f748f6bb15/pyfiles_zdd6fpyt.zip/data_pipeline/pipeline.py", line 129, in run
  File "/tmp/645b5598df4e41dd9a0e24f748f6bb15/pyfiles_zdd6fpyt.zip/data_pipeline/data_types/gtex_tissue_expression.py", line 14, in prepare_gtex_expression_data
  File "<decorator-gen-1010>", line 2, in export
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/table.py", line 1098, in export
    Env.backend().execute(
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 104, in execute
    self._handle_fatal_error_from_backend(e, ir)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/backend.py", line 181, in _handle_fatal_error_from_backend
    raise err
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 31, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: MethodTooLargeException: Method too large: __C19580collect_distributed_array.__m19633split_InsertFields ()V

Java stack trace:
is.hail.relocated.org.objectweb.asm.MethodTooLargeException: Method too large: __C19580collect_distributed_array.__m19633split_InsertFields ()V
	at is.hail.relocated.org.objectweb.asm.MethodWriter.computeMethodInfoSize(MethodWriter.java:2087)
	at is.hail.relocated.org.objectweb.asm.ClassWriter.toByteArray(ClassWriter.java:489)
	at is.hail.lir.Emit$.apply(Emit.scala:217)
	at is.hail.lir.Classx.$anonfun$asBytes$4(X.scala:108)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at scala.collection.AbstractIterator.to(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
	at is.hail.lir.Classx.asBytes(X.scala:121)
	at is.hail.asm4s.ClassBuilder.classBytes(ClassBuilder.scala:357)
	at is.hail.asm4s.ModuleBuilder.$anonfun$classesBytes$1(ClassBuilder.scala:151)
	at is.hail.asm4s.ModuleBuilder.$anonfun$classesBytes$1$adapted(ClassBuilder.scala:151)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
	at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
	at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
	at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)
	at scala.collection.AbstractIterator.to(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)
	at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431)
	at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)
	at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1431)
	at is.hail.asm4s.ModuleBuilder.classesBytes(ClassBuilder.scala:152)
	at is.hail.expr.ir.EmitClassBuilder.resultWithIndex(EmitClassBuilder.scala:708)
	at is.hail.expr.ir.WrappedEmitClassBuilder.resultWithIndex(EmitClassBuilder.scala:170)
	at is.hail.expr.ir.WrappedEmitClassBuilder.resultWithIndex$(EmitClassBuilder.scala:170)
	at is.hail.expr.ir.EmitFunctionBuilder.resultWithIndex(EmitClassBuilder.scala:1115)
	at is.hail.expr.ir.Emit.$anonfun$emitI$225(Emit.scala:2337)
	at is.hail.expr.ir.IEmitCodeGen.map(Emit.scala:334)
	at is.hail.expr.ir.Emit.emitI(Emit.scala:2278)
	at is.hail.expr.ir.Emit.$anonfun$emitSplitMethod$1(Emit.scala:575)
	at is.hail.expr.ir.Emit.$anonfun$emitSplitMethod$1$adapted(Emit.scala:573)
	at is.hail.expr.ir.EmitCodeBuilder$.scoped(EmitCodeBuilder.scala:18)
	at is.hail.expr.ir.EmitCodeBuilder$.scopedVoid(EmitCodeBuilder.scala:28)
	at is.hail.expr.ir.EmitMethodBuilder.voidWithBuilder(EmitClassBuilder.scala:1048)
	at is.hail.expr.ir.Emit.emitSplitMethod(Emit.scala:573)
	at is.hail.expr.ir.Emit.emitInSeparateMethod(Emit.scala:590)
	at is.hail.expr.ir.Emit.emitI(Emit.scala:777)
	at is.hail.expr.ir.Emit.emitI$1(Emit.scala:614)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$26(Emit.scala:732)
	at is.hail.expr.ir.TableTextFinalizer.writeMetadata(TableWriter.scala:507)
	at is.hail.expr.ir.Emit.emitVoid(Emit.scala:732)
	at is.hail.expr.ir.Emit.emitVoid$1(Emit.scala:611)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$5(Emit.scala:628)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$5$adapted(Emit.scala:628)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$4(Emit.scala:628)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$4$adapted(Emit.scala:627)
	at is.hail.expr.ir.EmitCodeBuilder$.scoped(EmitCodeBuilder.scala:18)
	at is.hail.expr.ir.EmitCodeBuilder$.scopedVoid(EmitCodeBuilder.scala:28)
	at is.hail.expr.ir.EmitMethodBuilder.voidWithBuilder(EmitClassBuilder.scala:1048)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$3(Emit.scala:627)
	at is.hail.expr.ir.Emit.$anonfun$emitVoid$3$adapted(Emit.scala:625)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at is.hail.expr.ir.Emit.emitVoid(Emit.scala:625)
	at is.hail.expr.ir.Emit$.$anonfun$apply$3(Emit.scala:70)
	at is.hail.expr.ir.Emit$.$anonfun$apply$3$adapted(Emit.scala:68)
	at is.hail.expr.ir.EmitCodeBuilder$.scoped(EmitCodeBuilder.scala:18)
	at is.hail.expr.ir.EmitCodeBuilder$.scopedVoid(EmitCodeBuilder.scala:28)
	at is.hail.expr.ir.EmitMethodBuilder.voidWithBuilder(EmitClassBuilder.scala:1048)
	at is.hail.expr.ir.Emit$.apply(Emit.scala:68)
	at is.hail.expr.ir.Compile$.apply(Compile.scala:78)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$1(CompileAndEvaluate.scala:50)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:50)
	at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:416)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:452)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:69)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:69)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:58)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:310)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:449)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:448)
	at sun.reflect.GeneratedMethodAccessor113.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.95-513139587f57
Error summary: MethodTooLargeException: Method too large: __C19580collect_distributed_array.__m19633split_InsertFields ()V

At this stage of the pipeline, we're importing and exporting the GTEx transcript TPM file (https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RSEMv1.2.22_transcript_tpm.txt.gz)

https://github.com/broadinstitute/gnomad-browser/blob/77f38723663fdecb92c479f1ed46b1c252363d77/data-pipeline/src/data_pipeline/data_types/gtex_tissue_expression.py#L10-L14

I ran this pipeline task on the same file several months ago with Hail 0.2.81, so something must have changed since then (now 0.2.95). Perhaps there are too many columns for Hail to import now; there appears to be one column for every GTEx sample (e.g., GTEX-1117F-2826-SM-5GZXL: str, GTEX-1117F-2926-SM-5GZYI: str`, etc).

Possible solutions

As a near-term work around, could adapt the pipeline to use one of the previous successful exports of the table from this step: gtex_v7_tissue_expression.ht, since it doesn't look like the source file has been updated anways.

In parallel file a bug report with Hail.

mattsolo1 avatar Jun 23 '22 13:06 mattsolo1

@mattsolo1 FYI I am currently investigating this

phildarnowsky-broad avatar Jun 24 '22 17:06 phildarnowsky-broad

Btw just noticed hail 0.2.96 released today... could be worth trying again after an update

mattsolo1 avatar Jun 24 '22 17:06 mattsolo1

OK nice I'll try that before I go any further on this, and I suppose we'd want to keep up to date on Hail in any case.

On Fri, Jun 24, 2022 at 1:29 PM Matthew Solomonson @.***> wrote:

Btw just noticed hail 0.2.96 released today... could be worth trying again after an update

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gnomad-browser/issues/914#issuecomment-1165784716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZQ2NNOEKAB7OD6C35RX2Q3VQXV57ANCNFSM5ZUKYNGA . You are receiving this because you were assigned.Message ID: @.***>

phildarnowsky-broad avatar Jun 24 '22 17:06 phildarnowsky-broad

No, same error with 0.2.96. On a related note, I had thought the hail version might be specified in data-pipeline/requirements.txt but it's not, is that something we want or is there a reason to leave it out?

On Fri, Jun 24, 2022 at 1:36 PM Phil Darnowsky @.***> wrote:

OK nice I'll try that before I go any further on this, and I suppose we'd want to keep up to date on Hail in any case.

On Fri, Jun 24, 2022 at 1:29 PM Matthew Solomonson < @.***> wrote:

Btw just noticed hail 0.2.96 released today... could be worth trying again after an update

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gnomad-browser/issues/914#issuecomment-1165784716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZQ2NNOEKAB7OD6C35RX2Q3VQXV57ANCNFSM5ZUKYNGA . You are receiving this because you were assigned.Message ID: @.***>

phildarnowsky-broad avatar Jun 24 '22 18:06 phildarnowsky-broad

Yeah, makes sense to include the hail version.

mattsolo1 avatar Jun 24 '22 20:06 mattsolo1

(note to self so I remember this come Monday morning)

what this step fundamentally is doing is taking tabular data represented in a Table and building a MatrixTable with all the same data, just organized differently. this export/re-import dance is a roundabout way of doing that, with the advantage that it leverages a lot of pre-existing code. however, we run into the problem we see here.

I'm thinking there may be a way to load this data into the MatrixTable we ultimately want more directly. It might work to read everything into a Table as we do now but then convert it into the MatrixTable without using the export/import machinery, rather explicitly modelling the transformation using steps that can be parallelized. Or maybe we don't need the intermediate Table and can build our MatrixTable right from the TSV file. In any case, I'll see if and how these could be done in Hail.

I am also concerned that, if we're running into this problem here, there might be similar problems in other parts of the pipeline where this MatrixTable or associated entities hit a resource limitation. Hard to say without being able to get past this point in the pipeline, but it's something that should not surprise us if it does happen.

On Fri, Jun 24, 2022 at 4:07 PM Matthew Solomonson @.***> wrote:

Yeah, makes sense to include the hail version.

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gnomad-browser/issues/914#issuecomment-1165898312, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZQ2NNMCUURINHZIQOPVTLLVQYIOXANCNFSM5ZUKYNGA . You are receiving this because you were assigned.Message ID: @.***>

phildarnowsky-broad avatar Jun 25 '22 16:06 phildarnowsky-broad

it looks like the crux of the problem is the number of columns, when I load the TSV and then throw away all but a handful of columns, this step succeeds and the pipeline continues

phildarnowsky-broad avatar Jun 27 '22 12:06 phildarnowsky-broad

Perhaps you could modify the gene pipeline task to use the output table from this step (gtex_v7_tissue_expression.ht) as an input?

Since the input GTEx file doesn't seem to change often, it may not be worth putting in the effort of rewriting this pipeline step right now, since it's possible that that future releases of GTEx expression results may have different a different format anyway.

And in the meantime perhaps file a hail bug report?

mattsolo1 avatar Jun 27 '22 14:06 mattsolo1

Per https://github.com/hail-is/hail/issues/11972, this is ultimately due to a known Hail bug with wide Tables that, for strategic reasons, will probably not be fixed. Hail team suggested rewriting this task using import_matrix_table.

phildarnowsky-broad avatar Jul 05 '22 13:07 phildarnowsky-broad

That makes sense. In that case should we close the issue on their repo?

mattsolo1 avatar Jul 05 '22 14:07 mattsolo1

@mattsolo1 yeah that makes sense, will do

phildarnowsky-broad avatar Jul 05 '22 14:07 phildarnowsky-broad

This issue is effectively closed by #1178 and #1269 in that an intermediate hail table is hard-coded.

https://github.com/broadinstitute/gnomad-browser/blob/4e80462dc391448d8d188e3773ee746d07463a3e/data-pipeline/src/data_pipeline/pipelines/genes.py#L253

However the genes pipeline is no longer reproducible with the reference to our private data pipeline bucket.

@rileyhgrant @ch-kr could this hail table be moved to the gnomAD public bucket (the same one we host the downloads page from)?

mattsolo1 avatar Nov 13 '23 14:11 mattsolo1

that makes sense to me, I think the team and community would benefit from using this resource. one question: how often does this HT get updated? if we regularly replace it, then copying this into the public bucket might not be a good idea (we can't delete old data)

ch-kr avatar Nov 13 '23 14:11 ch-kr

that makes sense to me, I think the team and community would benefit from using this resource. one question: how often does this HT get updated? if we regularly replace it, then copying this into the public bucket might not be a good idea (we can't delete old data)

The GTEx and pext hailtable referenced here never get updated, they're each a specific release of those resources, and our pipeline just reshapes the data.

If we want the versions of these we'll need to update the hailtables, but these two specific hailtables of the specific versions will never be updated.

rileyhgrant avatar Nov 13 '23 15:11 rileyhgrant

thanks for the context! copying sounds good to me

ch-kr avatar Nov 13 '23 15:11 ch-kr