ReinventCommunity icon indicating copy to clipboard operation
ReinventCommunity copied to clipboard

Data_Preparation.ipynb can not open

Open kong0706 opened this issue 3 years ago • 8 comments

As shown in the title, When I try to open Data_Preparation.ipynb, Web page reminds me "Sorry, something went wrong."

kong0706 avatar Aug 19 '21 02:08 kong0706

To the best of my understanding this is a github issue. The server is failing to serve the contents of the notebook.

patronov avatar Aug 23 '21 06:08 patronov

File sent.

patronov avatar Aug 25 '21 16:08 patronov

@patronov When I use Jupyter notebook to open Dat_Preparation.ipynb, I got Unreadable Notebook: /home/xzhang/projects/ReinventCommunity/notebooks/Data_Preparation.ipynb NotJSONError("Notebook does not appear to be JSON: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0...")

xuzhang5788 avatar Aug 28 '21 07:08 xuzhang5788

Thanks for letting us know. It seems that the files got corrupt and there are also versioning inaccuracies. Will try to sort this out as soon as we can. In the meantime please use 'reinvent.v3.0' instead of the old 'reinvent_shared.v2.1' as wrongly pointed in tutorials.

patronov avatar Aug 30 '21 14:08 patronov

I met the same problem as well. Could you give us some guide how to prepare the data?

qili2244 avatar Aug 31 '21 02:08 qili2244

I have updated the notebook file and should be opening fine now. Could you please check and let me know if it still causes issues. Thanks.

patronov avatar Sep 08 '21 17:09 patronov

@patronov Thank you so much for your updating. Now can open Data_preparation.ipynb. However, when I tried to run it, I found I need to install pyspark and molvs. So I installed pyspark and molvs using pip. Everything is going well until I ran the part of 2.1 Data purging. after I run the cell below: num_atoms_dist = chembl_annotated_df
.groupBy("num_atoms")
.agg(psf.count("num_atoms").alias("num"))
.withColumn("percent", psf.lit(100.0)*psf.col("num")/chembl_annotated_df.count())
.sort("num_atoms", ascending=False)
.toPandas() num_atoms_dist.plot(x="num_atoms", y="percent", xlim=(0, 100), lw=3)

I got the following error messages. I checked the version of py4j, which is 0.10.9

Py4JJavaError Traceback (most recent call last) in 2 .groupBy("num_atoms")
3 .agg(psf.count("num_atoms").alias("num"))
----> 4 .withColumn("percent", psf.lit(100.0)*psf.col("num")/chembl_annotated_df.count())
5 .sort("num_atoms", ascending=False)
6 .toPandas()

~/miniconda3/envs/ReinventCommunity/lib/python3.7/site-packages/pyspark/sql/dataframe.py in count(self) 662 2 663 """ --> 664 return int(self._jdf.count()) 665 666 def collect(self):

~/miniconda3/envs/ReinventCommunity/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args) 1303 answer = self.gateway_client.send_command(command) 1304 return_value = get_return_value( -> 1305 answer, self.gateway_client, self.target_id, self.name) 1306 1307 for temp_arg in temp_args:

~/miniconda3/envs/ReinventCommunity/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 109 def deco(*a, **kw): 110 try: --> 111 return f(*a, **kw) 112 except py4j.protocol.Py4JJavaError as e: 113 converted = convert_exception(e.java_exception)

~/miniconda3/envs/ReinventCommunity/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError(

Py4JJavaError: An error occurred while calling o154.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 7.0 failed 1 times, most recent failure: Lost task 6.0 in stage 7.0 (TID 5212) (192.168.0.25 executor driver): java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.spark.sql.execution.columnar.ColumnBuilder$.ensureFreeSpace(ColumnBuilder.scala:160) at org.apache.spark.sql.execution.columnar.BasicColumnBuilder.appendFrom(ColumnBuilder.scala:71) at org.apache.spark.sql.execution.columnar.ComplexColumnBuilder.org$apache$spark$sql$execution$columnar$NullableColumnBuilder$$super$appendFrom(ColumnBuilder.scala:91) at org.apache.spark.sql.execution.columnar.NullableColumnBuilder.appendFrom(NullableColumnBuilder.scala:61) at org.apache.spark.sql.execution.columnar.NullableColumnBuilder.appendFrom$(NullableColumnBuilder.scala:54) at org.apache.spark.sql.execution.columnar.ComplexColumnBuilder.appendFrom(ColumnBuilder.scala:91) at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:104) at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:79) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423) at org.apache.spark.storage.BlockManager$$Lambda$953/1670082431.apply(Unknown Source) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390) at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3006) at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3005) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685) at org.apache.spark.sql.Dataset.count(Dataset.scala:3005) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.spark.sql.execution.columnar.ColumnBuilder$.ensureFreeSpace(ColumnBuilder.scala:160) at org.apache.spark.sql.execution.columnar.BasicColumnBuilder.appendFrom(ColumnBuilder.scala:71) at org.apache.spark.sql.execution.columnar.ComplexColumnBuilder.org$apache$spark$sql$execution$columnar$NullableColumnBuilder$$super$appendFrom(ColumnBuilder.scala:91) at org.apache.spark.sql.execution.columnar.NullableColumnBuilder.appendFrom(NullableColumnBuilder.scala:61) at org.apache.spark.sql.execution.columnar.NullableColumnBuilder.appendFrom$(NullableColumnBuilder.scala:54) at org.apache.spark.sql.execution.columnar.ComplexColumnBuilder.appendFrom(ColumnBuilder.scala:91) at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:104) at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.next(InMemoryRelation.scala:79) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423) at org.apache.spark.storage.BlockManager$$Lambda$953/1670082431.apply(Unknown Source) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384) at org.apache.spark.rdd.RDD.iterator(RDD.scala:335) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)

xuzhang5788 avatar Sep 08 '21 22:09 xuzhang5788

This error message indicates that you dont have enough memory allocated to the process or your hardware doesnt have the requested memory. Best see how to dedicate a reasonable amount of RAM to pyspark, such that doesnt exceed your actual resources.

patronov avatar Sep 23 '21 23:09 patronov