ReinventCommunity
ReinventCommunity copied to clipboard
Data_Preparation.ipynb can not open
As shown in the title, When I try to open Data_Preparation.ipynb, Web page reminds me "Sorry, something went wrong."
To the best of my understanding this is a github issue. The server is failing to serve the contents of the notebook.
File sent.
@patronov When I use Jupyter notebook to open Dat_Preparation.ipynb, I got Unreadable Notebook: /home/xzhang/projects/ReinventCommunity/notebooks/Data_Preparation.ipynb NotJSONError("Notebook does not appear to be JSON: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0...")
Thanks for letting us know. It seems that the files got corrupt and there are also versioning inaccuracies. Will try to sort this out as soon as we can. In the meantime please use 'reinvent.v3.0' instead of the old 'reinvent_shared.v2.1' as wrongly pointed in tutorials.
I met the same problem as well. Could you give us some guide how to prepare the data?
I have updated the notebook file and should be opening fine now. Could you please check and let me know if it still causes issues. Thanks.
@patronov
Thank you so much for your updating. Now can open Data_preparation.ipynb.
However, when I tried to run it, I found I need to install pyspark and molvs. So I installed pyspark and molvs using pip. Everything is going well until I ran the part of 2.1 Data purging.
after I run the cell below:
num_atoms_dist = chembl_annotated_df
.groupBy("num_atoms")
.agg(psf.count("num_atoms").alias("num"))
.withColumn("percent", psf.lit(100.0)*psf.col("num")/chembl_annotated_df.count())
.sort("num_atoms", ascending=False)
.toPandas()
num_atoms_dist.plot(x="num_atoms", y="percent", xlim=(0, 100), lw=3)
I got the following error messages. I checked the version of py4j, which is 0.10.9
Py4JJavaError Traceback (most recent call last)
3 .agg(psf.count("num_atoms").alias("num"))
----> 4 .withColumn("percent", psf.lit(100.0)*psf.col("num")/chembl_annotated_df.count())
5 .sort("num_atoms", ascending=False)
6 .toPandas()
~/miniconda3/envs/ReinventCommunity/lib/python3.7/site-packages/pyspark/sql/dataframe.py in count(self) 662 2 663 """ --> 664 return int(self._jdf.count()) 665 666 def collect(self):
~/miniconda3/envs/ReinventCommunity/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args) 1303 answer = self.gateway_client.send_command(command) 1304 return_value = get_return_value( -> 1305 answer, self.gateway_client, self.target_id, self.name) 1306 1307 for temp_arg in temp_args:
~/miniconda3/envs/ReinventCommunity/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 109 def deco(*a, **kw): 110 try: --> 111 return f(*a, **kw) 112 except py4j.protocol.Py4JJavaError as e: 113 converted = convert_exception(e.java_exception)
~/miniconda3/envs/ReinventCommunity/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 format(target_id, ".", name), value) 329 else: 330 raise Py4JError(
Py4JJavaError: An error occurred while calling o154.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 7.0 failed 1 times, most recent failure: Lost task 6.0 in stage 7.0 (TID 5212) (192.168.0.25 executor driver): java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:390)
at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3006)
at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3005)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
at org.apache.spark.sql.Dataset.count(Dataset.scala:3005)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.
This error message indicates that you dont have enough memory allocated to the process or your hardware doesnt have the requested memory. Best see how to dedicate a reasonable amount of RAM to pyspark, such that doesnt exceed your actual resources.