spark-connector icon indicating copy to clipboard operation
spark-connector copied to clipboard

Use memory safe functions

Open samos123 opened this issue 2 years ago • 1 comments

@sam-h-bean reported a potential memory leak and this is a blind shot in potentially fixing this. It uses memory safe functions to get the data and makes a copy of the InternalRecord before using the data. This is a similar mechanism as is being used by Log4j and in some of the OSS included sources

Errors observed by @sam-h-bean :

java.lang.NullPointerException
	at scala.collection.mutable.ArrayOps$ofRef$.newBuilder$extension(ArrayOps.scala:202)
	at scala.collection.mutable.ArrayOps$ofRef.newBuilder(ArrayOps.scala:198)
	at scala.collection.TraversableLike.partition(TraversableLike.scala:449)
	at scala.collection.TraversableLike.partition$(TraversableLike.scala:448)
	at scala.collection.mutable.ArrayOps$ofRef.partition(ArrayOps.scala:198)
	at io.weaviate.spark.WeaviateDataWriter.writeBatch(WeaviateDataWriter.scala:45)
	at io.weaviate.spark.WeaviateDataWriter.write(WeaviateDataWriter.scala:27)
	at io.weaviate.spark.WeaviateDataWriter.write(WeaviateDataWriter.scala:15)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:452)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1731)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:490)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:391)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:169)
	at org.apache.spark.scheduler.Task.$anonfun$run$4(Task.scala:137)
	at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:125)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:137)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:96)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:902)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1697)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:905)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:760)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

and another:

{"description":"An I/O timeout occurs when the request takes longer than the specified server-side timeout.","error":"write tcp 10.244.2.15:8080-\u003e10.224.0.6:47706: i/o timeout","hint":"Either try increasing the server-side timeout using e.g. '--write-timeout=600s' as a command line flag when starting Weaviate, or try sending a computationally cheaper request, for example by reducing a batch size, reducing a limit, using less complex filters, etc. Note that this error is only thrown if client-side and server-side timeouts are not in sync, more precisely if the client-side timeout is longer than the server side timeout.","level":"error","method":"POST","msg":"i/o timeout","path":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/v1/batch/objects","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"time":"2023-02-26T04:46:49Z"}

samos123 avatar Feb 25 '23 17:02 samos123

Great to see you again! Thanks for the contribution.

beep boop - the SeMI bot 👋🤖

PS:
Are you already a member of the Weaviate Slack channel?

weaviate-git-bot avatar Feb 25 '23 22:02 weaviate-git-bot