pathling icon indicating copy to clipboard operation
pathling copied to clipboard

v7.0.0

Open johngrimes opened this issue 1 year ago • 11 comments

Resolves #1784.

johngrimes avatar Apr 17 '24 01:04 johngrimes

@piotrszul I had a go at upgrading to Spark 3.5.0, it looks like there are some breaking changes to the Catalyst API 😞

https://github.com/aehrc/pathling/actions/runs/8714691134/job/23905347873#step:5:2093

johngrimes avatar Apr 17 '24 01:04 johngrimes

@piotrszul Would you mind taking a look at the problem with the benchmark runner hanging?

I think it may have something to do with the Spring Test integration that you wrote.

johngrimes avatar Apr 23 '24 09:04 johngrimes

These problems remain that need to be resolved before we release this:

  • [x] Benchmark runner hangs indefinitely
  • [x] Performance issues with reference encoding change
  • [x] Classpath issues for the Python library
  • [x] Issues with the UI

johngrimes avatar Apr 29 '24 01:04 johngrimes

I cannot reproduce the python classpath issue on the clean build (everything seems to be working fine) @johngrimes could you please check with the clean build?

piotrszul avatar Apr 29 '24 04:04 piotrszul

@piotrszul I think you're right, I may have been using an older build.

However, I have now fixed that and I am getting a new error:

:: loading settings :: url = jar:file:/opt/homebrew/Caskroom/miniconda/base/envs/notebooks/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/gri306/.ivy2/cache
The jars for the packages stored in: /Users/gri306/.ivy2/jars
au.csiro.pathling#library-runtime added as a dependency
io.delta#delta-spark_2.12 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-d50713e6-dcb5-478b-9a89-6aa236b026cc;1.0
	confs: [default]
	found au.csiro.pathling#library-runtime;7.0.0-SNAPSHOT in local-m2-cache
	found io.delta#delta-spark_2.12;3.1.0 in local-m2-cache
	found io.delta#delta-storage;3.1.0 in local-m2-cache
	found org.antlr#antlr4-runtime;4.9.3 in local-m2-cache
	found org.apache.hadoop#hadoop-aws;3.3.4 in local-m2-cache
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in local-m2-cache
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in local-m2-cache
downloading file:/Users/gri306/.m2/repository/au/csiro/pathling/library-runtime/7.0.0-SNAPSHOT/library-runtime-7.0.0-SNAPSHOT.jar ...
	[SUCCESSFUL ] au.csiro.pathling#library-runtime;7.0.0-SNAPSHOT!library-runtime.jar (38ms)
downloading file:/Users/gri306/.m2/repository/io/delta/delta-spark_2.12/3.1.0/delta-spark_2.12-3.1.0.jar ...
	[SUCCESSFUL ] io.delta#delta-spark_2.12;3.1.0!delta-spark_2.12.jar (9ms)
downloading file:/Users/gri306/.m2/repository/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar ...
	[SUCCESSFUL ] org.apache.hadoop#hadoop-aws;3.3.4!hadoop-aws.jar (9ms)
downloading file:/Users/gri306/.m2/repository/io/delta/delta-storage/3.1.0/delta-storage-3.1.0.jar ...
	[SUCCESSFUL ] io.delta#delta-storage;3.1.0!delta-storage.jar (7ms)
downloading file:/Users/gri306/.m2/repository/org/antlr/antlr4-runtime/4.9.3/antlr4-runtime-4.9.3.jar ...
	[SUCCESSFUL ] org.antlr#antlr4-runtime;4.9.3!antlr4-runtime.jar (2ms)
downloading file:/Users/gri306/.m2/repository/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar ...
	[SUCCESSFUL ] com.amazonaws#aws-java-sdk-bundle;1.12.262!aws-java-sdk-bundle.jar (164ms)
downloading file:/Users/gri306/.m2/repository/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar ...
	[SUCCESSFUL ] org.wildfly.openssl#wildfly-openssl;1.0.7.Final!wildfly-openssl.jar (20ms)
:: resolution report :: resolve 24555ms :: artifacts dl 256ms
	:: modules in use:
	au.csiro.pathling#library-runtime;7.0.0-SNAPSHOT from local-m2-cache in [default]
	com.amazonaws#aws-java-sdk-bundle;1.12.262 from local-m2-cache in [default]
	io.delta#delta-spark_2.12;3.1.0 from local-m2-cache in [default]
	io.delta#delta-storage;3.1.0 from local-m2-cache in [default]
	org.antlr#antlr4-runtime;4.9.3 from local-m2-cache in [default]
	org.apache.hadoop#hadoop-aws;3.3.4 from local-m2-cache in [default]
	org.wildfly.openssl#wildfly-openssl;1.0.7.Final from local-m2-cache in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   7   |   7   |   7   |   0   ||   7   |   7   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-d50713e6-dcb5-478b-9a89-6aa236b026cc
	confs: [default]
	1 artifacts copied, 6 already retrieved (56724kB/51ms)
24/04/30 05:31:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/30 05:31:18 WARN SparkSession: Cannot use io.delta.sql.DeltaSparkSessionExtension to configure session extensions.
java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/analysis/UnresolvedLeafNode
	at java.base/java.lang.ClassLoader.defineClass1(Native Method)
	at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
	at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
	at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
	at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)
	at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
	at java.base/java.lang.ClassLoader.defineClass1(Native Method)
	at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
	at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
	at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
	at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)
	at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
	at java.base/java.lang.ClassLoader.defineClass1(Native Method)
	at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
	at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
	at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
	at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)
	at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
	at org.apache.spark.sql.delta.DeltaTableValueFunctions$.getTableValueFunctionInjection(DeltaTableValueFunctions.scala:58)
	at io.delta.sql.DeltaSparkSessionExtension.$anonfun$apply$14(DeltaSparkSessionExtension.scala:167)
	at io.delta.sql.DeltaSparkSessionExtension.$anonfun$apply$14$adapted(DeltaSparkSessionExtension.scala:165)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at io.delta.sql.DeltaSparkSessionExtension.apply(DeltaSparkSessionExtension.scala:165)
	at io.delta.sql.DeltaSparkSessionExtension.apply(DeltaSparkSessionExtension.scala:81)
	at org.apache.spark.sql.SparkSession$.$anonfun$applyExtensions$1(SparkSession.scala:1297)
	at org.apache.spark.sql.SparkSession$.$anonfun$applyExtensions$1$adapted(SparkSession.scala:1292)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$applyExtensions(SparkSession.scala:1292)
	at org.apache.spark.sql.SparkSession.<init>(SparkSession.scala:107)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.analysis.UnresolvedLeafNode
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
	... 56 more
24/04/30 05:31:19 WARN SimpleFunctionRegistry: The function date_diff replaced a previously registered function.
Traceback (most recent call last):
  File "/Users/gri306/Library/CloudStorage/OneDrive-CSIRO/Documents/Projects/Pathling/notebooks/test-data.py", line 9, in <module>
    result = data.extract(
             ^^^^^^^^^^^^^
  File "/Users/gri306/Code/pathling/lib/python/pathling/datasource.py", line 81, in extract
    return ExtractQuery(resource_type, columns, filters).execute(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gri306/Code/pathling/lib/python/pathling/query.py", line 87, in execute
    return resolved_data_source._wrap_df(jquery.execute())
                                         ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/notebooks/lib/python3.11/site-packages/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
                   ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/notebooks/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 169, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/notebooks/lib/python3.11/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o76.execute.
: com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/analysis/UnresolvedLeafNode
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
	at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
	at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
	at org.apache.spark.sql.delta.DeltaLog$.getDeltaLogFromCache$1(DeltaLog.scala:790)
	at org.apache.spark.sql.delta.DeltaLog$.apply(DeltaLog.scala:800)
	at org.apache.spark.sql.delta.DeltaLog$.apply(DeltaLog.scala:710)
	at org.apache.spark.sql.delta.DeltaLog$.forTable(DeltaLog.scala:672)
	at io.delta.tables.DeltaTable$.isDeltaTable(DeltaTable.scala:811)
	at io.delta.tables.DeltaTable.isDeltaTable(DeltaTable.scala)
	at au.csiro.pathling.io.FileSystemPersistence.exists(FileSystemPersistence.java:64)
	at au.csiro.pathling.io.Database.getMaybeNonExistentDeltaTable(Database.java:260)
	at au.csiro.pathling.io.Database.read(Database.java:135)
	at au.csiro.pathling.library.io.source.DatabaseSource.read(DatabaseSource.java:42)
	at au.csiro.pathling.fhirpath.ResourcePath.build(ResourcePath.java:116)
	at au.csiro.pathling.fhirpath.ResourcePath.build(ResourcePath.java:90)
	at au.csiro.pathling.extract.ExtractQueryExecutor.buildQuery(ExtractQueryExecutor.java:58)
	at au.csiro.pathling.library.query.QueryDispatcher.dispatch(QueryDispatcher.java:56)
	at au.csiro.pathling.library.query.ExtractQuery.execute(ExtractQuery.java:101)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/analysis/UnresolvedLeafNode
	at java.base/java.lang.ClassLoader.defineClass1(Native Method)
	at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
	at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
	at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
	at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)
	at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
	at java.base/java.lang.ClassLoader.defineClass1(Native Method)
	at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1012)
	at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
	at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
	at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)
	at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)
	at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
	at org.apache.spark.sql.delta.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:270)
	at org.apache.spark.sql.delta.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:81)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$2(AnalysisHelper.scala:170)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:170)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:168)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:164)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:160)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:159)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:31)
	at org.apache.spark.sql.delta.DeltaAnalysis.apply(DeltaAnalysis.scala:81)
	at org.apache.spark.sql.delta.DeltaAnalysis.apply(DeltaAnalysis.scala:76)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:222)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:91)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:219)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:211)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:211)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:228)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:224)
	at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:173)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:224)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:188)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:182)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:182)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:209)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:202)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:202)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:201)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
	at org.apache.spark.sql.delta.DeltaLog.loadIndex(DeltaLog.scala:171)
	at org.apache.spark.sql.delta.Snapshot.$anonfun$protocolAndMetadataReconstruction$2(Snapshot.scala:225)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.delta.Snapshot.protocolAndMetadataReconstruction(Snapshot.scala:225)
	at org.apache.spark.sql.delta.Snapshot.x$1$lzycompute(Snapshot.scala:138)
	at org.apache.spark.sql.delta.Snapshot.x$1(Snapshot.scala:133)
	at org.apache.spark.sql.delta.Snapshot._metadata$lzycompute(Snapshot.scala:133)
	at org.apache.spark.sql.delta.Snapshot._metadata(Snapshot.scala:133)
	at org.apache.spark.sql.delta.Snapshot.metadata(Snapshot.scala:202)
	at org.apache.spark.sql.delta.Snapshot.toString(Snapshot.scala:416)
	at java.base/java.lang.String.valueOf(String.java:4218)
	at java.base/java.lang.StringBuilder.append(StringBuilder.java:173)
	at org.apache.spark.sql.delta.Snapshot.$anonfun$new$1(Snapshot.scala:419)
	at org.apache.spark.sql.delta.Snapshot.$anonfun$logInfo$1(Snapshot.scala:396)
	at org.apache.spark.internal.Logging.logInfo(Logging.scala:60)
	at org.apache.spark.internal.Logging.logInfo$(Logging.scala:59)
	at org.apache.spark.sql.delta.Snapshot.logInfo(Snapshot.scala:396)
	at org.apache.spark.sql.delta.Snapshot.<init>(Snapshot.scala:419)
	at org.apache.spark.sql.delta.SnapshotManagement.$anonfun$createSnapshot$2(SnapshotManagement.scala:503)
	at org.apache.spark.sql.delta.SnapshotManagement.createSnapshotFromGivenOrEquivalentLogSegment(SnapshotManagement.scala:628)
	at org.apache.spark.sql.delta.SnapshotManagement.createSnapshotFromGivenOrEquivalentLogSegment$(SnapshotManagement.scala:616)
	at org.apache.spark.sql.delta.DeltaLog.createSnapshotFromGivenOrEquivalentLogSegment(DeltaLog.scala:72)
	at org.apache.spark.sql.delta.SnapshotManagement.createSnapshot(SnapshotManagement.scala:496)
	at org.apache.spark.sql.delta.SnapshotManagement.createSnapshot$(SnapshotManagement.scala:489)
	at org.apache.spark.sql.delta.DeltaLog.createSnapshot(DeltaLog.scala:72)
	at org.apache.spark.sql.delta.SnapshotManagement.$anonfun$createSnapshotAtInitInternal$1(SnapshotManagement.scala:450)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.delta.SnapshotManagement.createSnapshotAtInitInternal(SnapshotManagement.scala:447)
	at org.apache.spark.sql.delta.SnapshotManagement.createSnapshotAtInitInternal$(SnapshotManagement.scala:444)
	at org.apache.spark.sql.delta.DeltaLog.createSnapshotAtInitInternal(DeltaLog.scala:72)
	at org.apache.spark.sql.delta.SnapshotManagement.$anonfun$getSnapshotAtInit$1(SnapshotManagement.scala:439)
	at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:140)
	at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:138)
	at org.apache.spark.sql.delta.DeltaLog.recordFrameProfile(DeltaLog.scala:72)
	at org.apache.spark.sql.delta.SnapshotManagement.getSnapshotAtInit(SnapshotManagement.scala:434)
	at org.apache.spark.sql.delta.SnapshotManagement.getSnapshotAtInit$(SnapshotManagement.scala:433)
	at org.apache.spark.sql.delta.DeltaLog.getSnapshotAtInit(DeltaLog.scala:72)
	at org.apache.spark.sql.delta.SnapshotManagement.$init$(SnapshotManagement.scala:60)
	at org.apache.spark.sql.delta.DeltaLog.<init>(DeltaLog.scala:78)
	at org.apache.spark.sql.delta.DeltaLog$.$anonfun$apply$4(DeltaLog.scala:779)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323)
	at org.apache.spark.sql.delta.DeltaLog$.$anonfun$apply$3(DeltaLog.scala:774)
	at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:140)
	at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:138)
	at org.apache.spark.sql.delta.DeltaLog$.recordFrameProfile(DeltaLog.scala:598)
	at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:133)
	at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:128)
	at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:117)
	at org.apache.spark.sql.delta.DeltaLog$.recordOperation(DeltaLog.scala:598)
	at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:132)
	at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:122)
	at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:112)
	at org.apache.spark.sql.delta.DeltaLog$.recordDeltaOperation(DeltaLog.scala:598)
	at org.apache.spark.sql.delta.DeltaLog$.createDeltaLog$1(DeltaLog.scala:773)
	at org.apache.spark.sql.delta.DeltaLog$.$anonfun$apply$5(DeltaLog.scala:791)
	at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
	... 29 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.analysis.UnresolvedLeafNode
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:587)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
	... 155 more

This is the code I am using:

from pathling import PathlingContext

pc = PathlingContext.create(enable_delta=True)

data = pc.read.delta(
    "/Users/gri306/Code/pathling/fhir-server/src/test/resources/test-data/parquet"
)

result = data.extract(
    "Patient",
    columns=[
        "id",
        "reverseResolve(Condition.subject).code.coding",
        "reverseResolve(Condition.subject).code.memberOf('http://snomed.info/sct/32506021000036107/version/20231031?fhir_vs=ecl/^ 32570581000036105 : << 263502005 = << 90734009')",
    ],
    filters=["gender = 'female'"],
)

result.show(truncate=False)

johngrimes avatar Apr 29 '24 19:04 johngrimes

@johngrimes this code works for me on a clean build.

:: loading settings :: url = jar:file:/Users/szu004/miniconda3/envs/pathling-dev/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/szu004/.ivy2/cache
The jars for the packages stored in: /Users/szu004/.ivy2/jars
au.csiro.pathling#library-runtime added as a dependency
io.delta#delta-spark_2.12 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c2e37d74-355b-4d99-a235-6b9dea2fffb2;1.0
	confs: [default]
	found au.csiro.pathling#library-runtime;7.0.0-SNAPSHOT in local-m2-cache
	found io.delta#delta-spark_2.12;3.1.0 in local-m2-cache
	found io.delta#delta-storage;3.1.0 in local-m2-cache
	found org.antlr#antlr4-runtime;4.9.3 in local-m2-cache
	found org.apache.hadoop#hadoop-aws;3.3.4 in local-m2-cache
	found com.amazonaws#aws-java-sdk-bundle;1.12.262 in local-m2-cache
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in local-m2-cache
:: resolution report :: resolve 113ms :: artifacts dl 10ms
	:: modules in use:
	au.csiro.pathling#library-runtime;7.0.0-SNAPSHOT from local-m2-cache in [default]
	com.amazonaws#aws-java-sdk-bundle;1.12.262 from local-m2-cache in [default]
	io.delta#delta-spark_2.12;3.1.0 from local-m2-cache in [default]
	io.delta#delta-storage;3.1.0 from local-m2-cache in [default]
	org.antlr#antlr4-runtime;4.9.3 from local-m2-cache in [default]
	org.apache.hadoop#hadoop-aws;3.3.4 from local-m2-cache in [default]
	org.wildfly.openssl#wildfly-openssl;1.0.7.Final from local-m2-cache in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   7   |   0   |   0   |   0   ||   7   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-c2e37d74-355b-4d99-a235-6b9dea2fffb2
	confs: [default]
	0 artifacts copied, 7 already retrieved (0kB/4ms)
24/04/30 10:30:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/30 10:30:50 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/04/30 10:30:51 WARN SimpleFunctionRegistry: The function date_diff replaced a previously registered function.
24/04/30 10:30:53 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/04/30 10:30:56 WARN CacheManager: Asked to cache already cached data.        
+------------------------------------+---------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id                                  |reverseResolve(Condition.subject).code.coding                              |reverseResolve(Condition.subject).code.memberOf('http://snomed.info/sct/32506021000036107/version/20231031?fhir_vs=ecl/^ 32570581000036105 : << 263502005 = << 90734009')|
+------------------------------------+---------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|10509002||'Acute bronchitis (disorder)'             |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|444814009||'Viral sinusitis (disorder)'             |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|72892002||'Normal pregnancy'                        |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|156073000||'Fetus with unknown complication'        |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|6072007||'Bleeding from anus'                       |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|162864005||'Body mass index 30+ - obesity (finding)'|false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|55822004||Hyperlipidemia                            |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|236077008||'Protracted diarrhea'                    |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|444814009||'Viral sinusitis (disorder)'             |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|444814009||'Viral sinusitis (disorder)'             |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|15777000||Prediabetes                               |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|19169002||'Miscarriage in first trimester'          |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|38341003||Hypertension                              |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|72892002||'Normal pregnancy'                        |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|444814009||'Viral sinusitis (disorder)'             |false                                                                                                                                                                    |
|9360820c-8602-4335-8b50-c88d627a0c20|http://snomed.info/sct|94260004||'Secondary malignant neoplasm of colon'   |false                                                                                                                                                                    |
|7001ad9c-34d2-4eb5-8165-5fdc2147f469|http://snomed.info/sct|38341003||Hypertension                              |false                                                                                                                                                                    |
|7001ad9c-34d2-4eb5-8165-5fdc2147f469|http://snomed.info/sct|38822007||Cystitis                                  |false                                                                                                                                                                    |
|7001ad9c-34d2-4eb5-8165-5fdc2147f469|http://snomed.info/sct|162864005||'Body mass index 30+ - obesity (finding)'|false                                                                                                                                                                    |
|7001ad9c-34d2-4eb5-8165-5fdc2147f469|http://snomed.info/sct|70704007||'Sprain of wrist'                         |false                                                                                                                                                                    |
+------------------------------------+---------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 20 rows

piotrszul avatar Apr 30 '24 00:04 piotrszul

I suspect this maybe the old version of pyspark in your test environment. Can you make sure you have 3.5.1 there:

(pathling-dev) szu004@CHICKEN-QR:tmp$ pyspark --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.1
      /_/
                        
Using Scala version 2.12.18, Java HotSpot(TM) 64-Bit Server VM, 17.0.10
Branch HEAD
Compiled by user heartsavior on 2024-02-15T11:24:58Z
Revision fd86f85e181fc2dc0f50a096855acf83a6cc5d9c
Url https://github.com/apache/spark

If this is not the case than we need to investigate it on your machine as I am unable to reproduce it locally.

piotrszul avatar Apr 30 '24 00:04 piotrszul

Yep, that was it.

Thanks @piotrszul!

johngrimes avatar Apr 30 '24 00:04 johngrimes

@johngrimes how can I reproduce the UI issue ?

piotrszul avatar Apr 30 '24 00:04 piotrszul

I've had another look, and I think the UI might be fine - if somewhat sensitive to "dependency management".

The problem I had was bringing the library API into a Spring Boot project, which (depending on how you set it up) can pin some of the transitive dependencies relating to servlet implementations used by Spark, causing them to fail.

I think we can tick this one off for now.

johngrimes avatar May 01 '24 05:05 johngrimes

@piotrszul I ran this candidate release on Databricks (Extract demo), I'm getting this error. Any ideas?

SparkException: [FAILED_EXECUTE_UDF] User defined function (`subsumes (UDFRegistration$$Lambda$6922/0x00007f76cdb09830)`: (array<struct<id:string,system:string,version:string,code:string,display:string,userSelected:boolean>>, array<struct<id:void,system:string,version:void,code:string,display:void,userSelected:void,_fid:void>>, boolean) => boolean) failed due to: java.lang.IllegalArgumentException: Row or WrappedArray<Row> column expected in argument 0, but given: class scala.collection.compat.immutable.ArraySeq$ofRef,. SQLSTATE: 39000
Caused by: IllegalArgumentException: Row or WrappedArray<Row> column expected in argument 0, but given: class scala.collection.compat.immutable.ArraySeq$ofRef,
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 74.0 failed 4 times, most recent failure: Lost task 0.3 in stage 74.0 (TID 102) (ip-10-180-191-71.ap-southeast-2.compute.internal executor driver): org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] User defined function (`subsumes (UDFRegistration$$Lambda$6922/0x00007f76cdb09830)`: (array<struct<id:string,system:string,version:string,code:string,display:string,userSelected:boolean>>, array<struct<id:void,system:string,version:void,code:string,display:void,userSelected:void,_fid:void>>, boolean) => boolean) failed due to: java.lang.IllegalArgumentException: Row or WrappedArray<Row> column expected in argument 0, but given: class scala.collection.compat.immutable.ArraySeq$ofRef,. SQLSTATE: 39000
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:243)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage17.hashAgg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage17.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:56)
	at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:92)

johngrimes avatar May 17 '24 07:05 johngrimes

@piotrszul I ran this candidate release on Databricks (Extract demo), I'm getting this error. Any ideas?

SparkException: [FAILED_EXECUTE_UDF] User defined function (`subsumes (UDFRegistration$$Lambda$6922/0x00007f76cdb09830)`: (array<struct<id:string,system:string,version:string,code:string,display:string,userSelected:boolean>>, array<struct<id:void,system:string,version:void,code:string,display:void,userSelected:void,_fid:void>>, boolean) => boolean) failed due to: java.lang.IllegalArgumentException: Row or WrappedArray<Row> column expected in argument 0, but given: class scala.collection.compat.immutable.ArraySeq$ofRef,. SQLSTATE: 39000
Caused by: IllegalArgumentException: Row or WrappedArray<Row> column expected in argument 0, but given: class scala.collection.compat.immutable.ArraySeq$ofRef,
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 74.0 failed 4 times, most recent failure: Lost task 0.3 in stage 74.0 (TID 102) (ip-10-180-191-71.ap-southeast-2.compute.internal executor driver): org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] User defined function (`subsumes (UDFRegistration$$Lambda$6922/0x00007f76cdb09830)`: (array<struct<id:string,system:string,version:string,code:string,display:string,userSelected:boolean>>, array<struct<id:void,system:string,version:void,code:string,display:void,userSelected:void,_fid:void>>, boolean) => boolean) failed due to: java.lang.IllegalArgumentException: Row or WrappedArray<Row> column expected in argument 0, but given: class scala.collection.compat.immutable.ArraySeq$ofRef,. SQLSTATE: 39000
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:243)
	at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage17.hashAgg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage17.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:56)
	at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:92)

I've fixed this: we were specifically expecting WrappedArray in subsumes, but it seems that it is possible to get other types when running in Databricks. I have broadened the expectation to Iterable.

johngrimes avatar May 20 '24 05:05 johngrimes