OpenMetadata icon indicating copy to clipboard operation
OpenMetadata copied to clipboard

sparks.driver.extraClassPath is not a valid Spark property

Open yush1ga opened this issue 1 year ago • 1 comments
trafficstars

Affected module Database Services -> Delta Lake (Hive Metadata Database)

Describe the bug jdbcDriverClassPath won't be loaded because sparks.driver.extraClassPath is not a valid Spark property. Actually, it may be spark.driver.extraClassPath.

To Reproduce Please run following command in ingest container of docker-compose.yml

metadata ingestion -c  delta_lake_ingest.yml
source:
  type: deltalake
  serviceName: "test"
  serviceConnection:
    config:
      type: DeltaLake
      metastoreConnection:
        metastoreDb: jdbc:mysql://172.16.240.1:17306/hive_metastore?createDatabaseIfNotExist=true&useSSL=false
        username: xxxx
        password: xxxx
        driverName: com.mysql.cj.jdbc.Driver
        jdbcDriverClassPath: /opt/spark-3.5.0-bin-hadoop3/jars/mysql-connector-java-8.0.11.jar
      appName: OpenMetadata
  sourceConfig:
    config:
      type: DatabaseMetadata
      markDeletedTables: true
      includeTables: true
      includeViews: true
sink:
  type: metadata-rest
  config: {}
workflowConfig:
  loggerLevel: DEBUG 
  openMetadataServerConfig:
    hostPort: "http://openmetadata-server:18085/api"
    authProvider: openmetadata
    securityConfig:
      jwtToken: "xxxx"
    storeServiceConnection: true 

Expected behavior finish normally.

Actual behavior Got Warning: Ignoring non-Spark config property: sparks.driver.extraClassPath and The specified datastore driver ("com.mysql.cj.jdbc.Driver") was not found

log

airflow@30ec1bf0bc33:/opt/airflow$ metadata ingest -c delta_lake_ingest.yml 
[2024-04-29 15:25:32] DEBUG    {metadata.Utils:execution_time_tracker:124} - GET executed in 0.01s
[2024-04-29 15:25:32] INFO     {metadata.OMetaAPI:server_mixin:66} - OpenMetadata client running with Server version [1.3.1] and Client version [1.3.1.3]
[2024-04-29 15:25:32] DEBUG    {metadata.Utils:execution_time_tracker:124} - GET executed in 0.0s
/home/airflow/.local/lib/python3.10/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Warning: Ignoring non-Spark config property: sparks.driver.extraClassPath
:: loading settings :: url = jar:file:/home/airflow/.local/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/airflow/.ivy2/cache
The jars for the packages stored in: /home/airflow/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c2ede5d7-c832-49c7-b7dc-33cc9c326679;1.0
        confs: [default]
        found io.delta#delta-core_2.12;2.3.0 in central
        found io.delta#delta-storage;2.3.0 in central
        found org.antlr#antlr4-runtime;4.8 in central
:: resolution report :: resolve 193ms :: artifacts dl 9ms
        :: modules in use:
        io.delta#delta-core_2.12;2.3.0 from central in [default]
        io.delta#delta-storage;2.3.0 from central in [default]
        org.antlr#antlr4-runtime;4.8 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-c2ede5d7-c832-49c7-b7dc-33cc9c326679
        confs: [default]
        0 artifacts copied, 3 already retrieved (0kB/6ms)
24/04/29 15:25:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[2024-04-29 15:25:37] DEBUG    {metadata.Utils:execution_time_tracker:124} - GET executed in 0.02s
24/04/29 15:25:38 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/04/29 15:25:38 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
24/04/29 15:25:39 ERROR Datastore: Exception thrown creating StoreManager. See the nested exception
Error creating transactional connection factory
org.datanucleus.exceptions.NucleusException: Error creating transactional connection factory
        at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:214)
        at org.datanucleus.store.AbstractStoreManager.<init>(AbstractStoreManager.java:162)
        at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:285)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
        at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
        at org.datanucleus.NucleusContextHelper.createStoreManagerForProperties(NucleusContextHelper.java:133)
        at org.datanucleus.PersistenceNucleusContextImpl.initialise(PersistenceNucleusContextImpl.java:422)
        at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:817)
        at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:334)
        at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:213)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at javax.jdo.JDOHelper$16.run(JDOHelper.java:1975)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at javax.jdo.JDOHelper.invoke(JDOHelper.java:1970)
        at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1177)
        at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:814)
        at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:702)
        at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:521)
        at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:550)
        at org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:405)
        at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342)
        at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:303)
        at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:79)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:139)
        at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:58)
        at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStoreForConf(HiveMetaStore.java:628)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMSForConf(HiveMetaStore.java:594)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:588)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:655)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:431)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:79)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:92)
        at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6902)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:162)
        at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:70)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1740)
        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:83)
        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:133)
        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
        at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3607)
        at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3659)
        at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3639)
        at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1563)
        at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1552)
        at org.apache.spark.sql.hive.client.Shim_v0_12.databaseExists(HiveShim.scala:609)
        at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$databaseExists$1(HiveClientImpl.scala:398)
        at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
        at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:298)
        at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
        at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
        at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:278)
        at org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:398)
        at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:223)
        at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
        at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:101)
        at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223)
        at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
        at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:140)
        at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:54)
        at org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$1(HiveSessionStateBuilder.scala:69)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:121)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:121)
        at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:294)
        at org.apache.spark.sql.internal.CatalogImpl.listDatabases(CatalogImpl.scala:74)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:606)
        at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:330)
        at org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:203)
        ... 93 more
Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error : The specified datastore driver ("com.mysql.cj.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.

Version:

  • OS: Ubuntu 22.04 (host os)
  • Python version: 3.10.13
  • OpenMetadata version: 1.3.1
  • OpenMetadata Ingestion package version: 1.3.1.3

Additional context I firstly tried to connect delta lake by UI settings. Then, the UI returned only following error messages. If it have provided some stack traces, I would be happy :)

org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient image

yush1ga avatar Apr 29 '24 15:04 yush1ga

Related another bug: This should be connection.metastoreConnection.username,, not connection.metastoreConnection.metastoreDB https://github.com/open-metadata/OpenMetadata/blob/58992c2e24a380af7139666e3778590e53a47ea6/ingestion/src/metadata/ingestion/source/database/deltalake/connection.py#L74

yush1ga avatar Apr 29 '24 16:04 yush1ga

@yush1ga thanks for popping the issue and providing the fix. Really appreciated

pmbrull avatar May 02 '24 16:05 pmbrull