great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

Multiple Spark session creation error in Great_Expectations.yml file with When I am adding multiple Spark session with SparkDFExecutionEngine in Great_Expectation.yml file I am getting Py4JJavaError:

Open susreemohanty opened this issue 2 years ago • 5 comments

Describe the bug When I am adding multiple multiple Spark session with SparkDFExecutionEngine in Great_Expectation.yml file I am getting Py4JJavaError: An error occurred while calling o578.parquet java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext. To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

Environment (please complete the following information):

  • Operating System: [Linux)
  • Great Expectations Version: [e.g. 0.13.35]

s3_spark_folderlevel: module_name: great_expectations.datasource execution_engine: module_name: great_expectations.execution_engine class_name: SparkDFExecutionEngine spark_config: spark.master: local[32]
spark.port.maxRetries: 100 spark.debug.maxToStringFields: 10000 spark.driver.memory: 40g spark.executor.cores: 20 data_connectors: configured_data_connector_name: class_name: ConfiguredAssetS3DataConnector module_name: great_expectations.datasource.data_connector bucket: ${s3_bucket_input} assets: alpha: default_regex: pattern: (.*).parquet group_names: -index
prefix: ${s3_folder_output_prefix}

if there are 2 sessions it is throwing the spark session stopped error . Can anyone look into the issue ?

susreemohanty avatar Dec 10 '21 17:12 susreemohanty

Hey @susreemohanty, thanks for opening this issue!

Would you mind providing a bit more information regarding your use-case here? Is there a particular reason why you're instantiating two separate Spark sessions in this config? I believe utilizing separate ExecutionEngines might get you to your intended solution but I'd love to know more about your situation to understand the full picture.

cdkini avatar Dec 13 '21 16:12 cdkini

hi Chetan, thanks for your quick response. Actually we have defined all our datasources in Great_Expectations.yml . In checkpoint we are refering to the datasources. However if we create the spark session in python lib how to use the same spark session instead of datasources we are not able to figure it out will you be able to help us to fix it . regards, Susree

On Mon, Dec 13, 2021 at 10:22 PM Chetan Kini @.***> wrote:

Hey @susreemohanty https://github.com/susreemohanty, thanks for opening this issue!

Would you mind providing a bit more information regarding your use-case here? Is there a particular reason why you're instantiating two separate Spark sessions in this config? I believe utilizing separate ExecutionEngines might get you to your intended solution but I'd love to know more about your situation to understand the full picture.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/great-expectations/great_expectations/issues/3833#issuecomment-992673045, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHEKVCZ6BB374QGLMTJJIYTUQYQGNANCNFSM5JZUPTPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

susreemohanty avatar Dec 14 '21 18:12 susreemohanty

Hi Chetan, This is the sample code we have in python file to create the spark session from kedro.framework.session import KedroSession

from kedro.framework.session.session import _activate_session

from kedro.framework.project import configure_project

sys.path.append('/abc/src')

current_dir = Path.cwd() # this points to '/ace/src/abc' folder

project_path = '/abe'

project_path = current_dir.parents[1] # point back to the root of the

project

configure_project("project_abe")

session = KedroSession.create("project_abe", project_path, env="kk")

_activate_session(session)

context = session.load_context()

io = context.io

print(io.list())

from pyspark.sql import SparkSession

spark = SparkSession \

.builder \

.appName("v_test") \

.getOrCreate()

now how to use it instead of the datasources and data asset name while calling the checkpoint we are not able to figure it out. regards, Susree

On Wed, Dec 15, 2021 at 12:06 AM Susree Mohanty @.***> wrote:

hi Chetan, thanks for your quick response. Actually we have defined all our datasources in Great_Expectations.yml . In checkpoint we are refering to the datasources. However if we create the spark session in python lib how to use the same spark session instead of datasources we are not able to figure it out will you be able to help us to fix it . regards, Susree

On Mon, Dec 13, 2021 at 10:22 PM Chetan Kini @.***> wrote:

Hey @susreemohanty https://github.com/susreemohanty, thanks for opening this issue!

Would you mind providing a bit more information regarding your use-case here? Is there a particular reason why you're instantiating two separate Spark sessions in this config? I believe utilizing separate ExecutionEngines might get you to your intended solution but I'd love to know more about your situation to understand the full picture.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/great-expectations/great_expectations/issues/3833#issuecomment-992673045, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHEKVCZ6BB374QGLMTJJIYTUQYQGNANCNFSM5JZUPTPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

susreemohanty avatar Dec 14 '21 18:12 susreemohanty

Hey @susreemohanty ! Thanks for reaching out. We have eyes on this issue & will be investigating further in the near future.

austiezr avatar Aug 08 '22 17:08 austiezr

Hi Chetan, We were using multiple spart sessions as we have different databases .One with AWS s3 and one with Azure blog storage. Hope this helps to investigate the issue. regards, Susree

On Mon, Dec 13, 2021 at 10:22 PM Chetan Kini @.***> wrote:

Hey @susreemohanty https://github.com/susreemohanty, thanks for opening this issue!

Would you mind providing a bit more information regarding your use-case here? Is there a particular reason why you're instantiating two separate Spark sessions in this config? I believe utilizing separate ExecutionEngines might get you to your intended solution but I'd love to know more about your situation to understand the full picture.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/great-expectations/great_expectations/issues/3833#issuecomment-992673045, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHEKVCZ6BB374QGLMTJJIYTUQYQGNANCNFSM5JZUPTPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

susreemohanty avatar Aug 09 '22 04:08 susreemohanty

@susreemohanty Can you provide more information on your setup? A few outstanding questions to help me reproduce:

  • What version of great_expectations are you using? Am I correct in assuming you used the CLI for yaml generation?
  • What version of kedro are you using?
  • I only see one instance of SparkDFExecutionEngine in your provided yaml. Is there more yaml you can share?
  • What code or CLI command are your running that gets you the error? I don't see great_expectations referenced in the python sample code you provided.

I did try using two spark datasources using great_expectations 0.17.0 using Fluent Datasources (https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/fluent/filesystem/connect_filesystem_source_data) and didn't receive an error, but that was connecting spark to local files.

tyler-hoffman avatar Jun 23 '23 14:06 tyler-hoffman

I'm closing this because we haven't heard back in over 48 hours, but please feel free to reopen if you are still running into this and can provide a bit more information on your setup!

tyler-hoffman avatar Jun 28 '23 15:06 tyler-hoffman