great_expectations
great_expectations copied to clipboard
Multiple Spark session creation error in Great_Expectations.yml file with When I am adding multiple Spark session with SparkDFExecutionEngine in Great_Expectation.yml file I am getting Py4JJavaError:
Describe the bug When I am adding multiple multiple Spark session with SparkDFExecutionEngine in Great_Expectation.yml file I am getting Py4JJavaError: An error occurred while calling o578.parquet java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext. To Reproduce Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
Expected behavior A clear and concise description of what you expected to happen.
Environment (please complete the following information):
- Operating System: [Linux)
- Great Expectations Version: [e.g. 0.13.35]
s3_spark_folderlevel:
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: SparkDFExecutionEngine
spark_config:
spark.master: local[32]
spark.port.maxRetries: 100
spark.debug.maxToStringFields: 10000
spark.driver.memory: 40g
spark.executor.cores: 20
data_connectors:
configured_data_connector_name:
class_name: ConfiguredAssetS3DataConnector
module_name: great_expectations.datasource.data_connector
bucket: ${s3_bucket_input}
assets:
alpha:
default_regex:
pattern: (.*).parquet
group_names:
-index
prefix: ${s3_folder_output_prefix}
if there are 2 sessions it is throwing the spark session stopped error . Can anyone look into the issue ?
Hey @susreemohanty, thanks for opening this issue!
Would you mind providing a bit more information regarding your use-case here? Is there a particular reason why you're instantiating two separate Spark sessions in this config? I believe utilizing separate ExecutionEngines might get you to your intended solution but I'd love to know more about your situation to understand the full picture.
hi Chetan, thanks for your quick response. Actually we have defined all our datasources in Great_Expectations.yml . In checkpoint we are refering to the datasources. However if we create the spark session in python lib how to use the same spark session instead of datasources we are not able to figure it out will you be able to help us to fix it . regards, Susree
On Mon, Dec 13, 2021 at 10:22 PM Chetan Kini @.***> wrote:
Hey @susreemohanty https://github.com/susreemohanty, thanks for opening this issue!
Would you mind providing a bit more information regarding your use-case here? Is there a particular reason why you're instantiating two separate Spark sessions in this config? I believe utilizing separate ExecutionEngines might get you to your intended solution but I'd love to know more about your situation to understand the full picture.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/great-expectations/great_expectations/issues/3833#issuecomment-992673045, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHEKVCZ6BB374QGLMTJJIYTUQYQGNANCNFSM5JZUPTPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Hi Chetan, This is the sample code we have in python file to create the spark session from kedro.framework.session import KedroSession
from kedro.framework.session.session import _activate_session
from kedro.framework.project import configure_project
sys.path.append('/abc/src')
current_dir = Path.cwd() # this points to '/ace/src/abc' folder
project_path = '/abe'
project_path = current_dir.parents[1] # point back to the root of the
project
configure_project("project_abe")
session = KedroSession.create("project_abe", project_path, env="kk")
_activate_session(session)
context = session.load_context()
io = context.io
print(io.list())
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("v_test") \
.getOrCreate()
now how to use it instead of the datasources and data asset name while calling the checkpoint we are not able to figure it out. regards, Susree
On Wed, Dec 15, 2021 at 12:06 AM Susree Mohanty @.***> wrote:
hi Chetan, thanks for your quick response. Actually we have defined all our datasources in Great_Expectations.yml . In checkpoint we are refering to the datasources. However if we create the spark session in python lib how to use the same spark session instead of datasources we are not able to figure it out will you be able to help us to fix it . regards, Susree
On Mon, Dec 13, 2021 at 10:22 PM Chetan Kini @.***> wrote:
Hey @susreemohanty https://github.com/susreemohanty, thanks for opening this issue!
Would you mind providing a bit more information regarding your use-case here? Is there a particular reason why you're instantiating two separate Spark sessions in this config? I believe utilizing separate ExecutionEngines might get you to your intended solution but I'd love to know more about your situation to understand the full picture.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/great-expectations/great_expectations/issues/3833#issuecomment-992673045, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHEKVCZ6BB374QGLMTJJIYTUQYQGNANCNFSM5JZUPTPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Hey @susreemohanty ! Thanks for reaching out. We have eyes on this issue & will be investigating further in the near future.
Hi Chetan, We were using multiple spart sessions as we have different databases .One with AWS s3 and one with Azure blog storage. Hope this helps to investigate the issue. regards, Susree
On Mon, Dec 13, 2021 at 10:22 PM Chetan Kini @.***> wrote:
Hey @susreemohanty https://github.com/susreemohanty, thanks for opening this issue!
Would you mind providing a bit more information regarding your use-case here? Is there a particular reason why you're instantiating two separate Spark sessions in this config? I believe utilizing separate ExecutionEngines might get you to your intended solution but I'd love to know more about your situation to understand the full picture.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/great-expectations/great_expectations/issues/3833#issuecomment-992673045, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHEKVCZ6BB374QGLMTJJIYTUQYQGNANCNFSM5JZUPTPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@susreemohanty Can you provide more information on your setup? A few outstanding questions to help me reproduce:
- What version of
great_expectations
are you using? Am I correct in assuming you used the CLI for yaml generation? - What version of kedro are you using?
- I only see one instance of
SparkDFExecutionEngine
in your provided yaml. Is there more yaml you can share? - What code or CLI command are your running that gets you the error? I don't see
great_expectations
referenced in the python sample code you provided.
I did try using two spark datasources using great_expectations 0.17.0 using Fluent Datasources (https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/fluent/filesystem/connect_filesystem_source_data) and didn't receive an error, but that was connecting spark to local files.
I'm closing this because we haven't heard back in over 48 hours, but please feel free to reopen if you are still running into this and can provide a bit more information on your setup!