great_expectations
great_expectations copied to clipboard
Spark Config not applied to spark session.
Describe the bug I have followed this tutorial https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_datasources/how_to_configure_a_self_managed_spark_datasource.html and in particular my datasource looks like this (or similar):
datasources:
spark_dataframe:
data_asset_type:
class_name: SparkDFDataset
module_name: great_expectations.dataset
batch_kwargs_generators:
spark_sql_query:
class_name: QueryBatchKwargsGenerator
queries:
${query_name}: ${spark_sql_query}
module_name: great_expectations.datasource
class_name: SparkDFDatasource
spark_config:
spark.yarn.queue: 'my-yarn-queue'
When running great_expectations datasources list
I get an error that the spark session is running in the root Queue. However, at my company, I don't have access to this queue and hence I get an error.
I have tried to track this down, and the spark session that gets initialized does not contain any configuration:
spark = get_or_create_spark_application(
spark_config=spark_config,
force_reuse_spark_context=force_reuse_spark_context,
)
-> the spark_config is always empty. There are several places where I have tried to track down when the spark session is being created. They never recieve the spark_config.
To Reproduce Steps to reproduce the behavior:
- Go to link above
- Follow Instructions up untill where the spark_config is included in the datasource
- Run
great_expectations datasources list
- Go to spark_ui to validate that your config has indeed been applied. For me it wasn't.
Expected behavior I would expect the spark session which tries to access the dataset to use the spark_config provided. It should use the spark_config at any time (this includes datasources list, running the expectations etc.)
Environment (please complete the following information):
- Operating System: Linux, Cloudera Work Bench
- Great Expectations Version: 0.13.19
@Spooky-0 Thank you for reporting this! We will follow your steps to reproduce.
I seem to be having a similar issue with Great Expectations 0.13.23. In my case I'm trying to set spark.jars.packages
to include Spark Avro.
Hi @Spooky-0 and @dallinb - thanks so much for reporting this and apologies for delays.
I wanted to check - do you know if this is an issue with V3 as well? If it's only an issue with V2, we won't be able to prioritize this internally, but we would welcome a community contribution, and are happy to offer any assistance needed to get this over the line.
I was testing against V3. Downgraded to V2 to get around the problem.
Thanks @dallinb - would you please add more information about your setup/config here? The config posted by @Spooky-0 is for V2, so it would be good to see how you were trying to set this up for V3.
Is this issue still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity.
It will be closed if no further activity occurs. Thank you for your contributions 🙇