great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

Spark Config not applied to spark session.

Open Spooky-0 opened this issue 3 years ago • 6 comments

Describe the bug I have followed this tutorial https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_datasources/how_to_configure_a_self_managed_spark_datasource.html and in particular my datasource looks like this (or similar):

datasources:
  spark_dataframe:
    data_asset_type:
      class_name: SparkDFDataset
      module_name: great_expectations.dataset
    batch_kwargs_generators:
      spark_sql_query:
        class_name: QueryBatchKwargsGenerator
        queries:
          ${query_name}: ${spark_sql_query}
    module_name: great_expectations.datasource
    class_name: SparkDFDatasource
    spark_config:
      spark.yarn.queue: 'my-yarn-queue'

When running great_expectations datasources list I get an error that the spark session is running in the root Queue. However, at my company, I don't have access to this queue and hence I get an error.

I have tried to track this down, and the spark session that gets initialized does not contain any configuration:

 spark = get_or_create_spark_application(
            spark_config=spark_config,
            force_reuse_spark_context=force_reuse_spark_context,
        )

-> the spark_config is always empty. There are several places where I have tried to track down when the spark session is being created. They never recieve the spark_config.

To Reproduce Steps to reproduce the behavior:

  1. Go to link above
  2. Follow Instructions up untill where the spark_config is included in the datasource
  3. Run great_expectations datasources list
  4. Go to spark_ui to validate that your config has indeed been applied. For me it wasn't.

Expected behavior I would expect the spark session which tries to access the dataset to use the spark_config provided. It should use the spark_config at any time (this includes datasources list, running the expectations etc.)

Environment (please complete the following information):

  • Operating System: Linux, Cloudera Work Bench
  • Great Expectations Version: 0.13.19

Spooky-0 avatar May 11 '21 10:05 Spooky-0

@Spooky-0 Thank you for reporting this! We will follow your steps to reproduce.

eugmandel avatar May 17 '21 14:05 eugmandel

I seem to be having a similar issue with Great Expectations 0.13.23. In my case I'm trying to set spark.jars.packages to include Spark Avro.

dallinb avatar Jul 30 '21 22:07 dallinb

Hi @Spooky-0 and @dallinb - thanks so much for reporting this and apologies for delays.

I wanted to check - do you know if this is an issue with V3 as well? If it's only an issue with V2, we won't be able to prioritize this internally, but we would welcome a community contribution, and are happy to offer any assistance needed to get this over the line.

talagluck avatar Oct 22 '21 17:10 talagluck

I was testing against V3. Downgraded to V2 to get around the problem.

dallinb avatar Oct 22 '21 17:10 dallinb

Thanks @dallinb - would you please add more information about your setup/config here? The config posted by @Spooky-0 is for V2, so it would be good to see how you were trying to set this up for V3.

talagluck avatar Oct 22 '21 19:10 talagluck

Is this issue still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity.

It will be closed if no further activity occurs. Thank you for your contributions 🙇

github-actions[bot] avatar Aug 05 '22 02:08 github-actions[bot]