great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

fluent_datasources does not reflect runtime datasource addition

Open ramananayak opened this issue 4 months ago • 9 comments

Describe the bug I want to add fluent_datasource at runtime after a FileDataContext is already defined. context.fluent_datasources is of type dictionary. When I add a new fluent_datasource, it does not add to the existing dictionary. Where as it works on datasource.

To Reproduce

import great_expectations as gx
context = gx.data_context.FileDataContext(context_root_dir="my_context_dir")
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"
runtime_datasource = gx.datasource.fluent.PostgresDatasource(name="ds_runtime", connection_string=connection_string,create_temp_table=True)

# Running below does not update the dictonary
context.fluent_datasources[runtime_datasource.name] = runtime_datasource

# where as, if I run below command then it will update properly. Also, it will also update fluent_datasources
context.datasources[runtime_datasource.name] = runtime_datasource

Expected behavior context.fluent_datasources should show added runtime_datasource inside the dictonary

Environment (please complete the following information):

  • Operating System: MacOS
  • Great Expectations Version: 0.18.12
  • Data Source: Redshift
  • Cloud environment: AWS

Additional context Add any other context about the problem here.

ramananayak avatar Apr 02 '24 15:04 ramananayak

@ramananayak Sorry for the confusion this is because the context.fluent_datasources property is just a dictionary comprehension of context.datasources with all non-fluent datasources filtered out.

Would could alter the return type annotation to be an Immutable Mapping[str, FluentDatasource] to help with this. But it wouldn't alter runtime behavior and you'd have to rely on a type-checker or IDE to warn about it being immutable.

https://github.com/great-expectations/great_expectations/blob/abcf67159ad6562478318e659771237561b107b4/great_expectations/data_context/data_context/abstract_data_context.py#L4375-L4381

The idiomatic way to add or update a datasource is by using one of the context.sources.add_or_update_<DATASOURCE_TYPE>() methods. This method also bootstraps the datasource with the components needed to do config substitution and connect certain datasources to things like s3/gcs/databricks etc.

import great_expectations as gx
context = gx.data_context.FileDataContext(context_root_dir="my_context_dir")
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"

runtime_datasource = context.sources.add_or_update_postgres(
    name="ds_runtime",
    connection_string=connection_string,
    create_temp_table=True
)

print(repr(runtime_datasource))

Kilo59 avatar Apr 11 '24 12:04 Kilo59

thanks for the clarification @Kilo59 . I tried with context.sources.add_postgress() But for FileDataContext type this will end up updating the context file (great_expectations.yml) file with connection string details I am using as a variable in my code. This does not serve the purpose of being runtime. Also because of this write lock on the context file, if multiple checks running on same config will lead to failures. I want this source to be used just for runtime without affecting the (great_expectations.yml) file.

I did some investigation and saw that for FileDataContext() context file is opened in w mode (https://github.com/great-expectations/great_expectations/blob/abcf67159ad6562478318e659771237561b107b4/great_expectations/data_context/data_context/file_data_context.py#L168) . So is there any way to add configurations for true run time use without changing context file everytime.

Same case with dataasset, I don’t see any example to show how can we create runtime dataseet. Currently I am testing with fluent datasource, all the methods are just keep adding dataasset to context file. So it will lead to growing config file. in 0.17.1, below would have created run time data asset without any update in context file, for refrence below

validations:
  - batch_request:
      data_asset_name: runtime_asset
      runtime_parameters:
        query: "select column 1 from table"
    expectation_suite_name: appstat_suite

I don't rally know how can I achieve this in the latest version. Thanks for your help !

ramananayak avatar Apr 17 '24 14:04 ramananayak

@ramananayak I don't think this is exactly what you are looking for but you can use an EphemeralDataContext that doesn't persist anything.

import great_expectations as gx
context = gx.get_context(mode="ephemeral")
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"

runtime_datasource = context.sources.add_or_update_postgres(
    name="ds_runtime",
    connection_string=connection_string,
    create_temp_table=True
)

print(repr(runtime_datasource))

The code above ☝️ should work but you won't have access to your filebacked checkpoints or expectations etc. You would need to modify the code to pull in those items.

I will pass this along to our team working on the v1.0 release (and any other feedback you have).

Kilo59 avatar Apr 18 '24 13:04 Kilo59

There's a somewhat related issue where a user is creating an ephemeral context from a file context but is unable to load the fluent configs. For you, this shouldn't be a problem, though.

  • https://github.com/great-expectations/great_expectations/issues/9283

Updated example that should allow your ephemeral context to pull in the project config from your file context.

import great_expectations as gx

# Create two different contexts using THE SAME config
file_ctx = gx.get_context(mode="file")
ephm_ctx = gx.get_context(mode="ephemeral", project_config=file_ctx.config)

connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"

runtime_datasource = ephm_ctx.sources.add_or_update_postgres(
    name="ds_runtime",
    connection_string=connection_string,
    create_temp_table=True
)

print(repr(runtime_datasource))

Kilo59 avatar Apr 18 '24 13:04 Kilo59

Hi @Kilo59 Thanks for sharing this. Yes as you mentioned , I tried Ephemeral context and it looks like it will work.

Here is my version

import yaml
import great_expectations as gx
from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.data_context import EphemeralDataContext

context_root_dir="path to my initial great_expectation.yml file "
with open(context_root_dir+'/great_expectations.yml', 'r') as file:
    conf = yaml.safe_load(file)
    
context_config = DataContextConfig(**conf)
ephm_ctx  = EphemeralDataContext(project_config=context_config)

connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"
runtime_datasource = ephm_ctx.sources.add_postgres(name="ds_runtime", 
connection_string=connection_string, 
create_temp_table=True)

print(repr(runtime_datasource))

This is working. although I have to mention complete path for all the respective GX directories (like plugin directory) but that's understood.

But as I mentioned above, 0.17.11 supported RuntimeBatchRequest, where I could define datasource, dataasset and runtimequery as a part of checkpoint. I could see it is also available in 0.18.9 documentation. But I am not able to get it working. I am struggling with this. https://docs.greatexpectations.io/docs/reference/api/core/batch/RuntimeBatchRequest_class For example, will work perfectly in 0.17.11

validations:
  - batch_request:
      data_asset_name: runtime_asset
      runtime_parameters:
        query: "select column 1 from table"
    expectation_suite_name: appstat_suite

Is it supported in the latest GX. or I have to go with creating dataasset separately outside of checkpoint for input query and then call the checkpoint as a part of validation ? Is there any way to add datasource and query as a part of checkpoint.

Because this is a really helpfull feature for us, as we keep all the respective queries as a part of checkpoint and they stay separately , easy to identify dataasset and expectations together.

thanks !

ramananayak avatar Apr 19 '24 13:04 ramananayak

Hi @Kilo59 Do you have information about how can we set this type of config (one in the previous comment) in the latest GX version. for data asset ? IN the new GX version, Do we have to create dataasset first for every query and then add the required checkpoint ? So there is no way for run time dataasset creation ?

If you have any idea on this, if you can give some pointers that would really help.

thanks ! Ram

ramananayak avatar May 01 '24 15:05 ramananayak

@ramananayak any workflow from 0.17 should still work in 0.18.

I think the issue is that the new "Fluent Style" Datasource (which are datasources created using the context.sources.add_<TYPE>()) methods do not support declaring queries as part of the batch request.

The documentation for the old "Block Style" datasources is no longer part of our latest version. You'll have to refer to 0.15 docs

You can continue to use the old ("Block Style" Datasources) or you can create a QueryAsset.

runtime_datasource = ephm_ctx.sources.add_postgres(name="ds_runtime", 
connection_string=connection_string, 
create_temp_table=True)

my_query_asset = runtime_datasource.add_query_asset(name="my_query", query="select column 1 from table")

batch_request = my_query_asset.build_batch_request()

# pass batch_request to your checkpoint

Does the QueryAsset with an ephemeral context meet your needs, or are you still wanting something different? We are actively working on 1.0 and this kind of feedback is invaluable.

Kilo59 avatar May 01 '24 15:05 Kilo59