great_expectations
great_expectations copied to clipboard
fluent_datasources does not reflect runtime datasource addition
Describe the bug I want to add fluent_datasource at runtime after a FileDataContext is already defined. context.fluent_datasources is of type dictionary. When I add a new fluent_datasource, it does not add to the existing dictionary. Where as it works on datasource.
To Reproduce
import great_expectations as gx
context = gx.data_context.FileDataContext(context_root_dir="my_context_dir")
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"
runtime_datasource = gx.datasource.fluent.PostgresDatasource(name="ds_runtime", connection_string=connection_string,create_temp_table=True)
# Running below does not update the dictonary
context.fluent_datasources[runtime_datasource.name] = runtime_datasource
# where as, if I run below command then it will update properly. Also, it will also update fluent_datasources
context.datasources[runtime_datasource.name] = runtime_datasource
Expected behavior context.fluent_datasources should show added runtime_datasource inside the dictonary
Environment (please complete the following information):
- Operating System: MacOS
- Great Expectations Version: 0.18.12
- Data Source: Redshift
- Cloud environment: AWS
Additional context Add any other context about the problem here.
@ramananayak
Sorry for the confusion this is because the context.fluent_datasources
property is just a dictionary comprehension of context.datasources
with all non-fluent datasources filtered out.
Would could alter the return type annotation to be an Immutable Mapping[str, FluentDatasource]
to help with this. But it wouldn't alter runtime behavior and you'd have to rely on a type-checker or IDE to warn about it being immutable.
https://github.com/great-expectations/great_expectations/blob/abcf67159ad6562478318e659771237561b107b4/great_expectations/data_context/data_context/abstract_data_context.py#L4375-L4381
The idiomatic way to add or update a datasource is by using one of the context.sources.add_or_update_<DATASOURCE_TYPE>()
methods. This method also bootstraps the datasource with the components needed to do config substitution and connect certain datasources to things like s3/gcs/databricks etc.
import great_expectations as gx
context = gx.data_context.FileDataContext(context_root_dir="my_context_dir")
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"
runtime_datasource = context.sources.add_or_update_postgres(
name="ds_runtime",
connection_string=connection_string,
create_temp_table=True
)
print(repr(runtime_datasource))
thanks for the clarification @Kilo59 .
I tried with context.sources.add_postgress()
But for FileDataContext type this will end up updating the context file (great_expectations.yml) file with connection string details I am using as a variable in my code.
This does not serve the purpose of being runtime. Also because of this write lock on the context file, if multiple checks running on same config will lead to failures. I want this source to be used just for runtime without affecting the (great_expectations.yml) file.
I did some investigation and saw that for FileDataContext() context file is opened in w
mode (https://github.com/great-expectations/great_expectations/blob/abcf67159ad6562478318e659771237561b107b4/great_expectations/data_context/data_context/file_data_context.py#L168) .
So is there any way to add configurations for true run time use without changing context file everytime.
Same case with dataasset, I don’t see any example to show how can we create runtime dataseet. Currently I am testing with fluent datasource, all the methods are just keep adding dataasset to context file. So it will lead to growing config file. in 0.17.1, below would have created run time data asset without any update in context file, for refrence below
validations:
- batch_request:
data_asset_name: runtime_asset
runtime_parameters:
query: "select column 1 from table"
expectation_suite_name: appstat_suite
I don't rally know how can I achieve this in the latest version. Thanks for your help !
@ramananayak
I don't think this is exactly what you are looking for but you can use an EphemeralDataContext
that doesn't persist anything.
import great_expectations as gx
context = gx.get_context(mode="ephemeral")
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"
runtime_datasource = context.sources.add_or_update_postgres(
name="ds_runtime",
connection_string=connection_string,
create_temp_table=True
)
print(repr(runtime_datasource))
The code above ☝️ should work but you won't have access to your filebacked checkpoints or expectations etc. You would need to modify the code to pull in those items.
I will pass this along to our team working on the v1.0
release (and any other feedback you have).
There's a somewhat related issue where a user is creating an ephemeral context from a file context but is unable to load the fluent configs. For you, this shouldn't be a problem, though.
- https://github.com/great-expectations/great_expectations/issues/9283
Updated example that should allow your ephemeral context to pull in the project config from your file context.
import great_expectations as gx
# Create two different contexts using THE SAME config
file_ctx = gx.get_context(mode="file")
ephm_ctx = gx.get_context(mode="ephemeral", project_config=file_ctx.config)
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"
runtime_datasource = ephm_ctx.sources.add_or_update_postgres(
name="ds_runtime",
connection_string=connection_string,
create_temp_table=True
)
print(repr(runtime_datasource))
Hi @Kilo59 Thanks for sharing this. Yes as you mentioned , I tried Ephemeral context and it looks like it will work.
Here is my version
import yaml
import great_expectations as gx
from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.data_context import EphemeralDataContext
context_root_dir="path to my initial great_expectation.yml file "
with open(context_root_dir+'/great_expectations.yml', 'r') as file:
conf = yaml.safe_load(file)
context_config = DataContextConfig(**conf)
ephm_ctx = EphemeralDataContext(project_config=context_config)
connection_string = "postgresql+psycopg2://<user_name>:<password>@<host>:<port>/<database>"
runtime_datasource = ephm_ctx.sources.add_postgres(name="ds_runtime",
connection_string=connection_string,
create_temp_table=True)
print(repr(runtime_datasource))
This is working. although I have to mention complete path for all the respective GX directories (like plugin directory) but that's understood.
But as I mentioned above, 0.17.11 supported RuntimeBatchRequest, where I could define datasource, dataasset and runtimequery as a part of checkpoint. I could see it is also available in 0.18.9 documentation. But I am not able to get it working. I am struggling with this. https://docs.greatexpectations.io/docs/reference/api/core/batch/RuntimeBatchRequest_class For example, will work perfectly in 0.17.11
validations:
- batch_request:
data_asset_name: runtime_asset
runtime_parameters:
query: "select column 1 from table"
expectation_suite_name: appstat_suite
Is it supported in the latest GX. or I have to go with creating dataasset separately outside of checkpoint for input query and then call the checkpoint as a part of validation ? Is there any way to add datasource and query as a part of checkpoint.
Because this is a really helpfull feature for us, as we keep all the respective queries as a part of checkpoint and they stay separately , easy to identify dataasset and expectations together.
thanks !
Hi @Kilo59 Do you have information about how can we set this type of config (one in the previous comment) in the latest GX version. for data asset ? IN the new GX version, Do we have to create dataasset first for every query and then add the required checkpoint ? So there is no way for run time dataasset creation ?
If you have any idea on this, if you can give some pointers that would really help.
thanks ! Ram
@ramananayak any workflow from 0.17
should still work in 0.18
.
I think the issue is that the new "Fluent Style" Datasource (which are datasources created using the context.sources.add_<TYPE>()
) methods do not support declaring queries as part of the batch request.
The documentation for the old "Block Style" datasources is no longer part of our latest version. You'll have to refer to 0.15 docs
You can continue to use the old ("Block Style" Datasources) or you can create a QueryAsset
.
runtime_datasource = ephm_ctx.sources.add_postgres(
name="ds_runtime",
connection_string=connection_string,
create_temp_table=True
)
my_query_asset = runtime_datasource.add_query_asset(name="my_query", query="select column 1 from table")
batch_request = my_query_asset.build_batch_request()
# pass batch_request to your checkpoint
Does the QueryAsset
with an ephemeral context meet your needs, or are you still wanting something different?
We are actively working on 1.0
and this kind of feedback is invaluable.
Hi @Kilo59 thanks for you response.
As you mentioned, If I am correct, In the latest version block style datasource config is not supported.
and I assume older version 0.18
and 0.17
support will end once the next 2 latest version will be released.
Now I understand that QueryAsset
is the only way to go, I think I may have to write custom code to support run time query from config.
But I think run time query config is a nice feature to have because we have a lot of config which user will (say analyst) will setup in the form of config and all we do is to wrap the config in Airflow scheduler which runs this checks. This enables us to automate whole flow through config driven framework. Now everything becoming first class object, automating whole flow with multiple checks in a single input will add much more friction and only enable users to add single check at a time.
As I know lot of people use this method to add multiple checks in a single time. Also moving everything to a config file (in case of filedata context) also makes config file very bulky with lot of unnecessary configs added in context. Hope this makes sense thank you so much again !
@ramananayak -- in your case, are you expecting to be able to use the validation results that come from these runtime assets at any time other than the immediate validation? We designed runtime assets to mean that the data would be available/provided at runtime, but the asset configuration itself was durable. The intent of that approach was to ensure that saved validation results could be identified by the asset's (durable) name. It sounds to me like that may be the gap, in that you're not looking to have the configuration of the asset persist at all.
I'd love to jump on a call with you and @Kilo59 if you'd like to make sure I understand the case fully, since we've recently been looking at the question of how to support runtime cases more clearly.
Hello @ramananayak. With the launch of Great Expectations Core (GX 1.0), we are closing old issues posted regarding previous versions. Moving forward, we will focus our resources on supporting and improving GX Core (version 1.0 and beyond). If you find that an issue you previously reported still exists in GX Core, we encourage you to resubmit it against the new version. With more resources dedicated to community support, we aim to tackle new issues swiftly. For specific details on what is GX-supported vs community-supported, you can reference our integration and support policy.
To get started on your transition to GX Core, check out the GX Core quickstart (click “Full example code” tab to see a code example).
You can also join our upcoming community meeting on August 28th at 9am PT (noon ET / 4pm UTC) for a comprehensive rundown of everything GX Core, plus Q&A as time permits. Go to https://greatexpectations.io/meetup and click “follow calendar” to follow the GX community calendar.
Thank you for being part of the GX community and thank you for submitting this issue. We're excited about this new chapter and look forward to your feedback on GX Core. 🤗