cuallee
cuallee copied to clipboard
Relational Databases (MySQL, PostgreSQL) Integration
cuallee
- [x] relational databases (MySQL, PostgreSQL)
- [x] unit test
- [x] docs
- [ ] pyspark
- [ ] snowpark
- [ ] pandas
- [ ] duckdb
- [ ] polars
- [ ] other
Relational Databases (MySQL, PostgreSQL) Integration
Utilizing Polars
with the ConnectorX
engine to read from a database, tables in relational databases can now be checked and validated. ConnectorX, written in Rust, has native support for Apache Arrow, enabling it to transfer data directly into a Polars's DataFrame
without copying the data (zero-copy). More information about the ConnectorX engine can be found here.
This feature allows for the future integration of other DBMS, such as Redshift
(via the PostgreSQL protocol) and ClickHouse
(via the MySQL protocol), into cuallee
.
✨ Feature Enhancements:
- Added support for validating tables in
PostgreSQL
andMySQL
, enabling more robust data integrity checks. - Added necessary dependencies to
pyproject.toml
.
🧪 Testing:
- Implemented unit tests for all Compute methods to ensure their reliability and correctness. (Unit Tests for unimplemented methods are available but marked as
skip
.) - Added
PyTest
fixtures for automated unit tests forPostgreSQL
andMySQL
. - Added test cases as
init-db-psql.sql
andinit-db-mysql.sql
. - Added a unit test in
test_validate.py
to validate the type of outputted DataFrame, ensuring the resulting DataFrame is typepolars_dataframe
. - Added a unit test in
test_validate.py
to validate the error when a column is not found in the table.
📖 Documentation:
- Updated the
README.md
- Added
How-To
to set up the test environment.
⛑️ Know Issues:
-
It's important to note that most of
PostgreSQL
'sCompute
are inherited fromDuckDB
'sCompute
methods, so any changes inDuckDB
'sCompute
methods will affectPostgreSQL
'sCompute
methods. -
has_std
check could fail due to floating point precision. The approach suggested below could resolve it, but it needs to introduce a precision error parameter:# PostgreSQL f"CAST(STDDEV_SAMP({rule.column}) AS FLOAT) - CAST({rule.value} AS FLOAT) < {percision_error}" # MySQL f"""IF( CAST(STDDEV_SAMP({rule.column}) AS FLOAT) - CAST({rule.value} AS FLOAT) < {percision_error}, 'True', 'False') """
🦺 Limitation:
PostgreSQL
Not all checks are currently available for PostgreSQL
. Unavailable checks:
-
is_daily
-
has_entropy
-
has_workflow
MySQL
Not all checks are currently available for MySQL
. Unavailable checks:
-
is_daily
-
has_entropy
-
has_workflow
-
has_percentile
-
has_correlation
-
is_inside_interquartile_range
✈️ Future Improvements:
-
Improve the
Compute
methods to support complex queries: Because theCompute
methods are used to create a unified query, it is impossible to create complex queries with multipleCompute
methods. This section needed to be improved to support complex queries.# In `psql_validation.py` and `mysql_validation.py` unified_columns = ",\n\t".join([ # This is the same as compute.`rule.method`(rule) operator.methodcaller(rule.method, rule)(Compute(check.table_name)) + f' AS "{rule.key}"' for rule in check.rules ]) unified_query = f""" SELECT \t{unified_columns} FROM \t{check.table_name} """
-
Inherency Enhancement: A significant portion of
PostgreSQL
's Compute functionalities are derived fromDuckDB
's Compute methods. Therefore, any modifications to the Compute methods inDuckDB
will directly impact the functionalities inPostgreSQL
. Consolidating these shared methods into a separate class for better manageability is advisable.
@dsaad68 thanks for this great addition. I will conduct a review and proceed with the integration.
If there are any issues with this suggestion, please let me know.
If there are any issues with this suggestion, please let me know.
Hi @dsaad68 there is no issues with the submission. We are in a ongoing review for the Journal of Open Source software, and they have indicated the lack of some documentation. Before adding more functionality, we would like to pass the JOSS review, with the consolidated docs, at least from the core classes and modules. I think once we pass this hurdle, we will be adding the new functionality. Hope that explains the delays. Thank you in advance.
@canimus I totally understand, good luck. Let me know if I can help.
Hi @dsaad68 thanks for your patience with this PR. Now that the paper is out of the way, and finally published, I would like to make sure your contribution is all-in, because this branch has been stalled for some time, and a few commits and merges have passed, would you kindly review if we can bring it to the current state, and resolve the conflicts highlighted above? Thanks in advance.
@canimus I will look at this week.