Zach comments

Results 52 comments of


                                            Zach

Task not serializable on UDF

Hello @ericabertone, I guess the "small" dataframe works because it's run on the driver and therefore is not serialized to the executors.

Remote agent for optimized on-premise to cloud ingestion

In the description we have the following open point: how do we override the connection configuration used for Actions when executed on a remote agent? I have the following suggestion:...

Remote agent for optimized on-premise to cloud ingestion

Hi @Geheiner, thanks for the updated ideas. Some thoughts from my side. 1) I propose to introduce new top level objects RemoteAgent (or just Agent?) which hold the configuration how...

Support SQL Server and Azure SQL natively

We should also support Azure Synapse: - the jdbc driver above might work - for dedicated Synapse SQL pools there is an optimized connector: https://github.com/MicrosoftDocs/azure-docs.de-de/blob/master/articles/synapse-analytics/spark/synapse-spark-sql-pool-import-export.md

Optionally deduplicate input data in DeduplicationAction

Alternatively this could be implemented as Transformer, but it would need the primary key of the output data object. This could be achieved by adding the current action to the...

Optionally deduplicate input data in DeduplicationAction

It would be better to configure rank cols/expressions in order to control which duplicate records are discarded...

Option to abort when Schema Evolution detected

Implemented allowSchemaEvolution property for JdbcTableDataObject and DeltaLakeTableDataObject. Should we implement it aswell for HiveTableDataObject (only relevant with SaveMode.overwrite)?

Data Quality built into pipelines

"Define expressions to check thresholds" could be implemented using #377 for Spark. From a performance perspective this would be optimal.

Data Quality built into pipelines

Is probably also linked with #43

Feature/377 constraints and expectations

To be discussed in our weekly... Missing point from my side: - create a docu site for data quality? - check integration of original spark metrics.