OpenMetadata Add support for profiler/DQ for non SQA based sources

Is your feature request related to a problem? Please describe. Currently the data quality and profiler are only available for databases supporting SQA. We would like to extend this support to non SQA data sources.

List of Connectors:

[ ] Kafka
[ ] Elasticsearch
[x] https://github.com/open-metadata/OpenMetadata/issues/10013
[ ] https://github.com/open-metadata/OpenMetadata/issues/8515
[x] https://github.com/open-metadata/OpenMetadata/issues/8514
[ ] https://github.com/open-metadata/OpenMetadata/issues/15315

Sep 14 '22 07:09 TeddyCr

@ayush-shah here is a user from one of our users https://github.com/open-metadata/OpenMetadata/issues/7821

Sep 30 '22 12:09 TeddyCr

Few things to consider:

sampling % -> what approach do we want to take (e.g. sample from each file, picking only specific files, etc.)
- For example user input sample = 50%. Assume there would multiple files in the bucket
partitioning -> /YYYY/MM/DD/HH/file.parquet, how do we want to handle it (allow users to set partition time range)
data size in memory (need to make sure w don't run out of memory)

Use case for first implementation

Users will have to specify bucket (e.g. prod.datalake) and prefix (e.g. raw/staging/drivers) [required]
User can specify partition format (e.g. YYYY/MM/DD/, YYYY/MM/DD/HH/ etc.) if no partition format is provided assume files are at the root of the prefix
If partition format is specified, then the user can also specify the partition range and unit (i.e. hours, days,)

Most likely will need to use pandas with implementation for each of the clouds.

Oct 06 '22 10:10 TeddyCr