OpenMetadata icon indicating copy to clipboard operation
OpenMetadata copied to clipboard

Add support for profiler/DQ for non SQA based sources

Open TeddyCr opened this issue 3 years ago • 2 comments

Is your feature request related to a problem? Please describe. Currently the data quality and profiler are only available for databases supporting SQA. We would like to extend this support to non SQA data sources.

List of Connectors:

  • [ ] Kafka
  • [ ] Elasticsearch
  • [x] https://github.com/open-metadata/OpenMetadata/issues/10013
  • [ ] https://github.com/open-metadata/OpenMetadata/issues/8515
  • [x] https://github.com/open-metadata/OpenMetadata/issues/8514
  • [ ] https://github.com/open-metadata/OpenMetadata/issues/15315

TeddyCr avatar Sep 14 '22 07:09 TeddyCr

@ayush-shah here is a user from one of our users https://github.com/open-metadata/OpenMetadata/issues/7821

TeddyCr avatar Sep 30 '22 12:09 TeddyCr

Few things to consider:

  • sampling % -> what approach do we want to take (e.g. sample from each file, picking only specific files, etc.)
    • For example user input sample = 50%. Assume there would multiple files in the bucket
  • partitioning -> /YYYY/MM/DD/HH/file.parquet, how do we want to handle it (allow users to set partition time range)
  • data size in memory (need to make sure w don't run out of memory)

Use case for first implementation

  1. Users will have to specify bucket (e.g. prod.datalake) and prefix (e.g. raw/staging/drivers) [required]
  2. User can specify partition format (e.g. YYYY/MM/DD/, YYYY/MM/DD/HH/ etc.) if no partition format is provided assume files are at the root of the prefix
  3. If partition format is specified, then the user can also specify the partition range and unit (i.e. hours, days,)

Most likely will need to use pandas with implementation for each of the clouds.

TeddyCr avatar Oct 06 '22 10:10 TeddyCr