fvaleye/metadata-guardian: Provide an easy way with Python to protect your da...

image::logo.png[Metadata Guardian logo] image:https://github.com/fvaleye/metadata-guardian/actions/workflows/python_build.yml/badge.svg[![python-build, link=https://github.com/fvaleye/metadata-guardian/actions/workflows/python_build.yml] image:https://github.com/fvaleye/metadata-guardian/actions/workflows/rust_build.yml/badge.svg[![rust-build, link=https://github.com/fvaleye/metadata-guardian/actions/workflows/rust_build.yml] image:https://img.shields.io/badge/docs-python-blue.svg?style=flat-square[Docs,link=https://fvaleye.github.io/metadata-guardian/python] image:https://img.shields.io/pypi/v/metadata_guardian.svg?style=flat-square)[Pypi, link=https://pypi.org/project/metadata-guardian/]

== 📌 Overview Metadata Guardian is a Python package that provides an easy way to protect your data sources by searching its metadata. By searching with data rules, it will detect what you are looking to protect. Using Rust, it makes blazing fast multi-regex matching.

Read more in this https://medium.com/@florian.valeye/metadata-guardian-protect-your-data-by-searching-its-metadata-fe479c24f1b1[article].

== 📦 Where to get it

# Install all the data sources
pip install 'metadata_guardian[all]'

# Install one or more data sources from the list
pip install 'metadata_guardian[snowflake,avro,aws,gcp,deltalake,kafka_schema_registry,mysql]'

== 📜 Data Rules The available data rules are here: https://github.com/fvaleye/metadata-guardian/blob/main/python/metadata_guardian/rules/pii_rules.yaml[PII] and https://github.com/fvaleye/metadata-guardian/blob/main/python/metadata_guardian/rules/inclusion_rules.yaml[INCLUSION]. But you could also your custom data rules to suit your needs.

== 📊 Data Sources

=== Local

Parquet
ORC
AVRO
AVRO Schema
Arrow

=== External

AWS: Athena and Glue
Deltalake
GCP: BigQuery
Snowflake
MySQL
Kafka Schema Registry

== 🔎 Usage

With available Data Rules:

from metadata_guardian import (
    AvailableCategory,
    ColumnScanner,
    DataRules,
)
from metadata_guardian.source import MySQLSource

source = MySQLSource(
        user="root",
        password="12345678",
        host="localhost",
    )

data_rules = DataRules.from_available_category(category=AvailableCategory.PII)
column_scanner = ColumnScanner(data_rules=data_rules)

with source:
    report = column_scanner.scan_external(
        source,
        database_name="sequelmovie",
        include_comment=True,
    )
    report.to_console()

With custom Data Rules:

from metadata_guardian import (
    AvailableCategory,
    ColumnScanner,
    DataRule,
    DataRules,
)
from metadata_guardian.source import MySQLSource

source = MySQLSource(
        user="root",
        password="12345678",
        host="localhost",
    )

category = "example"
data_rule = DataRule(
rule_name="example_rule_name",
regex_pattern="\b(test|example)\b",
documentation="example_test",
)
data_rules = [data_rule]
data_rules = DataRules.from_new_category(category=category, data_rules=data_rules)
column_scanner = ColumnScanner(
data_rules=data_rules, progression_bar_disabled=False
)

with source:
    report = column_scanner.scan_external(
    source,
    database_name="sequelmovie",
    include_comment=True,
    )
    report.to_console()

== 🛡️ Licence https://raw.githubusercontent.com/fvaleye/metadata-guardian/main/LICENSE.txt[Apache License 2.0]

== 📚 Documentation The documentation is hosted here: https://fvaleye.github.io/metadata-guardian/python/

metadata-guardian
metadata-guardian copied to clipboard

Metadata

← Metadata

Owner

Metadata

metadata-guardian metadata-guardian copied to clipboard

Metadata

← Metadata

Owner

Metadata

metadata-guardian
metadata-guardian copied to clipboard