datahub icon indicating copy to clipboard operation
datahub copied to clipboard

feat(ingestion): add option to disable Kafka schema auto-registration

Open ppiont opened this issue 1 month ago • 1 comments

Summary

Add configuration to disable automatic schema registration in the Python Kafka emitter, enabling production deployments with read-only Schema Registry access.

Problem Statement

Currently, DataHub's Python Kafka emitter always attempts to auto-register schemas with the Schema Registry. This causes issues in production environments where:

  • Schema Registry access is read-only for security/governance
  • Schema registration requires approval workflows
  • Users want to pre-register schemas externally

Users encounter this error:

confluent_kafka.schema_registry.error.SchemaRegistryError: User is denied operation Write on Subject: datahub-metadata-change-proposal.1-value (HTTP status code 403, SR code 40301)

Solution

This PR adds a new configuration option disable_auto_schema_registration that:

  • Can be set via DATAHUB_KAFKA_DISABLE_AUTO_SCHEMA_REGISTRATION environment variable (recommended for Helm/Kubernetes)
  • Can be set in recipe configuration as a fallback
  • Passes auto.register.schemas=false to Confluent's AvroSerializer
  • Defaults to false (auto-registration enabled) for backward compatibility

Changes

Core Implementation

  • Environment Variable: Add DATAHUB_KAFKA_DISABLE_AUTO_SCHEMA_REGISTRATION in env_vars.py
  • Configuration: Add disable_auto_schema_registration field to KafkaProducerConnectionConfig
  • Kafka Emitter: Pass conf={'auto.register.schemas': False} to AvroSerializer when disabled

Tests

  • Configuration default and explicit setting tests (test_kafka_emitter.py)
  • Environment variable parsing tests (test_kafka_config.py - new file)
  • AvroSerializer configuration tests (test_kafka_sink.py)

Documentation

  • Updated sink_docs/datahub.md with config table entry and usage guide
  • Updated docs/deploy/environment-vars.md with new env var
  • Documented pre-registration requirements and schema locations

Usage

Option 1: Environment Variable (recommended for Helm/Kubernetes)

export DATAHUB_KAFKA_DISABLE_AUTO_SCHEMA_REGISTRATION=true

Option 2: Recipe Configuration

sink:
  type: datahub-kafka
  config:
    connection:
      bootstrap: "broker:9092"
      schema_registry_url: "http://schema-registry:8081"
      disable_auto_schema_registration: true

Important: When disabled, schemas must be pre-registered:

  • Subject: MetadataChangeEvent_v4-value
  • Subject: MetadataChangeProposal_v1-value

Helm Integration Path

This PR prepares the Python client for Helm chart integration. A follow-up PR to acryldata/datahub-helm can expose this as:

global:
  kafka:
    schemaRegistry:
      disableAutoRegistration: false

Testing

  • [x] Unit tests added for configuration, environment variables, and serializer behavior
  • [x] All tests pass locally
  • [x] Backward compatible - default behavior unchanged
  • [x] Documentation updated with usage examples

Checklist

  • [x] The PR conforms to DataHub's Contributing Guideline
  • [x] PR Title follows the required format
  • [x] Tests for the changes have been added
  • [x] Docs related to the changes have been added/updated
  • [x] Usage guide added in sink documentation
  • [x] No breaking changes (backward compatible)

Related Issues

Resolves issues where users with read-only Schema Registry access cannot use the Kafka sink.

ppiont avatar Nov 24 '25 20:11 ppiont

I would like to see this added ASAP. We have a restricted shared Aiven Kafka dev environment.

joaquinmenchaca-bit avatar Nov 26 '25 21:11 joaquinmenchaca-bit