feat(ingestion): add option to disable Kafka schema auto-registration
Summary
Add configuration to disable automatic schema registration in the Python Kafka emitter, enabling production deployments with read-only Schema Registry access.
Problem Statement
Currently, DataHub's Python Kafka emitter always attempts to auto-register schemas with the Schema Registry. This causes issues in production environments where:
- Schema Registry access is read-only for security/governance
- Schema registration requires approval workflows
- Users want to pre-register schemas externally
Users encounter this error:
confluent_kafka.schema_registry.error.SchemaRegistryError: User is denied operation Write on Subject: datahub-metadata-change-proposal.1-value (HTTP status code 403, SR code 40301)
Solution
This PR adds a new configuration option disable_auto_schema_registration that:
- Can be set via
DATAHUB_KAFKA_DISABLE_AUTO_SCHEMA_REGISTRATIONenvironment variable (recommended for Helm/Kubernetes) - Can be set in recipe configuration as a fallback
- Passes
auto.register.schemas=falseto Confluent's AvroSerializer - Defaults to
false(auto-registration enabled) for backward compatibility
Changes
Core Implementation
- Environment Variable: Add
DATAHUB_KAFKA_DISABLE_AUTO_SCHEMA_REGISTRATIONinenv_vars.py - Configuration: Add
disable_auto_schema_registrationfield toKafkaProducerConnectionConfig - Kafka Emitter: Pass
conf={'auto.register.schemas': False}toAvroSerializerwhen disabled
Tests
- Configuration default and explicit setting tests (
test_kafka_emitter.py) - Environment variable parsing tests (
test_kafka_config.py- new file) - AvroSerializer configuration tests (
test_kafka_sink.py)
Documentation
- Updated
sink_docs/datahub.mdwith config table entry and usage guide - Updated
docs/deploy/environment-vars.mdwith new env var - Documented pre-registration requirements and schema locations
Usage
Option 1: Environment Variable (recommended for Helm/Kubernetes)
export DATAHUB_KAFKA_DISABLE_AUTO_SCHEMA_REGISTRATION=true
Option 2: Recipe Configuration
sink:
type: datahub-kafka
config:
connection:
bootstrap: "broker:9092"
schema_registry_url: "http://schema-registry:8081"
disable_auto_schema_registration: true
Important: When disabled, schemas must be pre-registered:
- Subject:
MetadataChangeEvent_v4-value - Subject:
MetadataChangeProposal_v1-value
Helm Integration Path
This PR prepares the Python client for Helm chart integration. A follow-up PR to acryldata/datahub-helm can expose this as:
global:
kafka:
schemaRegistry:
disableAutoRegistration: false
Testing
- [x] Unit tests added for configuration, environment variables, and serializer behavior
- [x] All tests pass locally
- [x] Backward compatible - default behavior unchanged
- [x] Documentation updated with usage examples
Checklist
- [x] The PR conforms to DataHub's Contributing Guideline
- [x] PR Title follows the required format
- [x] Tests for the changes have been added
- [x] Docs related to the changes have been added/updated
- [x] Usage guide added in sink documentation
- [x] No breaking changes (backward compatible)
Related Issues
Resolves issues where users with read-only Schema Registry access cannot use the Kafka sink.
I would like to see this added ASAP. We have a restricted shared Aiven Kafka dev environment.