superset icon indicating copy to clipboard operation
superset copied to clipboard

feat: add option for hash algorithms

Open dpgaspar opened this issue 4 months ago • 13 comments

User description

SUMMARY

Adds configurable hash algorithm support to enable FedRAMP compliance. This PR introduces a HASH_ALGORITHM configuration option that allows deployments to choose between MD5 (legacy) or SHA-256 (FedRAMP compliant) for non-cryptographic hash operations.

Background

Superset currently uses MD5 for cache key generation, thumbnail digests, and UUID namespace generation. While these are non-cryptographic uses, NIST FIPS 140-2 (required by FedRAMP) prohibits MD5 for any purpose.

What Changed

  1. Added HASH_ALGORITHM config in superset/config.py

    • Options: 'md5' (default, legacy) or 'sha256' (FedRAMP compliant)
    • Controls hash algorithm for cache keys, thumbnails, and UUID generation
  2. Refactored superset/utils/hashing.py

    • New generic functions: hash_from_str(), hash_from_dict()
    • Support for algorithm override via parameter
    • Dispatch table pattern for O(1) algorithm lookup (instead of if/elif chains)
    • Type-safe implementation with proper Callable annotations
    • Maintains backward compatibility with md5_sha_from_str() and md5_sha_from_dict() aliases
  3. Updated superset/key_value/utils.py

    • get_uuid_namespace() now supports configurable hashing
    • SHA-256 uses first 16 bytes for UUID compatibility
    • Dispatch table for UUID namespace generators for scalability
    • Added get_fallback_algorithms() helper for configurable fallback chains
    • Added optional app parameter for testing without Flask context
  4. Updated superset/extensions/metastore_cache.py

    • Passes Flask app to get_uuid_namespace() to avoid context issues during initialization

Lazy Migration Strategy for Key-Value Store

Critical Feature: Backward-compatible fallback for existing data

When switching from MD5 to SHA-256, existing entries (permalinks, app configs) remain accessible through a lazy migration approach:

How It Works

  1. Old entries keep their MD5-based UUIDs - No database updates required
  2. Lookup tries both algorithms - SHA-256 first (current), then MD5 (fallback)
  3. New entries use SHA-256 - Created with current algorithm
  4. Both coexist peacefully - Database contains mix of MD5 and SHA-256 entries

Implementation Details

  • get_shared_value(): Implements configurable fallback lookup logic for app configs (permalink salts)
  • CreateDashboardPermalinkCommand: Checks configured fallback algorithms to prevent duplicates
  • get_uuid_namespace_with_algorithm(): New helper for explicit algorithm selection
  • HASH_ALGORITHM_FALLBACKS config: Controls which algorithms to try after primary fails

Migration Impact

No downtime - Works immediately after config change ✅ No data loss - Existing permalinks remain accessible ✅ No manual migration - Entries are never updated, only looked up differently ✅ Zero user impact - Old permalink URLs continue to work

Scalable Architecture for Future Algorithms

This PR implements a dispatch table pattern that makes adding new hash algorithms (e.g., SHA-512, SHA3) straightforward:

Dispatch Tables

  • _HASH_FUNCTIONS in hashing.py - Maps algorithm names to hash functions
  • _UUID_NAMESPACE_GENERATORS in key_value/utils.py - Maps algorithms to UUID generators
  • O(1) lookup performance (vs O(n) for if/elif chains)
  • No hardcoded algorithm string comparisons

Configurable Fallback Chain

# superset_config.py
HASH_ALGORITHM = "sha512"  # Future algorithm
HASH_ALGORITHM_FALLBACKS = ["sha256", "md5"]  # Try these if primary fails

Benefits:

  • Adding SHA-512 requires only updating type hints and dispatch tables
  • No changes to fallback logic (uses config)
  • Migration path: sha256 → sha512 with fallback to [sha256, md5]
  • Empty list disables fallback (strict mode)

Migration Paths

Path A: New FedRAMP-Compliant Deployment

# superset_config.py
HASH_ALGORITHM = 'sha256'
HASH_ALGORITHM_FALLBACKS = ["md5"]  # Default
  • Impact: None (clean start)
  • FedRAMP compliant from day one

Path B: Existing Deployment - Accept Cache Invalidation

# superset_config.py
HASH_ALGORITHM = 'sha256'
HASH_ALGORITHM_FALLBACKS = ["md5"]  # Enables permalink compatibility
  • Impact: All cached content invalidated (cache misses for 24-48 hours)
  • Cache re-warms naturally
  • Existing permalinks continue working via fallback
  • FedRAMP compliant after deployment

Path C: Legacy Deployment - Stay on MD5

# superset_config.py
HASH_ALGORITHM = 'md5'  # default
  • Impact: None (continues using MD5)
  • Not FedRAMP compliant

Path D: Future Migration to SHA-512

# superset_config.py
HASH_ALGORITHM = 'sha512'  # When available
HASH_ALGORITHM_FALLBACKS = ["sha256", "md5"]  # Gradual migration
  • Supports multi-generation fallback
  • No code changes needed (configuration only)

Performance Impact

  • SHA-256 is ~10-30% slower than MD5 for small inputs
  • Absolute overhead: <1ms per hash operation
  • Impact on request latency: <0.1%
  • Negligible for typical workloads
  • Dispatch table lookup: O(1) constant time

Breaking Changes

⚠️ Changing HASH_ALGORITHM invalidates all cached content (cache keys change). Permalinks remain valid (stored in database).

Testing

  • 29 unit tests pass in tests/unit_tests/key_value/
  • New tests for get_fallback_algorithms() with single/multiple/no fallbacks
  • Migration tests verify MD5 fallback and SHA-256 primary lookup
  • Type-safety verified with MyPy

ADDITIONAL INFORMATION

  • [x] Has associated issue: (FedRAMP compliance requirement)
  • [ ] Required feature flags: N/A
  • [ ] Changes UI: No
  • [ ] Includes DB Migration: No
  • [x] Introduces new feature or API: Adds HASH_ALGORITHM and HASH_ALGORITHM_FALLBACKS config options
  • [ ] Removes existing feature or API: No

CodeAnt-AI Description

Support configurable hash algorithm (MD5 or SHA-256) and graceful migration

What Changed

  • Adds a new configuration option HASH_ALGORITHM (choices: "md5" or "sha256") that controls all non-cryptographic hashing used by the app (cache keys, thumbnail digests, UUID namespaces, permalinks, screenshot keys, filter identifiers, and similar outputs).
  • Adds HASH_ALGORITHM_FALLBACKS and runtime fallback lookups so reads of existing permalink/shared entries try the current algorithm first and then configured legacy algorithms; when an entry is found via fallback it is best-effort migrated to the current algorithm so existing permalinks and shared values keep working.
  • Changing HASH_ALGORITHM rotates many externally visible identifiers (cache keys, thumbnails, permalinks, UUID-based keys). This invalidates existing cached content and thumbnails; cache will re-warm naturally (note: ~24–48 hours in typical deployments).
  • Test coverage added and updated to verify both MD5 and SHA-256 modes, deterministic UUID namespace generation, and fallback/migration behavior.

Impact

✅ FedRAMP-compliant hashing for non-cryptographic uses ✅ Fewer broken permalinks and shared values during MD5→SHA-256 migration ✅ Predictable cache and thumbnail invalidation after algorithm change

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

dpgaspar avatar Oct 13 '25 15:10 dpgaspar

Based on your review schedule, I'll hold off on reviewing this PR until it's marked as ready for review. If you'd like me to take a look now, comment /korbit-review.

Your admin can change your review schedule in the Korbit Console

korbit-ai[bot] avatar Oct 13 '25 15:10 korbit-ai[bot]

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

bito-code-review[bot] avatar Oct 13 '25 15:10 bito-code-review[bot]

Codecov Report

:x: Patch coverage is 81.48148% with 20 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 67.99%. Comparing base (1127374) to head (21a707f). :warning: Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
superset/key_value/shared_entries.py 71.42% 5 Missing and 1 partial :warning:
superset/utils/hashing.py 83.33% 1 Missing and 2 partials :warning:
superset/commands/dashboard/permalink/create.py 85.71% 1 Missing and 1 partial :warning:
superset/key_value/utils.py 91.30% 1 Missing and 1 partial :warning:
superset/db_engine_specs/base.py 50.00% 1 Missing :warning:
superset/db_engine_specs/bigquery.py 66.66% 1 Missing :warning:
superset/db_engine_specs/clickhouse.py 50.00% 1 Missing :warning:
superset/db_engine_specs/databend.py 50.00% 1 Missing :warning:
superset/db_engine_specs/dremio.py 50.00% 1 Missing :warning:
superset/db_engine_specs/drill.py 50.00% 1 Missing :warning:
... and 1 more
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #35621       +/-   ##
===========================================
+ Coverage        0   67.99%   +67.99%     
===========================================
  Files           0      635      +635     
  Lines           0    46949    +46949     
  Branches        0     5106     +5106     
===========================================
+ Hits            0    31923    +31923     
- Misses          0    13749    +13749     
- Partials        0     1277     +1277     
Flag Coverage Δ
hive 43.64% <34.25%> (?)
mysql 67.05% <81.48%> (?)
postgres 67.10% <81.48%> (?)
presto 47.29% <40.74%> (?)
python 67.96% <81.48%> (?)
sqlite 66.81% <81.48%> (?)
unit 100.00% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Oct 13 '25 16:10 codecov[bot]

@dpgaspar How do we handle existing data? Hash values are used for content that can be regenerated such as cache, thumbnails, etc but also for content that can't be regenerated like meta store entries that use the key_value function such as permalinks, app configs, etc.

I think the correct way of solving this is to add the algorithm that was used to hash the value in the database. We could create a migration that adds the algorithm column set to MD5 for all existing values. Then, we create a function that parses the hashes based on the saved algorithm. This way, admins can change HASH_ALGORITHM multiple times without breaking anything.

michael-s-molina avatar Oct 14 '25 11:10 michael-s-molina

@dpgaspar How do we handle existing data? Hash values are used for content that can be regenerated such as cache, thumbnails, etc but also for content that can't be regenerated like meta store entries that use the key_value function such as permalinks, app configs, etc.

I think the correct way of solving this is to add the algorithm that was used to hash the value in the database. We could create a migration that adds the algorithm column set to MD5 for all existing values. Then, we create a function that parses the hashes based on the saved algorithm. This way, admins can change HASH_ALGORITHM multiple times without breaking anything.

I agree on @michael-s-molina 's proposal. For fresh installs this is ofc not a problem, but without a simple migration tool it will be difficult to benefit from these improvements on existing deployments.

villebro avatar Oct 15 '25 18:10 villebro

@villebro @michael-s-molina I've added fallback logic using a new config named HASH_ALGORITHM_FALLBACKS for lazy migration. We can't migrate existing keys on the key value store, but we can guarantee that all new keys will use the new configured algorithm and existing keys (for permalinks) are still discovereble

dpgaspar avatar Nov 26 '25 09:11 dpgaspar

existing keys (for permalinks) are still discovereble

What about application configs? It would be important to raise exactly what type of information can't be re-generated to make sure fallbacks are applied to all. Also, wouldn't be better to save the hash type in the database (as a separate column or as a prefix for the hash like md5$<hashvalue> instead of a try/error approach with fallbacks?

michael-s-molina avatar Nov 26 '25 13:11 michael-s-molina

existing keys (for permalinks) are still discovereble

What about application configs? It would be important to raise exactly what type of information can't be re-generated to make sure fallbacks are applied to all. Also, wouldn't be better to save the hash type in the database (as a separate column or as a prefix for the hash like md5$<hashvalue> instead of a try/error approach with fallbacks?

Regarding application configs: Yes, they're covered. The get_shared_value() function in superset/key_value/shared_entries.py implements the same fallback mechanism for app configs (like the permalink salt). When found via fallback, it also auto-migrates the entry to the current algorithm's UUID.

Regarding storing the hash type (column or prefix): I considered this approach, but here's why I chose the fallback pattern instead:

  1. The hash isn't stored directly - The hash algorithm generates a UUID namespace, and that UUID is what's stored. To track the algorithm, we'd need a new column on key_value entries, requiring a DB migration + backfill.
  2. Fallback scope is very limited - It's only used in two places:
    • CreateDashboardPermalinkCommand: To check for duplicates before creating (prevents creating a new permalink if one exists with the old algorithm)
    • get_shared_value(): For app configs like the permalink salt

Notably, GET permalink by URL doesn't need fallback - the URL contains an encoded integer ID, so lookups are direct regardless of which algorithm created it. 3. Trivial algorithm set - With only 2-3 algorithms (md5, sha256, maybe sha512 someday), the "try-error" cost is one extra UUID generation + DB lookup in the worst case - microseconds. 4. Transitional by nature - Fallback is only for legacy entries. New entries use the current algorithm, so the fallback list naturally becomes less relevant over time.

Given the limited scope (duplicate detection + app configs only) and the migration complexity a column approach would require, the fallback pattern seemed like the right trade-off. But happy to discuss further!

dpgaspar avatar Dec 04 '25 11:12 dpgaspar

Given the limited scope (duplicate detection + app configs only) and the migration complexity a column approach would require, the fallback pattern seemed like the right trade-off.

Thank you for the additional context @dpgaspar. Your points seem reasonable.

michael-s-molina avatar Dec 04 '25 16:12 michael-s-molina

CodeAnt AI is reviewing your PR.

CodeAnt AI finished reviewing your PR.

💡 Enhance Your PR Reviews

We noticed that 3 feature(s) are not configured for this repository. Enabling these features can help improve your code quality and workflow:

🚦 Quality Gates

Status: Quality Gates are not enabled at the organization level Learn more about Quality Gates

🎫 Jira Ticket Compliance

Status: Jira credentials file not found. Please configure Jira integration in your settings Learn more about Jira Integration

⚙️ Custom Rules

Status: No custom rules configured. Add rules via organization settings or .codeant/review.json in your repository Learn more about Custom Rules


Want to enable these features? Contact your organization admin or check our documentation for setup instructions.

CodeAnt AI is running Incremental review

Brilliant work @dpgaspar , I like the fallback approach to resolve this stalemate 👍 One last consideration, but I'm fine either way.

Uau, thank you for the reviews also! I've added a line on UPDATING.md since the default is now sha256

dpgaspar avatar Dec 09 '25 16:12 dpgaspar