feat: add option for hash algorithms
User description
SUMMARY
Adds configurable hash algorithm support to enable FedRAMP compliance. This PR introduces a HASH_ALGORITHM configuration option that allows deployments to choose between MD5 (legacy) or SHA-256 (FedRAMP compliant) for non-cryptographic hash operations.
Background
Superset currently uses MD5 for cache key generation, thumbnail digests, and UUID namespace generation. While these are non-cryptographic uses, NIST FIPS 140-2 (required by FedRAMP) prohibits MD5 for any purpose.
What Changed
-
Added
HASH_ALGORITHMconfig insuperset/config.py- Options:
'md5'(default, legacy) or'sha256'(FedRAMP compliant) - Controls hash algorithm for cache keys, thumbnails, and UUID generation
- Options:
-
Refactored
superset/utils/hashing.py- New generic functions:
hash_from_str(),hash_from_dict() - Support for algorithm override via parameter
- Dispatch table pattern for O(1) algorithm lookup (instead of if/elif chains)
- Type-safe implementation with proper
Callableannotations - Maintains backward compatibility with
md5_sha_from_str()andmd5_sha_from_dict()aliases
- New generic functions:
-
Updated
superset/key_value/utils.py-
get_uuid_namespace()now supports configurable hashing - SHA-256 uses first 16 bytes for UUID compatibility
- Dispatch table for UUID namespace generators for scalability
- Added
get_fallback_algorithms()helper for configurable fallback chains - Added optional
appparameter for testing without Flask context
-
-
Updated
superset/extensions/metastore_cache.py- Passes Flask app to
get_uuid_namespace()to avoid context issues during initialization
- Passes Flask app to
Lazy Migration Strategy for Key-Value Store
Critical Feature: Backward-compatible fallback for existing data
When switching from MD5 to SHA-256, existing entries (permalinks, app configs) remain accessible through a lazy migration approach:
How It Works
- Old entries keep their MD5-based UUIDs - No database updates required
- Lookup tries both algorithms - SHA-256 first (current), then MD5 (fallback)
- New entries use SHA-256 - Created with current algorithm
- Both coexist peacefully - Database contains mix of MD5 and SHA-256 entries
Implementation Details
-
get_shared_value(): Implements configurable fallback lookup logic for app configs (permalink salts) -
CreateDashboardPermalinkCommand: Checks configured fallback algorithms to prevent duplicates -
get_uuid_namespace_with_algorithm(): New helper for explicit algorithm selection -
HASH_ALGORITHM_FALLBACKSconfig: Controls which algorithms to try after primary fails
Migration Impact
✅ No downtime - Works immediately after config change ✅ No data loss - Existing permalinks remain accessible ✅ No manual migration - Entries are never updated, only looked up differently ✅ Zero user impact - Old permalink URLs continue to work
Scalable Architecture for Future Algorithms
This PR implements a dispatch table pattern that makes adding new hash algorithms (e.g., SHA-512, SHA3) straightforward:
Dispatch Tables
-
_HASH_FUNCTIONSinhashing.py- Maps algorithm names to hash functions -
_UUID_NAMESPACE_GENERATORSinkey_value/utils.py- Maps algorithms to UUID generators - O(1) lookup performance (vs O(n) for if/elif chains)
- No hardcoded algorithm string comparisons
Configurable Fallback Chain
# superset_config.py
HASH_ALGORITHM = "sha512" # Future algorithm
HASH_ALGORITHM_FALLBACKS = ["sha256", "md5"] # Try these if primary fails
Benefits:
- Adding SHA-512 requires only updating type hints and dispatch tables
- No changes to fallback logic (uses config)
- Migration path: sha256 → sha512 with fallback to [sha256, md5]
- Empty list disables fallback (strict mode)
Migration Paths
Path A: New FedRAMP-Compliant Deployment
# superset_config.py
HASH_ALGORITHM = 'sha256'
HASH_ALGORITHM_FALLBACKS = ["md5"] # Default
- Impact: None (clean start)
- FedRAMP compliant from day one
Path B: Existing Deployment - Accept Cache Invalidation
# superset_config.py
HASH_ALGORITHM = 'sha256'
HASH_ALGORITHM_FALLBACKS = ["md5"] # Enables permalink compatibility
- Impact: All cached content invalidated (cache misses for 24-48 hours)
- Cache re-warms naturally
- Existing permalinks continue working via fallback
- FedRAMP compliant after deployment
Path C: Legacy Deployment - Stay on MD5
# superset_config.py
HASH_ALGORITHM = 'md5' # default
- Impact: None (continues using MD5)
- Not FedRAMP compliant
Path D: Future Migration to SHA-512
# superset_config.py
HASH_ALGORITHM = 'sha512' # When available
HASH_ALGORITHM_FALLBACKS = ["sha256", "md5"] # Gradual migration
- Supports multi-generation fallback
- No code changes needed (configuration only)
Performance Impact
- SHA-256 is ~10-30% slower than MD5 for small inputs
- Absolute overhead: <1ms per hash operation
- Impact on request latency: <0.1%
- Negligible for typical workloads
- Dispatch table lookup: O(1) constant time
Breaking Changes
⚠️ Changing HASH_ALGORITHM invalidates all cached content (cache keys change). Permalinks remain valid (stored in database).
Testing
-
29 unit tests pass in
tests/unit_tests/key_value/ - New tests for
get_fallback_algorithms()with single/multiple/no fallbacks - Migration tests verify MD5 fallback and SHA-256 primary lookup
- Type-safety verified with MyPy
ADDITIONAL INFORMATION
- [x] Has associated issue: (FedRAMP compliance requirement)
- [ ] Required feature flags: N/A
- [ ] Changes UI: No
- [ ] Includes DB Migration: No
- [x] Introduces new feature or API: Adds
HASH_ALGORITHMandHASH_ALGORITHM_FALLBACKSconfig options - [ ] Removes existing feature or API: No
CodeAnt-AI Description
Support configurable hash algorithm (MD5 or SHA-256) and graceful migration
What Changed
- Adds a new configuration option HASH_ALGORITHM (choices: "md5" or "sha256") that controls all non-cryptographic hashing used by the app (cache keys, thumbnail digests, UUID namespaces, permalinks, screenshot keys, filter identifiers, and similar outputs).
- Adds HASH_ALGORITHM_FALLBACKS and runtime fallback lookups so reads of existing permalink/shared entries try the current algorithm first and then configured legacy algorithms; when an entry is found via fallback it is best-effort migrated to the current algorithm so existing permalinks and shared values keep working.
- Changing HASH_ALGORITHM rotates many externally visible identifiers (cache keys, thumbnails, permalinks, UUID-based keys). This invalidates existing cached content and thumbnails; cache will re-warm naturally (note: ~24–48 hours in typical deployments).
- Test coverage added and updated to verify both MD5 and SHA-256 modes, deterministic UUID namespace generation, and fallback/migration behavior.
Impact
✅ FedRAMP-compliant hashing for non-cryptographic uses
✅ Fewer broken permalinks and shared values during MD5→SHA-256 migration
✅ Predictable cache and thumbnail invalidation after algorithm change
💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
@codeant-ai ask: Your question here
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
@codeant-ai ask: Can you suggest a safer alternative to storing this secret?
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
@codeant-ai: Your feedback here
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
@codeant-ai: Do not flag unused imports.
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
@codeant-ai: review
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.
Based on your review schedule, I'll hold off on reviewing this PR until it's marked as ready for review. If you'd like me to take a look now, comment /korbit-review.
Your admin can change your review schedule in the Korbit Console
Bito Automatic Review Skipped - Draft PR
Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].
Codecov Report
:x: Patch coverage is 81.48148% with 20 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 67.99%. Comparing base (1127374) to head (21a707f).
:warning: Report is 17 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #35621 +/- ##
===========================================
+ Coverage 0 67.99% +67.99%
===========================================
Files 0 635 +635
Lines 0 46949 +46949
Branches 0 5106 +5106
===========================================
+ Hits 0 31923 +31923
- Misses 0 13749 +13749
- Partials 0 1277 +1277
| Flag | Coverage Δ | |
|---|---|---|
| hive | 43.64% <34.25%> (?) |
|
| mysql | 67.05% <81.48%> (?) |
|
| postgres | 67.10% <81.48%> (?) |
|
| presto | 47.29% <40.74%> (?) |
|
| python | 67.96% <81.48%> (?) |
|
| sqlite | 66.81% <81.48%> (?) |
|
| unit | 100.00% <ø> (?) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
@dpgaspar How do we handle existing data? Hash values are used for content that can be regenerated such as cache, thumbnails, etc but also for content that can't be regenerated like meta store entries that use the key_value function such as permalinks, app configs, etc.
I think the correct way of solving this is to add the algorithm that was used to hash the value in the database. We could create a migration that adds the algorithm column set to MD5 for all existing values. Then, we create a function that parses the hashes based on the saved algorithm. This way, admins can change HASH_ALGORITHM multiple times without breaking anything.
@dpgaspar How do we handle existing data? Hash values are used for content that can be regenerated such as cache, thumbnails, etc but also for content that can't be regenerated like meta store entries that use the
key_valuefunction such as permalinks, app configs, etc.I think the correct way of solving this is to add the algorithm that was used to hash the value in the database. We could create a migration that adds the algorithm column set to MD5 for all existing values. Then, we create a function that parses the hashes based on the saved algorithm. This way, admins can change
HASH_ALGORITHMmultiple times without breaking anything.
I agree on @michael-s-molina 's proposal. For fresh installs this is ofc not a problem, but without a simple migration tool it will be difficult to benefit from these improvements on existing deployments.
@villebro @michael-s-molina I've added fallback logic using a new config named HASH_ALGORITHM_FALLBACKS for lazy migration. We can't migrate existing keys on the key value store, but we can guarantee that all new keys will use the new configured algorithm and existing keys (for permalinks) are still discovereble
existing keys (for permalinks) are still discovereble
What about application configs? It would be important to raise exactly what type of information can't be re-generated to make sure fallbacks are applied to all. Also, wouldn't be better to save the hash type in the database (as a separate column or as a prefix for the hash like md5$<hashvalue> instead of a try/error approach with fallbacks?
existing keys (for permalinks) are still discovereble
What about application configs? It would be important to raise exactly what type of information can't be re-generated to make sure fallbacks are applied to all. Also, wouldn't be better to save the hash type in the database (as a separate column or as a prefix for the hash like
md5$<hashvalue>instead of a try/error approach with fallbacks?
Regarding application configs: Yes, they're covered. The get_shared_value() function in superset/key_value/shared_entries.py implements the same fallback mechanism for app configs (like the
permalink salt). When found via fallback, it also auto-migrates the entry to the current algorithm's UUID.
Regarding storing the hash type (column or prefix): I considered this approach, but here's why I chose the fallback pattern instead:
- The hash isn't stored directly - The hash algorithm generates a UUID namespace, and that UUID is what's stored. To track the algorithm, we'd need a new column on
key_valueentries, requiring a DB migration + backfill. - Fallback scope is very limited - It's only used in two places:
-
CreateDashboardPermalinkCommand: To check for duplicates before creating (prevents creating a new permalink if one exists with the old algorithm) -
get_shared_value(): For app configs like the permalink salt
-
Notably, GET permalink by URL doesn't need fallback - the URL contains an encoded integer ID, so lookups are direct regardless of which algorithm created it. 3. Trivial algorithm set - With only 2-3 algorithms (md5, sha256, maybe sha512 someday), the "try-error" cost is one extra UUID generation + DB lookup in the worst case - microseconds. 4. Transitional by nature - Fallback is only for legacy entries. New entries use the current algorithm, so the fallback list naturally becomes less relevant over time.
Given the limited scope (duplicate detection + app configs only) and the migration complexity a column approach would require, the fallback pattern seemed like the right trade-off. But happy to discuss further!
Given the limited scope (duplicate detection + app configs only) and the migration complexity a column approach would require, the fallback pattern seemed like the right trade-off.
Thank you for the additional context @dpgaspar. Your points seem reasonable.
CodeAnt AI is reviewing your PR.
CodeAnt AI finished reviewing your PR.
💡 Enhance Your PR Reviews
We noticed that 3 feature(s) are not configured for this repository. Enabling these features can help improve your code quality and workflow:
🚦 Quality Gates
Status: Quality Gates are not enabled at the organization level Learn more about Quality Gates
🎫 Jira Ticket Compliance
Status: Jira credentials file not found. Please configure Jira integration in your settings Learn more about Jira Integration
⚙️ Custom Rules
Status: No custom rules configured. Add rules via organization settings or .codeant/review.json in your repository Learn more about Custom Rules
Want to enable these features? Contact your organization admin or check our documentation for setup instructions.
CodeAnt AI is running Incremental review
Brilliant work @dpgaspar , I like the fallback approach to resolve this stalemate 👍 One last consideration, but I'm fine either way.
Uau, thank you for the reviews also! I've added a line on UPDATING.md since the default is now sha256