streamalert
streamalert copied to clipboard
[bug] Changing type(s) in a log schema will break historical search against data using old schema
Background
If you have historical search enabled and the file_format
is set to parquet
, bad news, we will be screwed if we change the type(s) in a log schema and we will get the error HIVE_PARTITION_SCHEMA_MISMATCH
error when we try to search historical data across all partitions in the table using the schema we changed.
For example, if we change following timestamp
to string
, carbonblack_alert_watchlist_hit_feedsearch_bin
table partitions will be screwed.
https://github.com/airbnb/streamalert/blob/19458d7547a6098d5d8ae7e21c6f88bc525e9726/conf/schemas/carbonblack.json#L51
If we don't change the schema ever, happy life! Unfortunately, this is not the reality 😢
Desired Change
Couple things we can improve.
-
Standardize Everything on string String is larger in memory footprint, but is the most permissive to future changes.
-
Have a script that can fix this quickly Script should drop target table(s) and rebuild them using new schemas, and should recreate partitions. This script may also need to fix underlying data (which might be hard).
-
Or other solutions we haven't thought about.