Runtime: Re-ingest local files if they are updated
We have gotten several reports about file sources not being updated when the local file is changed. This is especially problematic for mapping.csv files used for access management in security policies.
We need to be careful only to do this for small local files, to avoid accidentally scanning large/remote files (even for computing a hash). Maybe it should be an optional parameter for the file: connector?
I don't have a clear idea about how we might solve this, but here are some thoughts (probably too hacky):
- We could support a
filestore_to_selfexecutor for DuckDB and ClickHouse for ingesting from thefile/local_fileconnector - If
invalidate_on_changeis set totruein theInputProps, we could compute a hash of the ingested files and store it in theModelResult - In the call to
ModelManager.Exists, we could re-compute the hash and returnfalseif it changed
We can also consider following alternate solutions if the use case is only for mapping files kind of situations:
- Create a non materialised model on top of the file so that every query that reads data from the view reads data from the file. For files with few hundred rows the difference is less than 5ms.
- Add a source/model config
invalidate_alwayswhich always returnstrueforModelManager.Existsso that the model is always updated on reconcile. Since these mappings are expected to be small it should be okay to always refresh them.
1 has a limitation that dependencies will not be refreshed. 2 has a limitation that dependencies will always be refreshed.
If we want to solve this for all cases then I think the solution listed above is the only way. Two considerations though :
- For clickhouse the local file is actually stored on the clickhouse server so we can't compute its hash. Also given clickhouse sources are always run as models we only have the queries and not the actual path of the file.
- I think we also need to store the path of the file in
ModelManagerso that hash can be recomputed.
We can also consider following alternate solutions if the use case is only for
mappingfiles kind of situations:
-
Unfortunately they're often referenced in security policies and dimension/measure expressions, which get hit a ton (e.g. when listing dashboards, we check the security policy of each dashboard in the project), so I'd be too worried about the number of disk reads and CSV parses here.
-
This might be okay, though it should be used cautiously so it doesn't invalidate large downstream models too frequently. However, there's an issue that applies here, which also applies to my suggestion above, which is that if a model file isn't edited, there is no call to
Reconcileat all.Reconcileis only called on controller restart or when the resource spec changes.
If we want to solve this for all cases then I think the solution listed above is the only way. Two considerations though :
About 1. here – for ClickHouse, the file/local_file connector should still load the local path on the runtime and upload/write it into a ClickHouse table. Because the goal for the local file connector is using small data files that are checked into Git.
About 1. here – for ClickHouse, the file/local_file connector should still load the local path on the runtime and upload/write it into a ClickHouse table. Because the goal for the local file connector is using small data files that are checked into Git.
Sure I think this can be handled separately. I will raise an issue.