mimir
mimir copied to clipboard
Extend ANALYZE to detect potential data errors
As of right now, ANALYZE only detects sources of uncertainty injected by Mimir. It would be helpful if Mimir had some facility to do syntactic analysis on a dataset being shown to identify potential data errors (e.g., that might be fixed by adding a new lens). Consider the following cases:
- [ ] String columns in the dataset actually contain stronger types. Solution --- Create a Type Inference Lens
- [ ] Sequence columns with regularly spaced rows, but which have holes (e.g., 2016-02-01, 2016-02-02, 2016-02-03, 2016-02-05) --- Solution: Create a MissingKeyLens
- [ ] Columns where the majority of records fit in with a user type, but some don't: Solution --- Create a Type Inference Lens
- [ ] Potential functional dependencies that have violations
- [ ] Duplicated rows
- [ ] Histogram violations: The majority of records fall within a pair of bounds (lower/upper, set of values), while others don't.
- [ ] Coherence violations: Identify columns where fields are inconsistent/incoherent and detect cases where one or more columns could be shifted over to improve coherence.
Challenges that come to mind:
- Computing all of these stats is going to get very computationally intensive.
- Idea 1: Compute them in the background --- This might be hard to ram into the Command-Line UI, since there's no way to get output the user assymetrically. This might still support something like Vizier.
- Idea 2: Compute stats when a table is first loaded or a view is first created and propagate them with annotations (requires annotations, and isn't going to get 100% coverage... but still a potentially reasonable strategy). This computation should still be performed in the background.
- Idea 3: Make this functionality part of the "I don't trust it" lens.
- Tracking feedback is going to require having a place to come back to --- If the user acknowledges that a particular sequence with holes is acceptable, where does that feedback get registered?
- Another case where Idea 2/3 helps... since this gives a concrete point where we can register feedback.
- On the other hand, if the user has us "fix" the issue by adding a lens... where does the lens fix get applied?
Generalization of #128 . Maybe we implement this as a simple command rather than as a special lens.