mimir Extend ANALYZE to detect potential data errors

Extend ANALYZE to detect potential data errors

Open okennedy opened this issue 7 years ago • 1 comments

As of right now, ANALYZE only detects sources of uncertainty injected by Mimir. It would be helpful if Mimir had some facility to do syntactic analysis on a dataset being shown to identify potential data errors (e.g., that might be fixed by adding a new lens). Consider the following cases:

[ ] String columns in the dataset actually contain stronger types. Solution --- Create a Type Inference Lens
[ ] Sequence columns with regularly spaced rows, but which have holes (e.g., 2016-02-01, 2016-02-02, 2016-02-03, 2016-02-05) --- Solution: Create a MissingKeyLens
[ ] Columns where the majority of records fit in with a user type, but some don't: Solution --- Create a Type Inference Lens
[ ] Potential functional dependencies that have violations
[ ] Duplicated rows
[ ] Histogram violations: The majority of records fall within a pair of bounds (lower/upper, set of values), while others don't.
[ ] Coherence violations: Identify columns where fields are inconsistent/incoherent and detect cases where one or more columns could be shifted over to improve coherence.

Challenges that come to mind:

Computing all of these stats is going to get very computationally intensive.
- Idea 1: Compute them in the background --- This might be hard to ram into the Command-Line UI, since there's no way to get output the user assymetrically. This might still support something like Vizier.
- Idea 2: Compute stats when a table is first loaded or a view is first created and propagate them with annotations (requires annotations, and isn't going to get 100% coverage... but still a potentially reasonable strategy). This computation should still be performed in the background.
- Idea 3: Make this functionality part of the "I don't trust it" lens.
Tracking feedback is going to require having a place to come back to --- If the user acknowledges that a particular sequence with holes is acceptable, where does that feedback get registered?
- Another case where Idea 2/3 helps... since this gives a concrete point where we can register feedback.
On the other hand, if the user has us "fix" the issue by adding a lens... where does the lens fix get applied?

Jun 29 '17 16:06 okennedy

Generalization of #128 . Maybe we implement this as a simple command rather than as a special lens.

Jul 31 '19 18:07 okennedy

mimir mimir copied to clipboard

Extend ANALYZE to detect potential data errors

mimir
mimir copied to clipboard