LiberTEM icon indicating copy to clipboard operation
LiberTEM copied to clipboard

`ValidationUDF`: Validate that all data of a dataset is seen exactly once

Open uellue opened this issue 2 years ago • 1 comments

Currently, the ValidationUDF only checks if the data it receives matches the reference. In the future it could also keep track which parts of the data it has seen to make sure none are skipped or processed twice.

uellue avatar Jul 18 '22 11:07 uellue

Sadly not as easy as I first thought: adding a check in get_results doesn't work in a straight-forward way, as we don't know if we have the final result in our hands, or if it is only a partial result. We can't easily check a "should eventually be equal to" constraint in that way... so something like this fails:

diff --git a/tests/utils.py b/tests/utils.py
index 9dd04dd2..848b141f 100644
--- a/tests/utils.py
+++ b/tests/utils.py
@@ -167,15 +167,23 @@ class ValidationUDF(UDF):
 
     def get_result_buffers(self):
         return {
-            # Just a buffer to "feel" the av shape
-            'nav_shape': self.buffer(kind="nav", dtype="float32"),
+            'seen': self.buffer(kind="nav", dtype=np.int64),
         }
 
     def process_tile(self, tile):
+        self.results.seen[:] += 1
         assert self.params.validation_function(
             self.meta.slice.get(self.params.reference), tile
         )
 
+    def get_results(self):
+        if self.meta.roi is None:
+            expected = self.meta.dataset_shape.size
+        else:
+            expected = np.sum(self.meta.roi)
+        assert np.sum(self.results.seen) == expected
+        return {}
+

Might need checks at the call site of the UDF, or maybe a small addition to the UDF API?

sk1p avatar Jul 18 '22 12:07 sk1p