csv-validator icon indicating copy to clipboard operation
csv-validator copied to clipboard

Cross check values in other rows

Open DavidUnderdown opened this issue 12 years ago • 5 comments

Image acquisition metadata file consistency checks say we should be able to check that if image_split is("yes") we should be able to cross-check that the image_split_other_uuid refers to another file_uuid within the same csv file, and that in the row for that other image image_split_other_uuid refers back to file_uuid in the same row we are actually validating

DavidUnderdown avatar Apr 18 '13 10:04 DavidUnderdown

Can you paste an example please?

adamretter avatar Apr 18 '13 13:04 adamretter

So you'd have something like (with some white space added for readability

file_uuid, image_split, image_split_other_uuid 0f2d29e1-36a0-4449-8682-3c57e5c5e965, yes, 6f18b9cf-7d75-4bba-a698-ea08b2e0a422 6f18b9cf-7d75-4bba-a698-ea08b2e0a422, yes, 0f2d29e1-36a0-4449-8682-3c57e5c5e965

DavidUnderdown avatar Apr 18 '13 14:04 DavidUnderdown

At present all constraints are applied on a per-row basis, apart from 'unique' which can operate over multiple rows.

Such a change is unfortunately not trivial. If we wish to stay with a streaming processing approach of the CSV (which I think we do), then you effectively need to be able to establish a rule which is evaluated against each row in the CSV. This will need to cause some data to be cached (just like 'unique') as processing proceeds through the CSV and if the rule is not matched by the end of the CSV then an error needs to be raised.

Perhaps a syntax something like -

image_split_other_uuid: ~(is(~file_uuid) and ~image_split_other_uuid/is($file_uuid))

Basically at the moment we have "$" which indicates a named column in the same (local) row. The idea is to introduce '~' which means a named column in a different (remote) row. There is an additional subtlety, which is that in this case you want both the file_uuid and image_split_other_uuid columns to be in the same remote row, to express this I have used a wrapping prefix "~(" and terminating ')'.

Another more flexible/complex option perhaps would be to say that '~' when used in a single column validaton rule always applies to the same remote row consistently, and if you wanted to say in any remote row then you could use a different symbol like '>'. For example -

image_split_other_uuid: is(~file_uuid) and ~image_split_other_uuid/is($file_uuid)

would check that the remote file_uuid and image_split_other_uuid used in this rule to match this local row, came from the same remote row. Whereas:

image_split_other_uuid: is(>file_uuid) and >image_split_other_uuid/is($file_uuid)

would check that the local image_split_other_uuid value matched in any remote row in the column file_uuid and and remote columns image_split_other_uuid matched in the local row's file_uuid column. Although it would be easier to rewrite the above as:

file_uuid: is(>image_split_other_uuid) image_split_other_uuid: is(>file_uuid)

adamretter avatar May 10 '13 16:05 adamretter

Another use case, making sure that you don't have all rows within a defined subset being of the same "type" - will expand on this point later.

DavidUnderdown avatar Nov 06 '15 11:11 DavidUnderdown

@DavidUnderdown Okay looking forward to hearing more...

adamretter avatar Nov 06 '15 12:11 adamretter