Nicolas Williams
Nicolas Williams
Using this script, we can identify the number of matches for each `data.links.self` in the sample of 2024-04-30: [sample.2024-04-30.matches.log](https://github.com/openownership/register/files/15435044/sample.2024-04-30.matches.log) This results in 9130 lines, which is what we expect from...
Looking into `sample.2024-04-30.matches.1.log`, an issue with the process becomes clear: some of the matches are from prior to the bulk Ingester PSC. Rather than searching the whole of 2024, it...
What we would like is 2 files, only containing our sample from 2024-04-30, and one containing its match from 2024-05. Data for 2024-04-30 can be found by modifying the previous...
We'd like to compare the 2 files line-by-line. But there are a lot of differences. Using Vimdiff, we can get a general sense of the types of differences which occur:...
Using Vimdiff:  This is becoming easier to compare. We can spot that in some cases, `company_number` is `null` in...
At this point, it's likely easier comparing fields in expanded, not compact, JSON form: ```sh jq < sample-880.2024-04-30.no-etag.differences.no-cn.differences.log > sample-402.2024-04-30.log jq < sample-880.2024-05.no-etag.differences.no-cn.differences.log > sample-402.2024-05.log diff sample-402.2024-{04-30,05}.log > sample-402.log ```...
The remaining differences can be broadly grouped into categories. Those are: ### ceased on additions ```diff 9a10 > "ceased_on": "2024-05-01", ``` ```diff 4589a4512 > "ceased_on": "2023-10-13", ``` ### natures of...
### Questions There's a bit of a complexity, here, in that it's not clear in the case of amendments whether the change came from an update via the stream, or...
### 1. Is ceased_on present within the stream at any point? Or is it only present in bulk data? ```sh ag -c '"ceased_on":"2024-05-25"' ``` ``` month=05/day=25/psc-prod-2-2024-05-25-05-57-03-ae8f6ea6-4225-49f2-b682-016f3b255e8e:1 month=05/day=25/psc-prod-2-2024-05-25-06-25-03-3ca73436-b28e-4ba7-9fca-57b12d1c6f41:1 month=05/day=25/psc-prod-2-2024-05-25-09-03-02-a2348fb1-ea96-4085-b633-f0fd72ceaf19:1 month=05/day=25/psc-prod-2-2024-05-25-08-57-03-cf5f3b96-53b2-433c-8b84-5df6ac0596df:1 month=05/day=25/psc-prod-2-2024-05-25-09-13-03-958ee50e-0358-4fd3-b9c3-c777f1659848:2...
### 2. For the records found to differ only by etag, were those matches definitely in the bulk data, rather than the stream? ```sh comm -12