Nicolas Williams
Nicolas Williams
### Reproduction of broken HTTP handling To reproduce the broken HTTP handling and debug the stream data, it is possible to apply these patches: Ingester PSC: ```diff diff --git a/lib/register_ingester_psc/streams/clients/psc_stream.rb...
At first, I thought this was because of incorrect handling of the PSC stream keepalives. PSC Streaming API documentation states: > Keeping a connection alive > > HTTP connections will...
All the fixes have been merged and deployed. So far, so good: no gaps. I intend to check in on it tomorrow, by which time, it definitely should have started...
Looks good. Events are still consecutive after running for around 16 hours, with no apparent gaps. Taking a sample of 500 events: [270-cut.txt](https://github.com/user-attachments/files/16028309/270-cut.txt) `11862989 - 11862490 + 1 = 500`,...
There are a few things to consider, here: 1. What were the differences between `psc.2024-05-03T10:53:43+00:00.jsonl.gz` and `psc.2024-04-08T07:06:44+00:00.jsonl.gz` bulk data exports? The latter was without bulk Ingester PSC being run, but...
### (1), (2) Trying to analyse these files directly could be problematic, since they're 3.5G compressed files. However, there is no need to consider using Athena or similar, here, since...
Here, we can already spot a potential issue: `psc.2024-05-05T10:53:40+00:00.jsonl.gz` resulted in 105932 extra statements which `psc.2024-05-03T10:53:43+00:00.jsonl.gz` didn't have. That in turn had only 24584 statements, but that's less of a...
### (3) In order to examine this, we need to look at a sample of data written to the `oo-register-v2` S3 bucket `raw_data` directory. Since streaming Ingester PSC is running...
Examining a large number of files through a filter, `data.links.self` can be one of these values: ``` corporate-entity-person-with-significant-control individual-person-with-significant-control legal-person-person-with-significant-control persons-with-significant-control-statement super-secure-person-with-significant-control ``` For each of these possible values, examining...
In order to examine the number of matches for each `data.links.self`, we write a small script. This accepts a source directory to analyse and a destination directory to perform the...