reach
reach copied to clipboard
File hashes are not consistent between Reach runs
Just realized that some documents we scrape have a different file hash even though they represent the same document.
If our intention is not to redownload the same document, this is one problem, but even if we do not care about this, during analysis there needs to be a field which is unique per document in order for the conclusions to be accurate. The solution could be to document that document_uri is the unique key that people should use of they want to analyze the document or it could be to alter our implementation of file hash accordingly or it can be something else.
Also there are scraped documents with the same file hash. Checking in the RDS data just now - there are 205,234 scraped policy docs, but 143,142 unique file hashs.
An example of this is the policy document
http://www.fao.org/3/I9553EN/i9553en.pdf
which has the document ID
818aff942d9813d338fe31828ee9452a
and also
14748f5b61ec161bc226354edbeee7f1
I think this issue has reared its head again.
The gold annotated data that I am using to evaluate reach has 81 annotated documents from a scrape done by Reach on 2019-10-8. Not all of these document ids exist in more recent scrapes however:
Date of evaluation | Gold documents found |
---|---|
Dec/Jan | 10/81 |
2020-01-13 | 7/81 |
2020-01-16 | 6/81 |
2020-01-28 | 4/81 |
The problem seems to be getting worse with time. Is there any reason why taking an md5 hash of the entire document could yield different results on subsequent scrapes: @SamDepardieu @jdu https://github.com/wellcometrust/reach/blob/d66379b2b40ea6d01e53c4667f98db2cf2ca2896/reach/scraper/wsf_scraping/file_system.py#L11
It's possible it could change, but not sure the cases where this would be the case apply here. For instance if there was a process on their end which opened the file and saved it, even without changing the content of the PDF, there's a trailer in the file which contains a CreationDate/ModDate value, which if those changed it might completely change the MD5 hash output.
Are we sure that the hashes changed and not that that the documents just aren't there a tall?
If you have the original file from your gold data you could calc a hash of it, pull the same file from S3 and calc the hash on it and if they differ you could compare the file in a diff viewer to see if there are any differences.
It's possible it could change, but not sure the cases where this would be the case apply here. For instance if there was a process on their end which opened the file and saved it, even without changing the content of the PDF, there's a trailer in the file which contains a CreationDate/ModDate value, which if those changed it might completely change the MD5 hash output.
Are we sure that the hashes changed and not that that the documents just aren't there a tall?
Nope not 100% sure that they are not there at all - that is the other possibility.
Also, if you opened the file in preview and then calculated the MD5 hash in order to set up the gold data set, your MD5 hash calculated locally might differe because some data in the PDF might have changed from the open / close. I'll do a quick test in a moment.
Just to clarify, i never calculate md5 hashes locally, they are always taken from Reach. So the question is whether in previous runs of the scraper, the same file is being given a different hash over subsequent scrapes...or whether those files simply aren't being found anymore.
Already tested (was curious to see for myself in any case) and at least with Preview in macos it doesn't change the file hash, even if I open and save the file. I think because it's only hashing the first 65536 bytes of the file and the trailer is at the end of the file generally.
If there's nothing altering the files then the md5 should be deterministic and shouldn't change at all between scrapes unless the file has changed on their end.
yes... that is was I thought...
going to dig into this a bit further, i think i have some historical files knocking around. Think i'm also going to add a datestamp to the files output by the new evaluator tasks
The fact you're getting some as well is weird. Because if it was something in the pipeline that was changing the hash I would think it would be all or nothing. So either they aren't being scraped / are missing from the data (a very real possibility) or they've been changed at the source. If even a single byte changes in the first block of bytes in the PDF it can drastically change the hash.
Another option here is to use the DocumentId embedded in the PDF if it's available and then falling back to an MD5 hash if the DocumentID doesn't exist.
That DocumentId shouldn't change.
$ pdfinfo -meta stored_v.pdf
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c015 91.163280, 2018/06/22-11:31:03 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:stRef="http://ns.adobe.com/xap/1.0/sType/ResourceRef#"
xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<xmp:CreateDate>2018-09-04T16:57:41+02:00</xmp:CreateDate>
<xmp:MetadataDate>2018-11-23T16:25:13+01:00</xmp:MetadataDate>
<xmp:ModifyDate>2018-11-23T16:25:13+01:00</xmp:ModifyDate>
<xmp:CreatorTool>Adobe InDesign CC 13.1 (Macintosh)</xmp:CreatorTool>
<xmpMM:InstanceID>uuid:96ee9e09-c03e-be42-a57b-49ffadd7d79c</xmpMM:InstanceID>
<xmpMM:OriginalDocumentID>xmp.did:F77F117407206811822A97A08D940DAA</xmpMM:OriginalDocumentID>
<!-- HERE -->
<xmpMM:DocumentID>xmp.id:6b8f1861-33fa-4840-8122-e964de1452e0</xmpMM:DocumentID>
<!-- HERE -->
<xmpMM:RenditionClass>proof:pdf</xmpMM:RenditionClass>
<xmpMM:DerivedFrom rdf:parseType="Resource">
<stRef:instanceID>xmp.iid:80a4aae9-2c43-41e9-a87b-ecf759b5ca1c</stRef:instanceID>
<stRef:documentID>xmp.did:845fb7c5-0600-4ff5-9fa1-d370d05b7ed5</stRef:documentID>
<stRef:originalDocumentID>xmp.did:F77F117407206811822A97A08D940DAA</stRef:originalDocumentID>
<stRef:renditionClass>default</stRef:renditionClass>
</xmpMM:DerivedFrom>
<xmpMM:History>
<rdf:Seq>
<rdf:li rdf:parseType="Resource">
<stEvt:action>converted</stEvt:action>
<stEvt:parameters>from application/x-indesign to application/pdf</stEvt:parameters>
<stEvt:softwareAgent>Adobe InDesign CC 13.1 (Macintosh)</stEvt:softwareAgent>
<stEvt:changed>/</stEvt:changed>
<stEvt:when>2018-09-04T16:57:41+02:00</stEvt:when>
</rdf:li>
</rdf:Seq>
</xmpMM:History>
<dc:format>application/pdf</dc:format>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">The State of Food Security and Nutrition in the World 2018</rdf:li>
</rdf:Alt>
</dc:title>
<dc:creator>
<rdf:Bag/>
</dc:creator>
<pdf:Producer>Adobe PDF Library 15.0</pdf:Producer>
<pdf:Trapped>False</pdf:Trapped>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
Ahh good to know that's an option. I'm just digging into the most recent scrape from staging to do a comparison of document_ids from a couple of months ago. I just want to get a sense of how bad the problem is first...
Would be good to know how many pdfs have this embedded document id, and how unique it is. If we are still falling back on a document hash for the remainder, then we will still need to fix any underlying issue
This problem is a bit annoying. I don't understand why would data miss from subsequent scrapes, that is a bit problematic. Whatever we decide it would be good to have some guarantee from engineering that these ids will be there and they will not change.
So... I've looked back at a scrape from (2019-10-09) and compared it to the scrape that was done on staging overnight (2020-01-29). Here are the scores on the doors:
Metric | 20191009 | 20200129 |
---|---|---|
Unique doc ids | 7033 | 157223 |
Unique to this scrape | 2969 | 153159 |
This gives an overlap of just 4064
document ids, which means 58% of document ids captured in that first scrape no longer appear in the latest one.
Note that it is unclear whether the 2019-10-09 scrape was a complete one - it just happened to be the data that was on staging when I pulled it, but I don't think it matters very much.
I'll have a look at the scraped pdfs and see if the DocumentID
present in pdfs is a reliable alternative, but i think it is going to have to be very reliable to make it worthwhile, otherwise falling back onto an md5 hash of the document for even a small percentage of the scrape could be problematic given the above results.
I'll have a look at the DAG tomorrow and see what I can see in there. I suspect there might be alot of duplicates but it's also possible we might not have been getting all the docs before, might be worth checking how many of the docs in the last run are duplicates.
This problem is a bit annoying. I don't understand why would data miss from subsequent scrapes, that is a bit problematic. Whatever we decide it would be good to have some guarantee from engineering that these ids will be there and they will not change.
@nsorros We wiped data after all the key changes and scraper changes as the indexes needed to be recreated (and some data was orphaned) and the data repopulated using the new schemas. So the first run we did recently was a fresh run from mostly a blank slate, it's never going to be 100% perfect getting data from the targets as there's too much there thats out of our control from an engineering perspective, what we have to aim for is eventual consistency where the datas completeness increases steadily over subsequent runs.
I should have also said that I extracted these document ids from parsed-pdf json, not from ES.
Some analytics on two recent Reach runs on staging dated 2020-01-29 and 2020-01-31 from the parsed-pdfs
jsons. The headline figure is that in the earlier Reach run, there were 57989 more unique document ids than in the later run. In the later run 18872 document ids were new, which is about 16%. That's actually not as bad as I had expected given the results from the Evaluator task, but still a significant amount.
Total doc ids
$ cat 20200129_combined.jsonl | jq 'file_hash' | uniq | wc -l
157223
$ cat 20200131_combined.jsonl | jq 'file_hash' | uniq | wc -l
118106
Total overlap between the Reach runs:
$ comm -1 -2 \
<(cat 20200131_combined.jsonl | jq '.file_hash' | uniq | sort) \
<(cat 20200129_combined.jsonl | jq '.file_hash' | uniq | sort) | wc -l
99234
Files unique to 20200131:
$ comm -1 -3 \ <aws:wellcome>
<(cat 20200131_combined.jsonl | jq '.file_hash' | uniq | sort) \
<(cat 20200129_combined.jsonl | jq '.file_hash' | uniq | sort) | wc -l
57989
Files unique to 20200129:
$ comm -2 -3 \ <aws:wellcome>
<(cat 20200131_combined.jsonl | jq '.file_hash' | uniq | sort) \
<(cat 20200129_combined.jsonl | jq '.file_hash' | uniq | sort) | wc -l
18872
Looking at the new pdf_metadata
elements in the 2020-01-31 run, the following elements appear:
$ cat 20200131_combined.jsonl | jq '.pdf_metadata | [paths | join(".")][]' | sort | uniq -c
75964 "author"
108479 "creator"
82309 "title"
but the total number of unique titles is:
$ cat 20200131_combined.jsonl| jq '.pdf_metadata.title' | sort | uniq | wc -l
26090
which is about 22% of the total. These are of extremely varied quality, and would not serve to produce a unique identifier. For example, a good portion of the titles are:
9 "Written evidence - Willis Towers Watson"
3 "Written evidence - Wilson Bio-Chemical"
3 "Written evidence - Wiltshire Council"
3 "Written evidence - Wimborne War on Waste"
3 "Written evidence - Winchester Friends of the Earth"
3 "Written evidence - Windwatch NI"
15 "Written evidence - Wine and Spirit Trade Association"
3 "Written evidence - Wine Institute"
3 "Written evidence - Wine & Spirit Trade Association"
6 "Written evidence - WinVisible (women with visible & invisible disabilities)"
3 "Written evidence - Wipe The Slate Clean"
3 "Written evidence - Wirral Foodbank"
3 "Written evidence - Wirral Older People's Parliament"
3 "Written evidence - Witchford and Area Schools Partnership"
3 "Written evidence - Witness Confident"
3 "Written evidence - W L Consulting Ltd"
3 "Written evidence - Wm Morrison Supermarkets plc"
3 "Written evidence - Woking Borough Council Overview & Scrutiny Committee"
3 "Written evidence - Womankind Worldwide"
3 "Written evidence - Women And Children First (Uk)"
3 "Written evidence - Women for Independence Midlothian"
6 "Written evidence - Women for Refugee Women"
3 "Written evidence - Women For Women International Uk"
3 "Written evidence - Women in Manufacturing and Engineering"
Looking at the source_metadata.did
element, there are 27777 unique doc ids
$ cat 20200131_combined.jsonl | jq'.source_metadata.did' | sort | uniq | wc -l
27777
A good proportion of these are replicated a large number of times:
$ cat 20200131_combined.jsonl | jq '.source_metadata.did' | sort | uniq -c | sort -k 1 -h | tail -n 10
33 "98f2358e-486b-2b47-bbfc-0032f25e4f90"
36 "6B3DAA810D206811822ADF4D98799A13"
36 "E4311E9E0E206811822ADDFBCA027983"
36 "f2f18911-a877-9e4a-88c5-016178588b6e"
36 "FFCB0B8328206811822AA0B70B0C5E7C"
42 "7dd5faff-9622-4bd4-b4e3-7b5b9de50ccb"
51 "923dbd41-10a3-4835-b0d9-aa3259774f18"
66 "9e716710-7d79-e641-b11c-198ce2405117"
90 "a8ded3b0-069b-7b49-9d09-b9de6a2b3f1e"
15883 did
s appear just once in the dataset, all the others appear at least twice:
$ cat 20200131_combined.jsonl | jq '.source_metadata.did' | sort | uniq -c | awk 'FNR > 0 { if($1 > 1) print $1 }' | wc -l
15883
Next thing to look at is the relationship between file_hash
and did
but i probably need to open python for that!
nice to see the analysis! Just a comment on my idea to use url as the unique identifier - it would need cleaning/also has uniqueness problems! (just reminded of this after revisiting my uber-reach comparison work) e.g. these link to the same doc
- http://apps.who.int/iris/bitstream/handle/10665/255035/seajphv6n1.pdf?sequence=1&isAllowed=y
- http://apps.who.int/iris/bitstream/10665/255035/1/seajphv6n1.pdf
@lizgzil We should definitely steer clear of using the URL as an identifier for the document. Primarily because the URL that the document lives at could change in future scrapes which would result in duplication, as well as a given document could exist at multiple URLs like your example above with different query params causing a different hash to be calculated from the URL. In addition to this, a document could be referenced in more than one site. If a policy for instance is published to WHO and gov.co.uk puts up the same document on their site (this might be contrived but it could be possible) so the identifier needs to be able to understand that those two documents are the same.
As well I think one of the sites has weird POST URLs for some documents which don't have a unique URL between a group of documents as the document to retrieve is part of a post body within the request itself versus encoded in the URL.
From a functional viewpoint the following would potentially yield a higher number of unique document ids:
Use the DocumentID if available along with the title of the document as an id. The title of the document is less likely to change than the location/url, data within the PDF itself.
With the DocumentId, the only problem I can think of with that is if the document is generated from scratch again and the current version replaced with the new one. Because it was regenerated from scratch the DocumentId will have changed in that instance. However, we can mitigate this by storing information about the PDF and source page itself (base source URL, title, source...) which if we find a new document at a URL which we previously found a different document at, we can evaluate to see if it's the same document but with a different hash. I don't think there's going to be a single property which we can use that will be unique consistently, but a couple used in conjunction will if we have a process to mitigate collisions like outlined above.
Based on some other discussions, this might be a good scenario to utilize SQLite instead of a JSON store as we can insert the scrape results into the SQlite database and query the data on multiple properties and shunt the sqlite file to the next stage in the process versus being stuck with a key/value approach because of JSON.
Nice ideas @jdu. It should be quite easy to calculate how many pieces of information we need to get a unique ID, I'll have a look at this now while I run a test dag.
Of course, this still doesn't explain why we are not getting a consistent file hash across Reach runs :woman_shrugging:
@jdu thanks for the info, good to know. It's tricky in my comparison work since I need a unique identifier to link the Uber policy documents with the Reach ones. Perhaps it will involve an additional step of scraping the url texts then to see if 2 policy documents are the same (Uber url
-> scrape uber url
->hash text from uber scrape
<=?=> hash text from reach scrape
<- scrape reach url
<-Reach url
Jus tto add to this (at least for reaches internal IDs). What we could possibly do is gather the following information about a PDF.
{
"did": "<document_id>",
"title": "<doc_title>",
"ref_page": "<page that held the link to download>",
"url": "<url the file was downloaded from>",
"scraped_date": "<unix timestamp when this doc was scraped>",
"aliases": [
["<a document id that was found later that matched>", "<scraped_date>"]
]
}
So if we move the data store into sqlite, when we come across a PDF, we can do a query:
SELECT * FROM scraped_pdfs WHERE title = <title> AND url = <url>;
using the details from the PDF currently being evaluated, if that returns a match in the SQLite db, instead of updating the record, we append the file we just found that matches to aliases property of the existing entry.
That way instead of us completely orphaning the document by removing the original entry, we keep some referential integrity so that in later stages we can see that a single document actually has multiple potential ids, we can do this with the original file has as well. So the document now has persistent state across scrapes which tells us how that documents identification has changed over time while keeping the original id intact.
Nice idea @jdu. Since we can't rely on the file_hash would we assign a UUID to each document internally for use in Reach?
@ivyleavedtoadflax We can either generate a UUID4 for it, i'm more inclined to use the initial DocumentId we get for a given document as it's identifier. That way anythign after that can continue to use that ID for the document, including your evaluations. But we have the aliases stored as secondary identifiers for the document.