engine icon indicating copy to clipboard operation
engine copied to clipboard

Pipeline breaking when PDF file is replaced by an HTML page

Open LVerneyEC opened this issue 1 month ago • 0 comments

Hi,

We hit this exception thrown this morning in our daily run on our set of declarations: https://github.com/OpenTermsArchive/engine/blob/041ca35bfc4aec9b64e088d14b822d2e18257ef0/src/archivist/recorder/repositories/git/dataMapper.js#L56-L58

It seems this error is uncaught and crashes the whole pipeline with no recovery options. I get the following log:

2025-11-28T06:05:18+00:00 [31merror[39m Zalando — Data Catalogue for Vetted Researchers Error: Only one file should have been recorded in 693a560f39b6de4006a6219c3e97c8778dbe6bbb, but all these files were recorded: Zalando/Data Catalogue for Vetted Researchers.html, Zalando/Data Catalogue for Vetted Researchers.pdf

And then a traceback:

 at Module.toDomain (file:///home/pptruser/open-terms-archive/engine/src/archivist/recorder/repositories/git/dataMapper.js:57:11)
...
 at async Archivist.trackTermsChanges (file:///home/pptruser/open-terms-archive/engine/src/archivist/index.js:184:22)

The snapshot commit mentioned is current HEAD of our snapshot Git repository: https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-snapshots/-/tree/693a560f39b6de4006a6219c3e97c8778dbe6bbb

As you can see in the "Zalando" folder, the "Data catalogue..." file is duplicated, once as (empty) HTML and once as PDF.

Relevant declaration is: https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/vlop-vlose-declarations/-/blob/main/declarations/Zalando.yml?ref_type=heads#L14-15

My understanding of the situation is that:

  • Zalando declaration contains a PDF file, which was correctly fetched over the last days/weeks.
  • At some point in time, some issue triggered an empty HTML reply (temporary issue on the webserver, antibot, whatever). Then, the engine recorded the HTML file alongside the PDF file.
  • The snapshot directory now contains both a HTML and a PDF file, crashing the pipeline.

I can probably work around it by manually removing the faulty HTML file, but this issue will likely happen again on future runs.

LVerneyEC avatar Nov 28 '25 12:11 LVerneyEC