mhtml-parser icon indicating copy to clipboard operation
mhtml-parser copied to clipboard

Handle unique content with duplicated filenames

Open javamonn opened this issue 4 years ago • 1 comments

It looks like sometimes content has the same Content-Location, but the content itself is different. The Content-ID within the .mhtml is also different, which makes me think these are maybe subframes on the page. I've included an .mhtml here (zipped as GitHub disallows the .mhtml filetype) exported from Chrome of https://lithub.com/meaning-in-the-margins-on-the-literary-value-of-annotation/ that shows this problem. If you search the archive for Content-Location: https://lithub.com/meaning-in-the-margins-on-the-literary-value-of-annotation/, you'll see that there are two different html documents at that location.

This creates a problem particularly if the files are written to disk or upload to a static server keyed on their filenames, as the last filename will overwrite earlier filenames. The demo server (npm run serve) handles this alright as it will directly serve the first parsed data item from memory, but if these files were instead written to disk, the subframe document would end up replacing what was supposed to be the index document.

One solution I can think of would be to involve the Content-ID within the rewritten filenames and internal URLs of the parsed data.

javamonn avatar Apr 14 '21 14:04 javamonn

Oh, that's a good point. PR to involved the Content-ID welcome

benjamingr avatar May 25 '21 08:05 benjamingr