web-auto-extractor
web-auto-extractor copied to clipboard
Make LD parser more resilient
https://github.com/indix/web-auto-extractor/blob/2d15ce45c8a8ef8387f5c8035817fe45a607081c/src/parsers/jsonld-parser.js#L8-L22
The current JSON-LD parser assumes a perfect world scenario.
- Some websites (e.g. www.empireonline.com/) have JSON that includes new-lines, i.e. invalid JSON.
- Some website (e.g. Variety) have JSON that is surrounded in CDATA comments, e.g. https://gist.github.com/gajus/4a2653b4a5235ccebedc44467a2896f2. Furthermore, this includes
;at the end of the JSON.
Here is how I've implemented a LD+JSON parser in my local project:
(html: string): $ReadOnlyArray<Object> => {
const dom = new JSDOM(html);
const nodes = Object.values(dom.window.document.querySelectorAll('script[type="application/ld+json"]'));
return nodes.map((node) => {
if (!node || typeof node.innerHTML !== 'string') {
throw new TypeError('Unexpected content.');
}
let body = node.innerHTML;
debug('body', body);
// Some websites (e.g. Empire) have JSON that includes new-lines, i.e. invalid JSON.
body = body.replace(/\n/g, '');
// Some website (e.g. Variety) have JSON that is surrounded in CDATA comments, e.g.
// https://gist.github.com/gajus/4a2653b4a5235ccebedc44467a2896f2
body = body.slice(body.indexOf('{'), body.lastIndexOf('}') + 1);
return JSON.parse(body);
});
};
Thus far it works with all the sites I have been testing.
Another thing worth mentioning is that a lot of the sites include HTML entity encoded data in the LD+JSON feed.
I simply use import { AllHtmlEntities } from 'html-entities'; to decode all fields, just in case.
Thanks @gajus for reporting. Will fix the corner cases.