web-auto-extractor icon indicating copy to clipboard operation
web-auto-extractor copied to clipboard

Make LD parser more resilient

Open gajus opened this issue 8 years ago • 2 comments

https://github.com/indix/web-auto-extractor/blob/2d15ce45c8a8ef8387f5c8035817fe45a607081c/src/parsers/jsonld-parser.js#L8-L22

The current JSON-LD parser assumes a perfect world scenario.

  • Some websites (e.g. www.empireonline.com/) have JSON that includes new-lines, i.e. invalid JSON.
  • Some website (e.g. Variety) have JSON that is surrounded in CDATA comments, e.g. https://gist.github.com/gajus/4a2653b4a5235ccebedc44467a2896f2. Furthermore, this includes ; at the end of the JSON.

Here is how I've implemented a LD+JSON parser in my local project:

(html: string): $ReadOnlyArray<Object> => {
  const dom = new JSDOM(html);

  const nodes = Object.values(dom.window.document.querySelectorAll('script[type="application/ld+json"]'));

  return nodes.map((node) => {
    if (!node || typeof node.innerHTML !== 'string') {
      throw new TypeError('Unexpected content.');
    }

    let body = node.innerHTML;

    debug('body', body);

    // Some websites (e.g. Empire) have JSON that includes new-lines, i.e. invalid JSON.
    body = body.replace(/\n/g, '');

    // Some website (e.g. Variety) have JSON that is surrounded in CDATA comments, e.g.
    // https://gist.github.com/gajus/4a2653b4a5235ccebedc44467a2896f2
    body = body.slice(body.indexOf('{'), body.lastIndexOf('}') + 1);

    return JSON.parse(body);
  });
};

Thus far it works with all the sites I have been testing.

gajus avatar Sep 16 '17 19:09 gajus

Another thing worth mentioning is that a lot of the sites include HTML entity encoded data in the LD+JSON feed.

I simply use import { AllHtmlEntities } from 'html-entities'; to decode all fields, just in case.

gajus avatar Sep 16 '17 20:09 gajus

Thanks @gajus for reporting. Will fix the corner cases.

Vasanth-Indix avatar Sep 21 '17 04:09 Vasanth-Indix