extruct icon indicating copy to clipboard operation
extruct copied to clipboard

Handle badly formatted JSON-LD data.

Open shiquanwang opened this issue 7 years ago • 3 comments

Some web pages contain badly formatted JSON-LD data, e.g., an example

The JSON-LD in this page is:


{
  "@context": "http://schema.org",
        "@type": "Product",
                "name": "Black 'Clint' FT0511 cat eye sunglasses",
                "image": "https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001",
		"brand": {
                  "@type": "Thing",
                  "name": "Tom Ford"
                },
                "offers": {
                	"@type": "Offer",
                	"priceCurrency": "GBP",
                	"price": "285.00",
                	"itemCondition": "http://schema.org/NewCondition",
                	"availability": "http://schema.org/InStock"
                }
    }
}

In the JSON-LD above, the last } is extra. And extruct or json.loads won't handle it properly.

The json.loads in Python after 3.5 will give detailed error information as JSONDecodeError: Extra data: line 19 column 1 (char 624)

In [7]: try:
   ...:     data = json.loads(json_ld_string)
   ...: except json.JSONDecodeError as err:
   ...:     print(err)
   ...:     print(err.msg)
   ...:     print(err.pos)
   ...:
Extra data: line 19 column 1 (char 624)
Extra data
624

The error.msg and error.pos can give some hint to fix the JSON-LD data, e.g., this one we can remove the character at position 624 and parse the data string again to correctly get:

{'@context': 'http://schema.org',
 '@type': 'Product',
 'brand': {'@type': 'Thing', 'name': 'Tom Ford'},
 'image': 'https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001',
 'name': "Black 'Clint' FT0511 cat eye sunglasses",
 'offers': {'@type': 'Offer',
            'availability': 'http://schema.org/InStock',
            'itemCondition': 'http://schema.org/NewCondition',
            'price': '285.00',
            'priceCurrency': 'GBP'}}

There're many possible format errors and some can be fixed easily some might be harder or even impossible.

I propose 3 ways to improve the situation:

  • extruct try various ways to fix the json-ld data case by case, but need to adapt to Python >= 3.5 to allow to get detailed error info
  • extruct allow the user to pass in a function to parse JSON data, and let the user to handle his own possible error types
  • extruct can output the extracted JSON-LD string not parsed data and let the user to parse and handle his own possible error types

I personally recommend the latter 2 ways.

Thanks.

shiquanwang avatar Aug 20 '18 08:08 shiquanwang

I guess this provides more motivation for https://github.com/scrapinghub/extruct/pull/69/, though I'd prefer json decoding function to be an argument, not a global option.

Providing something which handles more cases by default makes sense to me, though we may start just with having a good example in README.

kmike avatar Aug 22 '18 19:08 kmike

Maybe other libraries like demjson or yajl can handle it (see http://deron.meranda.us/python/demjson/demjson-2.2.4/docs/demjson.html#-decode - it seems there is an option to return data after the error).

kmike avatar Aug 22 '18 19:08 kmike

Updated JSON-Ld can autocorrect badly formatted JSON.

gaurav19063 avatar May 30 '20 03:05 gaurav19063