extruct
extruct copied to clipboard
JSONDecodeError: Extra data: line 21 column 1 (char 572) for URL https://lubelska.co.uk/
Seems that the issue is that the JSON-LD document is:
// <![CDATA[
{
"@context": "http:\/\/schema.org\/",
"name": "Lubelska",
"@type": "Organization",
"logo": "https://lubelska.co.uk/wp/wp-content/uploads/2019/05/Lubelska-1.jpg",
"url": "https://lubelska.co.uk/",
"sameAs": [
"https://twitter.com/EdwardHowey",
"https://www.facebook.com/Lubelska-309144763268698/",
"https://www.pinterest.co.uk/lubelskaltd/",
"https://www.instagram.com/lubelska1/"
],
"contactPoint": [{
"@type": "ContactPoint",
"telephone": "+44 20 3911 5526",
"email": "[email protected]",
"contactType": "sales"
}]
}
// ]]>
and after the replacing in jsonLd._extractItems()
:
# sometimes JSON-decoding errors are due to leading HTML or JavaScript comments
data = json.loads(
HTML_OR_JS_COMMENTLINE.sub('', script), strict=False)
it becomes:
{
"@context": "http:\/\/schema.org\/",
"name": "Lubelska",
"@type": "Organization",
"logo": "https://lubelska.co.uk/wp/wp-content/uploads/2019/05/Lubelska-1.jpg",
"url": "https://lubelska.co.uk/",
"sameAs": [
"https://twitter.com/EdwardHowey",
"https://www.facebook.com/Lubelska-309144763268698/",
"https://www.pinterest.co.uk/lubelskaltd/",
"https://www.instagram.com/lubelska1/"
],
"contactPoint": [{
"@type": "ContactPoint",
"telephone": "+44 20 3911 5526",
"email": "[email protected]",
"contactType": "sales"
}]
}
// ]]>
and naturally this part which was not replaced:
// ]]>
causes the error.
Having the same problem with this url: https://www.eatwell101.com/shrimp-and-broccoli-foil-packs-recipe
Which has this as the value for script
after running HTML_OR_JS_COMMENTLINE
'\n{
"@context":"https:\\/\\/schema.org\\/",
"@type":"Recipe",
"mainEntityOfPage":{
"@type":"WebPage","
@id":"https:\\/\\/www.eatwell101.com\\/shrimp-and-broccoli-foil-packs-recipe"},
"name":"Baked Shrimp and Broccoli Foil Packs with Garlic Lemon Butter Sauce",
"url":"https:\\/\\/www.eatwell101.com\\/shrimp-and-broccoli-foil-packs-recipe",
"headline":"Baked Shrimp and Broccoli Foil Packs with Garlic Lemon Butter Sauce",
"Description":"This baked shrimp foil pack meal is ready in under 30 minutes - The easiest way to cook shrimp in your oven!",
"author":{
"@type":"Person",
"name":"Christina Cherrier"},
"image":"https:\\/\\/www.eatwell101.com\\/wp-content\\/uploads\\/2019\\/04\\/shrimp-and-broccoli-recipe-2.jpg",
"datePublished":"2020-01-10 07:47:21",
"dateModified":"2020-06-20 17:47:39",
"Publisher":"Eatwell101",
"ingredients":"",
"prepTime":"PT10M",
"cookTime":"PT15M",
"recipeYield":"2 servings"}
// ]]>\n'
so same problem where // ]]>\n'
was not replaced correctly
Just opened a PR with a fix here: https://github.com/scrapinghub/extruct/pull/144