cookbook
cookbook copied to clipboard
Cannot import recipes from the wayback machine
Failing website:
https://web.archive.org/web/20210921133924/https://www.finecooking.com/recipe/carrot-fingerling-potato-and-pea-ragout
Checking if valid metadata are present:
Yes, I check the source code of the website and found metadata.
It has two application/ld+json script tags. The second one is the one which has the recipe information:
{
"@context": "https:\/\/web.archive.org\/web\/20210921133924\/http:\/\/schema.org\/",
"@type": "Recipe",
"name": "Carrot, Fingerling Potato, and Pea Rago\u00fbt",
"author": [
{
"@type": "Person",
"name": "Susie Middleton",
"url": "https:\/\/web.archive.org\/web\/20210921133924\/https:\/\/www.finecooking.com\/author\/susie-middleton"
}
],
"datePublished": "2012-03-01",
"image": "https:\/\/web.archive.org\/web\/20210921133924\/https:\/\/s3.amazonaws.com\/finecooking.s3.tauntonclud.com\/app\/uploads\/2017\/04\/18125930\/051116057-01-spring-vegetable-ragout-thumb1x1.jpg",
"description": "Hearty caramelized carrots and potatoes are the base for this delicious side dish, while peas, baby spinach, lemon, and tarragon add a bright, fresh twist. Serve with roast chicken or…",
"recipeYield": " 4 to 6 servings",
"recipeIngredient": [
"1 medium lemon",
"1 tsp. balsamic vinegar",
"1 tsp. maple syrup",
"3-1\/2 Tbs. unsalted butter, chilled",
"2 Tbs. extra-virgin olive oil; more as needed",
"1-1\/2lb. large carrots, cut into 2-inch-long, 1\/2-inch-thick sticks",
"Kosher salt",
"12 oz. small fingerling potatoes, cut in half lengthwise (if longer than 2 inches, cut in half crosswise)",
"1 cup lower-salt chicken broth or water",
"1-1\/2 tsp. minced garlic",
"3 oz. (about 3\/4 cup) fresh peas, blanched, or frozen peas, thawed",
"2 oz. stemmed baby spinach leaves",
"2 tsp. chopped fresh tarragon"
],
"recipeInstructions": [
"Finely grate the lemon to yield 1 tsp. zest and juice it to yield 1-1\/2 tsp. juice. In a small bowl, combine the zest, juice, vinegar, maple syrup, and 1 Tbs. water.",
"In a 5- to 6-quart Dutch oven (or other deep, wide pan), heat 1 Tbs. of the butter and the olive oil over low heat. Add the carrots and 3\/4 tsp. salt. Cover and cook, stirring frequently but gently, until the carrots are nicely browned and just tender, about 20 minutes. With a slotted spoon, transfer the carrots to a large plate.",
"Add 1 Tbs. butter to the remaining fat in the pan. (If there\u2019s no fat in the pan, add 1 Tbs. olive oil too.) When the butter has melted, arrange the fingerlings cut side down in a single layer in the pan and season with 3\/4 tsp. salt. Cover partially and cook, undisturbed, until the potatoes are deep golden-brown on the bottom, 5 to 7 minutes. Add the chicken broth or water and bring to a boil; reduce to a simmer and cover partially. Cook until the potatoes are tender and the liquid has reduced to 2 to 3 Tbs., 12 to 14 minutes.",
"Add the garlic to the potatoes and cook, stirring very gently, until fragrant, about 30 seconds. Add the reserved carrots and the peas, spinach, and lemon juice mixture. Stir gently until the spinach is wilted, 1 to 2 minutes. Remove the pan from the heat and stir in the remaining 1-1\/2 Tbs. butter until just melted. Stir in the tarragon. Transfer the vegetables to a platter and serve."
],
"recipeCategory": "Side dishes",
"recipeCuisine": "French",
"nutrition": {
"@type": "NutritionInformation",
"servingSize": " 4 to 6",
"calories": "210 kcal",
"fatContent": "110 kcal",
"saturatedFatContent": "5 g",
"transFatContent": "12 g",
"carbohydrateContent": "25 g",
"fiberContent": "5 g",
"proteinContent": "4 g",
"cholesterolContent": "20 mg",
"sodiumContent": "410 mg",
"unsaturatedFatContent": "6 g"
},
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "5",
"ratingCount": "4"
},
"isPartOf": {
"@type": "PublicationIssue",
"name": "Issue 116",
"url": "https:\/\/web.archive.org\/web\/20210921133924\/https:\/\/www.finecooking.com\/issue\/2012\/03\/issue-116",
"isPartOf": {
"@type": "Periodical",
"name": "Fine Cooking Magazine",
"publisher": {
"@type": "Organization",
"name": "Fine Cooking",
"url": "https:\/\/web.archive.org\/web\/20210921133924\/https:\/\/www.finecooking.com",
"sameAs": [
"https:\/\/web.archive.org\/web\/20210921133924\/https:\/\/twitter.com\/finecooking",
"https:\/\/web.archive.org\/web\/20210921133924\/https:\/\/www.facebook.com\/FineCooking",
"https:\/\/web.archive.org\/web\/20210921133924\/https:\/\/www.instagram.com\/finecookingmag\/",
"https:\/\/web.archive.org\/web\/20210921133924\/https:\/\/www.pinterest.com\/finecooking\/"
],
"logo": {
"@type": "ImageObject",
"url": "https:\/\/web.archive.org\/web\/20210921133924\/https:\/\/www.finecooking.com\/app\/plugins\/finecooking\/assets\/img\/fc-logo-black.png"
}
}
},
"issueNumber": "116",
"image": "https:\/\/web.archive.org\/web\/20210921133924\/https:\/\/s3.amazonaws.com\/finecooking.s3.tauntonclud.com\/app\/uploads\/2017\/04\/18212453\/issue_116.jpg"
}
}
Cookbook version: 0.11.2
Problem description (if applicable):
Cannot import recipes from archive.org.
This is because archive.org modifies the html and prepends an https://web.archive.org/... url to all urls.
If you look at the above json-ld metadata, you will notice that "@context": "http:\/\/schema.org\/" has been replaced with "@context": "https:\/\/web.archive.org\/web\/20210921133924\/http:\/\/schema.org\/", which causes this function to return false: https://github.com/nextcloud/cookbook/blob/644f96881bbdf5b38e86b4544e09d430a3feb454/lib/Service/JsonService.php#L74-L76
The alternative I would suggest is doing:
public function isSchemaContext(string $context): bool {
return preg_match('@^https?://schema\.org/?$@', $context) == 1 || preg_match('@^https?://web.archive.org/web/\d+/https?://schema\.org/?$@', $context) == 1;
}
This will account for the url being prefixed with https://web.archive.org/...
I have checked
and neither of them modify the @context property in the json-ld, however there may possibly be other archival sites which do modify it. So perhaps the regex could instead be changed to @https?://schema\.org/?$@, that way it doesn't need to match the beginning of the string.
I'd even be keen to just change it to return stristr($context, 'schema.org'); so as long as schema.org appears in the context, that should be enough to proceed.