opengraph icon indicating copy to clipboard operation
opengraph copied to clipboard

opengraph.Fetch returns nothing for a few domains

Open Pancham97 opened this issue 2 years ago • 6 comments

I have been using this package to fetch opengraph info about websites and articles, but for a few websites, e.g. FastCompany, the Fetch() method returns nothing. After some research, I found that few websites block bots from scraping their content. However, when I try Raycast preview, or even macOS preview, it successfully fetches the metadata with the image and title. How can I achieve that? Here's how my code looks:

package api

import (
	"net/http"
	"mypackage/read-it-later/structs"
	"mypackage/read-it-later/utils"

	"github.com/gin-gonic/gin"
	"github.com/oklog/ulid/v2"
	"github.com/otiai10/opengraph"
)

func StoreEntity(c *gin.Context) {
	var requestBody structs.RequestURL
	if err := c.BindJSON(&requestBody); err != nil {
		return
	}

	c.Header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36")

	ogp, err := opengraph.Fetch(requestBody.URL)

	if err != nil {
		c.JSON(http.StatusInternalServerError, err)
	}

	c.JSON(http.StatusOK, gin.H{
		"id":          ulid.Make(),
		"description": ogp.Description,
		"favicon":     ogp.Favicon,
		"image":       utils.FetchImageURL(ogp.Image, ogp.Favicon, ogp.URL.Host, ogp.URL.Scheme),
		"siteName":    ogp.SiteName,
		"title":       ogp.Title,
		"type":        ogp.Type,
		"URL":         ogp.URL,
		"all_info":    ogp,
	})
}

Pancham97 avatar Aug 26 '23 16:08 Pancham97

Thank you, @Pancham97

  1. Give me the actual URL you are mentioning
  2. What are your rationale that you found the websites are blocking bot?

otiai10 avatar Aug 28 '23 08:08 otiai10

Hey @otiai10, sorry, forgot to add them. Here are a couple that I didn't seem to get working:

  1. https://www.fastcompany.com/90945102/ai-chatbots-health-medicine-chatgpt-webmd-self-diagnosis-misinformation
  2. https://www.nplusonemag.com/issue-25/on-the-fringe/uncanny-valley/

What are your rationale that you found the websites are blocking bot?

I am not sure. Maybe they don't want unnecessary website scraping or something. Plus, a few websites serve content via JavaScript, and that could be an issue too? 🤷

Pancham97 avatar Aug 28 '23 08:08 Pancham97

This works:

package main

import (
	"compress/gzip"
	"encoding/json"
	"log"
	"net/http"
	"os"

	"github.com/otiai10/opengraph"
)

func main() {

	target := "https://www.fastcompany.com/90945102/ai-chatbots-health-medicine-chatgpt-webmd-self-diagnosis-misinformation"

	// 1) Necessary headers
	headers := map[string]string{
		"Accept":          "text/html",
		"Accept-Encoding": "gzip",
		"User-Agent":      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
	}

	req, _ := http.NewRequest("GET", target, nil)
	for k, v := range headers {
		req.Header.Set(k, v)
	}

	// 2) Necessary cookies (set by geo.capthca-delivery.com)
	req.AddCookie(&http.Cookie{
		Name:  "datadome", // See your browser's cookie with this name
		Value: "2ZnfSBOvZs1C2ZURicdpZAkZ-86xXY_RyRG-D6E8CjiNpgopXq7byBj5KmkCtLmcjRGjeGpzkBmP0JvFmKwUxazBMrGTkpY8-K9mJdGxD8WYobZ5QmI76Uqdhgf6Wvdi",
	})

	res, err := http.DefaultClient.Do(req)
	if err != nil {
		log.Println(1001, err)
		return
	}
	defer res.Body.Close()

	if res.StatusCode != 200 {
		log.Println(1005, "Status code is not 200")
		log.Println("Status:", res.StatusCode)
		log.Println("Content-Type:", res.Header.Get("Content-Type"))
		log.Println("Content-Encoding:", res.Header.Get("Content-Encoding"))
		return
	}

	reader, err := gzip.NewReader(res.Body)
	if err != nil {
		log.Println(1002, err)
	}
	defer reader.Close()

	// Use "Parse" for the io.Reader
	ogp := opengraph.New(target)
	if err := ogp.Parse(reader); err != nil {
		log.Println(1004, err)
		return
	}

	// Then let's check it out!
	enc := json.NewEncoder(os.Stdout)
	enc.SetIndent("", "  ")
	enc.Encode(ogp)
}

otiai10 avatar Aug 28 '23 09:08 otiai10

There might be various reasons that this package opengraph cannot fetch information.

(not holistic)

  1. Just the way they are: a. Just missing OGP 😝 b. Client-side rendering c. etc...
  2. Content control for security and reliability reasons a. User-Agent b. Human auth (e.g., captcha) c. etc...

Then, in your case with fastcompany.com, 2-a and 2-b of the list above matter.

otiai10 avatar Aug 28 '23 09:08 otiai10

Hey @otiai10, thanks, but for some reason, I can't seem to get it working. I am a bit new to Go so might be missing something obvious, but I am getting the error 1005. I have replaced the value of the datadome cookie with my browser cookie, and yet it does not work. Please help.

2023/08/29 23:24:11 1005 Status code is not 200
2023/08/29 23:24:11 Status: 403
2023/08/29 23:24:11 Content-Type: text/html;charset=utf-8
2023/08/29 23:24:11 Content-Encoding:

Pancham97 avatar Aug 29 '23 17:08 Pancham97

  1. Check the status text of the response
  2. Check the response body with 403 if exists
  3. Tweak headers more
  4. Tweak cookies more

otiai10 avatar Aug 30 '23 08:08 otiai10