opengraph
opengraph copied to clipboard
opengraph.Fetch returns nothing for a few domains
I have been using this package to fetch opengraph info about websites and articles, but for a few websites, e.g. FastCompany, the Fetch() method returns nothing. After some research, I found that few websites block bots from scraping their content. However, when I try Raycast preview, or even macOS preview, it successfully fetches the metadata with the image and title. How can I achieve that? Here's how my code looks:
package api
import (
"net/http"
"mypackage/read-it-later/structs"
"mypackage/read-it-later/utils"
"github.com/gin-gonic/gin"
"github.com/oklog/ulid/v2"
"github.com/otiai10/opengraph"
)
func StoreEntity(c *gin.Context) {
var requestBody structs.RequestURL
if err := c.BindJSON(&requestBody); err != nil {
return
}
c.Header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36")
ogp, err := opengraph.Fetch(requestBody.URL)
if err != nil {
c.JSON(http.StatusInternalServerError, err)
}
c.JSON(http.StatusOK, gin.H{
"id": ulid.Make(),
"description": ogp.Description,
"favicon": ogp.Favicon,
"image": utils.FetchImageURL(ogp.Image, ogp.Favicon, ogp.URL.Host, ogp.URL.Scheme),
"siteName": ogp.SiteName,
"title": ogp.Title,
"type": ogp.Type,
"URL": ogp.URL,
"all_info": ogp,
})
}
Thank you, @Pancham97
- Give me the actual URL you are mentioning
- What are your rationale that you found the websites are blocking bot?
Hey @otiai10, sorry, forgot to add them. Here are a couple that I didn't seem to get working:
- https://www.fastcompany.com/90945102/ai-chatbots-health-medicine-chatgpt-webmd-self-diagnosis-misinformation
- https://www.nplusonemag.com/issue-25/on-the-fringe/uncanny-valley/
What are your rationale that you found the websites are blocking bot?
I am not sure. Maybe they don't want unnecessary website scraping or something. Plus, a few websites serve content via JavaScript, and that could be an issue too? 🤷
This works:
package main
import (
"compress/gzip"
"encoding/json"
"log"
"net/http"
"os"
"github.com/otiai10/opengraph"
)
func main() {
target := "https://www.fastcompany.com/90945102/ai-chatbots-health-medicine-chatgpt-webmd-self-diagnosis-misinformation"
// 1) Necessary headers
headers := map[string]string{
"Accept": "text/html",
"Accept-Encoding": "gzip",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36",
}
req, _ := http.NewRequest("GET", target, nil)
for k, v := range headers {
req.Header.Set(k, v)
}
// 2) Necessary cookies (set by geo.capthca-delivery.com)
req.AddCookie(&http.Cookie{
Name: "datadome", // See your browser's cookie with this name
Value: "2ZnfSBOvZs1C2ZURicdpZAkZ-86xXY_RyRG-D6E8CjiNpgopXq7byBj5KmkCtLmcjRGjeGpzkBmP0JvFmKwUxazBMrGTkpY8-K9mJdGxD8WYobZ5QmI76Uqdhgf6Wvdi",
})
res, err := http.DefaultClient.Do(req)
if err != nil {
log.Println(1001, err)
return
}
defer res.Body.Close()
if res.StatusCode != 200 {
log.Println(1005, "Status code is not 200")
log.Println("Status:", res.StatusCode)
log.Println("Content-Type:", res.Header.Get("Content-Type"))
log.Println("Content-Encoding:", res.Header.Get("Content-Encoding"))
return
}
reader, err := gzip.NewReader(res.Body)
if err != nil {
log.Println(1002, err)
}
defer reader.Close()
// Use "Parse" for the io.Reader
ogp := opengraph.New(target)
if err := ogp.Parse(reader); err != nil {
log.Println(1004, err)
return
}
// Then let's check it out!
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
enc.Encode(ogp)
}
There might be various reasons that this package opengraph cannot fetch information.
(not holistic)
- Just the way they are: a. Just missing OGP 😝 b. Client-side rendering c. etc...
- Content control for security and reliability reasons a. User-Agent b. Human auth (e.g., captcha) c. etc...
Then, in your case with fastcompany.com, 2-a and 2-b of the list above matter.
Hey @otiai10, thanks, but for some reason, I can't seem to get it working. I am a bit new to Go so might be missing something obvious, but I am getting the error 1005. I have replaced the value of the datadome cookie with my browser cookie, and yet it does not work. Please help.
2023/08/29 23:24:11 1005 Status code is not 200
2023/08/29 23:24:11 Status: 403
2023/08/29 23:24:11 Content-Type: text/html;charset=utf-8
2023/08/29 23:24:11 Content-Encoding:
- Check the status text of the response
- Check the response body with 403 if exists
- Tweak headers more
- Tweak cookies more