digital-resources icon indicating copy to clipboard operation
digital-resources copied to clipboard

OG scraper for list of URLs

Open ViorelMocanu opened this issue 1 year ago • 0 comments

In order for us to be able to add the content from all the 1600+ links in the document, we require an Open Graph and metadata scraper to:

  1. parse through all 1600+ remote URL resources
  2. fetch the meta information (minimum: title, description, og:image) for each URL
  3. export a Markdown file for each link inside ./src/content/resources/, following the schema in ./src/content/config.ts, and name the .md file based on the meta title, in a safe way (alphanumeric + _ or -) to avoid duplicates for resources with identical titles
  4. download the OG image, name it identical to the Markdown file and place it in the same folder as the Markdown
  5. (optional) use OpenAI ChatGPT API to optimize description and try to write some extensive descriptive content
  6. (optional) use MidJourney API to generate a unique OG image based on the original (if one exists) or on the first fold of the remote URL

Potential APIs and packages

  • [x] https://www.npmjs.com/package/meta-fetcher (13.6kb)
  • [ ] https://www.npmjs.com/package/fetch-opengraph (14kb)
  • [ ] https://www.npmjs.com/package/url-metadata (22.7kb)
  • [ ] https://www.npmjs.com/package/open-graph-scraper (88.3kb)
  • [ ] https://www.npmjs.com/package/isomorphic-unfetch (3.38kb) vs https://www.npmjs.com/package/axios (1770kb) https://www.zenrows.com/blog/axios-web-scraping

ViorelMocanu avatar Oct 22 '23 17:10 ViorelMocanu