llm-scraper
llm-scraper copied to clipboard
Fix Readable.js on certain pages (like HN)
Looking into this, it's hn's CSP that's blocking loading scripts from skypack. Looking into the console:
Refused to load the script 'https://cdn.skypack.dev/@mozilla/readability' because it violates the following Content Security Policy directive: "script-src 'self' 'unsafe-inline' https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://cdnjs.cloudflare.com/". Note that 'script-src-elem' was not explicitly set, so 'script-src' is used as a fallback.
So setting the page up with options like: { bypassCSP: true } completely resolves the issue. e.g.:
https://playwright.dev/docs/api/class-browser#browser-new-page-option-bypass-csp
// Open new page
const page = await browser.newPage({ bypassCSP: true })
await page.goto('https://news.ycombinator.com')
// Define schema to extract contents into
const schema = z.object({
top: z
.array(
z.object({
title: z.string(),
points: z.number(),
by: z.string(),
commentsURL: z.string(),
})
)
.length(5)
.describe('Top 5 stories on Hacker News'),
})
// Run the scraper
const { data } = await scraper.run(page, schema, {
format: 'text',
})
// Show the result from LLM
console.log(data.top)
Example output I'd get rn:
[
{
title: "Crowdstrike Update: Windows Bluescreen and Boot Loops",
points: 2126,
by: "BLKNSLVR",
commentsURL: "https://reddit.com",
}, {
title: "FCC votes unanimously to dramatically limit prison telecom charges",
points: 293,
by: "Avshalom",
commentsURL: "https://worthrises.org",
}, {
title: "Foliate: Read e-books in style, navigate with ease",
points: 330,
by: "ingve",
commentsURL: "https://johnfactotum.github.io",
}, {
title: "Want to spot a deepfake? Look for the stars in their eyes",
points: 65,
by: "jonbaer",
commentsURL: "https://ras.ac.uk",
}, {
title: "Startups building balloons to hoist tourists 100k feet into the stratosphere",
points: 15,
by: "amichail",
commentsURL: "https://cnbc.com",
}
]
So I suppose it'd be sufficient to document this behavior as a known constraint on some sites, when using text mode with Readable.js?
We should get rid of Readable.js in favour of html2text