fundus icon indicating copy to clipboard operation
fundus copied to clipboard

[Proposal]: use case for SG AI

Open DiTo97 opened this issue 1 year ago • 1 comments

Problem statement

A lot of manual work and tuning goes into every single publisher that's currently maintained, and still requires constant monitoring if anything changes in the supported news outlets or web sources.

Solution

replace manual and labour-intensive scraping code with SG AI, whose you-only-scrape-once (YOSO) concept serves that purpose specifically: you write the scraping pipeline once, and leverage powerful LLMs (open-source or closed-source) to extract the articles in the desired format regardless of the web source or its HTML code changing over time.

write a single smart scraper graph tailored for news and articles crawling in the desired relational format, common to all available publishers and outlets.

Draft

Open Questions

No response

DiTo97 avatar May 18 '24 20:05 DiTo97

Hey @DiTo97 thanks for the proposal :)

Fundus uses manual-written heuristics to optimize for accuracy and recall. Our library aims to yield artifact-free extractions for every supported publisher. I will give SG AI a shot and see the results it scores on our benchmark

MaxDall avatar May 20 '24 19:05 MaxDall