autoscraper icon indicating copy to clipboard operation
autoscraper copied to clipboard

Possible to to try to extract main article from a page?

Open vzeazy opened this issue 2 years ago • 1 comments

Given the examples, struggle to understand how this may be utilized to extract the main article from a page. In this case, the sample would be the article content itself. Would be great if it could use several samples from other websites and then develop a generalized pattern for additional pages. Guessing this my be out of scope for this project.

vzeazy avatar Mar 26 '23 23:03 vzeazy

The following worked for me,

wanted_dict = {
    "title": ["Possible to to try to extract main article from a page?"],
    "meta": ["vzeazy"],
    "content": ['Given the examples, struggle to understand how this may be utilized to extract the main article from a page. In this case, the sample would be the article content itself. Would be great if it could use several samples from other websites and then develop a generalized pattern for additional pages. Guessing this my be out of scope for this project.']
}

html_file = open('sample/train.html', 'r', encoding='utf-8')
source_code = html_file.read()
result = scraper.build(html=source_code, wanted_dict=wanted_dict)
scraper.save('github')

html_file = open('sample/test.html', 'r', encoding='utf-8')
source_code = html_file.read()
result=scraper.get_result_exact(html=source_code)

entrptaher avatar May 01 '23 06:05 entrptaher