learnhtml
learnhtml copied to clipboard

→

Metadata

Web content extraction using machine learning

Readme
Issues

LearnHtml

Html web content extraction library using mostly DOM features as well as some textual features. Achieves a tag-level F1-score of .96 on the Dragnet dataset.

Requirements

First you will need to install the dependencies. For the binary dependencies:

sudo apt-get install recode libxml2-dev libxslt1-dev unzip

Python dependencies:

pip install -r requirements.txt

Build the project and install it locally

pip install -e .

Running the scripts

./learnhtml/cli/prepare_data.sh <<WHERE_TO_DOWNLOAD_DATA>> <<NUMBER_OF_WORKERS>>

About

Web content extraction using machine learning

deep-learning

html

content-extraction

32

Stars

9

Forks

Watchers

Owner

nikitautiu

← Metadata

32

Stars

9

Forks

Watchers

Owner

nikitautiu

Metadata

Web content extraction using machine learning

Back

learnhtml learnhtml copied to clipboard

Metadata

LearnHtml

Requirements

Running the scripts

← Metadata

Owner

Metadata

learnhtml
learnhtml copied to clipboard