learnhtml icon indicating copy to clipboard operation
learnhtml copied to clipboard

Web content extraction using machine learning

LearnHtml

Html web content extraction library using mostly DOM features as well as some textual features. Achieves a tag-level F1-score of .96 on the Dragnet dataset.

Requirements

First you will need to install the dependencies. For the binary dependencies:

sudo apt-get install recode libxml2-dev libxslt1-dev unzip

Python dependencies:

pip install -r requirements.txt

Build the project and install it locally

pip install -e .

Running the scripts

./learnhtml/cli/prepare_data.sh <<WHERE_TO_DOWNLOAD_DATA>> <<NUMBER_OF_WORKERS>>

Copyright (C) 2018 Nichita Uțiu