learnhtml
learnhtml copied to clipboard
Web content extraction using machine learning
LearnHtml
Html web content extraction library using mostly DOM features as well as some textual features. Achieves a tag-level F1-score of .96
on the Dragnet dataset.
Requirements
First you will need to install the dependencies. For the binary dependencies:
sudo apt-get install recode libxml2-dev libxslt1-dev unzip
Python dependencies:
pip install -r requirements.txt
Build the project and install it locally
pip install -e .
Running the scripts
./learnhtml/cli/prepare_data.sh <<WHERE_TO_DOWNLOAD_DATA>> <<NUMBER_OF_WORKERS>>
Copyright (C) 2018 Nichita Uțiu