Similarius
                                
                                 Similarius copied to clipboard
                                
                                    Similarius copied to clipboard
                            
                            
                            
                        Similarius is a Python library to compare web page and evaluate the level of similarity.
Similarius
Similarius is a Python library to compare web page and evaluate the level of similarity.
The tool can be used as a stand-alone tool or to feed other systems.
Requirements
- Python 3.8+
- Requests
- Scikit-learn
- Beautifulsoup4
- nltk
Installation
Source install
Similarius can be install with poetry. If you don't have poetry installed, you can do the following curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python.
$ poetry install
$ poetry shell
$ similarius -h
pip installation
$ pip3 install similarius
Usage
dacru@dacru:~/git/Similarius/similarius$ similarius --help
usage: similarius.py [-h] [-o ORIGINAL] [-w WEBSITE [WEBSITE ...]]
optional arguments:
  -h, --help            show this help message and exit
  -o ORIGINAL, --original ORIGINAL
                        Website to compare
  -w WEBSITE [WEBSITE ...], --website WEBSITE [WEBSITE ...]
                        Website to compare
Usage example
dacru@dacru:~/git/Similarius/similarius$ similarius -o circl.lu -w europa.eu circl.eu circl.lu
Used as a library
import argparse
from similarius import get_website, extract_text_ressource, sk_similarity, ressource_difference, ratio
parser = argparse.ArgumentParser()
parser.add_argument("-w", "--website", nargs="+", help="Website to compare")
parser.add_argument("-o", "--original", help="Website to compare")
args = parser.parse_args()
# Original
original = get_website(args.original)
if not original:
    print("[-] The original website is unreachable...")
    exit(1)
original_text, original_ressource = extract_text_ressource(original.text)
for website in args.website:
    print(f"\n********** {args.original} <-> {website} **********")
    # Compare
    compare = get_website(website)
    if not compare:
        print(f"[-] {website} is unreachable...")
        continue
    compare_text, compare_ressource = extract_text_ressource(compare.text)
    # Calculate
    sim = str(sk_similarity(compare_text, original_text))
    print(f"\nSimilarity: {sim}")
    ressource_diff = ressource_difference(original_ressource, compare_ressource)
    print(f"Ressource Difference: {ressource_diff}")
    ratio_compare = ratio(ressource_diff, sim)
    print(f"Ratio: {ratio_compare}")
Acknowledgment

The project has been co-funded by CEF-TC-2020-2 - 2020-EU-IA-0260 - JTAN - Joint Threat Analysis Network.