tgcontest icon indicating copy to clipboard operation
tgcontest copied to clipboard

Telegram Data Clustering contest solution by Mindful Squirrel

TGNews

Build Status

Links

  • Description in English: https://medium.com/@phoenixilya/news-aggregator-in-2-weeks-5b38783b95e3
  • Description in Russian: https://habr.com/ru/post/487324/

Demo

Install

Prerequisites: CMake, Boost

$ sudo apt-get install cmake libboost-all-dev build-essential libjsoncpp-dev uuid-dev protobuf-compiler libprotobuf-dev

For MacOS

$ brew install boost jsoncpp ossp-uuid protobuf

If you got zip archive, just go to building binary

To download code and models:

$ git clone https://github.com/IlyaGusev/tgcontest
$ cd tgcontest
$ git submodule update --init --recursive
$ bash download_models.sh
$ wget https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.5.0%2Bcpu.zip
$ unzip libtorch-cxx11-abi-shared-with-deps-1.5.0+cpu.zip

For MacOS use https://download.pytorch.org/libtorch/cpu/libtorch-macos-1.5.0.zip

To build binary (in "tgcontest" dir):

$ mkdir build && cd build && Torch_DIR="../libtorch" cmake -DCMAKE_BUILD_TYPE=Release .. && make -j4

To download datasets:

$ bash download_data.sh

Run on sample:

./build/tgnews top data --ndocs 10000

Training

  • Russian FastText vectors training: VectorsRu.ipynb Open In Colab
  • Russian fasttext category classifier training: CatTrainRu.ipynb Open In Colab
  • Russian text embedder with triplet loss training (v3): Open In Colab
  • English FastText vectors training: VectorsEn.ipynb Open In Colab
  • English fasttext category classifier training: CatTrainEn.ipynb Open In Colab
  • English text embedder with triplet loss training (v3): Open In Colab
  • PageRank rating calculation: PageRankRating.ipynb Open In Colab
  • Russian ELMo-based sentence embedder training (not used): Open In Colab
  • XLM-RoBERTa pseudo-labeling for categorization: Open In Colab

Models

Data

Markup

Misc

  • Flamegraph: https://ilyagusev.github.io/tgcontest/flamegraph.svg

Other contestants

  • Round 2
    • II place
      • Daring Frog: https://github.com/a-l-e-x-k/data_clustering_contest, article: https://medium.com/@alexkuznetsov/2nd-place-solution-for-telegram-data-clustering-contest-f28d55b98d30
      • Swift Skunk: https://github.com/sorrge/tg_news_cluster
    • III place
      • Mindful Kitten: https://danlark.org/2020/07/31/news-aggregator-from-scratch-in-2-weeks/
    • IV place
      • Bossy Gnu: https://github.com/maxoodf/tgnews
    • Other:
      • Large Crab: https://github.com/ilya-ustinov/tgcontest
  • Round 1
    • III place
      • Kooky Dragon: https://github.com/nick-baliesnyi/tgnews
    • IV place
      • Sharp Sloth: https://github.com/thehemen/telegram-data-clustering
    • Other
      • Desert Python: https://github.com/crazyleg/telegram_data_clustering_2019
      • Funky Peacock: https://github.com/Stepka/telegram_clustering_contest
      • Unknown animal: https://github.com/roman-rybalko/telegram-data-clustering-contest
      • Unknown animal: https://github.com/MarcoBuster/data-clustering-contest
      • Unknown animal: https://github.com/sudevschiz/tgnews
      • Unknown animal: https://github.com/crazyleg/telegram_data_clustering_2019
      • Unknown animal: https://github.com/77ph/tgnews
      • Unknown animal: https://github.com/akash-joshi/telegram-cluster
      • Unknown animal: https://github.com/dremovd/telegram-clustering

Contacts