nyan
nyan copied to clipboard
How to add new category?
What is the scenario for adding a new category ?
From what I understood, we need a new dataset in .jsonl with text and labels. Could you share datasets that this was trained on? Especially for not_news. By reading the telegram contest I see that for russian content they mostly used lenta.ru archive. But what about ukrainian?
Here you go: https://github.com/NyanNyanovich/nyan/releases/download/can_annot/cat_markup.tar.gz I used Lenta and gpt-4 annotations, here is the script to query gpt-4: https://github.com/NyanNyanovich/nyan/blob/master/scripts/annotate_categories.py And the training script: https://github.com/NyanNyanovich/nyan/blob/master/scripts/train_clf.py
@NyanNyanovich Thanks, I have found train_clf.py already and tried to train it with a single category but then on send.sh classificator failed probably because of "not_news" missing..
I have taken a dataset for Ukrainian news website which tagged their news, grouped only related to corruption and gotten about 700 entries which I united with categories_train.jsonl.
And after training I've became getting much worse results: many from war/politics became triggering corruption now and resulting as "unknown". I have found out that in the added dataset the median text size is 1000+ characters when in yours about 450.
So I have a few questions about the hints for a dataset for the new category:
- Does smaller article size improves accuracy?
- Do multiple labels for the new category (like ["corruption", "war"] or ["corruption", "politics"]) will increase accuracy?
- What was your strategy (or was it random?) in news selection for your training dataset:
Labels sorted by Count: politics: 1200 occurrences war: 1062 occurrences economy: 760 occurrences incident: 699 occurrences not_news: 451 occurrences entertainment: 426 occurrences tech: 418 occurrences sports: 324 occurrences science: 138 occurrences other: 37 occurrences
- What are the other hints you might suggest?