nyan How to add new category?

What is the scenario for adding a new category ?

Jan 11 '24 23:01 b0tm1nd

From what I understood, we need a new dataset in .jsonl with text and labels. Could you share datasets that this was trained on? Especially for not_news. By reading the telegram contest I see that for russian content they mostly used lenta.ru archive. But what about ukrainian?

Jan 12 '24 23:01 b0tm1nd

Here you go: https://github.com/NyanNyanovich/nyan/releases/download/can_annot/cat_markup.tar.gz I used Lenta and gpt-4 annotations, here is the script to query gpt-4: https://github.com/NyanNyanovich/nyan/blob/master/scripts/annotate_categories.py And the training script: https://github.com/NyanNyanovich/nyan/blob/master/scripts/train_clf.py

Jan 13 '24 23:01 NyanNyanovich

@NyanNyanovich Thanks, I have found train_clf.py already and tried to train it with a single category but then on send.sh classificator failed probably because of "not_news" missing..

I have taken a dataset for Ukrainian news website which tagged their news, grouped only related to corruption and gotten about 700 entries which I united with categories_train.jsonl.

And after training I've became getting much worse results: many from war/politics became triggering corruption now and resulting as "unknown". I have found out that in the added dataset the median text size is 1000+ characters when in yours about 450.

So I have a few questions about the hints for a dataset for the new category:

Does smaller article size improves accuracy?
Do multiple labels for the new category (like ["corruption", "war"] or ["corruption", "politics"]) will increase accuracy?
What was your strategy (or was it random?) in news selection for your training dataset:

Labels sorted by Count: politics: 1200 occurrences war: 1062 occurrences economy: 760 occurrences incident: 699 occurrences not_news: 451 occurrences entertainment: 426 occurrences tech: 418 occurrences sports: 324 occurrences science: 138 occurrences other: 37 occurrences

What are the other hints you might suggest?

Jan 15 '24 01:01 b0tm1nd

nyan nyan copied to clipboard

How to add new category?

nyan
nyan copied to clipboard