doccano-transformer
doccano-transformer copied to clipboard
fastText format for text classification
Example:
__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What’s the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces
Hi! Started working on this one. I am going to also use label metadata in order to get label names. Would that be allright?
I am going to also use label metadata in order to get label names. Would that be allright?
I agree with you. Where and how does the label metadata pass it?
Couple of ideas, but here's what comes in my mind:
Personally, as a user, I would prefer to use class method of each Dataset
directly so instead of using
dataset = read_jsonl(filepath='example.jsonl', dataset=NERDataset, encoding='utf-8')
I would suggest to directly use
dataset = NERDataset.from_jsonl(filepath='example.jsonl', encoding='utf-8')
and when it comes to TextClassificationDataset
(working name), I would just add another optional argument (via **kwargs) ...
dataset = TextClassificationDataset.from_jsonl(annotations_filepath='example.jsonl', labels_filepath='project_1_labels.jsonl', encoding='utf-8)
...optional because without the label metadata filepath, annotations could be still converted with appended label id (and warning for information) like that: __label__1
although I am not sure this is a valid fasttext
label (have to check that)
If you decide to stay with the current implementation, labels path could be passed either as **kwargs
to read_jsonl
function and passed further to Dataset constructor or passed directly to TextClassificationDataset.to_fasttext
method (yes, this requires reading label metadata every time you want to perform a conversion so I am not a fan of this solution)
Let me know what you think