doccano-transformer icon indicating copy to clipboard operation
doccano-transformer copied to clipboard

fastText format for text classification

Open Hironsan opened this issue 4 years ago • 3 comments

Example:

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What’s the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces

Hironsan avatar May 15 '20 09:05 Hironsan

Hi! Started working on this one. I am going to also use label metadata in order to get label names. Would that be allright?

prokotg avatar May 16 '20 12:05 prokotg

I am going to also use label metadata in order to get label names. Would that be allright?

I agree with you. Where and how does the label metadata pass it?

Hironsan avatar May 16 '20 13:05 Hironsan

Couple of ideas, but here's what comes in my mind:

Personally, as a user, I would prefer to use class method of each Dataset directly so instead of using

dataset = read_jsonl(filepath='example.jsonl', dataset=NERDataset, encoding='utf-8')

I would suggest to directly use

dataset = NERDataset.from_jsonl(filepath='example.jsonl', encoding='utf-8')

and when it comes to TextClassificationDataset (working name), I would just add another optional argument (via **kwargs) ...

dataset = TextClassificationDataset.from_jsonl(annotations_filepath='example.jsonl', labels_filepath='project_1_labels.jsonl', encoding='utf-8)

...optional because without the label metadata filepath, annotations could be still converted with appended label id (and warning for information) like that: __label__1 although I am not sure this is a valid fasttext label (have to check that)

If you decide to stay with the current implementation, labels path could be passed either as **kwargs to read_jsonl function and passed further to Dataset constructor or passed directly to TextClassificationDataset.to_fasttext method (yes, this requires reading label metadata every time you want to perform a conversion so I am not a fan of this solution)

Let me know what you think

prokotg avatar May 16 '20 16:05 prokotg