bulk
bulk copied to clipboard
Request: Add a 'label' column to the saved subset data
Along with the option to have more than just the text column saved, what if we could add a label column to the saved subset, to make it easier to manage the data and files. A little input box on the form...
This would complicate the app a bit, because we'd need to keep track of the state of all the different labels that have been attached sofar.
The pattern that I've been using is to just save many .csv
files when I'm bulk annotating to combine them later via the bulk util concat
command.
I should admit that my use-case here is to use bulk
as little as possible and to only use it as a tool in the preparation of Prodigy. Bulk is a tool that gives a great demo, but the labels themselves should be treated as weak labels.
It's a chore to me to have to manage it all by filename, is all. More processing in pandas pre prodigy.
What about non-exclusive labels? A text could have more than one label attached, or at least, this is kind of my use-case in text most of the time.
What about saving the filename, without the suffix, into the saved file as well? That's something a config file might be able to handle.
I'm still a little up in the air about this feature because a single text/image, in my use-cases, typically allow for more than one label to be attached.
@arnicas I might want to pick up this ticket, but the more I think about it ... the more I like the "cli" approach. Mainly because it seems super normal that a piece of text could belong to more than one label.
Something like:
python -m bulk util combine card.jsonl website.jsonl another-label.jsonl --out-file merged.jsonl
What would be the main concern of doing it this way? It keeps the UI minimal, which is a huge plus for me.