zingg icon indicating copy to clipboard operation
zingg copied to clipboard

ndjson for training data persistence

Open tomdavidson opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe. Marked recorded are stored as individual Parquet files. Parquet is an "immutable" binary format and difficult to edit and view with out special tools.

Exporting and importing labels has lots of extra motion: https://docs.zingg.ai/zingg/stepbystep/createtrainingdata/exportlabeleddata

I recently had to remove some records marked as unsure when I learned that they should have been matched. With about 250 marks, it was quite a pain to go through and find the "offending files"

Describe the solution you'd like The labels are small data and do not need the columnar binary format. Storing all the records in a single plaintext file such as NDJSON is self describing, appendable, universal, and accessible. This probably applies to other files zingg is persisting too.

Describe alternatives you've considered CSV is problematic due to the minimal spec without types nor lists. Another alt could be a db for zingg training data, labels, stop words, synonyms, models, and future api for clis and webapps.... but I think the json file would deliver immediate value with a lot less effort.

tomdavidson avatar Apr 13 '22 04:04 tomdavidson