vowpal_wabbit
vowpal_wabbit copied to clipboard
Proposal: A generic structure for JSON formats in VW
A generic structure for JSON formats in VW
Motivation
Currently there are two JSON-related formats, JSON and DSJSON formats. They are similar yet different. The primary motivator for this split was likely being able to supply both top level metadata as well as structured label information for examples.
DSJSON has always focused on contextual bandit formats, and as those have been extended (CCB and Slates), the format has not scaled well. I suspect there was some motivation that being able to only modify the root level of the JSON structure to join an example with a label was desirable, but with the addition of schematized binary formats the impact of this is lessened.
JSON and DSJSON are not well documented or tested and cannot be used for every label type. JSON is actually quite a good strucutre to express VW examples because namespaces can be expressed naturally whereas in CSV it cannot.
Proposal
Unify and deprecate the two existing formats with one simple format that fulfills the requirements of both but is easier to understand, write and parse.
This proposal primarily addresses several things:
- Consistent handling of multiline examples
- Consistent expression of labels
- Self describing data file
- Arbitrary metadata to decorate the datafile (one of the core reasons DSJSON was created)
Specifics
- Add flag
--ujsonwhich must be supplied with--jsonwhich interprets the input as the new format - Deprecate
--jsonsupplied on its own - Deprecate
--dsjson - Create tool to convert from DSJSON to new json
- Create a JSON schema for the new json format
Format Specification
The new json format is based on several ideas:
- There is no special cased structure like
_multiand_slotsin DSJSON - All
_prefixed fields are special fields and have an associated meaning - If metadata must be stored in the json, any field named like
_meta_Xwill be available from the parser in a metadata map in the keyX- This metadata map will be strucutred similar to JSON. The map will have string keys and
Valuefields where value is a tagged union of one ofObject, String, Array, Float. The metadata map itself is actually of typeObjectitself.
- This metadata map will be strucutred similar to JSON. The map will have string keys and
- There is a special field
_exampleswhich is an array of objects and must be supplied at the root of the JSON structure. Each object in the array may contain a_label_TYPEfield. This is where the label based extensibility comes into play, each label type defines how to handle this_labelfield. This continues along with the desired feature of data files being self describing as TYPE must be one of the label types. - If there is only one example in this line (single ex), then the
_examplesfield can be omitted and the fields be placed directly in the root of the object. If this is done then_examplescannot be supplied as well.
Examples
Simple
VW Text
0 'First|HouseInfo price:.23 sqft:.25 age:.05 2006
Converted to new JSON format
{
"_examples": [
{
"_label_simple": 0,
"_tag": "First",
"Info": {
"price": 0.23,
"sqft": 0.25,
"age": 0.5,
"2006": 1.0
}
}
]
}
Flattened shorthand for single examples
{
"_label_simple": 0,
"_tag": "First",
"Info": {
"price": 0.23,
"sqft": 0.25,
"age": 0.5,
"2006": 1.0
}
}
CB
Existing DSJSON format
{
"_label_cost": 0,
"_label_probability": 0.333333343,
"_label_Action": 2,
"_labelIndex": 1,
"o": [
{
"EventId": "f5547244f88543d1bcad0e3416fbd592",
"v": {
"reward": 0.0,
"value4": 0.0,
"value1": 0.0,
"value2": 1400.0,
"value3": 0.0
}
}
],
"Timestamp": "2018-11-22T02:31:39.1440000Z",
"Version": "1",
"EventId": "f5547244f88543d1bcad0e3416fbd592",
"a": [
2,
1,
3
],
"c": {
"shared_feature": 1.5,
"_multi": [
{
"name3": {
"name4": 4.65312243,
"name2": 1.0
}
},
{
"name3": {
"name": 4.65312243,
"name2": 1.0
}
},
{
"name3": {
"name5": 4.65312243
}
}
]
},
"p": [
0.333333343,
0.333333343,
0.333333343
],
"VWState": {
"m": "N/A"
}
}
Converted to new JSON format
{
"_meta": {
"Timestamp": "2018-11-22T02:31:39.1440000Z",
"Version": "1",
"EventId": "f5547244f88543d1bcad0e3416fbd592",
"VWState": {
"m": "N/A"
},
"o": [
{
"EventId": "f5547244f88543d1bcad0e3416fbd592",
"v": {
"reward": 0.0,
"value4": 0.0,
"value1": 0.0,
"value2": 1400.0,
"value3": 0.0
}
}
],
"a": [
2,
1,
3
],
"p": [
0.333333343,
0.333333343,
0.333333343
]
},
"_examples": [
{
"_label_cb": {
"type": "shared"
},
"shared_feature": 1.5
},
{
"_label_cb": {
"type": "action"
},
"name3": {
"name4": 4.65312243,
"name2": 1.0
}
},
{
"_label_cb": {
"type": "action",
"cost": 0,
"probability": 0.333333343
},
"name3": {
"name": 4.65312243,
"name2": 1.0
}
},
{
"_label_cb": {
"type": "action"
},
"name3": {
"name5": 4.65312243
}
}
]
}
- Is it possible or desired to uplevel the action and probability lists out of the metadata and into the label?
CCB
Existing DSJSON format
{
"Version": "1",
"c": {
"TShared": {
"a=1": 1,
"b=0": 1,
"c=1": 1
},
"_multi": [
{
"TAction": {
"value=0.000000": 1
}
},
{
"TAction": {
"value=1.000000": 1
}
},
{
"TAction": {
"value=2.000000": 1
}
},
{
"TAction": {
"value=3.000000": 1
}
}
],
"_slots": [
{
"slot_feature": 1
},
{
"slot_feature": 1
}
]
},
"_outcomes": [
{
"_id": "ac32c0fc-f895-429d-9063-01c996432f791249622271",
"_label_cost": 0,
"_a": [
0,
1,
2,
3
],
"_p": [
0.25,
0.25,
0.25,
0.25
],
"_o": [
0
]
},
{
"_id": "b64a5e7d-6e76-4d66-98fe-dc214e675ff81249622271",
"_label_cost": 0,
"_a": [
1,
2,
3
],
"_p": [
0.333333,
0.333333,
0.333333
],
"_o": [
0
]
}
],
"VWState": {
"m": "N/A"
}
}
Converted to new JSON format
{
"_meta": {
"Version": "1",
"ids": [
"ac32c0fc-f895-429d-9063-01c996432f791249622271",
"b64a5e7d-6e76-4d66-98fe-dc214e675ff81249622271"
],
"o": [
0,
0
],
"VWState": {
"m": "N/A"
}
},
"_examples": [
{
"_label_ccb": {
"type": "shared"
},
"TShared": {
"a=1": 1,
"b=0": 1,
"c=1": 1
}
},
{
"_label_ccb": {
"type": "action"
},
"TAction": {
"value=0.000000": 1
}
},
{
"_label_ccb": {
"type": "action"
},
"TAction": {
"value=1.000000": 1
}
},
{
"_label_ccb": {
"type": "action"
},
"TAction": {
"value=2.000000": 1
}
},
{
"_label_ccb": {
"type": "action"
},
"TAction": {
"value=3.000000": 1
}
},
{
"_label_ccb": {
"type": "slot",
"_label_cost": 0,
"_a": [
0,
1,
2,
3
],
"_p": [
0.25,
0.25,
0.25,
0.25
]
},
"slot_feature": 1
},
{
"_label_ccb": {
"type": "slot",
"_a": [
1,
2,
3
],
"_p": [
0.333333,
0.333333,
0.333333
]
},
"slot_feature": 1
}
]
}
Migration
Migration is an important aspect of this proposal since large amounts of data exist in DSJSON format. A tool will be created that can convert from DSJSON format to to new json. Given input DSJSON that is well formed this conversion should be a trivial process.
Open questions
- Some labels formats seem to lean towards a single "label" for the entire multi-ex, but each specific example needs meta information for the type of label (shared, action, slot) for example. This depends a lot on how the "predict-info" work lands.
According to the description shouldn't the _label fields for cb be: "_label_cb", "_label_shared", "_label_cb_action" OR "_label_action"? When is "_label" {"type": "X"} used vs "_label_TYPE"
Yes, thanks for picking up on that. I've updated _label -> _label_cb and _label -> _label_ccb for the relevant examples.
When is "_label" {"type": "X"} used vs "_label_TYPE"
{"type": "X"} is just part of the data structure. _label_TYPE is the label enum, CB, CCB, Simple etc