vowpal_wabbit icon indicating copy to clipboard operation
vowpal_wabbit copied to clipboard

Proposal: A generic structure for JSON formats in VW

Open jackgerrits opened this issue 5 years ago • 2 comments

A generic structure for JSON formats in VW

Motivation

Currently there are two JSON-related formats, JSON and DSJSON formats. They are similar yet different. The primary motivator for this split was likely being able to supply both top level metadata as well as structured label information for examples.

DSJSON has always focused on contextual bandit formats, and as those have been extended (CCB and Slates), the format has not scaled well. I suspect there was some motivation that being able to only modify the root level of the JSON structure to join an example with a label was desirable, but with the addition of schematized binary formats the impact of this is lessened.

JSON and DSJSON are not well documented or tested and cannot be used for every label type. JSON is actually quite a good strucutre to express VW examples because namespaces can be expressed naturally whereas in CSV it cannot.

Proposal

Unify and deprecate the two existing formats with one simple format that fulfills the requirements of both but is easier to understand, write and parse.

This proposal primarily addresses several things:

  • Consistent handling of multiline examples
  • Consistent expression of labels
  • Self describing data file
  • Arbitrary metadata to decorate the datafile (one of the core reasons DSJSON was created)

Specifics

  • Add flag --ujson which must be supplied with --json which interprets the input as the new format
  • Deprecate --json supplied on its own
  • Deprecate --dsjson
  • Create tool to convert from DSJSON to new json
  • Create a JSON schema for the new json format

Format Specification

The new json format is based on several ideas:

  • There is no special cased structure like _multi and _slots in DSJSON
  • All _ prefixed fields are special fields and have an associated meaning
  • If metadata must be stored in the json, any field named like _meta_X will be available from the parser in a metadata map in the key X
    • This metadata map will be strucutred similar to JSON. The map will have string keys and Value fields where value is a tagged union of one of Object, String, Array, Float. The metadata map itself is actually of type Object itself.
  • There is a special field _examples which is an array of objects and must be supplied at the root of the JSON structure. Each object in the array may contain a _label_TYPE field. This is where the label based extensibility comes into play, each label type defines how to handle this _label field. This continues along with the desired feature of data files being self describing as TYPE must be one of the label types.
  • If there is only one example in this line (single ex), then the _examples field can be omitted and the fields be placed directly in the root of the object. If this is done then _examples cannot be supplied as well.

Examples

Simple

VW Text
0 'First|HouseInfo price:.23 sqft:.25 age:.05 2006
Converted to new JSON format
{
    "_examples": [
        {
            "_label_simple": 0,
            "_tag": "First",
            "Info": {
                "price": 0.23,
                "sqft": 0.25,
                "age": 0.5,
                "2006": 1.0
            }
        }
    ]
}
Flattened shorthand for single examples
{
    "_label_simple": 0,
    "_tag": "First",
    "Info": {
        "price": 0.23,
        "sqft": 0.25,
        "age": 0.5,
        "2006": 1.0
    }
}

CB

Existing DSJSON format
{
    "_label_cost": 0,
    "_label_probability": 0.333333343,
    "_label_Action": 2,
    "_labelIndex": 1,
    "o": [
        {
            "EventId": "f5547244f88543d1bcad0e3416fbd592",
            "v": {
                "reward": 0.0,
                "value4": 0.0,
                "value1": 0.0,
                "value2": 1400.0,
                "value3": 0.0
            }
        }
    ],
    "Timestamp": "2018-11-22T02:31:39.1440000Z",
    "Version": "1",
    "EventId": "f5547244f88543d1bcad0e3416fbd592",
    "a": [
        2,
        1,
        3
    ],
    "c": {
        "shared_feature": 1.5,
        "_multi": [
            {
                "name3": {
                    "name4": 4.65312243,
                    "name2": 1.0
                }
            },
            {
                "name3": {
                    "name": 4.65312243,
                    "name2": 1.0
                }
            },
            {
                "name3": {
                    "name5": 4.65312243
                }
            }
        ]
    },
    "p": [
        0.333333343,
        0.333333343,
        0.333333343
    ],
    "VWState": {
        "m": "N/A"
    }
}
Converted to new JSON format
{
    "_meta": {
        "Timestamp": "2018-11-22T02:31:39.1440000Z",
        "Version": "1",
        "EventId": "f5547244f88543d1bcad0e3416fbd592",
        "VWState": {
            "m": "N/A"
        },
        "o": [
            {
                "EventId": "f5547244f88543d1bcad0e3416fbd592",
                "v": {
                    "reward": 0.0,
                    "value4": 0.0,
                    "value1": 0.0,
                    "value2": 1400.0,
                    "value3": 0.0
                }
            }
        ],
        "a": [
            2,
            1,
            3
        ],
        "p": [
            0.333333343,
            0.333333343,
            0.333333343
        ]
    },
    "_examples": [
        {
            "_label_cb": {
                "type": "shared"
            },
            "shared_feature": 1.5
        },
        {
            "_label_cb": {
                "type": "action"
            },
            "name3": {
                "name4": 4.65312243,
                "name2": 1.0
            }
        },
        {
            "_label_cb": {
                "type": "action",
                "cost": 0,
                "probability": 0.333333343
            },
            "name3": {
                "name": 4.65312243,
                "name2": 1.0
            }
        },
        {
            "_label_cb": {
                "type": "action"
            },
            "name3": {
                "name5": 4.65312243
            }
        }
    ]
}
  • Is it possible or desired to uplevel the action and probability lists out of the metadata and into the label?

CCB

Existing DSJSON format
{
    "Version": "1",
    "c": {
        "TShared": {
            "a=1": 1,
            "b=0": 1,
            "c=1": 1
        },
        "_multi": [
            {
                "TAction": {
                    "value=0.000000": 1
                }
            },
            {
                "TAction": {
                    "value=1.000000": 1
                }
            },
            {
                "TAction": {
                    "value=2.000000": 1
                }
            },
            {
                "TAction": {
                    "value=3.000000": 1
                }
            }
        ],
        "_slots": [
            {
                "slot_feature": 1
            },
            {
                "slot_feature": 1
            }
        ]
    },
    "_outcomes": [
        {
            "_id": "ac32c0fc-f895-429d-9063-01c996432f791249622271",
            "_label_cost": 0,
            "_a": [
                0,
                1,
                2,
                3
            ],
            "_p": [
                0.25,
                0.25,
                0.25,
                0.25
            ],
            "_o": [
                0
            ]
        },
        {
            "_id": "b64a5e7d-6e76-4d66-98fe-dc214e675ff81249622271",
            "_label_cost": 0,
            "_a": [
                1,
                2,
                3
            ],
            "_p": [
                0.333333,
                0.333333,
                0.333333
            ],
            "_o": [
                0
            ]
        }
    ],
    "VWState": {
        "m": "N/A"
    }
}
Converted to new JSON format
{
    "_meta": {
        "Version": "1",
        "ids": [
            "ac32c0fc-f895-429d-9063-01c996432f791249622271",
            "b64a5e7d-6e76-4d66-98fe-dc214e675ff81249622271"
        ],
        "o": [
            0,
            0
        ],
        "VWState": {
            "m": "N/A"
        }
    },
    "_examples": [
        {
            "_label_ccb": {
                "type": "shared"
            },
            "TShared": {
                "a=1": 1,
                "b=0": 1,
                "c=1": 1
            }
        },
        {
            "_label_ccb": {
                "type": "action"
            },
            "TAction": {
                "value=0.000000": 1
            }
        },
        {
            "_label_ccb": {
                "type": "action"
            },
            "TAction": {
                "value=1.000000": 1
            }
        },
        {
            "_label_ccb": {
                "type": "action"
            },
            "TAction": {
                "value=2.000000": 1
            }
        },
        {
            "_label_ccb": {
                "type": "action"
            },
            "TAction": {
                "value=3.000000": 1
            }
        },
        {
            "_label_ccb": {
                "type": "slot",
                "_label_cost": 0,
                "_a": [
                    0,
                    1,
                    2,
                    3
                ],
                "_p": [
                    0.25,
                    0.25,
                    0.25,
                    0.25
                ]
            },
            "slot_feature": 1
        },
        {
            "_label_ccb": {
                "type": "slot",
                "_a": [
                    1,
                    2,
                    3
                ],
                "_p": [
                    0.333333,
                    0.333333,
                    0.333333
                ]
            },
            "slot_feature": 1
        }
    ]
}

Migration

Migration is an important aspect of this proposal since large amounts of data exist in DSJSON format. A tool will be created that can convert from DSJSON format to to new json. Given input DSJSON that is well formed this conversion should be a trivial process.

Open questions

  1. Some labels formats seem to lean towards a single "label" for the entire multi-ex, but each specific example needs meta information for the type of label (shared, action, slot) for example. This depends a lot on how the "predict-info" work lands.

jackgerrits avatar Jun 29 '20 17:06 jackgerrits

According to the description shouldn't the _label fields for cb be: "_label_cb", "_label_shared", "_label_cb_action" OR "_label_action"? When is "_label" {"type": "X"} used vs "_label_TYPE"

olgavrou avatar Sep 11 '20 10:09 olgavrou

Yes, thanks for picking up on that. I've updated _label -> _label_cb and _label -> _label_ccb for the relevant examples.

When is "_label" {"type": "X"} used vs "_label_TYPE"

{"type": "X"} is just part of the data structure. _label_TYPE is the label enum, CB, CCB, Simple etc

jackgerrits avatar Sep 11 '20 12:09 jackgerrits