croissant icon indicating copy to clipboard operation
croissant copied to clipboard

Support for multidimensional arrays in Croissant

Open pierrot0 opened this issue 1 year ago • 4 comments

There is no way for now to express that a field should be a multidimensional array, for example a 4x4 matrix.

An example of dataset with such a need: MatrixCity (https://github.com/city-super/MatrixCity), where there is a rotation matrix field in the data (distributed as JSON in example):

        {
            "frame_index": 0,
            "rot_mat": [
                [
                    -0.009902680292725563,
                    0.0010966990375891328,
                    -0.0008568363264203072,
                    -590.0
                ],
                [
                    -0.0013917317846789956,
                    -0.0078034186735749245,
                    0.006096699275076389,
                    590.0
                ],
                [
                    -8.448758914703092e-10,
                    0.0061566149815917015,
                    0.007880106568336487,
                    200.0
                ],
                [
                    0.0,
                    0.0,
                    0.0,
                    1.0
                ]
            ],
            "euler": [
                0.6632251739501953,
                8.44875884808971e-08,
                -3.0019662380218506
            ]
        },

One possibility might be to use JSON schema to represent such an array:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "array",
  "items": {
    "type": "array",
    "items": {"type": "number"},
    "minItems": 4,
    "maxItems": 4
  },
  "minItems": 4,
  "maxItems": 4
}

The benefit here is that JSON schema is quite complete, so it would be possible to express complex cases, including arrays of different types (useful in multimodal prompts for example).

The downside is that the range of possible schemas is quite large, and there is the risk that some datasets would end-up with one field defined in Croissant, that field type being a complex JSON-schema described object... That would also significantly increase the implementation complexity.

A possible alternative might be to define our own Array dataType in the croissant namespace, similarly to cr:BoundingBox. For example, something like:

{
  "@type": "cr:Field",
  "@id": "recordsetName/rotation_matrix",
  "description": "The rotation matrix.",
  "dataType": "cr:Array",
  "dataTypeParams": {
    "dimensions": [4, 4],
    "dataType": "sc:Float"
  },
  "source": {
    "fileSet": { ... },
     "extract": {
        "jsonPath": "..."
     }
  }
}

What do you folks think?

pierrot0 avatar May 14 '24 12:05 pierrot0

Could it also be implemented as a transform (e.g., by having a new reshape attribute)?

marcenacp avatar May 14 '24 13:05 marcenacp

If we do implement this as a transform, what datatype would you use? repeated Number?

One would still need to look at the transform to understand the kind of data to expect, no? Also in the above example, the data is already provided as a 4x4 matrix, which is what we want, so it would seem odd to me to apply a reshape on this.

pierrot0 avatar May 14 '24 13:05 pierrot0

I see. Indeed in that case, the shape would be implicit which is not great. I was thinking of a NumPy-like approach where even scalars would be arrays:

>>> import numpy as np
>>> np.array(1).dtype, np.array(1).shape
(dtype('int64'), ())

So we wouldn't need cr:Array at all:

{
  "@type": "cr:Field",
  "@id": "recordsetName/rotation_matrix",
  "description": "The rotation matrix.",
  "dataType": "cr:Float",
  "shape": [4, 4]
}

marcenacp avatar May 14 '24 14:05 marcenacp

I like your above example.

pierrot0 avatar May 15 '24 05:05 pierrot0

After a few offline conversations, the consensus seems to be to do the following:

  • deprecate repeated (boolean, spec, example) is favor of isArray (boolean).
  • introduce arrayShape (list of ints), defaulting to [-1] (a simple list), to indicate the shape of the array, where -1 indicates dimensions of unknown/unspecified size.

If anyone is against that, please speak up, or I'll just send PRs.

pierrot0 avatar Nov 22 '24 11:11 pierrot0

I came across #697 and my related comment from https://github.com/mlcommons/croissant/issues/737#issuecomment-2430009285. I would be interested to test this feature for the given examples (Netcdf or H5 files).

omshinde avatar Jan 22 '25 17:01 omshinde

PR to HF: https://github.com/huggingface/dataset-viewer/pull/3141

ccl-core avatar Feb 26 '25 15:02 ccl-core

Update: the final implementation uses:

{
          "@type": "cr:Field",
          "@id": "recordset/fieldname",
          "isArray": true,
          "arrayShape": "1,2,1"
       ...
}

ccl-core avatar Feb 26 '25 15:02 ccl-core