croissant icon indicating copy to clipboard operation
croissant copied to clipboard

Deprecate Nested RecordSets in favor of repeated subField

Open benjelloun opened this issue 1 year ago • 1 comments

The Croissant Spec allows nesting RecordSets inside RecordSets, by using a field with dataType="cr:RecordSet"

https://docs.mlcommons.org/croissant/docs/croissant-spec.html#nested-records

This mechanism has not been used much, is not supported in the mlcroissant library, and adds unneeded complexity.

Instead, we propose using the existing subField mechanism, and specifying repeated=true to represent multiple records.

Here is an example based on the one in the above documentation:

{
  "@type": "cr:RecordSet",
  "@id": "movies_with_ratings",
  "key": { "@id": "movies_with_ratings/movie_id" },
  "field": [
    {
      "@type": "cr:Field",
      "@id": "movies_with_ratings/movie_id",
      "source": { "@id": "movies/movie_id" }
      "references" :  { "@id": "ratings/movie_id" }
    },
    {
      "@type": "cr:Field",
      "@id": "movies_with_ratings/movie_title",
      "source": { "@id": "movies/title" }
    },
    {
      "@type": "cr:Field",
      "@id": "movies_with_ratings/ratings",
      "repeated": "true",
      "subField": [
        {
          "@type": "cr:Field",
          "@id": "movies_with_ratings/ratings/user_id",
          "source": { "@id": "ratings/user_id" }
        },
        {
          "@type": "cr:Field",
          "@id": "movies_with_ratings/ratings/rating",
          "source": { "@id": "ratings/rating" }
        },
        {
          "@type": "cr:Field",
          "@id": "movies_with_ratings/ratings/timestamp",
          "source": { "@id": "ratings/timestamp" }
        }
      ]
    }
  ]
}

Note that using a repeated field with subFields also enables us to get rid of the cumbersome "parentField" property in the previous syntax. Instead, the join with the underlying ratings table is specified on the "movie_id" property.

benjelloun avatar Sep 27 '24 10:09 benjelloun

This is a common representation for trees generally. Instead of actually nesting the data structure, maintain a flat data structure of all nodes, and have each node point to its immediate children. e.g.

  tree = {
    root: [1,2],
    1: [3,4],
    2: [5,6],
    3: [7]
  }

This mechanism is used by, for example, the GraphQL schema. GraphQL uses this mechanism because it actually empowers defining possibly infinite trees, where a subType for a type can be the type itself. IMHO, the GraphQL type system is pretty intelligent, and we could learn a lot from the setup there.

To this end, what might make sense is the ability to define a compound type, right in the .json file. For example, perhaps a movie can have a "sequel" field, which in turn is a movie itself, and which might have sequels, and so on.

csbrown avatar Dec 19 '24 15:12 csbrown

From looking at the mlcroissant code, nested recordsets don't seem to be supported by the library.

ccl-core avatar May 21 '25 09:05 ccl-core