Maximum recursion depth exceeded upon loading croissant json
Hi Guys,
We are using ml-croissant to describe our quite elaborate and connected biomedical datasets. One of our table captures the disease index of our platform. The index has disease identifiers ({"key":{"@id":"disease/id"}}) as primary key, but also have children/descendants/ancestors fields to capture disease hierarchy. As the contents of these fields are also disease identifiers, we provide that field as reference:
{
"@type": "cr:Field",
"@id": "disease/ancestors",
"name": "ancestors",
"description": "List of all ancestral disease terms",
"dataType": "sc:Text",
"references": {
"field": {
"@id": "disease/id"
}
},
"repeated": true,
"source": {
"fileSet": {
"@id": "disease-fileset"
},
"extract": {
"column": "ancestors"
}
}
}
When we compile the metadata of this recordset, we can save the croissant json file without problem, however, when we try to load it, it fails.
Upon loading croissant file
import mlcroissant as mlc
ds = mlc.Dataset("disease_only.json")
gives:
RecursionError: maximum recursion depth exceeded while calling a Python object
It raises a few questions:
- [ ] I assume it tries to build the entire dependency tree of all recordSets, and as such, it keeps going back to the same recordset in this case. Besides checking if the referenced field exists, is anything else happening? Otherwise there's no need to get lost in this infinite recursion.
- [ ] How is is possible that the creation of the JSON file goes without any problem? It fails if a non-existing field is referenced, but no infinite recursion.
- [ ] Besides not capturing foreign keys pointing to the primary key of the fileset, is there anything you can advise to avoid this issue? We are trying to provide the best possible explanation of our datasets for our users, therefore we would not drop this if not necessary.
We really need to capture all connections between the fields of all filesets, so it would be great if this issue could be solved.
- Croissant version: 1.0.17
- Python version: Python 3.10.8
- Test data: disease_only.json
Hi! Is there any update on this? I'm really interested in using Croissant for this context
I believe the RecursionError happened here in _add_operations_for_file_object(...).
The underlying issue is that there is a cycle created in the __post_init__ of Metadata when the library tried to backfill the graph.
Looking at the from_nodes_to_graph(...) logic, the current logic will always add an edge from field.reference to record_set.
In our case if field is disease/ancestors, there will be an edge from disease/id (reference) -> disease (RecordSet).
However, for every field in the RecordSet, there should be an edge from the RecordSet to the field. In this case disease (RecordSet) -> disease/id (reference), where a loop will be created.
Interestingly the cycle didn't get caught in the topological sort. My guess is that since the sort is a generator, it probably yields the source node before reaching the cycle.
In this operation graph an edge from A -> B means B depends on A.
As a result the proper fix should be adding field.reference -> field instead of field.reference -> RecordSet.
Based on the definition of the edges this should also work for cross-recordsets (or filesets) references.
Metadata | +-- FileSet | +-- RecordSet_A | | | +-- Field_A | | | | -- {depends on} <--RecordSet_A | | -- {depends on} <-- FileSet (via cr:source) | L -- {depends on} <-- Field_B (via cr:references) | +-- RecordSet_B | +-- Field_B | | -- {depends on} <--RecordSet_B L -- {depends on} <-- FileSet (via cr:source)
Will have a PR for code change.
https://github.com/mlcommons/croissant/pull/949 should have fixed this issue! @DSuveges @Tobi1kenobi can you check on your end?