croissant
croissant copied to clipboard
[NeurIPS] Join fails with reference if read from fileObject
Hi,
Regarding your proposed solution to #651 and the implementation of simple_join. This solution only works if it is a reference to a field where data is manually added. When I load the data from fileObject it returns nan.
Below is an example that fails (returns nan for the sequence field of examples, despite uid match):
{
"recordSet": [
{
"@type": "cr:RecordSet",
"@id": "sequences",
"name": "sequences",
"field": [
{
"@type": "cr:Field",
"@id": "sequences/uid",
"name": "uid",
"dataType": "sc:Text",
"source": {
"fileObject": {
"@id": "sequences.csv"
},
"extract": {
"column": "uniprot_id"
}
}
},
{
"@type": "cr:Field",
"@id": "sequences/sequence",
"name": "sequence",
"dataType": "sc:Text",
"source": {
"fileObject": {
"@id": "sequences.csv"
},
"extract": {
"column": "sequence"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"@id": "examples",
"name": "examples",
"field": [
{
"@type": "cr:Field",
"@id": "examples/uid",
"name": "uid",
"dataType": "sc:Text",
"references": {
"field": {
"@id": "sequences/uid"
}
},
"source": {
"fileObject": {
"@id": "annotations.csv"
},
"extract": {
"column": "uniprot_id"
}
}
},
{
"@type": "cr:Field",
"@id": "examples/type",
"name": "type",
"dataType": "sc:Text",
"source": {
"fileObject": {
"@id": "annotations.csv"
},
"extract": {
"column": "type"
}
}
},
{
"@type": "cr:Field",
"@id": "examples/annotation",
"name": "annotation",
"dataType": "sc:Text",
"source": {
"fileObject": {
"@id": "annotations.csv"
},
"extract": {
"column": "annotation"
}
}
},
{
"@type": "cr:Field",
"@id": "examples/sequence",
"name": "sequence",
"dataType": "sc:Text",
"source": {
"field": {
"@id": "sequences/sequence"
}
}
}
]
}
]
}
But the following works (where data is manually added similar to the provided example)
{
"recordSet": [
{
"@type": "cr:RecordSet",
"@id": "sequences",
"name": "sequences",
"field": [
{
"@type": "cr:Field",
"@id": "sequences/uid",
"name": "uid",
"dataType": "sc:Text"
},
{
"@type": "cr:Field",
"@id": "sequences/sequence",
"name": "sequence",
"dataType": "sc:Text"
}
],
"data": [
{
"uid": "XYZ",
"sequence": "MLCTHGHGHLMKNMNV"
}
]
},
{
"@type": "cr:RecordSet",
"@id": "examples",
"name": "examples",
"field": [
{
"@type": "cr:Field",
"@id": "examples/uid",
"name": "uid",
"dataType": "sc:Text",
"references": {
"field": {
"@id": "sequences/uid"
}
},
"source": {
"fileObject": {
"@id": "annotations.csv"
},
"extract": {
"column": "uniprot_id"
}
}
},
{
"@type": "cr:Field",
"@id": "examples/type",
"name": "type",
"dataType": "sc:Text",
"source": {
"fileObject": {
"@id": "annotations.csv"
},
"extract": {
"column": "type"
}
}
},
{
"@type": "cr:Field",
"@id": "examples/annotation",
"name": "annotation",
"dataType": "sc:Text",
"source": {
"fileObject": {
"@id": "annotations.csv"
},
"extract": {
"column": "annotation"
}
}
},
{
"@type": "cr:Field",
"@id": "examples/sequence",
"name": "sequence",
"dataType": "sc:Text",
"source": {
"field": {
"@id": "sequences/sequence"
}
}
}
]
}
]
}
I faced the same issue. I am unsure if I found the right solution, but having the property name removed or picking a name different than @id properties' values for recordSets solved the issue for me. Here is an example of my files: https://github.com/msorkhpar/wiki-entity-summarization/tree/main/croissant
Thanks for the suggestion. I removed all the "name"s from recordSets and made all other name pointers unique, but it unfortunately doesn't seem to work for me.
@EMCarrami . I experienced the same problem as you describe here. After debugging a bit, I could narrow it down to how the field is parsed from a data frame.
The EXPECTED_DATA_TYPES maps sc:text to bytes, and so a column from csv is a byte array, whereas the FileSet filename is a string. This fails the join.
I am not sure of whats the recommended way to specify a text field, and also why this Mapping is set, but @marcenacp might have some context on this. In an initial version text did map to str, but it was changed. Also the docs suggest to use the dataType sc:Text for a csv column.
I am attaching an example jsonld file to exemplify the join in the situation, perhaps that needs to be addressed? cookbook-dataset-metadata.json
I've also found this in the context of trying to join wav files by name to rows in jsonl. The wav names in the pandas dataframe are strings, the ids from the jsonl are bytes, producing an empty join.
I partially debugged this:
- Both record sets when iterated over correctly produce ids with type bytes.
- In Join operation, the left DataFrame is some DataFrame directly reading from jsonl it seems like, not a transformed dataframe of the recordset. Join is looking up the original column name and then uses apply_transformations, but doesn't use _cast_value which appears in Field operation after transformation, so they mismatch.
I can't figure out any workaround.