grano
grano copied to clipboard
Duplicate relations after importing aliases
This is -- to some extent -- the code that causes it:
https://github.com/granoproject/grano/blob/master/grano/logic/entities.py#L136
The question is, how does that code decide when to delete duplicate links - because it may want to consider more than just source and target. The only fully logical solution I can see is to load all entities first, then de-dupe and then load relations. But that would be a major refactor.
Would it not be possible to merge relations based on the uniqueness constraints in the schemata?
Hm, but the uniqueness constraints aren't actually in the schema; they're in the loaders. Which may be a problem anyway: if the schema knew about de-dupe, we could just POST whole objects without checking for them first, which would halve the number of HTTP requests we need to do to load a dataset.
I was thinking of something along the lines of a grano command that takes the schema file as an argument and de-dupes the relations.
What are good reasons for keeping grano ignorant of uniqueness constraints? Simpler code?
Well there could be different uniqueness constraints for different data sources, but that actually seems more like a bug now that I think of it.
Perhaps for now I can add a relation de-duping command to granoloader. It should be able to merge relations efficiently enough by paging through relations ordered by unique fields