grano icon indicating copy to clipboard operation
grano copied to clipboard

Duplicate relations after importing aliases

Open Rizziepit opened this issue 9 years ago • 6 comments

Rizziepit avatar Nov 04 '14 09:11 Rizziepit

This is -- to some extent -- the code that causes it:

https://github.com/granoproject/grano/blob/master/grano/logic/entities.py#L136

The question is, how does that code decide when to delete duplicate links - because it may want to consider more than just source and target. The only fully logical solution I can see is to load all entities first, then de-dupe and then load relations. But that would be a major refactor.

pudo avatar Nov 20 '14 10:11 pudo

Would it not be possible to merge relations based on the uniqueness constraints in the schemata?

Rizziepit avatar Nov 20 '14 10:11 Rizziepit

Hm, but the uniqueness constraints aren't actually in the schema; they're in the loaders. Which may be a problem anyway: if the schema knew about de-dupe, we could just POST whole objects without checking for them first, which would halve the number of HTTP requests we need to do to load a dataset.

pudo avatar Nov 20 '14 10:11 pudo

I was thinking of something along the lines of a grano command that takes the schema file as an argument and de-dupes the relations.

What are good reasons for keeping grano ignorant of uniqueness constraints? Simpler code?

Rizziepit avatar Nov 20 '14 10:11 Rizziepit

Well there could be different uniqueness constraints for different data sources, but that actually seems more like a bug now that I think of it.

pudo avatar Nov 20 '14 10:11 pudo

Perhaps for now I can add a relation de-duping command to granoloader. It should be able to merge relations efficiently enough by paging through relations ordered by unique fields

Rizziepit avatar Nov 20 '14 11:11 Rizziepit