gramps icon indicating copy to clipboard operation
gramps copied to clipboard

Better support for GEDCOM _UID

Open prculley opened this issue 5 years ago • 19 comments

As a result of Nick's comments in #1002, I closed that PR and #1000 to create this more comprehensive PR.

  • This adds a uid_list to both Person and Family objects. A single uid is attached to the object at creation (which required me to replace it on import to avoid an unnecessary extra uid).
  • As a result, it requires a database upgrade, which code is included. If no uids were found in attributes, a single uid is added to Persons and Families.
  • XML import/export and schema have been updated to deal with the new uid lists.
  • The XML import move any _UID attributes found in Persons or Families to the new uid list.
  • GEDCOM import/export also now stores _UID tags in the uid list and exports from there.
  • The FindDups tool now uses the uid list to match up people with very high confidence.
  • The FindDups tool also has a new "Very High" threshold level (the "Very High" was already in the gramps.po). This higher level will skip the more time consuming second pass of the tool if persons matching via uid are found.
  • carefully updated the import/export tests.

Comments: When attempting to merge in a GEDCOM that contained _UID for the people, with a tree that already had some of the same people, I noted that our 'Find Possible Duplicate People' was ignoring the _UID entries. The tree also happened to have a set of actual duplicates which I could not get to show up.

On investigation, I saw that the tool did not display all possible duplicates for a single person, only the one it thought had the highest score. And it did not look at _UID attributes at all.

This PR contains several commits;

  • a change to GEDCOM import to remove the find_family_from_handle method; this was required because it was messing up the GEDCOm import by adding an additional uid.
  • the basic UID support
  • a pylint on finddups
  • the finddups patch to allow all duplicate pairs to show up
  • the finddups patch to allow matching uid on persons to give the highest score (10).

Note: this PR is based on top of the #1010 database upgrade PR, as that is necessary to implement the upgrade. When looking at code, look at the individual commits.

prculley avatar Jan 22 '20 17:01 prculley

Codecov Report

Merging #1005 into master will increase coverage by 0.03%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1005      +/-   ##
==========================================
+ Coverage   41.36%   41.39%   +0.03%     
==========================================
  Files        1059     1060       +1     
  Lines      142956   143119     +163     
==========================================
+ Hits        59132    59248     +116     
- Misses      83824    83871      +47
Impacted Files Coverage Δ
gramps/plugins/importer/importxml.py 65.11% <ø> (+0.16%) :arrow_up:
gramps/gen/lib/test/serialize_test.py 98.71% <ø> (+0.05%) :arrow_up:
gramps/plugins/lib/libgrampsxml.py 100% <ø> (ø) :arrow_up:
gramps/gen/db/generic.py 86.65% <ø> (ø) :arrow_up:
gramps/gui/editors/test/editreference_test.py 87.23% <ø> (ø) :arrow_up:
gramps/gen/utils/test/file_test.py 94.11% <ø> (ø) :arrow_up:
gramps/plugins/tool/finddupes.py 0% <ø> (ø) :arrow_up:
gramps/plugins/db/dbapi/dbapi.py 89.58% <ø> (ø) :arrow_up:
gramps/plugins/importer/test/importvcard_test.py 98.78% <ø> (+0.01%) :arrow_up:
gramps/gen/db/exceptions.py 48.61% <ø> (ø) :arrow_up:
... and 39 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ce6f435...c67caf6. Read the comment docs.

codecov-io avatar Jan 23 '20 15:01 codecov-io

We should probably create a Uid class to encapsulate validation and checksum generation. I have added the start of such class which is as sub-class of UUID.

A Uid should serialize to the standard hyphenated format which is instantly recognisable as a uuid. We should also use this in the Gramps XML.

The class also validates the uuid and provides a convenience method for producing a Gedcom string.

The database upgrade is a problem. We don't want to duplicate code. I think we should seriously consider dropping support for BSDDB in the next major release, but we obviously need to discuss this on the list first.

Nick-Hall avatar Jan 23 '20 18:01 Nick-Hall

I think we need to first decide exactly what to do with 'illegal' uids found in a GEDCOM file. Tamura says to trash them. I infer that you might agree with this from your UID class proposal.

The GEDCOM L group says to export the first 36 characters of the UID unchanged (U7) http://wiki-de.genealogy.net/GEDCOM/UID-Tag.

As Tamura and GEDCOM L points out we have apps that export without checksum, with bad checksum, with '-' and sometimes '{}' characters (and no checksum) and even some that have non-hex characters in the string. If we take the hard line on this, we lose the ability to use these ?uids.

Since we could keep them as strings, and compare them that way, I don't really see a reason why not. I admit my first attempt is not quite in accordance with GEDCOM L in that if cleaned up chars look like a hex number, I only store the first 32 and regenerate the checksum later. But that is fixable if we want to follow GEDCOM L recommendation.

prculley avatar Jan 24 '20 15:01 prculley

@Nick-Hall I redid this based on your comments, but continue to store strings (in the Uid class) because I think that gives us better compatibility when working with GEDCOMs from other vendors, as per my comments above.

prculley avatar Feb 03 '20 22:02 prculley

@prculley I still think that we need to validate the uids. We shouldn't store just any old string. Otherwise there is no point of a checksum. Tamura Jones has the right idea here. We should also format the uid with hyphens the Gramps XML so that it is instantly recognisable as a uuid.

I also see a potential problem when two researchers who are collaborating decide to split a person. Both have the same person with the same uid. Then they both split the person, but one keeps the original uid on person A and creates a new uid for person B, the other does the opposite. Now we potentially have two people with the same uid!

Let me think about this further.

Nick-Hall avatar Feb 03 '20 23:02 Nick-Hall

Well if you want to consider that... Some thoughts... Moving/mis-identifying a person is an inherent issue with UIDs. What constitutes a change?

  • Changing the name to the right one for the position in the tree?
  • Changing the birth/death dates etc.?
  • moving the person to a different position in the tree?
  • What constitutes a move; different parents?, different spouse?, different children? Detected how; when a connection is removed?

This doesn't matter if you did not import the UID or export with a UID. An argument for not adding it until export.

We can leave it up to the user by putting the UIDs in the GUI somewhere they can be deleted (in Attributes, where it started out?), or have some sort of "Not the right person" button to nuke the UID.

Or we could help the user by popping a question if a person is changed in some way. Something like "Do your changes mean that this was not the correct person originally? If, so, the Unique ID should be changed... blah, blah"

prculley avatar Feb 04 '20 15:02 prculley

All this has previously been discussed. I have just been refreshing my memory.

Ideally the Gramps database handle should be a uuid. However, changing this would not be easy and it has been low on my list of priorities. Failing this, generating a uuid on object creation is the correct approach.

A uuid should be a 128 bit integer, not a string. We should use the existing python uuid.UUID class to validate and store values. Extra code should only be required for the PAF checksum.

Common representations that include hyphens, braces, upper and lower case characters should be accepted. Representations that cannot be converted to a 128 bit integer or fail the checksum should be rejected with an appropriate warning message. Tamura Jones advises: "Applications should reject all _UID values with invalid checksums, and all _UID values containing non-hexadecimal characters."

In Gramps XML, a uuid should be in hyphenated lowercase format.

Our uuid field should be compatible with database UUID data types such as in PostgreSQL and MongoDB.

As far as splitting people was concerned, I don't think we came up with a solution. One idea was to generate a new primary uuid for both people. All existing uuids should be retained and never deleted.

Nick-Hall avatar Feb 04 '20 16:02 Nick-Hall

@Nick-Hall I have updated the PR to your specifications, storing in the db the Uid objects which are a subclass of the UUID object, rather than simple strings.

The XML export now has the uid looking like the '-' separated groups.

GEDCOM import now checks the _UID line checksums on those records that have checksums (we still accept lines that look like the '-' separated UUID (which don't have checksums) or those with just hex chars but no checksum characters. Records that cannot be converted get rejected with out "not recognized" message.

The db upgrade code behaves like the GEDCOM import for _UID attributes; except if it cannot be converted, it stays an _UID attribute.

GEDCOM export uses the FamilySearch format.

There were a few issues along the way, like figuring out how to get the JSON serialize/un-serialize to work, and how to fix up the 'empty' object fixup in Check&Repair. But I think everything works now.

prculley avatar Feb 07 '20 22:02 prculley

This needs a rebase, but from what I can see it looks good. However, it would benefit from a UidBase class to avoid code duplication in the Person and Family classes.

Nick-Hall avatar Mar 07 '20 17:03 Nick-Hall

@Nick-Hall Changes as requested...

prculley avatar Mar 07 '20 22:03 prculley

@prculley The Uid base class should be in a separate file. It would also be nice to have a few unit test, but not essential. Otherwise I think we can merge this.

Nick-Hall avatar Apr 27 '20 11:04 Nick-Hall

I've created the uidbase file. I added a merge test that checks merging the uids, both different and same values. The standard import/export tests also test uid creation and checking. The coverage check indicates all but one of the lines in the uid.py (97%) and uidbase.py(100%) have been exercised. I don't have a check for failing checksum provided in the init code as a source in the standard files, but I tested that separately.

prculley avatar Apr 27 '20 15:04 prculley

Codecov Report

Merging #1005 into master will decrease coverage by 0.54%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1005      +/-   ##
==========================================
- Coverage   41.93%   41.38%   -0.55%     
==========================================
  Files        1066     1061       -5     
  Lines      145100   143261    -1839     
==========================================
- Hits        60848    59289    -1559     
+ Misses      84252    83972     -280     
Impacted Files Coverage Δ
gramps/cli/grampscli.py 61.74% <ø> (+8.90%) :arrow_up:
gramps/gen/config.py 91.55% <ø> (-0.34%) :arrow_down:
gramps/gen/datehandler/_date_de.py 70.08% <ø> (-2.38%) :arrow_down:
gramps/gen/db/dummydb.py 31.64% <ø> (-0.04%) :arrow_down:
gramps/gen/db/exceptions.py 48.61% <ø> (+4.36%) :arrow_up:
gramps/gen/db/generic.py 86.65% <ø> (-2.23%) :arrow_down:
gramps/gen/db/upgrade.py 0.00% <ø> (ø)
gramps/gen/filters/rules/citation/__init__.py 100.00% <ø> (ø)
gramps/gen/filters/rules/citation/_hasattribute.py 100.00% <ø> (ø)
gramps/gen/filters/rules/person/__init__.py 100.00% <ø> (ø)
... and 260 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fd380e3...d2dbf22. Read the comment docs.

codecov-commenter avatar Sep 15 '20 21:09 codecov-commenter

This is great! I did not dig through previous discussions, but was any consideration given to adding UUID to the other primary objects?

The use case I have in mind is where all the objects in a Gedcom have a UUID. A Gedcom style import could then potentially merge, replace or discard all duplicate records during the import to provide something more akin to an incremental tree sync.

cdhorn avatar Oct 18 '20 14:10 cdhorn

Now that Gedcom 7.0 has been released, we should probably use this rather than the Tamura Jones specification.

The UID tag is present on all objects except for places, citations and tags. For places they use the EXID tag to specify an identifier in an external authority (e.g. GOV). A UID is not necessary for tags.

In the Gedcom 7.0 the UID is a text field, but a size is not specified. New identifiers should be created using uuid.uuid4(). The use of checksums is discouraged, so I don't see any benefit in using them.

Nick-Hall avatar Jul 21 '21 11:07 Nick-Hall

I wonder if we should consider making UIDs, REFN and other forms of EXID as well as place Geonames ID or GOVID a more generic ID structure or facility. Now that we are moving to SQL style dbs, we might want to just make these new table or tables, with pointer back to handle, rather than loading all this stuff into the primary objects.

This would make for easy lookup by ID.

Food for thought.

prculley avatar Jul 21 '21 16:07 prculley

Aren't our handles already globally unique even if they don't meet the spec for uuids?

jralls avatar Jul 21 '21 17:07 jralls

Effectively unique, neither ours nor UUID4 are guaranteed as I understand things.

Keeping them in a separate table probably makes sense if they are not referenced in routine tasks very often. If so it would be useful to be able to identify the source system or systems they originated from if they came from imported data.

cdhorn avatar Jul 28 '21 11:07 cdhorn

GEDCOM 7 calls for following EVERY external ID with a URL.

Perhaps we could be a bit smarter and use URLs in shared Citations (or Sources or Repositories) augmented with a 'Internet' object record (which is currently only available in Person, Places, Repository. So all FamilySearch Person could be the same Repository and might even have the ID's JSON structure definition cached in a special Note type.

https://gedcom.io/specifications/gedcom7-rc.pdf
EXID (External Identifier)
Page 63 of 92

and https://gedcom.io/specifications/FamilySearchGEDCOMv7.pdf Page 94 of 111

TYPE (Type) g7:EXID-TYPE The authority issuing the EXID (p.74) , represented as a URI. It is recommended that this be a URL. If the authority maintains stable URLs for each identifier it issues, it is recommended that the TYPE (p.93) payload be selected such that appending the EXID payload to it yields that URL. However, this is not required and a different URI for the set of issued identifiers may be used instead. Registered URIs are listed in exid-types.json, where fields include: “label”: a short string suitable for display in a user interface. “type”: The URI representing the authority issuing the EXID . “description”: A description of the meaning of the EXID . “contact”: A contact email address for the person or organization registering the URI. “change-controller”: The name or contact information for the person or organization authorized to update the registration. “fragment”: If present, indicates a short string that can be used as a label for a fragment identifier appended to the URI. If absent, indicates that fragment identifiers are not used with the URI. 0 @I1@ INDI 1 ORDN 2 TYPE Bishop • • • The FamilySearch GEDCOM Specification 7.0.12 Page 94 of 111 “reference”: A URL with more information about the meaning of the EXID . Such information should explain the uniqueness and expected durability of the identifier. Additional type URIs can be registered by filing a GitHub pull request.

emyoulation avatar Jun 03 '23 00:06 emyoulation