ga4gh-server icon indicating copy to clipboard operation
ga4gh-server copied to clipboard

Remove data from IDs

Open david4096 opened this issue 8 years ago • 3 comments

Given we are not using randomly generated identifiers we need to take care not to leak private data. For example, a read group might be named by a patient name. Even if a client were to only store IDs, they may unintentionally be hosting private health data. When discussing the count everything grant this occurred to me since a common method of performing a count is measuring the number of unique identifiers.

ZGF0YXNldDE6YnJ1Y2V3YXluZWlzYmF0bWFu

david4096 avatar Mar 17 '16 20:03 david4096

This is valid point. Despite appearing to `only' encode a file name, there are all kind of unintended ways data can leak. With dbGap protected data, where much of the metadata is public, the file names, which are created by the data submitter, are considered protected, as they have had cases where someone put protected data in the file name.

diekhans avatar Mar 17 '16 20:03 diekhans

Good point. We should definitely bear this in mind.

jeromekelleher avatar Mar 21 '16 16:03 jeromekelleher

The requirement of being able to randomly access a variant by ID places a lot of assumptions about how IDs are structured. To close, we will probably have to address at the schemas level.

Closing this can be done by using an identifier scheme that does not encode strings. It will still have to be a compound ID if we don't address it at the schemas level. However, this ID can be a concatenation of integers and doesn't need to encode strings, as the current ID scheme does.

david4096 avatar Mar 09 '17 03:03 david4096