charlatan
charlatan copied to clipboard
Add remaining elements of protected health information
Many of these are included already, but the full list is here:
https://medschool.duke.edu/research/clinical-and-translational-research/duke-office-clinical-research/irb-and-institutional-14
- would be nice to add:
- random street names
- random zip code
- random city name
- random email, perhaps related to name
- random county name
- random SSN
Name Address (all geographic subdivisions smaller than state, including street address, city county, and zip code) All elements (except years) of dates related to an individual (including birthdate, admission date, discharge date, date of death, and exact age if over 89) Telephone numbers Fax number Email address Social Security Number Medical record number Health plan beneficiary number Account number Certificate or licence number Any vehicle or other device serial number Web URL Internet Protocol (IP) Address Finger or voice print Photographic image - Photographic images are not limited to images of the face. Any other characteristic that could uniquely identify the individual
thanks @higgi13425
Is the idea that people managing data under HIPAA will replace real data with fake data?
Exactly. To deidentify a clinical dataset. zipcode replaced with deid_zipcodename replaced with deid_namestreet with deid_streetdob with deid_dobetc. Ideally, the date of birth(dob) would be the index date, and could be assigned a random date in the year 1900. then all other dates in the dataset could be adjusted relative to deid_dob, to preserve the sequence of events and relative time, while keeping data deidentified. This would be really helpful for folks like me with HIPAA issues with PHI-containing datasets. Even cooler - a function to 1) add a deid_x version of each PHI variable in the dataset, then2) split dataset into two - one with PHI plus unique key (stored securely)- and the 2nd with unique key plus deid_x versions of PHI data (plus all the other data). then you could share the 2nd dataframe (on GitHub, etc),but if you really needed to, you could merge to re-identify. thanks for considering it. Peter
thanks @higgi13425
done already
- [x] Telephone numbers - done, see
PhoneNumberProvider
/ch_phone_number
- [x] Fax number (done I assume, or are there different fax number formats?)
- [x] street names, done, see
street_name
inAddressProvider
- [x] zip code, done, see
postcode
inAddressProvider
- [x] city name, done, see
city
inAddressProvider
not done, questions
- [ ] birthdate is just a date, see
DateTimeProvider$new()$date("%Y-%M-%d")
we don't have a way to pick a date within a certain range of years, can look into that - [ ] county name - are you intersted in US counties only?
For the below, I assume there's no standard format to this? is it just a string of letters and numbers? If so, we don't need specialized functions for each one
- [ ] Medical record number
- [ ] Health plan beneficiary number
- [ ] Account number
- [ ] Certificate or licence number
- [ ] Any vehicle or other device serial number
not done, can do
- [x] email address - can do that, see
InternetProvider$new()$email()
- [ ] SSN - can do that
- [x] Web URL, can do that, see
InternetProvider$new()$url()
- [x] Internet Protocol (IP) Address, can do that, see
InternetProvider$new()$ipv4()
your function idea is interesting. i'll open a new issue for that so this issue can focus on the data types
birthdate - the idea was to randomly select a day/month, and place the date of birth in a year that clearly is not the real date of birth - so that there is no confusion later between true dob and deid_dob. 1900 is a reasonable year, in that there are no people born in 1900 still alive. county name - for my purposes, US county only.I could imagine that if this becomes popular, the equivalent in other countries would be worthwhile. I agree, Most of the numbers can already be done. fax number ~ phone number
This sounds promising! Peter
- DOB: okay, i see now what you mean. can do it like
z <- DateTimeProvider$new()
z$date_time_between("1900-01-01", "1900-12-31")
- counties: thanks, my feeling is to only do us counties for now