Mondrian icon indicating copy to clipboard operation
Mondrian copied to clipboard

Final low and high values of the partitions

Open prajwal1210 opened this issue 5 years ago • 15 comments

So, I notice that in the code for the Mondrian, we only update the parent low and high values along a dimension when it is chosen as an allowed dimension. A few concerns regarding that:

  1. The dimension choice depends on the low and high values so won't we use wrong and old values to make the choice
  2. Once a dimension cannot be split anymore, we do not update its low and high value, however, a split in some other allowable dimension may cause range for this dimension to change as well

prajwal1210 avatar Nov 19 '19 09:11 prajwal1210

So, I notice that in the code for the Mondrian, we only update the parent low and high values along a dimension when it is chosen as an allowed dimension. A few concerns regarding that:

  1. The dimension choice depends on the low and high values so won't we use wrong and old values to make the choice
  2. Once a dimension ca'snot be split anymore, we do not update its low and high value, however, a split in some other allowable dimension may cause range for this dimension to change as well

Hi @prajwal1210

Sorry for late reply. :)

A to your concerns:

  1. The basic guideline of generalization is using range values to replace real values, such that the results are correct but not wrong. This technique is not perfect. It doesn't work for all cases.
  2. Correct. Splitting on other dimension may change the range of other dimension, but that won't hurt data anonymization.

Have a nice day! Qiyuan

qiyuangong avatar May 16 '20 13:05 qiyuangong

Hello, I just wanted to ask what data exactly gets anonymized. I am running the code with the instructions and i cant quite understand what goes inside the anonymized.data. I am sorry if this sounds like a "stupid" question but I am new to this.

Thank you!

3ndri avatar Dec 24 '20 19:12 3ndri

Hello, I just wanted to ask what data exactly gets anonymized. I am running the code with the instructions and i cant quite understand what goes inside the anonymized.data. I am sorry if this sounds like a "stupid" question but I am new to this.

Thank you!

Hi @3ndri . There isn't any stupid question, only stupid answer.

In short, we all know identifier (such as phone number) should be removed, meanwhile QIDs (quasi-identifier, such as age, gender etc) will be anonymized by k-anonymity related algorithms (e.g., Mondrian or others), all others attributes including sensitive values will remain untouched.

Hope this information can help you. :)

qiyuangong avatar Dec 26 '20 10:12 qiyuangong

But which column is the phone number in adult.data?

3ndri avatar Dec 26 '20 20:12 3ndri

Also the output is the same whether i run it with k=10 or k=20 Screenshot from 2020-12-26 21-30-07

3ndri avatar Dec 26 '20 20:12 3ndri

But which column is the phone number in adult.data?

IDs (phone personal ID or others) are already removed before available.

qiyuangong avatar Dec 27 '20 10:12 qiyuangong

Also the output is the same whether i run it with k=10 or k=20 Screenshot from 2020-12-26 21-30-07

No. They are different in NCP, which means information loss (higher NPC means more loss). Pls read REAMD.md, and checkout the output dir.

qiyuangong avatar Dec 27 '20 10:12 qiyuangong

But what does the output over K=10 mean? The one which reads:

[[], ['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov', 'Local-gov', 'Self-emp-inc', 'Without-pay'], [], ['Never-married', 'Married-civ-spouse', 'Divorced', 'Married-spouse-absent', 'Separated', 'Married-AF-spouse', 'Widowed'], ['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners', 'Prof-specialty', 'Other-service', 'Sales', 'Transport-moving', 'Farming-fishing', 'Machine-op-inspct', 'Tech-support', 'Craft-repair', 'Protective-serv', 'Armed-Forces', 'Priv-house-serv'], ['White', 'Black', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other'], ['Male', 'Female'], ['United-States', 'Cuba', 'Jamaica', 'India', 'Mexico', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany', 'Iran', 'Philippines', 'Poland', 'Columbia', 'Cambodia', 'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal', 'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala', 'Italy', 'China', 'South', 'Japan', 'Yugoslavia', 'Peru', 'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago', 'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary', 'Holand-Netherlands']]

3ndri avatar Dec 27 '20 10:12 3ndri

Oh I get it now, those are the quasi-identifiers

3ndri avatar Dec 27 '20 13:12 3ndri

I have a question about which database this program calls

Arigato97 avatar May 12 '21 06:05 Arigato97

Can you help me annotate the program? I don't understand it as a novice please

Arigato97 avatar May 12 '21 06:05 Arigato97

Can you help me annotate the program? I don't understand it as a novice please

Hi @Arigato97

This program calls adult dataset (https://github.com/qiyuangong/Mondrian/blob/master/data/adult.data) by default, and can be changed into infoms dataset (https://github.com/qiyuangong/Mondrian/blob/master/data/conditions.csv and https://github.com/qiyuangong/Mondrian/blob/master/data/demographics.csv)

qiyuangong avatar May 12 '21 07:05 qiyuangong

Can you add a little more comments to the program? It seems a little difficult for me ,please,help

Arigato97 avatar May 12 '21 07:05 Arigato97

有些程序看不明白 不清楚具体作用 能添加多一些注释吗 谢谢

Arigato97 avatar May 12 '21 07:05 Arigato97

有些程序看不明白 不清楚具体作用 能添加多一些注释吗 谢谢

抱歉,已经不会再添加注释和功能。

qiyuangong avatar May 13 '21 13:05 qiyuangong