Mondrian Final low and high values of the partitions

So, I notice that in the code for the Mondrian, we only update the parent low and high values along a dimension when it is chosen as an allowed dimension. A few concerns regarding that:

The dimension choice depends on the low and high values so won't we use wrong and old values to make the choice
Once a dimension cannot be split anymore, we do not update its low and high value, however, a split in some other allowable dimension may cause range for this dimension to change as well

Nov 19 '19 09:11 prajwal1210

So, I notice that in the code for the Mondrian, we only update the parent low and high values along a dimension when it is chosen as an allowed dimension. A few concerns regarding that:

The dimension choice depends on the low and high values so won't we use wrong and old values to make the choice

Once a dimension ca'snot be split anymore, we do not update its low and high value, however, a split in some other allowable dimension may cause range for this dimension to change as well

Hi @prajwal1210

Sorry for late reply. :)

A to your concerns:

The basic guideline of generalization is using range values to replace real values, such that the results are correct but not wrong. This technique is not perfect. It doesn't work for all cases.
Correct. Splitting on other dimension may change the range of other dimension, but that won't hurt data anonymization.

Have a nice day! Qiyuan

May 16 '20 13:05 qiyuangong

Hello, I just wanted to ask what data exactly gets anonymized. I am running the code with the instructions and i cant quite understand what goes inside the anonymized.data. I am sorry if this sounds like a "stupid" question but I am new to this.

Thank you!

Dec 24 '20 19:12 3ndri

Hello, I just wanted to ask what data exactly gets anonymized. I am running the code with the instructions and i cant quite understand what goes inside the anonymized.data. I am sorry if this sounds like a "stupid" question but I am new to this.

Thank you!

Hi @3ndri . There isn't any stupid question, only stupid answer.

In short, we all know identifier (such as phone number) should be removed, meanwhile QIDs (quasi-identifier, such as age, gender etc) will be anonymized by k-anonymity related algorithms (e.g., Mondrian or others), all others attributes including sensitive values will remain untouched.

Hope this information can help you. :)

Dec 26 '20 10:12 qiyuangong

But which column is the phone number in adult.data?

Dec 26 '20 20:12 3ndri

Also the output is the same whether i run it with k=10 or k=20 Screenshot from 2020-12-26 21-30-07

Dec 26 '20 20:12 3ndri

But which column is the phone number in adult.data?

IDs (phone personal ID or others) are already removed before available.

Dec 27 '20 10:12 qiyuangong

Also the output is the same whether i run it with k=10 or k=20

No. They are different in NCP, which means information loss (higher NPC means more loss). Pls read REAMD.md, and checkout the output dir.

Dec 27 '20 10:12 qiyuangong

But what does the output over K=10 mean? The one which reads:

[[], ['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov', 'Local-gov', 'Self-emp-inc', 'Without-pay'], [], ['Never-married', 'Married-civ-spouse', 'Divorced', 'Married-spouse-absent', 'Separated', 'Married-AF-spouse', 'Widowed'], ['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners', 'Prof-specialty', 'Other-service', 'Sales', 'Transport-moving', 'Farming-fishing', 'Machine-op-inspct', 'Tech-support', 'Craft-repair', 'Protective-serv', 'Armed-Forces', 'Priv-house-serv'], ['White', 'Black', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other'], ['Male', 'Female'], ['United-States', 'Cuba', 'Jamaica', 'India', 'Mexico', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany', 'Iran', 'Philippines', 'Poland', 'Columbia', 'Cambodia', 'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal', 'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala', 'Italy', 'China', 'South', 'Japan', 'Yugoslavia', 'Peru', 'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago', 'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary', 'Holand-Netherlands']]

Dec 27 '20 10:12 3ndri

Oh I get it now, those are the quasi-identifiers

Dec 27 '20 13:12 3ndri

I have a question about which database this program calls

May 12 '21 06:05 Arigato97

Can you help me annotate the program? I don't understand it as a novice please

May 12 '21 06:05 Arigato97

Can you help me annotate the program? I don't understand it as a novice please

Hi @Arigato97

This program calls adult dataset (https://github.com/qiyuangong/Mondrian/blob/master/data/adult.data) by default, and can be changed into infoms dataset (https://github.com/qiyuangong/Mondrian/blob/master/data/conditions.csv and https://github.com/qiyuangong/Mondrian/blob/master/data/demographics.csv)

May 12 '21 07:05 qiyuangong

Can you add a little more comments to the program? It seems a little difficult for me ，please，help

May 12 '21 07:05 Arigato97

有些程序看不明白不清楚具体作用能添加多一些注释吗谢谢

May 12 '21 07:05 Arigato97

有些程序看不明白不清楚具体作用能添加多一些注释吗谢谢

抱歉，已经不会再添加注释和功能。

May 13 '21 13:05 qiyuangong

Mondrian Mondrian copied to clipboard

Final low and high values of the partitions

Mondrian
Mondrian copied to clipboard