pynonymizer icon indicating copy to clipboard operation
pynonymizer copied to clipboard

Generate default strategy file to manage changing database schemas and not fail due to schema changes

Open armorKing11 opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe. Most systems will have a dynamically changing database as features are added to whatever application is being used . This will require the strategy files to be manually updated each time data sensitive columns are updated/added or removed If a manual change to the strategy files are not made , the data masking tool could break due to the column changes . Tested this by adding an unknown table to a strategy file and it failed:

ERROR 1146 (42S02) at line 1: Table 'services.userscool' doesn't exist

where userscool is a table that does not exist in services db

Describe the solution you'd like Temporary solution:

  1. Possibly not throw an exception if it cannot find a table/column . This will ensure that the tool can anonymize whatever valid tables and columns exist and possibly give a report , at the end that the following tables/columns could not be anonymized since they could not be found .

Permanent solution a. Add the ability for the pynonymizer to generate a default strategy file for the database so that any new schema changes will be captured in that default strategy file . It could also possibly check an exclusion file that contains tables/columns to be ignored. b. Compare default strategy file with existing strategy file and reconcile the difference between the 2 considering the default strategy file as a source of truth .

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

armorKing11 avatar Jan 26 '22 22:01 armorKing11

I feel i have to start this reply off by saying that I feel anonymization is a fundamentally manual process and must always have some human oversight, especially where schema is concerned. Schema change is just as likely to false positive/ leak personal information over time (e.g. if the tool "forgets" to anonymize something) as it is to fatally break the anonymization process. Like any application, your anonymization scripts are dependent on your database's version, which is normally managed via successive migrations.

With that out of the way, thank you for your suggestions, I find them really insightful. I'll give you my thoughts below:

Possibly not throw an exception if it cannot find a table/column

I think this is a good idea and I'd like to pursue it as a feature! I've added #96 to look at this. Since pynonymizer isn't an atomic operation it makes sense to me that it shouldn't just bail out half-way through the anonymization operation.

Add the ability for the pynonymizer to generate a default strategy file for the database so that any new schema changes will be captured in that default strategy file . It could also possibly check an exclusion file that contains tables/columns to be ignored.

I think that having introspection tooling for assisting setup is a good idea! It would also help with documentation as it might help show a good start for a strategyfile that isn't a generic example. e.g. a new CLI tool pynonymizer-inspect > my_db.yml

The biggest caveat for me is that this is not an exhaustive tool and If we implement this I want to make very clear to users that it is not exhaustive and cannot be relied upon to detect addition of personal data colums or tables in your database. Only a human with intimate domain knowledge can exhaustively check the database for an anonymization process. For this reason i have to veto on your 2nd idea "Compare default strategy file with existing strategy file and reconcile the difference between the 2" because it implies a lifecycle for pynonymizer that goes beyond a DSL for helping a human write anonymization routines.

And of course, it's just config, so if that is something that people are comfortable / interested in running with, the strategyfile yaml schema could simply be generated by other tooling or other projects in future.

rwnx avatar Jan 27 '22 12:01 rwnx

Thanks for your response @rwnx . If the pynonymizer module can support introspection tooling to create a default strategy file , that would make it easier to use it for say a database that has a bunch of tables since it provides an initial strategy file to work with and could also pickup any schema changes without having to find out when you run the script

Thank you for your quick response !!

armorKing11 avatar Jan 27 '22 21:01 armorKing11

No worries - as i said, I think that generation/initial strategyfile tooling is a good idea, but i'm not sure i see the use-case for the "pickup any schema changes without having to find out when you run the script" for the reasons outlined above.

rwnx avatar Jan 27 '22 23:01 rwnx

Hi @rwnx , A common use is simply based on the fact the database schemas dont remain static , it will frequently change as features are added and in the current world of automation of infrastructure pipelines , where say a production db server needs its databases anonymized on a schedule to meet as GDPR/Data Protection compliance is a constant process possibly a hourly/daily process since lets say if even if you implemented Issue#96, there is still a gap where the reason for the failures is due to a schema change which could have caused the failures , but that implies there are tables/columns with sensitive data that were not anonymized and will become available in the test/dev environment ,ie, sensitive data has leaked into test/dev db environment .

I would say as you mentioned above that whoever uses data masking tools should an have domain knowledge of the db schemas . But being human (which is why we move towards extensive automation through CI/CD systems to help us even when we are sleeping or are not alert enough to keep track of everything unlike a system that does not have these constraints ) , there would be human errors that would cause production data leakage simply because they forget to update one of their many strategy files

So if a data masking tool is smart enough to least be aware of a schema change and alert the automated CI/CD system that runs it of this change , causing it to stop , instead of throwing an exception and failing and preventing anonymization or worse causing prod data leakage , that would go a long way to help human operators.

My analysis of open source and commercial data masking tools indicate that most tools dont have this ability since they assume that maintaince of whatever config the tool operates on is with the human operators but given that in general produciton db schemas are a source of truth (unless some introduced a bug into it) if commercial and open source tools can use that as a reference and alert of a change in state of production db schema with respect to what the anonymization tool is expecting can prevent data leakage .

NOTE: This is not to critique of your work , more of my disappointment that most data masking tools (a few have) dont have this capability due to which other tooling will have to be custom build to fill this gap

Thanks for making this tool !!

armorKing11 avatar Jan 28 '22 20:01 armorKing11