Access to model definitions and training/validation data?
Would it be possible to get access to model definitions and training/validation data for the models used in SAP/credential-digger?
I'm interested to see how these models were trained, and to possible contribute to their future development.
Currently it seems that only trained models are available for download.
Hi @nlykkei,
Thank you for the interest to the project. I'll start first with a clarification with regards to the training/validation data. Currently we trained two types of Models, one based on real data that we keep internal (for privacy reasons), and a second one that is open source, that is trained using synthetic generated data. If you are interested we can give you more details on how this data is generated or how to train your own data (already some details are avaialble in our publication here ). If you are interested in contributing to the project or if you want to deploy it in your professional environment , let's then have a call together with the team and discuss this in details. You can join me directly on my e-mail that you will find in the publication ;) . Best regards Slim
Hi @SlimTrabelsi
Thanks for your reply,
If you are interested we can give you more details on how this data is generated or how to train your own data (already some details are avaialble in our publication here ).
I'd be very grateful, if you'd provide more details than already provided in the publication.
Personally, I've been working on a similar problem, but it has been very difficult to progress from a strict set of regular expressions (blacklist) to using ML to decide on results that are hard to express using regular expressions without introducing too many false positives (e.g. social security numbers: \d{8}[-: ]?\d{4}).
The experience I have gained is that it was only possible to identify sensitive data given a sufficient amount of context in its neighbourhood (e.g. think of a URL, https://user:[email protected]/foo/bar).
My experience with ML is elementary university courses and DeepLearning.AI certifications. Would you say that my skill level is inadequate to develop this kind of system?
Best regards Nicolas