[Question] Data Labeler vs. Presidio
I see many parallels between your Data Labeler and the Presidio project from Microsoft.
- Could you talk a little about the similarities and differences?
- Is the overall goal and focus the same?
- Have you run tests to compare accuracy and speed?
- Do you use any common NLP libraries under the hood?
I'm looking into which of the two is right for me and couldn't find a good comparison. I was hoping to get a little bit of insight from you as a starting point.
Thanks
Sorry for the delay in response. Here are my thoughts on your questions.
First, I can't speak to Presidio, so I'll explain more of the intent from the Data Profiler's perspective.
The Data Profiler's intent is to provide an easy to use library for data insight. One component of that is identification of sensitive data as suggested by the Data Labeler. While the Data Labeler provides an out of the box solution for structured and unstructured identification, it generally tries to abstract Data Labelers to allow others to build their own components (pre/post processor and model) and make them shareable with others for reuse or integration into the profiling capabilities of the library.
Hence, if the current set of labels meets your needs, then no added work is required. However, the data labeler's allow you to set bring your own datasets to train (transfer learning or reinitialized) the models on the labels that meet your needs or develop your own labeling pipeline as well.
The current default model is a TensorFlow Char CNN model that can configured to apply word aggregation in the postprocessing. The preprocessing was optimized by pushing the character encoding into the model layers. As mentioned above, the final layer can adapt to the labels needed for your own needs provided you have the dataset to train the model after architecture change. Integration into the profiling can be done through setting profiling options.
If you have further questions, please don't hesitate to reach out again or even contribute to library as well.
Closing due to inactivity.