metacrafter
metacrafter copied to clipboard
Metadata and data identification tool and Python library. Identifies PII, common identifiers, language specific identifiers. Fully customizable and flexible rules
Without cache, tool reloads rules on each run. It makes it harder to process thousands of datasets from the command line.
Some fields of databases are just incremental unique identifiers generated by the database engine. They can't be linked with any external identifier databases and are used only locally by databases....
Sometimes exported CSV files include whitespace before or after values for clearer formatting and fitting into fixed space data fields. White space should be removed automatically using the `strip()` function....
Flat table datasets (CSV) files, database tables, and sometimes objects with nested objects ofter include elements that could be grouped. For example CSV file [Zaara_D.csv](https://github.com/apicrafter/metacrafter/files/9274615/Zaara_D.csv) includes following fields: title, text,...
Nested documents in JSON/JSONlines/XML and e.t.c detected as str objects instead of dict or array objects. Example: nested objects `Scores` and `Geocode` detected as strings. - [ ] implement detection...
Named entity recognitions technology helps to identify named objects inside texts. **Strong** - allows to identify objects inside text blobs - could allow to support more named entities (identifiers) **Weakness**...
Thank you for open-sourcing this handy tool! I was trying to install the package from pip and source, but neither works out-of-the-box. From my end (Ubuntu with Python 3.10), running...
Thank you for open-sourcing this package! I was wondering if the following behavior is expected when running `metacrafter scan-file --format short world+City.csv `: > Processing file /data/bird_sql/train_csv/world+City.csv > > 2024-07-03...