SkillNER
SkillNER copied to clipboard
ENH - Add support for custom skills
Fixes #46
Please see custom_skills_updates.md for a comprehensive walkthrough, questions, and future directions.
Feel free to search for "?" in the documentation to jump to actionable items/questions.
Thanks for the wait and looking forward to hearing your thoughts 😸
Thanks for your interest @yonglin-wang. I highly appreciate your suggestions.
Before, let me clarify some important things:
Suppose that we approved the EPL Skill database and merged it to the codebase. And later on, another contributor suggested another one, and we merged it, and so on and so forth... we will end up in the end with a very huge codebase (json, classes, and their docs) that is chaotic and difficult to maintain in the future. We better keep our codebase as small as possible.
Hence, EMSI skill database will be the only database supported by default in skillner
as it contains many skills that were gathered and approved by experts, and plus, we enriched some of them manually.
Using a custom skill database is something user-specific. So what I suggest, is to define a user-friendly pipeline that creates the skills bundles so that to use them with SkillExtractor
, which is, somehow, what you have suggested.
With that being said, instead of implementing classes, abstract classes, or abstract methods, ... (let's keep things as simple as possible), I suggest adopting a sequential point view
flowchart LR;
o(x) -->| list, path, url| A(Load data) --> B(create dict with id, name)
B --> C{other processing...}
C -->D(save bundles)
Let's us start first by implementing these pieces and then combine them to form a complete pipeline.
Note on work organization:
- make sure to push frequently so that we can follow with you. Indeed, I was a bit overwhelmed with 11 commits, 14 file changes, ... This will help us keep up with you and intervene whenever necessary
- Don't reinvent the wheel. For instance, we have already implemented a cleaner class, so you can use it instead of creating your own. This saves as from retesting things over and over again.
- Try to have a code as readable as possible, Indeed, I struggled a bit to get what you are trying to do. Also, make sure to follow a certain convention in docstrings, in our case we use the numpydoc, instead of putting sentences as functions' descriptions. It help us understand your class/function without having to read its implementation.