doccano-client
doccano-client copied to clipboard
API for easy and seamless import/export dataset of Doccano's database from python script to allow Human-in-the-loop and Active Learning capabilities
Feature description
Hello, I have been looking for a NER labelling tool that allows me to quickly iterate after a labelling session, training a model an infer with the trained models for later check of these inferences for an Active Learning problem I am dealing with. I have found that Doccano is an excellent tool, but as of my understanding, it lacks of quick import/export functionality, because I have to deal with files, upload them, tag/revise them, download new annotated dataset, etc.
In my opinion, it would be much easier to perform this task if I could directly query the database Doccano uses under the hood to keep all this texts and tags info, and play around with it.
Is there any feature already implemented that could help me achieve this task?
If it is not implemented: Is something related in the roadmap for this tool?
Thank you
Would you write your environment? Thank you!
What I've been doing for this use case is adding outgoing webhooks to the backend server to notify my model training server when new examples are ready. Specifically, in here:
https://github.com/doccano/doccano/blob/master/backend/api/views/example_state.py
A lot of the database is exposed through the API as well. If you just want to be able to download the dataset through the API, it's not difficult -- it involves only 3 API requests, one to start the job, another to check the job status, and another to download the result. Here's some very rough code you could adapt:
result = doccano_client.post(f'{base_url}/v1/projects/{project_id}/download', json={'exportApproved': True, 'format': 'JSONL'})
task_id = result['task_id']
while True:
result = doccano_client.get(f'{base_url}/v1/tasks/status/{task_id}')
if result['ready']:
break
time.sleep(1)
result = doccano_client.get_file(f'{base_url}/v1/projects/{project_id}/download?taskId={task_id}')
Thank you @Hironsan ! The new features look great. Can't wait to try them!
Just to close this properly, I'd like to mention (for future reference) that the problem I mentioned in the issue:
it would be much easier to perform this task if I could directly query the database Doccano uses under the hood to keep all this texts and tags info, and play around with it.
Can now be easily done by using the proper methods of the API, like create_span and delete_span for NER, update_category for text classification, etc.