doccano-client icon indicating copy to clipboard operation
doccano-client copied to clipboard

API for easy and seamless import/export dataset of Doccano's database from python script to allow Human-in-the-loop and Active Learning capabilities

Open ivsanro1 opened this issue 4 years ago • 2 comments

Feature description

Hello, I have been looking for a NER labelling tool that allows me to quickly iterate after a labelling session, training a model an infer with the trained models for later check of these inferences for an Active Learning problem I am dealing with. I have found that Doccano is an excellent tool, but as of my understanding, it lacks of quick import/export functionality, because I have to deal with files, upload them, tag/revise them, download new annotated dataset, etc.

In my opinion, it would be much easier to perform this task if I could directly query the database Doccano uses under the hood to keep all this texts and tags info, and play around with it.

Is there any feature already implemented that could help me achieve this task?

If it is not implemented: Is something related in the roadmap for this tool?

Thank you

ivsanro1 avatar Jul 28 '21 10:07 ivsanro1

Would you write your environment? Thank you!

github-actions[bot] avatar Jul 28 '21 10:07 github-actions[bot]

What I've been doing for this use case is adding outgoing webhooks to the backend server to notify my model training server when new examples are ready. Specifically, in here:

https://github.com/doccano/doccano/blob/master/backend/api/views/example_state.py

A lot of the database is exposed through the API as well. If you just want to be able to download the dataset through the API, it's not difficult -- it involves only 3 API requests, one to start the job, another to check the job status, and another to download the result. Here's some very rough code you could adapt:

result = doccano_client.post(f'{base_url}/v1/projects/{project_id}/download', json={'exportApproved': True, 'format': 'JSONL'}) 
task_id = result['task_id']
while True:
    result = doccano_client.get(f'{base_url}/v1/tasks/status/{task_id}')
    if result['ready']:
        break
    time.sleep(1)
result = doccano_client.get_file(f'{base_url}/v1/projects/{project_id}/download?taskId={task_id}')

daleevans avatar Aug 11 '21 22:08 daleevans

Thank you @Hironsan ! The new features look great. Can't wait to try them!

Just to close this properly, I'd like to mention (for future reference) that the problem I mentioned in the issue:

it would be much easier to perform this task if I could directly query the database Doccano uses under the hood to keep all this texts and tags info, and play around with it.

Can now be easily done by using the proper methods of the API, like create_span and delete_span for NER, update_category for text classification, etc.

ivsanro1 avatar Oct 07 '22 08:10 ivsanro1