doccano-client icon indicating copy to clipboard operation
doccano-client copied to clipboard

How to download data?

Open LighthouseInTheSea opened this issue 3 years ago • 5 comments

When I try to call the following code doc_download = doccano_client.get_doc_download(2, 'json') print(doc_download.text) `

doccano - doccano
Loading...
`

How do I get the downloaded data ?
  • Operating System: windows
  • Python Version: 3.10.2
  • Package Version: doccano(1.5.5) doccano-client(1.0.3)

LighthouseInTheSea avatar Feb 24 '22 08:02 LighthouseInTheSea

return self.get_file( "v1/projects/{project_id}/docs/download".format(project_id=project_id), params={"q": file_format, "onlyApproved": str(only_approved).lower()}, headers=headers, )

http://xxx.xx.xxx.xx:xxxx/v1/projects/13/download

Error requesting address?

LighthouseInTheSea avatar Mar 11 '22 03:03 LighthouseInTheSea

For those struggling with the same, I copied some code from another issue and added the zip file creation. It's all a bit obscure so I'm reposting it here

def export_project(project_id,save_path):
    result = doccano_client.post(f'{doccano_client_url}v1/projects/{project_id}/download', json={'exportApproved': False, 'format': 'JSONL'}) 
    task_id = result['task_id']
    while True:
        result = doccano_client.get(f'{doccano_client_url}v1/tasks/status/{task_id}')
        if result['ready']:
            break
        time.sleep(1)
    result = doccano_client.get_file(f'{doccano_client_url}v1/projects/{project_id}/download?taskId={task_id}')
    with open(save_path, 'wb') as f:
        for chunk in result.iter_content(chunk_size=8192): 
            f.write(chunk)

PedroMTQ avatar Apr 11 '22 09:04 PedroMTQ

@PedroMTQ This is great, thanks.

wpnbos avatar Apr 13 '22 10:04 wpnbos

For those struggling with the same, I copied some code from another issue and added the zip file creation. It's all a bit obscure so I'm reposting it here

def export_project(project_id,save_path):
    result = doccano_client.post(f'{doccano_client_url}v1/projects/{project_id}/download', json={'exportApproved': False, 'format': 'JSONL'}) 
    task_id = result['task_id']
    while True:
        result = doccano_client.get(f'{doccano_client_url}v1/tasks/status/{task_id}')
        if result['ready']:
            break
        time.sleep(1)
    result = doccano_client.get_file(f'{doccano_client_url}v1/projects/{project_id}/download?taskId={task_id}')
    with open(save_path, 'wb') as f:
        for chunk in result.iter_content(chunk_size=8192): 
            f.write(chunk)

I slightly modified the end of the function to avoid writing to disk. It returns the results as a list of json blocks. I also use the baseurl from the doccano_client object instead of the doccano_client_url parameter.

...
result = doccano_client.get_file(f'{doccano_client.baseurl}v1/projects/{project_id}/download?taskId={task_id}')
file_like_object = BytesIO(result.content)
zipfile_obj = ZipFile(file_like_object)
data = zipfile_obj.open(zipfile_obj.namelist()[0]).read().splitlines()
data = [json.loads(line) for line in data]
return data

Cheers!

david-engelmann avatar Apr 28 '22 14:04 david-engelmann

Is there anyway to download the documents and include metadata?

peter-mccabe avatar Aug 18 '22 15:08 peter-mccabe

fixed https://github.com/doccano/doccano-client/pull/59

Hironsan avatar Sep 20 '22 08:09 Hironsan