datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Not able to ingest non english utf-8 chars such as Japanese etc

Open fr-judson opened this issue 3 years ago • 4 comments

Describe the bug Could not able to ingest data contains non english utf-8 chars such as Japanese ( for example "Sample Data - 商品ブランドコード") to dataset entity on their aspects such as datasetProperties, dataset schemaMetadata ( on column description part).

To Reproduce Steps to reproduce the behavior:

  1. Ingest data which contains non english utf-8 chars such as Japanese to Dataset entity on the following aspects datasetProperties , datasetSchemametadata

Expected behavior Metadata should be ingested to datahub.

Observed behavior Metadata not able to ingest to datahub.

Additional context

  • Used Datahub Java emitter.
  • Data encoding utf-8
  • Refer slack conversation: https://datahubspace.slack.com/archives/CUMUWQU66/p1657100203454079

Person to contact: Chris Margach (in datahub slack)

fr-judson avatar Jul 11 '22 08:07 fr-judson

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Aug 28 '22 02:08 github-actions[bot]

Additional context here: non utf-8 characters seems to work on the python side + on the frontend. I only tested the description field though, so @fr-judson let me know if there's other fields that were causing problems for you?

image (2)

As such, it looks like this error is specific to the java emitter

hsheth2 avatar Nov 15 '22 03:11 hsheth2

Fix suggestion:

in RestEmitter.java, StringEntity uses iso-8859-1 by default, but JSON is always UTF-8.

httpPost.setEntity(new StringEntity(payloadJson));

must be rewriten into:

httpPost.setEntity(new StringEntity(payloadJson, org.apache.http.entity.ContentType.APPLICATION_JSON));

and

httpPost.setEntity(new StringEntity(objectMapper.writeValueAsString(payload)));

must be rewriten into:

httpPost.setEntity(new StringEntity(objectMapper.writeValueAsString(payload), org.apache.http.entity.ContentType.APPLICATION_JSON));

humpfhumpf avatar Nov 22 '22 16:11 humpfhumpf

@hsheth2 Unfortunately, @fr-judson has left the organization. As far as I can see from our chat logs, we only tried "descriptions for both datasetproperties and dataset column descriptions"

@humpfhumpf Thank you very much for the suggestion! We'll try it out.

fr-chrismargach avatar Nov 23 '22 00:11 fr-chrismargach