datahub
datahub copied to clipboard
Not able to ingest non english utf-8 chars such as Japanese etc
Describe the bug Could not able to ingest data contains non english utf-8 chars such as Japanese ( for example "Sample Data - 商品ブランドコード") to dataset entity on their aspects such as datasetProperties, dataset schemaMetadata ( on column description part).
To Reproduce Steps to reproduce the behavior:
- Ingest data which contains non english utf-8 chars such as Japanese to Dataset entity on the following aspects datasetProperties , datasetSchemametadata
Expected behavior Metadata should be ingested to datahub.
Observed behavior Metadata not able to ingest to datahub.
Additional context
- Used Datahub Java emitter.
- Data encoding utf-8
- Refer slack conversation: https://datahubspace.slack.com/archives/CUMUWQU66/p1657100203454079
Person to contact: Chris Margach (in datahub slack)
This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io
Additional context here: non utf-8 characters seems to work on the python side + on the frontend. I only tested the description field though, so @fr-judson let me know if there's other fields that were causing problems for you?
As such, it looks like this error is specific to the java emitter
Fix suggestion:
in RestEmitter.java, StringEntity uses iso-8859-1 by default, but JSON is always UTF-8.
httpPost.setEntity(new StringEntity(payloadJson));
must be rewriten into:
httpPost.setEntity(new StringEntity(payloadJson, org.apache.http.entity.ContentType.APPLICATION_JSON));
and
httpPost.setEntity(new StringEntity(objectMapper.writeValueAsString(payload)));
must be rewriten into:
httpPost.setEntity(new StringEntity(objectMapper.writeValueAsString(payload), org.apache.http.entity.ContentType.APPLICATION_JSON));
@hsheth2 Unfortunately, @fr-judson has left the organization. As far as I can see from our chat logs, we only tried "descriptions for both datasetproperties and dataset column descriptions"
@humpfhumpf Thank you very much for the suggestion! We'll try it out.