elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

ignore_malformed to support ignoring JSON objects ingested into fields of the wrong type

Open samcday opened this issue 10 years ago • 73 comments

Indexing a document with an object type on a field that has already been mapped as a string type causes MapperParsingException, even if index.mapping.ignore_malformed has been enabled.

Reproducible test case

On Elasticsearch 1.6.0:

$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test":"a string"}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test":{"nested":"a string"}}'
{"error":"MapperParsingException[failed to parse [test]]; nested: ElasticsearchIllegalArgumentException[unknown property [nested]]; ","status":400}

$ curl localhost:9200/broken/_mapping
{"broken":{"mappings":{"type":{"properties":{"test":{"type":"string"}}}}}}

Expected behaviour

Indexing a document with an object field where Elasticsearch expected a string field to be will not fail the whole document when index.mapping.ignore_malformed is enabled. Instead, it will ignore the invalid object field.

samcday avatar Jul 21 '15 10:07 samcday

+1

clintongormley avatar Jul 24 '15 09:07 clintongormley

While working on this issue, I found out that it fails on other types too, but for another reason: For example, for integer:

$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test2": 10}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test2":{"nested": 20}}'
[elasticsearch] [2015-09-26 02:20:23,380][DEBUG][action.index             ] [Tyrant] [broken][1], node[7WAPN-92TAeuFYbRLVqf8g], [P], v[2], s[STARTED], a[id=WlYpBZ6vTXS-4WMvAypeTA]: Failed to execute [index {[broken][type][AVAIGFNQZ9WMajLk5l0S], source[{"test2":{"nested":1}}]}]
[elasticsearch] MapperParsingException[failed to parse]; nested: IllegalArgumentException[Malformed content, found extra data after parsing: END_OBJECT];
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:157)
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:77)
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:319)
[elasticsearch]     at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:475)
[elasticsearch]     at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:466)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction.prepareIndexOperationOnPrimary(TransportReplicationAction.java:1053)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction.executeIndexRequestOnPrimary(TransportReplicationAction.java:1061)
[elasticsearch]     at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:170)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.performOnPrimary(TransportReplicationAction.java:580)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1.doRun(TransportReplicationAction.java:453)
[elasticsearch]     at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
[elasticsearch]     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[elasticsearch]     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[elasticsearch]     at java.lang.Thread.run(Thread.java:745)
[elasticsearch] Caused by: java.lang.IllegalArgumentException: Malformed content, found extra data after parsing: END_OBJECT
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:142)
[elasticsearch]     ... 13 more

Thats happening because, unlike in the string case, we are handling the ignoreMalformed for numeric types but, when we throw the exception here we didn't parse the field object until XContentParser.Token.END_OBJECT and that comes to bite us later, here.

So, I think two things must be done: (1) Use the ignoreMalformed settings in StringFieldMapper, which is not happening (hence the original reported issue) (2) Parse until the end of the current object before throwing IllegalArgumentException("unknown property [" + currentFieldName + "]"); in the Mapper classes. To prevent the exception I reported from happening. Or maybe just ignore this exception, in innerParseDocument, when ignoreMalformed is set?

Does this make sense, @clintongormley? I'll happily send a PR for this.

andrestc avatar Sep 26 '15 05:09 andrestc

ah - i just realised that the original post refers to a string field, which doesn't support ignore_malformed...

@andrestc i agree with your second point, but i'm unsure about the first...

@rjernst what do you think?

clintongormley avatar Sep 27 '15 10:09 clintongormley

Sorry for the delayed response, I lost this one in email.

@clintongormley I think it is probably worth making the behavior consistent, and it does seem to me finding an object where a specific piece of data is expected constitutes "malformed" data.

@andrestc A PR would be great.

rjernst avatar Nov 09 '15 01:11 rjernst

I want to upvote this issue! I have fields in my JSON that are objects, but when they are empty, they contain an empty string, i.e. "" (this is the result of an XML2JSON parser). Now when I add a document where this is the case, I get a

 MapperParsingException[object mapping for [xxx] tried to parse field [xxx] as object, but found a concrete value]

This is not at all what I would expect from the documentation https://www.elastic.co/guide/en/elasticsearch/reference/2.0/ignore-malformed.html; please improve the documentation or fix the behavior (preferred!).

@clintongormley "i just realised that the original post refers to a string field, which doesn't support ignore_malformed..." Why should string fields not support ignore_malformed?

abulhol avatar Dec 21 '15 14:12 abulhol

+1

I think there could be done much more e.g. set the field to a default value and add an annotation to the document - so users can see what went wrong. In my case all documents from Apache Logs having "-" in the size field (Integer) got ignored. I could tell you 100 stories, why Elasticsearch don't take documents from real data sources ... (just to mention one more https://github.com/elastic/elasticsearch/issues/3714)

I think this problem could be handled much better:

  1. if a type error appears, try to convert the value (optional server/index setting). Often a JSON has number without quotes (correct), some put numbers as string in quotes. In this case the string could be converted to integer.
  2. if the type does not fit, take a default value for this type (0,null) - or ignore the field as you do today, but very bad if it is a larger object ...
  3. add a comment field like "_es_error_report: MapperParsingException: ...." In that way users can see that there was something wrong, today - data just disappears, when it fails to be indexed or the field is ignored. And the sysadmin might see error message in some logs ... but users wonder that data in elasticsearch is not complete and might have no access to elasticsearch logs. In my case I missed all Apache messages with status code 500 and size "-" instead of 0 - which is really bad - and depends on the log parser ...

A good example is Logsene, it adds Error-Annotations to failed documents together with the String version of the original source document (@sematext can catch Elasticsearch errors during the indexing process). So at least Logsene users can see failed index operations and orginal document in their UI or in Kibana. Thanks to this feature I'm able to report this issue to you.

It would be nice when such improvements would be available out of box for all Elasticsearch users.

megastef avatar Jan 27 '16 17:01 megastef

any news here?

abulhol avatar Feb 01 '16 20:02 abulhol

I wish to upvote the issue too. My understanding of the ignore_malformed purpose is to not lose events, even when you might lose some of its content. In the current situation I'm in, a issue similar to what has been described here is occurring, and although it's identified and multiple mid-term approaches are looked into - Issue in our case relates to multiple sources sending similar event, so options like splitting the events in separate mappings, or even cleaning up the events before reaching elasticsearch could be done - I would have liked a short term approach similar to ignore_malformed functionality to be in place to help sort term.

balooka avatar Mar 22 '16 10:03 balooka

Same problem with dates.

When adding an object with a field of type "date", in my DB whenever it is empty it's represented as "" (empty string) causing this error:

[DEBUG][action.admin.indices.mapping.put] [x] failed to put mappings on indices [[all]], type [seedMember]
java.lang.IllegalArgumentException: mapper [nms_recipient.birthDate] of different type, current_type [string], merged_type [date]

BeccaMG avatar May 03 '16 12:05 BeccaMG

Same problem with me. I'm using the ELK stack in which people may use the same properties but with different types. I don't want those properties to be searchable but I don't want to loose the entity event neither. I though ignore_malformed would do that but apparently is not working for all cases.

satazor avatar May 06 '16 14:05 satazor

We are having issues with this same feature. We have documents that sometimes decide to have objects inside something that was intedended to have strings. We would like to not lose the whole document just because one of the nodes of data are malformed.

This is the behaviour I expected to get from setting ignore_malformed on the properties, and I would applaude such a feature.

jarlelin avatar Jun 20 '16 12:06 jarlelin

Hay, I have the same problem. Is there any solution (even if it is a bit hacky) out there?

DaTebe avatar Aug 26 '16 15:08 DaTebe

Facing this in elasticsearch 2.3.1 . Before this bug is fixed we should atleast have a list of bad fields inside mapper_parsing_exception error so that the app can choose to remove them . Currently there is no standard field in the error through which these keys can be retrieved -

"error":{"type":"mapper_parsing_exception","reason":"object mapping for [A.B.C.D] tried to parse field [D] as object, but found a concrete value"}}

The app would have to parse the reason string and extract A.B.C.D which will fail if the error doc format changes . Additionally mapper_parsing_exception error itself must be using different formats for different parsing error scenarios all of which need to be handled by the app

goodfella1408 avatar Sep 09 '16 07:09 goodfella1408

I used a workaround for this matter following the recommendations from Elasticsearch forums and official documentation.

Declaring the mapping of the objects you want to index (if you know it), choosing ignore_malfored in dates and numbers, should do the trick. Those tricky ones that could have string or nested content could be simply declared as object.

BeccaMG avatar Sep 14 '16 04:09 BeccaMG

for usage as a real log stash I would say something like https://github.com/elastic/elasticsearch/issues/12366#issuecomment-175748358 is a must have! I can get accustomed to losing indexed fields but losing log entries is a no-go for ELK from my perspective

derEremit avatar Sep 15 '16 16:09 derEremit

Bumping, this issue is preventing a number of my messages to successfully be processed as a field object is returned as an empty string on rare cases.

micpotts avatar Oct 31 '16 21:10 micpotts

Bump, this is proving to be an extremely tedious (non) feature to work around.

patrick-oyst avatar Jan 19 '17 14:01 patrick-oyst

I've found a way around this but it comes at a cost. It could be worth it for those like me who are in a situation where intervening directly on your data flow (like checking and fixing the log line yourself before sending it to ES) is something you'd like to avoid in the short term. Set the enabled setting of your field to false. This will make the field non searchable though. This isn't too big of an issue in my context because the reason this field is so unpredictable is the reason I need ignore_malformed to begin with, so it's not a particularly useful field to search on anyways, though you still have access to the data when you search for that document using another field. Incidentally, this solves both situations : writing an object to a non-object field and vice versa.

Hope this helps. It certainly saved me a lot of trouble...

patrick-oyst avatar Jan 19 '17 15:01 patrick-oyst

Thats a good trick. Ill try that out.

On 19 Jan 2017 16:01, "patrick-oyst" [email protected] wrote:

I've found a way around this but it comes at a cost. It could be worth it for those like me who are in a situation where intervening directly on your data flow (like checking and fixing the log line yourself before sending it to ES) is something you'd like to avoid in the short term. Set your object's enabled setting to false. This will make the fields non searchable though. This isn't too big of an issue in my context because the reason this field is so unpredictable is the reason I need ignore_malformed to begin with, so it's not a particularly useful field to search on anyways, though you still have access to the data when you search for that document using another field.

Hope this helps. It certainly saved me a lot of trouble...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/12366#issuecomment-273798499, or mute the thread https://github.com/notifications/unsubscribe-auth/AGC4v4w0ZIXlOGN40nAgl_8fpy0dj2CUks5rT3rhgaJpZM4Fcpph .

jarlelin avatar Jan 19 '17 15:01 jarlelin

+1

robinjha avatar Jan 25 '17 19:01 robinjha

+1

senseysensor avatar Mar 15 '17 07:03 senseysensor

+1

marcovdkuur avatar Mar 28 '17 08:03 marcovdkuur

+1

jonesn avatar Apr 04 '17 05:04 jonesn

Also an issue on ES 5.2.1. Very frustrating when dealing with some unexpected input that may possibly be malformed.

rholloway avatar Apr 05 '17 13:04 rholloway

👍 Would definitely be great to enable the ignore_malformed property for object. I've had many cases of mapping errors due to the fact that someone tried to index a string where a nested object should be and vice versa.

hartfordfive avatar Apr 06 '17 15:04 hartfordfive

👍

sorinescu avatar Apr 11 '17 13:04 sorinescu

👍

aivis avatar May 14 '17 07:05 aivis

+1

EamonHetherton avatar May 31 '17 04:05 EamonHetherton

👍

fkoclas avatar Jun 01 '17 15:06 fkoclas

👍

flashfm avatar Jun 14 '17 16:06 flashfm