elasticsearch
elasticsearch copied to clipboard
ignore_malformed to support ignoring JSON objects ingested into fields of the wrong type
Indexing a document with an object type on a field that has already been mapped as a string type causes MapperParsingException, even if index.mapping.ignore_malformed has been enabled.
Reproducible test case
On Elasticsearch 1.6.0:
$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}
$ curl -XPOST localhost:9200/broken/type -d '{"test":"a string"}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}
$ curl -XPOST localhost:9200/broken/type -d '{"test":{"nested":"a string"}}'
{"error":"MapperParsingException[failed to parse [test]]; nested: ElasticsearchIllegalArgumentException[unknown property [nested]]; ","status":400}
$ curl localhost:9200/broken/_mapping
{"broken":{"mappings":{"type":{"properties":{"test":{"type":"string"}}}}}}
Expected behaviour
Indexing a document with an object field where Elasticsearch expected a string field to be will not fail the whole document when index.mapping.ignore_malformed is enabled. Instead, it will ignore the invalid object field.
+1
While working on this issue, I found out that it fails on other types too, but for another reason: For example, for integer:
$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}
$ curl -XPOST localhost:9200/broken/type -d '{"test2": 10}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}
$ curl -XPOST localhost:9200/broken/type -d '{"test2":{"nested": 20}}'
[elasticsearch] [2015-09-26 02:20:23,380][DEBUG][action.index ] [Tyrant] [broken][1], node[7WAPN-92TAeuFYbRLVqf8g], [P], v[2], s[STARTED], a[id=WlYpBZ6vTXS-4WMvAypeTA]: Failed to execute [index {[broken][type][AVAIGFNQZ9WMajLk5l0S], source[{"test2":{"nested":1}}]}]
[elasticsearch] MapperParsingException[failed to parse]; nested: IllegalArgumentException[Malformed content, found extra data after parsing: END_OBJECT];
[elasticsearch] at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:157)
[elasticsearch] at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:77)
[elasticsearch] at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:319)
[elasticsearch] at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:475)
[elasticsearch] at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:466)
[elasticsearch] at org.elasticsearch.action.support.replication.TransportReplicationAction.prepareIndexOperationOnPrimary(TransportReplicationAction.java:1053)
[elasticsearch] at org.elasticsearch.action.support.replication.TransportReplicationAction.executeIndexRequestOnPrimary(TransportReplicationAction.java:1061)
[elasticsearch] at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:170)
[elasticsearch] at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.performOnPrimary(TransportReplicationAction.java:580)
[elasticsearch] at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1.doRun(TransportReplicationAction.java:453)
[elasticsearch] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
[elasticsearch] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[elasticsearch] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[elasticsearch] at java.lang.Thread.run(Thread.java:745)
[elasticsearch] Caused by: java.lang.IllegalArgumentException: Malformed content, found extra data after parsing: END_OBJECT
[elasticsearch] at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:142)
[elasticsearch] ... 13 more
Thats happening because, unlike in the string case, we are handling the ignoreMalformed for numeric types but, when we throw the exception here we didn't parse the field object until XContentParser.Token.END_OBJECT and that comes to bite us later, here.
So, I think two things must be done:
(1) Use the ignoreMalformed settings in StringFieldMapper, which is not happening (hence the original reported issue)
(2) Parse until the end of the current object before throwing IllegalArgumentException("unknown property [" + currentFieldName + "]"); in the Mapper classes. To prevent the exception I reported from happening. Or maybe just ignore this exception, in innerParseDocument, when ignoreMalformed is set?
Does this make sense, @clintongormley? I'll happily send a PR for this.
ah - i just realised that the original post refers to a string field, which doesn't support ignore_malformed...
@andrestc i agree with your second point, but i'm unsure about the first...
@rjernst what do you think?
Sorry for the delayed response, I lost this one in email.
@clintongormley I think it is probably worth making the behavior consistent, and it does seem to me finding an object where a specific piece of data is expected constitutes "malformed" data.
@andrestc A PR would be great.
I want to upvote this issue! I have fields in my JSON that are objects, but when they are empty, they contain an empty string, i.e. "" (this is the result of an XML2JSON parser). Now when I add a document where this is the case, I get a
MapperParsingException[object mapping for [xxx] tried to parse field [xxx] as object, but found a concrete value]
This is not at all what I would expect from the documentation https://www.elastic.co/guide/en/elasticsearch/reference/2.0/ignore-malformed.html; please improve the documentation or fix the behavior (preferred!).
@clintongormley "i just realised that the original post refers to a string field, which doesn't support ignore_malformed..." Why should string fields not support ignore_malformed?
+1
I think there could be done much more e.g. set the field to a default value and add an annotation to the document - so users can see what went wrong. In my case all documents from Apache Logs having "-" in the size field (Integer) got ignored. I could tell you 100 stories, why Elasticsearch don't take documents from real data sources ... (just to mention one more https://github.com/elastic/elasticsearch/issues/3714)
I think this problem could be handled much better:
- if a type error appears, try to convert the value (optional server/index setting). Often a JSON has number without quotes (correct), some put numbers as string in quotes. In this case the string could be converted to integer.
- if the type does not fit, take a default value for this type (0,null) - or ignore the field as you do today, but very bad if it is a larger object ...
- add a comment field like "_es_error_report: MapperParsingException: ...." In that way users can see that there was something wrong, today - data just disappears, when it fails to be indexed or the field is ignored. And the sysadmin might see error message in some logs ... but users wonder that data in elasticsearch is not complete and might have no access to elasticsearch logs. In my case I missed all Apache messages with status code 500 and size "-" instead of 0 - which is really bad - and depends on the log parser ...
A good example is Logsene, it adds Error-Annotations to failed documents together with the String version of the original source document (@sematext can catch Elasticsearch errors during the indexing process). So at least Logsene users can see failed index operations and orginal document in their UI or in Kibana. Thanks to this feature I'm able to report this issue to you.
It would be nice when such improvements would be available out of box for all Elasticsearch users.
any news here?
I wish to upvote the issue too. My understanding of the ignore_malformed purpose is to not lose events, even when you might lose some of its content. In the current situation I'm in, a issue similar to what has been described here is occurring, and although it's identified and multiple mid-term approaches are looked into - Issue in our case relates to multiple sources sending similar event, so options like splitting the events in separate mappings, or even cleaning up the events before reaching elasticsearch could be done - I would have liked a short term approach similar to ignore_malformed functionality to be in place to help sort term.
Same problem with dates.
When adding an object with a field of type "date", in my DB whenever it is empty it's represented as "" (empty string) causing this error:
[DEBUG][action.admin.indices.mapping.put] [x] failed to put mappings on indices [[all]], type [seedMember]
java.lang.IllegalArgumentException: mapper [nms_recipient.birthDate] of different type, current_type [string], merged_type [date]
Same problem with me. I'm using the ELK stack in which people may use the same properties but with different types. I don't want those properties to be searchable but I don't want to loose the entity event neither. I though ignore_malformed would do that but apparently is not working for all cases.
We are having issues with this same feature. We have documents that sometimes decide to have objects inside something that was intedended to have strings. We would like to not lose the whole document just because one of the nodes of data are malformed.
This is the behaviour I expected to get from setting ignore_malformed on the properties, and I would applaude such a feature.
Hay, I have the same problem. Is there any solution (even if it is a bit hacky) out there?
Facing this in elasticsearch 2.3.1 . Before this bug is fixed we should atleast have a list of bad fields inside mapper_parsing_exception error so that the app can choose to remove them . Currently there is no standard field in the error through which these keys can be retrieved -
"error":{"type":"mapper_parsing_exception","reason":"object mapping for [A.B.C.D] tried to parse field [D] as object, but found a concrete value"}}
The app would have to parse the reason string and extract A.B.C.D which will fail if the error doc format changes . Additionally mapper_parsing_exception error itself must be using different formats for different parsing error scenarios all of which need to be handled by the app
I used a workaround for this matter following the recommendations from Elasticsearch forums and official documentation.
Declaring the mapping of the objects you want to index (if you know it), choosing ignore_malfored in dates and numbers, should do the trick. Those tricky ones that could have string or nested content could be simply declared as object.
for usage as a real log stash I would say something like https://github.com/elastic/elasticsearch/issues/12366#issuecomment-175748358 is a must have! I can get accustomed to losing indexed fields but losing log entries is a no-go for ELK from my perspective
Bumping, this issue is preventing a number of my messages to successfully be processed as a field object is returned as an empty string on rare cases.
Bump, this is proving to be an extremely tedious (non) feature to work around.
I've found a way around this but it comes at a cost. It could be worth it for those like me who are in a situation where intervening directly on your data flow (like checking and fixing the log line yourself before sending it to ES) is something you'd like to avoid in the short term. Set the enabled setting of your field to false. This will make the field non searchable though. This isn't too big of an issue in my context because the reason this field is so unpredictable is the reason I need ignore_malformed to begin with, so it's not a particularly useful field to search on anyways, though you still have access to the data when you search for that document using another field. Incidentally, this solves both situations : writing an object to a non-object field and vice versa.
Hope this helps. It certainly saved me a lot of trouble...
Thats a good trick. Ill try that out.
On 19 Jan 2017 16:01, "patrick-oyst" [email protected] wrote:
I've found a way around this but it comes at a cost. It could be worth it for those like me who are in a situation where intervening directly on your data flow (like checking and fixing the log line yourself before sending it to ES) is something you'd like to avoid in the short term. Set your object's enabled setting to false. This will make the fields non searchable though. This isn't too big of an issue in my context because the reason this field is so unpredictable is the reason I need ignore_malformed to begin with, so it's not a particularly useful field to search on anyways, though you still have access to the data when you search for that document using another field.
Hope this helps. It certainly saved me a lot of trouble...
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/12366#issuecomment-273798499, or mute the thread https://github.com/notifications/unsubscribe-auth/AGC4v4w0ZIXlOGN40nAgl_8fpy0dj2CUks5rT3rhgaJpZM4Fcpph .
+1
+1
+1
+1
Also an issue on ES 5.2.1. Very frustrating when dealing with some unexpected input that may possibly be malformed.
👍
Would definitely be great to enable the ignore_malformed property for object. I've had many cases of mapping errors due to the fact that someone tried to index a string where a nested object should be and vice versa.
👍
👍
+1
👍
👍