FOSElasticaBundle icon indicating copy to clipboard operation
FOSElasticaBundle copied to clipboard

Attachment ingest issue

Open aarsla opened this issue 7 years ago • 4 comments

Hello, I am using ES 5.2.2 with ingest attachment plugin and I am trying to search through doc/pdf files.

Sending a file to index/documents/test?pipeline=attachment creates

 "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
          "attachment": {
            "content_type": "application/rtf",
            "language": "ro",
            "content": "Lorem ipsum dolor sit amet",
            "content_length": 28
          },

so I can search through attachment.content field, however

app/console fos:elastica:populate --no-reset

with this mapping

 data:
                            type: attachment
                            path: full
                            fields:
                                name: { store: yes }
                                title: { store : yes }
                                date: { store : yes }
                                content : { term_vector: with_positions_offsets, store: yes }

only creates

"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

without desired attachment/content fields.

Any hints on what am I doing wrong? Thanks

aarsla avatar Mar 22 '17 22:03 aarsla

I'm having a similar issue, but I'm not sure if ElasticaBundle supports ingest-attachment vs mapper-attachments.

I just get the error No handler for type [attachment] declared on field [content]

SirWaddles avatar Apr 06 '17 00:04 SirWaddles

I have never used attachments. Can you guys make a PR to resolve the issue?

XWB avatar Apr 24 '17 08:04 XWB

I'm migrating to a newer ElasticSearch and facing similar problems. From my own research I think the new Ingest Attachment plugin works a bit differently. You first define a "pipeline", where you configure the attachment plugin to take the original document, read a base64-encoded file from one field and put a "parsed" representation (an object containg content, mime type, etc.) of it into another field. The parsed object looks like this:

{
  "content_type": "application/rtf",
  "language": "ro",
  "content": "Lorem ipsum dolor sit amet",
  "content_length": 28
}

So clearly the base64 encoded file & "parsed" result must be separate fields, because they have are different types. Another complication is that in order to actually use a pipeline, you must specify it as a query parameter (i.e. ?pipeline=my-custom-pipeline-that-parses-files). I don't think there's a nice way of doing it in FosElasticaBundle, right?

Another gotcha is that pipelines are not supported with the update API. If new files are added to an existing entity, they won't even be processed by the pipeline.

So yeah... Not sure how to even approach this. I think files are just a special case and if this bundle ever supports this use case, it should instead support pipelines in general. For now I think I'm just gonna stick with the deprecated attachment mapper plugin.

kgilden avatar Nov 23 '17 03:11 kgilden

Any news? Elastica.io now fully supports pipelines and the ingest attachment plugin.

progmancod avatar Jul 04 '19 20:07 progmancod