elasticsearch-river-web icon indicating copy to clipboard operation
elasticsearch-river-web copied to clipboard

How to index text inside <div> tags

Open srinivasv2 opened this issue 11 years ago • 5 comments

Hi,

Can anyone help me on indexing text between particular

tags something like: < div data-canvas-width="125.304" data-font-name="g_font_580_0" data-angle="0" style="font-size: 24px; font-family: sans-serif; left: 64px; top: 172px; transform: rotate(0deg) scale(1.00243, 1); transform-origin: 0% 0% 0px;" dir="ltr">Automotive < /div>

This is to index some content in pdf files as per my requirement.

Thanks In Advance, Srinivas

srinivasv2 avatar Mar 25 '14 19:03 srinivasv2

This is to index some content in pdf files as per my requirement.

Is div tag in PDF file??

marevol avatar Mar 25 '14 21:03 marevol

Yes, this div tag is in pdf file. I need to index all such kind of pdf data for my requirement.

srinivasv2 avatar Mar 26 '14 09:03 srinivasv2

Hmm, extracting contents with CSS query supports HTML only. So, it's difficult to do that..

marevol avatar Mar 26 '14 14:03 marevol

Okay thanks for your response. Actually my intention is to extract some data from pdf files to display as title and description in the search page just like we show for normal html pages where I am getting empty field when I try to index "title" in crawl pattern.

Search result should be like below:

[PDF] Automotive Tote Labeling ... Printers & Media Application Brief Automotive Manufacturing Labeling Industry Need Public Safety and 24/7 production ...

Please let me know any alternate solution to index and fetch any particular data in pdf files which we are able to do in our current search application. As of now I am just able to index only URL and body fields for pdf's in ES where almost body content is in binary format.

Thanks, Srinivas

srinivasv2 avatar Mar 26 '14 18:03 srinivasv2

An attachment type might work... Please see Use attachment type.

marevol avatar Mar 26 '14 21:03 marevol