fscrawler icon indicating copy to clipboard operation
fscrawler copied to clipboard

Ability to split documents per page so one elasticsearch entry per page

Open jawiz opened this issue 5 years ago • 6 comments

When indexing large documents you may hit limits not only on the indexing part, but also when doing searches.

Splitting documents into one entry per page helps slice up large documents into bite-size chunks and help performance of indexing and searching in the documents.

jawiz avatar Jun 20 '19 07:06 jawiz

Sadly Tika does not offer this AFAIK.

dadoonet avatar Jun 20 '19 08:06 dadoonet

Actually @tballison wrote recently on discuss:

It is simpler than that. Just use the ToXMLContentHandler to get an XML String, and then run a SAXParser (or JSoup in case we're not getting our tags right :D) against that xml, and parse the content per page. No need to send anything back to Tika. I can demo it for you pretty easily...

So that should be doable. I need to play a bit around it. :)

dadoonet avatar Jul 03 '19 12:07 dadoonet

Yeah it's totally doable. I wrote a small program that cover my needs in Python. It only works on .pdf and not .docx. Basically Tika parses to html and each page is a div with class page.

In Python I wrote it using BeautifulSoup 4 html parser to parse the HTML.

def pageSplit(rawContent):
    content = rawContent["content"]
    soup = BeautifulSoup(content, "html.parser")
    pages = []
    for page in soup.find_all('div', {"class": "page"}):
        pages.append(page.text)
    return pages

jawiz avatar Jul 04 '19 13:07 jawiz

do you have an estimated date of availability so I can decide either to interface with ES myself or wait for the functionality ? Thanks.

mchari avatar Oct 24 '19 15:10 mchari

Absolutely no idea. I believe it won't happen before 2020 unless someone wants to add it to the project.

dadoonet avatar Oct 24 '19 15:10 dadoonet

Hi David, I am able to add pages encoded in base64 using instructions in https://kb.objectrocket.com/elasticsearch/how-to-index-a-pdf-file-as-an-elasticsearch-index-267 I used es.index() to add the encoded pages into ES. I did not specify any document id. I confirmed that there is a document in my ES index. But when I try to query for content such as qres= es.search(index="prestotest",body={ "query" : { "bool" : { "must": [{ "match": { "content": "Informed consent" } } ] } } }) print(qres['hits']['total'])

I don't see any hits. Any idea how I could make it work ?
Thanks

mchari avatar Oct 28 '19 18:10 mchari