wagtail_textract icon indicating copy to clipboard operation
wagtail_textract copied to clipboard

How to make wagtail_textract work with Media Storage Backend

Open danieltomasku opened this issue 6 years ago • 5 comments

Hi there,

I am trying to use wagtail_textract for my project. I tried previously using just textract but am interested in some of the helper utilities of wagtail_textract. I am wondering how wagtail_textract will work in production with Docker and a Media Storage backend, such as Azure.

The line here is referencing a file path: text = textract.process(document.file.path).strip()

but when in production using a Media Storage backend, it seems like this will fail because it does not have a proper file system. Has this been tested or does anybody know how I might be able to get this to work? Any help would be much appreciated! Let me know if you need any more info about my project setup.

danieltomasku avatar May 09 '18 17:05 danieltomasku

Hi @danieltomasku ,

Sorry, there is nothing yet for other storage backends. Since we only use the default backend for now. What do you think is needed to make wagtail_textract play well with other storage backends?

PR's are always welcome ;).

allcaps avatar May 16 '18 15:05 allcaps

Hi @allcaps ,

Thanks for your response. I ended up using tempfile to make document indexing work with my Azure Media Storage backend. The solution I made accounts for both local development and production environments. It uses the normal file.path when in your local dev environment, but then in production uses tempfile to read the file and then close and delete. Below is the snippet I used:

import tempfile

def index_documents(self):
        """Loops through all the documents in a ResourcePage and returns a string with all the extracted text."""
        alltext = ''
        for block in self.body:
            if block.block_type == 'resource_document':
                if block.value['document'].file_extension.lower() in ['pdf', 'pptx', 'html', 'htm', 'xls', 'xlsx', 'doc', 'docx', 'rtf', '.txt']:
                    try:
                        path = block.value['document'].file.path
                        text = self.extract_text(path)
                    except NotImplementedError:
                        logger.info('Downloading for search index %s' % block.value['document'].file.url)
                        remote_file_url = block.value['document'].file.url
                        f = tempfile.NamedTemporaryFile('w+b', suffix='.%s' % block.value['document'].file_extension)
                        urlretrieve(remote_file_url, filename=f.name)
                        path = f.name
                        text = self.extract_text(path)
                        f.close()
                    alltext += text.decode("utf-8")
        return alltext

The relevant section is in the except statement. In my case, I am looping through a StreamField where users add documents to the page and indexing the individual documents. This could be extended to the way that wagtail_textract indexes documents through the transcribe_document handler.

Thoughts on the approach?

danieltomasku avatar May 16 '18 16:05 danieltomasku

Hi @danieltomasku ,

Thanks for bringing this use case to attention.

That approach looks pretty good to me.

Is there some way we can make this easier? Maybe the transcribe_document() method could have some pluggable helper method?

khink avatar May 24 '18 09:05 khink

Seems like Wagtail does the 'is-this-a-local-file' check in the same way: https://github.com/wagtail/wagtail/blob/7034cd131774b8971ff3c7424999a28164480f29/wagtail/documents/views/serve.py#L35

allcaps avatar May 28 '18 07:05 allcaps

Hi @danieltomasku, I'd be happy to accept a PR if that helps.

khink avatar Jun 05 '18 08:06 khink