wagtail_textract
wagtail_textract copied to clipboard
How to make wagtail_textract work with Media Storage Backend
Hi there,
I am trying to use wagtail_textract
for my project. I tried previously using just textract
but am interested in some of the helper utilities of wagtail_textract
. I am wondering how wagtail_textract
will work in production with Docker and a Media Storage backend, such as Azure.
The line here is referencing a file path:
text = textract.process(document.file.path).strip()
but when in production using a Media Storage backend, it seems like this will fail because it does not have a proper file system. Has this been tested or does anybody know how I might be able to get this to work? Any help would be much appreciated! Let me know if you need any more info about my project setup.
Hi @danieltomasku ,
Sorry, there is nothing yet for other storage backends. Since we only use the default backend for now. What do you think is needed to make wagtail_textract play well with other storage backends?
PR's are always welcome ;).
Hi @allcaps ,
Thanks for your response. I ended up using tempfile
to make document indexing work with my Azure Media Storage backend. The solution I made accounts for both local development and production environments. It uses the normal file.path
when in your local dev environment, but then in production uses tempfile
to read the file and then close and delete. Below is the snippet I used:
import tempfile
def index_documents(self):
"""Loops through all the documents in a ResourcePage and returns a string with all the extracted text."""
alltext = ''
for block in self.body:
if block.block_type == 'resource_document':
if block.value['document'].file_extension.lower() in ['pdf', 'pptx', 'html', 'htm', 'xls', 'xlsx', 'doc', 'docx', 'rtf', '.txt']:
try:
path = block.value['document'].file.path
text = self.extract_text(path)
except NotImplementedError:
logger.info('Downloading for search index %s' % block.value['document'].file.url)
remote_file_url = block.value['document'].file.url
f = tempfile.NamedTemporaryFile('w+b', suffix='.%s' % block.value['document'].file_extension)
urlretrieve(remote_file_url, filename=f.name)
path = f.name
text = self.extract_text(path)
f.close()
alltext += text.decode("utf-8")
return alltext
The relevant section is in the except
statement. In my case, I am looping through a StreamField where users add documents to the page and indexing the individual documents. This could be extended to the way that wagtail_textract
indexes documents through the transcribe_document
handler.
Thoughts on the approach?
Hi @danieltomasku ,
Thanks for bringing this use case to attention.
That approach looks pretty good to me.
Is there some way we can make this easier? Maybe the transcribe_document()
method could have some pluggable helper method?
Seems like Wagtail does the 'is-this-a-local-file' check in the same way: https://github.com/wagtail/wagtail/blob/7034cd131774b8971ff3c7424999a28164480f29/wagtail/documents/views/serve.py#L35
Hi @danieltomasku, I'd be happy to accept a PR if that helps.