David Bashford

Results 96 comments of David Bashford

Totally reasonable request. Would require a good deal of refactoring to allow extractors to return something other than a string. And it wouldn't be hard for the `pptx` extractor to...

Because this may require a good deal of refactoring, going to wait to take a look at this until after I've cleared the decks on the the other workable issues....

Did you use `preserveLineBreaks`?

Also, yes, there is support for groups of characters, if something is missing, raise another issue.

Can you give me an example document?

have you tried adding the arabic support that pdftotext provides? I wasn't able to get it working locally. going to take a look soon at the PR that introduces use...

For what its worth I have confirmed that arabic works fine in general (can extract from `.docx` and included a test to confirm), the characters are just not coming out...

In order to get `2.0` out soon, moving this to `2.1`

If I run textract on the `.doc` files in that attachment, I get a document filled with single spaced out words. No instances of double spaces. Single spaced words are...