catma
catma copied to clipboard
Upload problem epub files
I was surprised to see that I could upload epub files to CATMA – I didn't know that! However, the text seems to be imcomplete – it is cut off somewhere in the middle. What could be the problem here?
For all but XML documents we use the Apache Tika parser to extract the text. It claims to support a lot of formats: https://tika.apache.org/1.24.1/formats.html But I suspect it has a problem with epub. And it obviously cannot handle DRM protected documents. But I would need to have a closer look to see what's the problem. When I tested it, some documents where handled fine while others were not.