sist2 icon indicating copy to clipboard operation
sist2 copied to clipboard

Invalid output: <...> (Success) when specifying output directory

Open willwade opened this issue 2 years ago • 6 comments

Device Information (please complete the following information):

  • OS: Ubuntu 20.04.2 LTS
  • Deployment: Linux Binary
  • SIST2 Version: 2.11.5
  • Elasticsearch Version (if relevant) : 7.14.0

Command with arguments

~~./sist2-x64-linux-debug scan --ocr eng /mnt/best/BEST_ACE/DOCS/OLD/ -o ~/.docs_old_idx ./sist2-x64-linux-debug: error while loading shared libraries: libasan.so.4: cannot open shared object file: No such file or directory~~

~~If I run this in the regular binary I get~~

./sist2 scan --ocr eng /mnt/best/BEST_ACE/DOCS/OLD/ -o ~/.docs_old_idx
Invalid output: '/home/willwade/.docs_old_idx/' (Success).

Am I doing something wrong?

Describe the bug

Failing. See above.

Steps To Reproduce

  1. mkdir /mnt/dir
  2. smbmount a directory to the di
  3. install tesseract language file
  4. run the scan index wit --ocr eng

Expected behavior Should OCR the files..

Actual Behavior Crashing?!

willwade avatar Dec 15 '21 15:12 willwade

Interestingly if I run it without the -o option e.g...

./sist2 scan --ocr eng /mnt/best/BEST_ACE/DOCS/OLD

I get

[7F4E3EC42A40] [2021-12-15 16:26:01] [FATAL cli.c] Could not find tesseract language file!

but Ive definitely installed it. e.g.

sudo apt install tesseract-ocr-eng [sudo] password for willwade: Reading package lists... Done Building dependency tree Reading state information... Done tesseract-ocr-eng is already the newest version (1:4.00~git30-7274cfa-1). 0 upgraded, 0 newly installed, 0 to remove and 86 not upgraded.

willwade avatar Dec 15 '21 16:12 willwade

To use the debug binary you need to install the libasan4 package (or libasan5 I don't remember exactly). Ideally you want to use the release binary for better performance and only use the debug one to help me troubleshoot crashes.

For Invalid output: <...> (Success) thing it's a little bit weird, does sist2 have write permission to that directory? Or does that folder already exist?

For tesseract language file it might be because ubuntu changed the folder they used to save the language files. Can you try to locate them on your machine? What is the output of find /usr/share/ -name "*.traineddata" ?

simon987 avatar Dec 15 '21 16:12 simon987

Re: tesseract: /usr/share/tesseract-ocr/4.00/tessdata/osd.traineddata /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata

Re: write access. Do you mean does It have write access to '/home/willwade/.docs_old_idx/' - yeah - sist2 created it. It won't be able to write to the mounted dir though - I think I've set that as read only. Does it need that?

Done the sudo apt install libasan4 - no more debug output other than Invalid output: ..

willwade avatar Dec 15 '21 21:12 willwade

For now you can just copy the eng.traineddata in your working directory (the same directory as the sist2 binary) and it should work.

I don't have access to my workstation for a few days. I'm trying to understand why the -o option doesn't work here, can you try to specify the full path and not use a . prefix for the folder? This shouldn't be an issue but it's worth trying

For example ... -o /home/willwade/docs_old_idx/

simon987 avatar Dec 18 '21 13:12 simon987

ok - so popping the tesseract training file has worked.

what's weird though is even though sist2 made the directory on one attempt - if I rerun a more successful scan I get that Invalid output: 'index.sist2/' (Success). error. So I think that is saying "dir exists".

I would totally have expected it to have just written over the top of it - maybe thats just user error though..

So anyway - got it working now!

willwade avatar Dec 18 '21 15:12 willwade

Ok I'm glad it works now

No it should never overwrite the out directory, but it should say "Output exists" or something like that. I'll update the message or fix the error checking, it should not consider Success as an error code

simon987 avatar Dec 18 '21 15:12 simon987