This plug-in lets Vim read text documents of type PDF, Microsoft Office such as Word (doc(x)), Excel (xls(x)) or Powerpoint (ppt(x)), Open Document (odt), EPUB .... The text extraction depends on external tools, but most use cases are covered by an installation of

LibreOffice and a common text browser (such as lynx), and
pdftotext.

Extractors

It uses, whenever available, appropriate external converters such as unrtf, pandoc, docx2txt.pl, odt2txt, xlscat, xlsx2csv.py or pptx2md ..., but will fall back to:

Either LibreOffice which is an office suite that (together with a common text browser such as lynx) can handle all those formats listed above, except PDFs. (On Microsoft Windows, ensure after its installation that the path of the folder containing the executable, by default %ProgramFiles%\LibreOffice\program, is added to the %PATH% environment variable.
Or Tika which is a content extractor that can handle all those formats listed above and many more. To use it:
1. Download the latest runnable tika-app-...jar from Tika to ~/bin/tika.jar (on Linux) respectively %USERPROFILE%\bin (on Microsoft Windows).
2. Create
  - on Linux, a shell script ~/bin/tika that reads
```
    #!/bin/sh
    exec java -Dfile.encoding=UTF-8 -jar "$HOME/bin/tika.jar" "$@" 2>/dev/null
```
  and mark it executable (by chmod a+x ~/bin/tika).
  - on Microsoft Windows, a batch script %USERPROFILE%\bin\tika.bat that reads
```
    @echo off
    java -Dfile.encoding=UTF-8 -jar "%USERPROFILE%\bin\tika.jar" %*
```
3. Add the folder of the newly created tika executable to your environment variable $PATH (on Linux) respectively %PATH% (on Microsoft Windows):
  - on Linux, if you use bash or zsh by adding to ~/.profile or ~/.zshenv the line
```
    PATH=$PATH:~/bin
```
  - on Microsoft Windows, a convenient program to update %PATH% is Rapidee.

OCR

For the (English) text extraction of common image files of common formats, it uses tesseract whenever its executable is found, available on Microsoft Windows and Linux. On Microsoft Windows, ensure after its installation that the path of the folder containing its executable, by default %ProgramFiles%\Tesseract-OCR, is added to the %PATH% environment variable.

Pass additional command-line options to tesseract by g:office_tesseract (left empty by default), for example

  let g:office_tesseract = '-l eng+ita'

to properly extract Italian words (as well as English ones).

Other (media) file formats

To go even further, for example, to read, among many others file formats, media files in Vim, add this Vimscript snippet from lesspipe.sh to your vimrc!

Pandoc

To convert a file to markdown, add the following command to your vimrc and run :PandocToMarkdown inside the buffer of the opened file:

  command! -range=% PandocToMarkdown exe '<line1>,<line2>!pandoc --wrap=preserve --from='..PandocFiletype(&l:filetype)..'--to markdown %:S'
  function! PandocFiletype(filetype) abort
    if a:filetype ==# 'tex'
      return 'latex'
    elseif a:filetype ==# 'pandoc'
      return 'markdown'
    elseif a:filetype ==# 'text' || empty(a:filetype)
      return expand('%:e')
    else
      return a:filetype
    endif
  endfunction

This answer shows how this plug-in works in principle, and refers to vim-util as an alternative implementation for some word document formats using the textutil command on Mac OS that also allows to write the text edited in Vim.

vim-office
vim-office copied to clipboard

Metadata

Extractors

OCR

Other (media) file formats

Pandoc

Related

← Metadata

Owner

Metadata

vim-office vim-office copied to clipboard

Metadata

Extractors

OCR

Other (media) file formats

Pandoc

Related

← Metadata

Owner

Metadata

vim-office
vim-office copied to clipboard