vim-office
vim-office copied to clipboard
read common binary files, such as PDFs and those of Microsoft Office or LibreOffice, in Vim
This plug-in lets Vim read text documents of type PDF, Microsoft Office such as Word (doc(x)), Excel (xls(x)) or Powerpoint (ppt(x)), Open Document (odt), EPUB ....
The text extraction depends on external tools, but most use cases are covered by an installation of
LibreOfficeand a common text browser (such aslynx), andpdftotext.
Extractors
It uses, whenever available, appropriate external converters such as unrtf, pandoc, docx2txt.pl, odt2txt, xlscat, xlsx2csv.py or pptx2md ..., but will fall back to:
-
Either LibreOffice which is an office suite that (together with a common text browser such as
lynx) can handle all those formats listed above, exceptPDFs. (On Microsoft Windows, ensure after its installation that the path of the folder containing the executable, by default%ProgramFiles%\LibreOffice\program, is added to the%PATH%environment variable. -
Or Tika which is a content extractor that can handle all those formats listed above and many more. To use it:
-
Download the latest runnable
tika-app-...jarfrom Tika to~/bin/tika.jar(on Linux) respectively%USERPROFILE%\bin(on Microsoft Windows). -
Create
- on Linux, a shell script
~/bin/tikathat reads
#!/bin/sh exec java -Dfile.encoding=UTF-8 -jar "$HOME/bin/tika.jar" "$@" 2>/dev/nulland mark it executable (by
chmod a+x ~/bin/tika).- on Microsoft Windows, a batch script
%USERPROFILE%\bin\tika.batthat reads
@echo off java -Dfile.encoding=UTF-8 -jar "%USERPROFILE%\bin\tika.jar" %* - on Linux, a shell script
-
Add the folder of the newly created
tikaexecutable to your environment variable$PATH(on Linux) respectively%PATH%(on Microsoft Windows):- on Linux, if you use
bashorzshby adding to~/.profileor~/.zshenvthe line
PATH=$PATH:~/bin- on Microsoft Windows, a convenient program to update
%PATH%is Rapidee.
- on Linux, if you use
-
OCR
For the (English) text extraction of common image files of common formats, it uses tesseract whenever its executable is found, available on Microsoft Windows and Linux.
On Microsoft Windows, ensure after its installation that the path of the folder containing its executable, by default %ProgramFiles%\Tesseract-OCR, is added to the %PATH% environment variable.
Pass additional command-line options to tesseract by g:office_tesseract
(left empty by default), for example
let g:office_tesseract = '-l eng+ita'
to properly extract Italian words (as well as English ones).
Other (media) file formats
To go even further, for example, to read, among many others file formats, media files in Vim, add this Vimscript snippet from lesspipe.sh to your vimrc!
Pandoc
To convert a file to markdown, add the following command to your vimrc and run :PandocToMarkdown inside the buffer of the opened file:
command! -range=% PandocToMarkdown exe '<line1>,<line2>!pandoc --wrap=preserve --from='..PandocFiletype(&l:filetype)..'--to markdown %:S'
function! PandocFiletype(filetype) abort
if a:filetype ==# 'tex'
return 'latex'
elseif a:filetype ==# 'pandoc'
return 'markdown'
elseif a:filetype ==# 'text' || empty(a:filetype)
return expand('%:e')
else
return a:filetype
endif
endfunction
Related
This answer shows how this plug-in works in principle, and refers to vim-util as an alternative implementation for some word document formats using the textutil command on Mac OS that also allows to write the text edited in Vim.