qubes-app-linux-pdf-converter icon indicating copy to clipboard operation
qubes-app-linux-pdf-converter copied to clipboard

Add support for more file types + archlinux packaging

Open neowutran opened this issue 4 years ago • 29 comments

Main "features" of this pull request

  • Add more file type: Support PDF protected by password, and support office files (all files understood by libreoffice, with password protected file support)
  • Add archlinux packaging

Minor improvements

  • merge the man page with the readme to avoid text duplication (that does not get updated at the same frequency..)
  • Switch from ImageMagick to GraphicsMagick (it is a fork of ImageMagick, it is theoretically faster, with less features and less bug prone)

Things that need improvement / TODO:

  • ~~How to quickly tests the software while developing it ? I created a new directory called "dev_tools", it contain some safe tests file and a script to automatically convert them.~~

  • ~~Fix debian and fedora packaging~~

  • ~~In the client script I added a code that read the pdf file to get its size wc -c "$INPUT_FILE" | cut -d ' ' -f 1 I assumed that counting the number of bytes of a file is a operation simple enough to be bug free. This line is part of the changes I did to support password protected file from the GUI (The server script tell the client script "Hey! It need password!" Or "Hey! All fine, it doesn't need password!" depending on that, the GUI prompt a password form. So I needed avoid closing the file descriptor exec >&- to be able to keep communicating between client and server after the file have been transferred. Didn't found a way to avoid this bytes counting on the file.~~

Why this pull request

I found this project https://dangerzone.rocks https://github.com/firstlookmedia/dangerzone https://github.com/firstlookmedia/dangerzone-converter from Micahflee who implemented the idea for non-Qube based system and improved it to support more file type. So trying to implement it back for Qubes, and improving it a bit further to support password protected file and allow some more file type (still only the files supported by libreoffice, but I am way more permissive with mime type check)

Tests I did

The things in "dev_tools" Tested it on archlinux, with gui and with cli. Didn't tested it on debian/fedora yet

I see there is a pull request to re-implement the script from bash to python. So this pull request will require modifications once the bash -> python is done

neowutran avatar Apr 22 '20 18:04 neowutran

* How to quickly tests the software while developing it ? I created a new directory called "dev_tools", it contain some safe tests file and a script to automatically convert them.

qrexec really provides pipe-like connection between two processes, running in different VMs. You can emulate it locally with socat EXEC:client EXEC:server.

* This line is part of the changes I did to support password protected file from the GUI (The server script tell the client script "Hey! It need password!" Or "Hey! All fine, it doesn't need password!" depending on that, the GUI prompt a password form.  So I needed avoid closing the file descriptor `exec >&-` to be able to keep communicating between client and server after the file have been transferred.

I don't like this added interactive part to the protocol. Why not simply display a password prompt from within DispVM (server part)?

marmarek avatar Apr 23 '20 21:04 marmarek

Ran into another issue: Cannot convert password protected files on Debian. The "unoconv" package available on debian is way too old (~2017). Even the version in SID is way too old.

So multiples options:

  • Try to push an update for the official debian package (no idea how the official debian packaging community work) ?
  • Remove unoconv dependency and embed our own version of unoconv (don't really like the idea of adding code to "solve" a problem solved multiples years ago (https://github.com/unoconv/unoconv/pull/358) )
  • INSERT_BETTER_IDEA

Just download the latest unoconv version from github https://github.com/unoconv/unoconv/blob/master/unoconv and running it on debian buster: work as expected

(Also, played a bit with libreoffice (in fact found nearly all that code on the internet but didn't saved the original url) macro to remove password)

REM  *****  BASIC  *****

Sub Main
   dim properties(0) as new com.sun.star.beans.PropertyValue
   url = convertToURL("/home/user/qubes-app-linux-pdf-converter/tools/files/password_protected.odt")
   properties(0).Name = "Password"
   properties(0).Value = "toor"
   properties(1).Name = "Hidden"
   properties(1).Value = True
   doc = StarDesktop.loadComponentFromUrl(url, "_blank", 0, properties())
   dim properties2(0) as new com.sun.star.beans.PropertyValue
   doc.storeAsURL("file:////home/user/qubes-app-linux-pdf-converter/tools/files/password_protected.not.odt", properties2())
End Sub

neowutran avatar Apr 26 '20 12:04 neowutran

I see there is a pull request to re-implement the script from bash to python.

That's me! :D

I didn't know about Dangerzone when I started but you've certainly put me on to them now! Looks like the major differences are OCR, compressed PDF size, and multiple file types.

Right now, I'm just trying to stabilize my PR before adding any more new features. But, if you'd like, I certainly wouldn't mind some extra help on implementing those things afterwards. It seems you already have the multiple file types down, and porting them to Python wouldn't be too hard (he said hopefully).

ibokuri avatar Apr 26 '20 17:04 ibokuri

Hi, Don't worry, still have some issues to fix (aka: mostly Debian issues). For the "compressed PDF size", from the few tests I did, the result is either bigger or of the same size (used the same command line as Dangerzone, but maybe more tests should be interesting) (And for the OCR part, I have a mixed personal opinion: https://github.com/QubesOS/qubes-app-linux-pdf-converter/pull/9/files#diff-f8d345018d9d86cc6bd2b8c68545b447R48 )

neowutran avatar May 01 '20 08:05 neowutran

And for the OCR part, I have a mixed personal opinion [...]

From the links you pointed out, it looks like OCR (well, Tesseract anyways) doesn't work on raw RGB bitmaps but more finalized image formats like PNG. Idk where Dangerzone does its OCR but we create PNGs client-side, meaning OCR would occur on the client as well, which is assumed to be safe.

The main things we need to worry about are what the server sends over: page count, image dimensions, and RGB bitmaps. Valid but incorrect submissions of these would be a pain to deal with.

ibokuri avatar May 01 '20 15:05 ibokuri

Some updates about debian: Created a new issue: https://bugs.launchpad.net/ubuntu/+source/unoconv/+bug/1876448 Didn't sent a pull request here for the moment https://salsa.debian.org/debian/unoconv ( I have trouble to understand debian packaging and why it need so many differents files )

Tried the following workaround (seems working but didn't tested it on all templates yet):

	libreoffice --accept='socket,host=localhost,port=2202;urp;' --norestore --nologo --nodefault >/dev/null 2>/dev/null &
	listener_notready=1
	
	# Wait until libreoffice server is started
	while [ $listener_notready -ne 0 ];
	do
		sleep 1
		netstat -anop 2> /dev/null | grep '127.0.0.1:2202' | grep LISTEN >/dev/null 2>/dev/null
		listener_notready=$?
	done

	# Remove password from file using libreoffice API
	# ...
		python3 -c '
import os
import uno
from com.sun.star.beans import PropertyValue
import sys

src="file://'"$INPUT_FILE"'"
dst="file://'"$INPUT_FILE.nopassword"'"
password="'"$PASSWORD"'"

localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
ctx = resolver.resolve("uno:socket,host=localhost,port=2202;urp;StarOffice.ComponentContext")
smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext("com.sun.star.frame.Desktop", ctx)

hidden_property = PropertyValue()
hidden_property.Name = "Hidden"
hidden_property.Value = True

password_property = PropertyValue()
password_property.Name = "Password"
password_property.Value = password

document = desktop.loadComponentFromURL(src, "_blank", 0, (password_property, hidden_property,))
document.storeAsURL(dst, ())' >&2
  # ...
	libreoffice --convert-to pdf "$INPUT_FILE.nopassword" --outdir /tmp/ >&2
	mv "$INPUT_FILE"".pdf" "$INPUT_FILE"

It remove the unoconv dependency but it add around 40 lines of code.

Update

The workaround seems to be working on all templates I tested (archlinux, debian buster, and fedora 32)

TODO:

  • Try to do a pull request for unoconv debian package
  • Wait a bit to see if something is moving for debian unoconv package. But very unlikely that anything land on stable quickly.
  • ~~Verify that I fixed the old-stable debian packaging that I broke earlier~~
  • ~~In archlinux package I added tests (try to convert some files before installing the package). Good idea / Bad idea ? If good idea, check how to add the archlinux "check" to debian and fedora packaging~~ It also exist for debian, it is recommanded by archlinux and recommanded by debian. Didn't checked fedora yet.
  • Properly integrate thoses tests to https://github.com/QubesOS/qubes-app-linux-pdf-converter/blob/master/qubespdfconverter/tests.py (Better to way until the rewrite to python is over)

neowutran avatar May 02 '20 11:05 neowutran

Pretty much all I wanted to add is done. Putting this pull request on hold until the port from bash to python is done and accepted. And then need to integrate those changes to the python port, but should be too hard

neowutran avatar May 09 '20 08:05 neowutran

Update: since the bash -> python rewrite is over, I started to rewrite this pull request in python. Currently here https://github.com/neowutran/qubes-app-linux-pdf-converter/tree/QubesOS-master. Once I reach a point where it is working fine, i will merge it back to master. (PS: I am learning python at the same time)

neowutran avatar Jul 04 '20 08:07 neowutran

@bl0nd could you do a first review of the python code ? :) Should be working, generated the packages for fc32, buster & archlinux. Tested on archlinux

But probably lot of rooms for improvements Thanks ! :)

neowutran avatar Jul 11 '20 16:07 neowutran

Thanks you very much for the review ! :) I will take some time to fix that, probably next weekend.

* As it stands, I'd rather not support password-protected PDFs. The main issues with this implementation are that passwords are specified as command-line arguments, you may only specify 1 password even though multiple files may have different passwords, and `_decrypt()` is ironically incredibly cryptic relative to the rest of the codebase.

About the "_decrypt()", i sadly agree. I don't know how to improve the code of this part since it is mostly libreoffice API. I think instead I will add comments to be more verbose about what it does exactly

* It seems to me that `--gui` is just for password prompts? If so, I think it's a bit unnecessary. If password-protected PDFs _were_ supported, GUI prompts are basically the only option considering that multiple files can be processed.

Yes, "--gui" is just for password prompts.

The original intend was to have 2 way of using this tools, the "standard" way, and as a batch process (So without any gui or blocking prompt). However, after reading your comments, it seems that it is more complex than the interest it have. I think I will drop "--gui", "--password" and my "batch process" idea, could be implemented later if someone explicitly state the need.

However I will keep the password-protected file support: I see more and more phishing campaign using password-protected office file containing malicious macro / commands, so I [and probably others] have a need for that feature.

Plus, GUI prompts shouldn't be created from the server (for security reasons), but from requests made to Dom0 which always provides a GUI.

Can you explain more about that ? Don't really see the issue with the [potentially infected] server showing up password prompt ? [Except the: "malicious document exploiting a vulnerability in one of the tools used by server, to display a password prompt for the user. User that will mindlessly fill the field with the same password he use for every non-malicious office document. Giving the attacker a way to decrypt the others documents he stole before", but i don't see what I can do for this case (except a documentation on how to configure the disposableVM to disable network access and every non PdfConvert qubes RPC) ]

* Having to use raw sockets for multiple file type support... Is this really the only the way? sweat_smile

Yeah, not particularly elegant... The problem I am trying to solve is that when I start the libreoffice process, I need to way a bit until it is ready to take requests. I will check the documentation to see if there is something like "notify-me when ready" event, available on the command line.

neowutran avatar Jul 12 '20 07:07 neowutran

* Plus, GUI prompts shouldn't be created from the server (for security reasons)

As @neowutran already said - having password prompt on the server side should be fine. This way, all the things that interacts with potentially malicious PDF file are isolated in that VM. Plus, it will allow to avoid protocol change (server will prompt for the password itself, no need to receive it from the client). That will be at the cost of having password prompt always interactive (no way to script it), but in my opinion it isn't big issue.

There is something wrong with commits here, a lot of conflicts. Most likely because of commits for the old version still present here. I think the easier way to fix it is to forcefully rebase the whole thing:

# we'll discard all the changes from the current branch, make a backup
# also make sure to commit everything that is useful first, otherwise it will be discarded too
git branch backup-pre-rebase HEAD
# assuming "origin" points at https://github.com/QubesOS/qubes-app-linux-pdf-converter
git fetch --tags origin
# verify signature
git verify-tag $(git describe origin/master)
# this discards every local modification and commit
git reset --hard origin/master
# then pickup just changed files from the backup (example list, I haven't checked what files you've modified)
git checkout backup-pre-rebase qubespdfconverter/client.py qubespdfconverter/client.py
# review changes, then commit and push
git diff --cached
git commit -a
git push --force

Generally I would recommend creating separate branch for pull requests (and various changes in general), instead of using master in your repo, but that's unrelated topic.

PS I haven't looked at the changes yet, only at comments here.

marmarek avatar Jul 12 '20 10:07 marmarek

However I will keep the password-protected file support [...]

Oh, I want password-protected file support too, I just meant not with this specific implementation.

Don't really see the issue with the [potentially infected] server showing up password prompt ?

I guess I just thought untrusted VMs shouldn't draw prompts or something. Marek said it was fine though so forget about that part.

ibokuri avatar Jul 12 '20 15:07 ibokuri

By the way, would you mind running some performance tests? We've been using this as a benchmark, and you can find my results here.

ibokuri avatar Jul 12 '20 15:07 ibokuri

@bl0nd

time qvm-convert-pdf 253665-sdm-vol-1.pdf

OG server, new client (batch: 50, bulk): 5 min

: On my computer: 3 min 33

new server (as d2880ff), new client (as d2880ff): 2min10

neowutran avatar Jul 14 '20 21:07 neowutran

Things that doesn't work yet:

  • CI build [issue between pylint & UNO libreoffice API]

neowutran avatar Jul 17 '20 10:07 neowutran

I think it is ready for another review.

Things that changed since last time:

  • Things we discussed previously should be fixed
  • Modification to fix to GUI progress bar (zenity)

(tested on archlinux, fc32 and buster)

CI build still doesn't work: Python still doesn't find the libreoffice UNO lib, no idea why.

neowutran avatar Jul 21 '20 11:07 neowutran

@bl0nd if you want to review it once more :)

neowutran avatar Jul 29 '20 07:07 neowutran

Been a bit busy with some other stuff lately, sorry about that. I'll try to review it as soon as I can.

ibokuri avatar Jul 29 '20 15:07 ibokuri

Thanks for the review @bl0nd ! (Normally) Applied nearly all your recommendations (I will comment on the one I willingly not applied) Don't hesitate if you have more to say, and thanks again :)

neowutran avatar Aug 01 '20 07:08 neowutran

Nice work! Btw, could you resolve the conversations you've dealt with? It'd make it easier to see what still needs to be worked on.

ibokuri avatar Aug 01 '20 15:08 ibokuri

Done, 2 unresolved conversations. For 1 I will send a new commit later this week end

One a side note, travis CI is still not working, python doesn't find the libreoffice dependency to resolve "uno". For the moment no idea why, but probably better to keep that for the end

From my understanding, it doesn't work because: Python is run in a "virtualenv" so it can't access system libraries (so can't access the libreoffice python library). There is a "system_site_packages" option to allow "virtualenv" to access system libraries, but it work only for python version that have been installed on the OS. CI OS is old and doesn't support a recent enough version of python for the "pdf-converter" so we need to use a downloaded version defined in the .travis file, so we can't use "system_site_packages". As a workaround, some other project like unoconv download libreoffice and use the python binary embedded inside (see: https://github.com/unoconv/unoconv/blob/master/ci/linux.bash ). But we also need to install some pip dependencies.

Update: So I decided to ignore the error message instead of starting to do something very complex for nearly nothing ( 0302da8 )

neowutran avatar Aug 01 '20 16:08 neowutran

@marmarek I think it is ready for a review

neowutran avatar Aug 05 '20 19:08 neowutran

Something I can do / change to help for the review ?

neowutran avatar Oct 08 '20 18:10 neowutran

I think first you can try to solve the Makefile conflict (rebase) then @marmarek?

fepitre avatar Oct 22 '20 19:10 fepitre

Seems like the new version of libreoffice (>= 7.0) on archlinux introduced a bug/regression in the "storeAsURL" API for some file type (.docx). So WIP again to find a workaround / open bug ticket

neowutran avatar Oct 30 '20 10:10 neowutran

I was unable to find an explanation of why the "decrypt" function stop working with libreoffice >= 7, so filled a bug ticket https://bugs.documentfoundation.org/show_bug.cgi?id=137926

neowutran avatar Nov 02 '20 08:11 neowutran

Anyway it is ready for review, could instruct archlinux user to select "libreoffice-still" (instead of "libreoffice-fresh") until libreoffice team fix the issue. Or need to dig the libreoffice git to find what was changed in XStorable / storeasurl for the handling of com::sun::star::document::MediaDescriptor parameters

neowutran avatar Nov 08 '20 08:11 neowutran

Some idea of workaround for the libreoffice bug (I didn't found the motivation nor to [ learn c++ and learn libreoffice codebase ] nor to [ recompile and test libreoffice to find the faulty commit between 6.4 and 7.0] ): Instead of using uno, just use the built in macro system.

  • Programmatically create a macro (aka create a file in ~/.config/libreoffice/4/user/basic/Standard/RANDOM_MODULE_NAME.xba) Something like that
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">REM  *****  BASIC  *****

Sub Main
   dim properties(1) as new com.sun.star.beans.PropertyValue
   url = convertToURL(&quot;/home/user/qubes-app-linux-pdf-converter/tests/files_success/doc.doc&quot;)
   properties(0).Name = &quot;Password&quot;
   properties(0).Value = &quot;toor&quot;
   properties(1).Name = &quot;Hidden&quot;
   properties(1).Value = True
   doc = StarDesktop.loadComponentFromUrl(url, &quot;_blank&quot;, 0, properties())
   dim properties2(0) as new com.sun.star.beans.PropertyValue
   doc.storeAsURL(&quot;file:////home/user/qubes-app-linux-pdf-converter/tests/files_success/tata.doc&quot;, properties2())
   StarDesktop.Terminate()
End Sub
</script:module>

Then call libreoffice

soffice --nologo --norestore --nodefault --headless 'macro:///Standard.Module1.Main' 

neowutran avatar Mar 09 '21 23:03 neowutran

Found the commit that broke the behavior that I was relying on. Sent a suggestion/proposition of patch to re-add the behavior (everything is in the libreoffice bug ticket). Tested on my side, with the patch everything seems to be working correctly ( all my tests-cases work ).

Also, since I now understand this change of behavior, I wrote a workaround for this issue.

neowutran avatar Mar 18 '21 20:03 neowutran