msoffcrypto-tool icon indicating copy to clipboard operation
msoffcrypto-tool copied to clipboard

Capability to return a BytesIO/filelike even if it isn't encrypted?

Open CDWimmer opened this issue 1 year ago • 3 comments

Hello,

I understand that this is potentially out of scope for the project, but considering the existence of OfficeFile.is_encrypted() I feel this would tie its usage up nicely.

I'll explain a use case via example: I am using this to load up a set of usually-encrypted Excel files into pandas, this is great, except a handful of these Excel files have randomly have not been password protected. I don't actually care whether or not they have a password, I just want to put them all into dataframes.

Right now, the argument I pass to pandas.read_excel() is either a non-protected Excel file's Path, or a BytesIO objected retrieved using this library.

This is fine but it has resulted in this messy function:

def decrypt_office_file(file: Path, password: str = None) -> Union[io.BytesIO, Path]:
    decrypted_file = io.BytesIO()
    with open(file, 'rb') as f:
        office_file = msoffcrypto.OfficeFile(f)
        if office_file.is_encrypted():
            office_file.load_key(password=password)
            office_file.decrypt(decrypted_file)
        else:
            decrypted_file = file
    return decrypted_file


excel_file = decrypt_office_file("my_file.xlsx")
df = pd.read_excel(excel_file, ...)

And then I just have to hope everything downstream is cool with taking either a BytesIO or a str/Path, which is okay for pandas but I imagine is less okay for other libraries/use cases.

I'm not sure how it would be best to insert the functionality, but something like OfficeFile.to_bytes() (I'm sure there are better ideas for function names available) would be great, then we can have consistent return types.

I also find it really odd that .decrypt() takes the object you want to inject the file into as an argument, rather than returning a BytesIO object? It makes following the code flow feel awkward to me, but that's an issue for another day!

CDWimmer avatar Jan 11 '24 13:01 CDWimmer

Thanks for your suggestion! I think adding a utility function (as a context manager) similar to your decrypt_office_file() to the library is a good idea and will do it. In the case of your code example, I might replace decrypted_file = file with decrypted_file = BytesIO(f) so that the function always returns BytesIO.

I also find it really odd that .decrypt() takes the object you want to inject the file into as an argument, rather than returning a BytesIO object? It makes following the code flow feel awkward to me, but that's a for another day!

I understand. The problem is that always creating a BytesIO object can consume unnecessary memory, especially if the document is large in file size.

nolze avatar Jan 11 '24 16:01 nolze

Thanks for your suggestion! I think adding a utility function (as a context manager) similar to your decrypt_office_file() to the library is a good idea and will do it. In the case of your code example, I might replace decrypted_file = file with decrypted_file = BytesIO(f) so that the function always returns BytesIO.

Brilliant, I just threw that together to get what I needed, I'm sure you'll make it better 😄

I understand. The problem is that always creating a BytesIO object can consume unnecessary memory, especially if the document is large in file size.

~ snip ~ I wrote a load of nonsense, I'm sure you know what you're doing! Plus, this way can support alternatives to just BytesIO which is nice, I guess you could with open as f and pass f as ofile!

CDWimmer avatar Jan 11 '24 17:01 CDWimmer

Thanks for your suggestion! I think adding a utility function (as a context manager) similar to your decrypt_office_file() to the library is a good idea and will do it. In the case of your code example, I might replace decrypted_file = file with decrypted_file = BytesIO(f) so that the function always returns BytesIO.

Brilliant, I just threw that together to get what I needed, I'm sure you'll make it better 😄

I understand. The problem is that always creating a BytesIO object can consume unnecessary memory, especially if the document is large in file size.

~ snip ~ I wrote a load of nonsense, I'm sure you know what you're doing! Plus, this way can support alternatives to just BytesIO which is nice, I guess you could with open as f and pass f as ofile!

There will probably be a lot of people like me who forgot to f.seek(0) 😂.

with open(file_path, 'rb') as f:
    file = msoffcrypto.OfficeFile(f)
    if file.is_encrypted():
        file.load_key(password=password)  # Use password
        file.decrypt(decrypted_io)
    else:
        f.seek(0) # don't forget this
        decrypted_io.write(f.read())
return decrypted_io

954-Ivory avatar Jan 28 '24 21:01 954-Ivory