TheHive4py icon indicating copy to clipboard operation
TheHive4py copied to clipboard

Add the possibility to download files and keep them in a variable

Open TytoCapensis opened this issue 2 months ago • 12 comments

Related to issue #322

As today, all functions related to download of files take a file path as an argument, and download themselves the file on the filesystem (by using the functions in the session.py module)

This can cause issues because:

  • Downloading a file can be cumbersome in some contexts, sometimes it is preferable to keep it in memory (i.e. in a variable)
  • We do not have control about how the file is opened and written exactly

This pull request:

  • Allows to pass False as a "filepath" to all file download functions, to tell that we do not want the file to be written on the filesystem, but instead returned by the function so we can store it as a variable
  • Assume False as a default for all file download functions (currently there is no default, the user has to provide a file path regardless).

TytoCapensis avatar Apr 26 '24 16:04 TytoCapensis

I just pushed a replacement commit that went with the method suggested in the issue #322 instead (using a io.BytesIO object passed in the argument). I still have to modify some files however, to change the type hinting definitions.

Note that I had to use a # type: ignore in session.py, because mypy got confused and was reporting that io.BytesIO was not a correct type for open() function (yet it is in another if block). I will change the code if someone comes with a better idea on how to manage that.

TytoCapensis avatar Apr 29 '24 11:04 TytoCapensis

Thank you for your changes @TytoCapensis , I'll find time to review them. But by just looking at your comment I might argue with the type ignore as I'm afraid mypy is right in this case, BytesIO is already an open buffer ready for write and read so you shouldn't call open on it, just write.

Can you double check?

Kamforka avatar Apr 29 '24 15:04 Kamforka

This should be right, because there are two cases: either the argument is a io.BytesIO buffer and it directly writes to it, or it is a path and an open() is performed on it:

if type(download_path) is BytesIO:
    for chunk in response.iter_content(chunk_size=4096):
        download_path.write(chunk)
    return download_path
else:
    with open(download_path, "wb") as download_fp:  # type: ignore
        for chunk in response.iter_content(chunk_size=4096):
            download_fp.write(chunk)

But I am guessing mypy does not realize that, and raises a warning because it sees open() used with download_path which can be io.BytesIO. (One could argue that a better design would be to use two different arguments, but this would require more refactoring)

TytoCapensis avatar Apr 29 '24 17:04 TytoCapensis

That's because you're using:

if type(download_path) is BytesIO

Which is not recognized by mypy properly, a more conventional method (which will also get rid of the type hinting warning) is the below:

if isinstance(download_path, BytesIO):
   ...
else:
   ...

Can you check with this?

Kamforka avatar May 01 '24 14:05 Kamforka

Also it might make sense to create a module e.g. in types/_common.py and collect the common type definitions there, so we can get rid of the repetition of this path/buffer type hint:

types/_common.py:

from io import BytesIO
from os import PathLike

PathOrBuffer = str | PathLike | BytesIO

And then just reuse this across the endpoints and the session modules. What do you think?

Kamforka avatar May 01 '24 15:05 Kamforka

I've reworked the integration test setup. Can you please rebase on the latest main branch to check how it works out?

Kamforka avatar May 02 '24 13:05 Kamforka

By the way, now that I'm taking a second look at this request, what are we really trying to solve here? Can you describe me a usecase where an in memory buffer is really necessary instead of a temporary file?

Kamforka avatar May 02 '24 13:05 Kamforka

Also it might make sense to create a module e.g. in types/_common.py and collect the common type definitions there, so we can get rid of the repetition of this path/buffer type hint

You mean creating a _common.py module and having from _common import * into each type module? So we would be able to do something like this:

from thehive4py.types.case import PathOrBuffer

We could also import thehive4py.types._common or its classes/variables in each relevant file (which looks cleaner, but also requires to be "duplicated" in relevant files). I can take care of both possibilities.

By the way, now that I'm taking a second look at this request, what are we really trying to solve here? Can you describe me a usecase where an in memory buffer is really necessary instead of a temporary file?

I would not say that using a memory buffer is really mandatory (except if we are in a specific environment, like a read-only filesystem): it would be more about optimization and adaptation. Most API libraries include similar possibilities.
One of my use cases involve collecting a file from TheHive and immediately uploading it elsewhere. In this case, it does not make a lot of sense to create a file to immediately remove it after: that is an useless usage of system resources and operations, not to mention the safeguards that have to come with it (ex. preventing two files with the same name to exist at the same moment in the download directory)

I will address the other points (rebase, type checking...) soon.

TytoCapensis avatar May 02 '24 15:05 TytoCapensis

Changes made!

I also had to create BufferOrNone for the return types of related functions. Feel free to suggest something that you would find better.

TytoCapensis avatar May 03 '24 09:05 TytoCapensis

One of my use cases involve collecting a file from TheHive and immediately uploading it elsewhere. In this case, it does not make a lot of sense to create a file to immediately remove it after: that is an useless usage of system resources and operations, not to mention the safeguards that have to come with it (ex. preventing two files with the same name to exist at the same moment in the download directory)

I disagree with the above, this is the perfect usecase for the tempfile module actually. A simple example would look like the below:

import os
import tempfile

with tempfile.TemporaryDirectory() as tmpdir:
    tmp_path = os.path.join(tmpdir, "my-attachment.path")
    hive.alert.download_attachment(alert_id="~1234", download_path=tmp_path)

   # then one can upload based on a filepath
    other_service.upload_attachment(tmp_path)

    # or can open it for read and pass it as a buffer
    with open(tmp_path, "rb") as tmp_buffer:
        other_service.upload_attachment(tmp_buffer)

With this you don't have to care about cleanup or name conflicts at all. Might not be as elegant as reading and writing to a buffer at the same time, but on the other hand much easier to understand in my opinion.

Kamforka avatar May 03 '24 10:05 Kamforka

That is an interesting solution – I was not aware of the tempfile module. It could indeed solve most (not all, through) challenges related to temporary file handling.

However, I still think the library should offer a possibility for people to keep the file in a variable, instead of forcibly download it to the filesystem. Most API libraries I know in the Python ecosystem only give the raw bytes, and prefer to leave to the user the filesaving part (and some libs only return an URL that has to be requested). In the case of TheHive, the potentially large files we could encounter justifies the need for the filesaving to be handled by the download function (to not fill up memory too much), but not to the price of removing the other possibility in my opinion.

What I was originally thinking about was to be able to not supply a file path, to get the file as raw bytes:

f = thehive4py.case.download_attachment(case_id="myid", attachment_id="my_id")
print(type(f))
# <class 'bytes'>

And by specifying a download path, we get the file on the filesystem:

thehive4py.case.download_attachment(case_id="myid", attachment_id="my_id", attachment_path="/tmp/myfile.txt")

However, None already has a "meaning" for _process_response, so it was complicated to handle it without some refactoring; however, I can do it if you think it is a good idea.


That being said, using the tempfile module could solve most issues related to file handling, so I would say it is an acceptable solution as well. However, I think an annotation should be included in the documentation of thehive4py, to recommend people to use tempfile if they need the files downloaded to be temporary.

TytoCapensis avatar May 03 '24 14:05 TytoCapensis

This is what I mean when I say you don't need to return anything:

from io import BytesIO

from thehive4py.client import TheHiveApi

hive = TheHiveApi(
    url="http://localhost:9000", username="[email protected]", password="secret"
)


my_case = hive.case.create({"title": "...", "description": "..."})


attachment_path = "my-attachment.txt"
attachment_content = b"abcdef"
with open("my-attachment.txt", "wb") as attachment_fp:
    attachment_fp.write(attachment_content)

case_attachments = hive.case.add_attachment(
    case_id=my_case["_id"], attachment_paths=[attachment_path]
)


# this is the buffer that you pass to the call and then reuse it
download_buffer = BytesIO()

hive.case.download_attachment(
    case_id=my_case["_id"],
    attachment_id=case_attachments[0]["_id"],
    attachment_path=download_buffer,
)
hive.case.delete(case_id=my_case["_id"])


assert download_buffer.getvalue() == attachment_content

As you can see, you just create the case, upload the attachment, and then initialize your buffer and pass it to the call. Then after the lib wrote the content to your buffer you are free to use it for whatever you want.

I understand your point, but I'm still not convinced that we need to complicate the lib internals with something that would be used in 20% of the usecases, while 80% of the people will eventually save the downloads to disk or use temp files.

And you know the saying in software: no is temporary but yes is forever :)

Kamforka avatar May 03 '24 15:05 Kamforka