caracara Support streamed download to a path

There is currently no way either in falconpy or caracara to stream a file down to the disk. That causes 4GB requests to be entirely loaded in memory at some point.

We're using ugly hacks to pass stream=True to a raw requests.request and call it a day, so that we can stream a large file to disk chunk by chunk.

def rtr_session_download_to_path(self, session_id, sha256, destination, known_size = None):
    '''
    Downloads an extracted file straight into a file (7z -pinfected) using
    chunks, so that we don't have a 4GB single http request in memory at
    some point. Or several in parallel.
    '''
    # First, prepare a HTTP request by stealing the self.falcon config for URL & token
    url = f'{self.falcon.base_url}/real-time-response/entities/extracted-file-contents/v1'
    params = {
        'session_id': session_id,
        'sha256': sha256,
    }
    self.logger.debug(f'Getting file sha256={sha256}, session_id={session_id} into {destination}')
    total_written_bytes = 0
    with request(
        'get',url,
        # Here we assume the token is fresh enough, which is usually the case since we just listed the file properties.
        headers = self.falcon.headers(),
        verify = self.falcon.ssl_verify,
        stream = True,
        params = params,
        ) as r:
        if not destination.parent.exists():
            self.logger.info(f'Creating folder {destination.parent}')
            destination.parent.mkdir(parents = True, exist_ok = True)
        with destination.open('wb') as f, tqdm(
            desc=str(destination),
            total=known_size,
            unit='iB',
            unit_scale=True,
            unit_divisor=1024,
        ) as bar:
            self.logger.debug(f'Actual download iteration start')
            for chunk in r.iter_content(chunk_size=10*1024):
                written_bytes = f.write(chunk)
                bar.update(written_bytes)
                total_written_bytes += written_bytes

    return destination, total_written_bytes

Could this be done natively by caracara ? I'm no asyncio expert but there's some http + file magic to be done here imo.

Thanks !

May 03 '23 08:05 59e5aaf4

This is a really neat trick, and actually close to something I implemented internally for this type of use case. I agree that we should do this.

In Caracara, we just call the endpoint directly in FalconPy -- https://github.com/CrowdStrike/caracara/blob/3cf2e49440b6ea2d05dc2a9a65b3e8052de144b8/caracara/modules/rtr/get_file.py#LL72C66-L76C66

@jshcodes is this something that we can build into FalconPy? Otherwise, I could short-circuit FalconPy here and perform this request manually, but obviously we would like API operations to be handled natively by FalconPy where possible to take advantage of the abstraction layer there. It might be that you'll need to return a request object back directly, or at least enable stream=True so that we do not have to download the whole blob at once.

May 03 '23 12:05 ChristopherHammond13

Also, ahem, https://eu-1.ideas.crowdstrike.com/ideas/IDEA-I-10248 , there's no support for the Range HTTP header, preventing partial downloads (on either the normal API or the WebUI API). Another missing pretty relevant header would be Content-Length, the closest thing we can get to is the size field of a RTR file object, but that describes the size of the in-7z file, not the 7z file itself.

If you know anyone that might be able to take a look at this, feel free to share the concern :D I am really surprised we only get 1 try to download 4GB files, and then have to start over from offset=0 if the VPN gods decide to banish a socket into the void.

May 03 '23 12:05 59e5aaf4

@59e5aaf4 This is great feedback. I have posed the question over to the RTR team to see what they say. I suspect it could require a few teams to get involved, but I'll keep tabs on it internally. For now, I think getting the chunking support provided first party would be an important first step. I am speaking with @jshcodes on the side to figure out which code should own this functionality, and will update this ticket once we figure out the best place for this to live.

May 03 '23 14:05 ChristopherHammond13

Providing support for the stream keyword in requests is a neat idea and should not be difficult to implement.

Content-Length will require a bit more effort, but it makes sense and we should support it if we can.

FalconPy v1.2.16 development will start here soon, this enhancement has been added to the punch list. 👊

May 03 '23 15:05 jshcodes

Awesome, thanks @jshcodes! Once 1.2.16 is launched, we can figure out the best way to expose this to Caracara users in a way that makes the most sense. Ultimately this SDK is about doing as much for users as possible, so I'm open to taking something more raw from FalconPy, or perhaps we can get FalconPy to perform the download and move the chunking there?

May 03 '23 15:05 ChristopherHammond13

caracara caracara copied to clipboard

Support streamed download to a path

caracara
caracara copied to clipboard