vcrpy icon indicating copy to clipboard operation
vcrpy copied to clipboard

aiohttp stub records the full url in the recorded data regardless of parameter filtering

Open vEpiphyte opened this issue 5 years ago • 2 comments
trafficstars

When using the filter_query_parameters argument, the aiohttp stub still records the full url given by the response object for later reconstruction of the yarl response. This is an unexpected behavior and can lead to secrets being leaked. Based on a review of the code, I think the Tornado stub may also have this same issue (but I haven't verified that). This leads us to having to manually scrub filtered values from cassettes to avoid secrets being committed into repositories.

Here is an example test case and recorded cassettes. The query parameter sometoken is intended to be dropped from storage (please ignore the fact that that the test target, httpbin, includes the query parameters in the response body).

import os
import json
import asyncio
import unittest

import vcr
import aiohttp
import requests

URL = 'https://httpbin.org/get?arg1=val1&sometoken=sekrit'
ASSETDIR = './'

FILTER_QUERY_PAREMETERS = ['sometoken']

class FilterQueryParamTest(unittest.TestCase):
    # Helper code, replicated here
    def _getVcrArgs(self):
        kwargs = getattr(self, 'vcr_kwargs', {})
        return kwargs

    def getVcr(self, **kwargs):
        # simplification for the vcr-unittest library.
        fn = '{0}.{1}.yaml'.format(self.__class__.__name__,
                                   self._testMethodName)
        fp = os.path.join(ASSETDIR, fn)
        _kwargs = self._getVcrArgs()
        _kwargs.update(kwargs)
        myvcr = vcr.VCR(**_kwargs)
        cm = myvcr.use_cassette(fp)
        return cm

    def test_requests(self):
        with self.getVcr(filter_query_parameters=FILTER_QUERY_PAREMETERS):
            resp = requests.get(URL)
            d = resp.json()
        self.assertIsInstance(d, dict)

    def test_aiohttp(self):
        async def coro():
            with self.getVcr(filter_query_parameters=FILTER_QUERY_PAREMETERS):
                async with aiohttp.ClientSession() as session:
                    async with session.get(URL) as response:
                        text = await response.text()
            return text

        result = asyncio.run(coro())
        d = json.loads(result)
        self.assertIsInstance(d, dict)

This is the requests output which does not have the filtered value present in any vcrpy constructs:

interactions:
- request:
    body: null
    headers:
      Accept:
      - '*/*'
      Accept-Encoding:
      - gzip, deflate
      Connection:
      - keep-alive
      User-Agent:
      - python-requests/2.23.0
    method: GET
    uri: https://httpbin.org/get?arg1=val1
  response:
    body:
      string: "{\n  \"args\": {\n    \"arg1\": \"val1\", \n    \"sometoken\": \"sekrit\"\n
        \ }, \n  \"headers\": {\n    \"Accept\": \"*/*\", \n    \"Accept-Encoding\":
        \"gzip, deflate\", \n    \"Host\": \"httpbin.org\", \n    \"User-Agent\":
        \"python-requests/2.23.0\", \n    \"X-Amzn-Trace-Id\": \"Root=1-5e849e8a-908aad86267b92fe372fa1da\"\n
        \ }, \n  \"origin\": \"8.8.8.8\", \n  \"url\": \"https://httpbin.org/get?arg1=val1&sometoken=sekrit\"\n}\n"
    headers:
      Access-Control-Allow-Credentials:
      - 'true'
      Access-Control-Allow-Origin:
      - '*'
      Connection:
      - keep-alive
      Content-Length:
      - '385'
      Content-Type:
      - application/json
      Date:
      - Wed, 01 Apr 2020 14:00:42 GMT
      Server:
      - gunicorn/19.9.0
    status:
      code: 200
      message: OK
version: 1

This is the aiohttp response which does include the unfiltered token in the url key of the response:

interactions:
- request:
    body: null
    headers: {}
    method: GET
    uri: https://httpbin.org/get?arg1=val1
  response:
    body:
      string: "{\n  \"args\": {\n    \"arg1\": \"val1\", \n    \"sometoken\": \"sekrit\"\n
        \ }, \n  \"headers\": {\n    \"Accept\": \"*/*\", \n    \"Accept-Encoding\":
        \"gzip, deflate\", \n    \"Host\": \"httpbin.org\", \n    \"User-Agent\":
        \"Python/3.7 aiohttp/3.6.0\", \n    \"X-Amzn-Trace-Id\": \"Root=1-5e849e89-2fdc0c40f507cd205231bb40\"\n
        \ }, \n  \"origin\": \"8.8.8.8\", \n  \"url\": \"https://httpbin.org/get?arg1=val1&sometoken=sekrit\"\n}\n"
    headers:
      Access-Control-Allow-Credentials: 'true'
      Access-Control-Allow-Origin: '*'
      Connection: keep-alive
      Content-Length: '387'
      Content-Type: application/json
      Date: Wed, 01 Apr 2020 14:00:41 GMT
      Server: gunicorn/19.9.0
    status:
      code: 200
      message: OK
    url: https://httpbin.org/get?arg1=val1&sometoken=sekrit
version: 1

Edit:

This behavior was tested against the current master of the vcrpy project, but has also been observed in the latest 4.x release as well.

vEpiphyte avatar Apr 01 '20 14:04 vEpiphyte

Just wanted to note here that this can be remedied "in the mean time" with something like this:

def filter_response(response):
    response["url"] = ''  # hide the URL
    return response

..., and then making your own VCR instance, passing filter_response to the before_record_response kwarg. This doesn't seem to impact the later response playback in any negative way.

DevilXD avatar Apr 29 '20 09:04 DevilXD

Here's a more elaborate implementation I've been using that does the actual response["url"] query param filtering: https://github.com/scop/pytekukko/blob/c43533c8c8ff46f5fc1114e0d19a9cde58c89ae9/tests/test_pytekukko.py#L36-L50 (Nowhere near perfect: it expects replacements, does not do removal, and does not preserve more than one instance of a param, but works for my use case.)

Haven't tried just emptying the whole URL myself, nice if that works, but this one FTR in case someone finds a problem with the emptying approach.

scop avatar Sep 01 '22 05:09 scop