crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Unable to execute POST request with JSON payload

Open Mantisus opened this issue 1 year ago • 2 comments

Example

async def main() -> None:
    crawler = HttpCrawler()

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: HttpCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        response = context.http_response.read().decode('utf-8')
        context.log.info(f'Response: {response}')  # To see the response in the logs.

    # Prepare a POST request to the form endpoint.
    request = Request.from_url(
        url='https://httpbin.org/post',
        method='POST',
        headers = {"content-type": "application/json"},
        data={
            'custname': 'John Doe',
            'custtel': '1234567890',
            'custemail': '[email protected]',
            'size': 'large',
            'topping': ['bacon', 'cheese', 'mushroom'],
            'delivery': '13:00',
            'comments': 'Please ring the doorbell upon arrival.',
        },
    )
    await crawler.run([request])

Current response format

{
  "args": {}, 
  "data": "custname=John+Doe&custtel=1234567890&custemail=johndoe%40example.com&size=large&topping=bacon&topping=cheese&topping=mushroom&delivery=13%3A00&comments=Please+ring+the+doorbell+upon+arrival.", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-US,en;q=0.9", 
    "Content-Length": "190", 
    "Content-Type": "application/json", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36", 
    "X-Amzn-Trace-Id": "Root=1-66fc90cf-11644d4f5483e1096211b721"
  }, 
  "json": null, 
  "origin": "91.240.96.149", 
  "url": "https://httpbin.org/post"
}

Expected response format

{
  "args": {}, 
  "data": "{\"custname\": \"John Doe\", \"custtel\": \"1234567890\", \"custemail\": \"[email protected]\", \"size\": \"large\", \"topping\": [\"bacon\", \"cheese\", \"mushroom\"], \"delivery\": \"13:00\", \"comments\": \"Please ring the doorbell upon arrival.\"}", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Content-Length": "221", 
    "Content-Type": "application/json", 
    "Host": "httpbin.org", 
    "User-Agent": "python-httpx/0.27.2", 
    "X-Amzn-Trace-Id": "Root=1-66fc91c2-6db9989347fef25b150615e2"
  }, 
  "json": {
    "comments": "Please ring the doorbell upon arrival.", 
    "custemail": "[email protected]", 
    "custname": "John Doe", 
    "custtel": "1234567890", 
    "delivery": "13:00", 
    "size": "large", 
    "topping": [
      "bacon", 
      "cheese", 
      "mushroom"
    ]
  }, 
  "origin": "91.240.96.149", 
  "url": "https://httpbin.org/post"
}

Both HTTPX and curl_impersonate allow creating a “POST” request with JSON payload, in two ways

data={
    'custname': 'John Doe',
    'custtel': '1234567890',
    'custemail': '[email protected]',
    'size': 'large',
    'topping': ['bacon', 'cheese', 'mushroom'],
    'delivery': '13:00',
    'comments': 'Please ring the doorbell upon arrival.',
    }

response = httpx.post(url, json=data)

response = httpx.post(url, data=json.dumps(data))

But we can't reproduce this behavior in Crawlee, because the json parameter is not passed when the request is created and the data parameter cannot be a string

Mantisus avatar Oct 02 '24 00:10 Mantisus

Hi @Mantisus,

I'm interested in working on this issue regarding the JSON payload handling in POST requests. I have a few questions to ensure I understand the requirements correctly:

  1. It seems the current behavior differs from HTTPX in how it handles the json parameter and string data. Are we aiming to make Crawlee's behavior match HTTPX exactly, or are there specific adjustments needed for the Crawlee context?

  2. For the implementation, should we:

    • Add support for both methods shown in the example (json=data and data=json.dumps(data))?
    • Focus on one specific approach?
    • Or is there a different preferred solution?
  3. Are there any specific test cases you'd like to see implemented alongside the fix?

I have some experience with Python HTTP clients and would be happy to contribute to this issue. Let me know if you need any clarification or have additional requirements before I start working on it.


References:

belloibrahv avatar Oct 02 '24 03:10 belloibrahv

Hi, thanks for reporting this. We already know about this, it is related to the #542.

vdusek avatar Oct 02 '24 09:10 vdusek