crawl4ai Cannot get response headers

trafficstars

When I do this

result = await crawler.arun(url=url, bypass_cache=False)

result.response_headers is always {}

It appears here: 62 async def arun( .... 131: crawl_result.response_headers = async_response.response_headers if async_response else {}

It appears that async_response is always None. Is there a way to get the headers? Am I missing a flag? I have never been able to get the headers and this is baffling to me.

Oct 31 '24 14:10 ejkitchen

@ejkitchen Could you please share the URL you're trying? It works on my side, as seen in this image, and shouldn't require any flag by default; maybe it's dirt. Sharing the URL will help me check it better.

Nov 03 '24 05:11 unclecode

Thanks for the response! I decided to go back to basics as I could not get this to work. We had about 400k URLs and I sampled about 15 at random (it's a public website) and nothing, no errors in logs etc. Everything else was coming through but not the headers. Using the basic python libs, they come through without issue. I no longer have the code since we moved away from the lib but if you give me your snippet here, I can try again and if I get an issue, will let you know ASAP and what links.

Nov 12 '24 20:11 ejkitchen

Sure, for examples:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(headless=True) as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/apple",
            bypass_cache=True,
        )
        
        print(result.response_headers)

if __name__ == "__main__":
    asyncio.run(main())

This generate the below log for me:

[LOG] 🚀 Crawling done for https://en.wikipedia.org/wiki/apple, success: True, time taken: 0.54 seconds
[LOG] 🚀 Content extracted for https://en.wikipedia.org/wiki/apple, success: True, time taken: 4.12 seconds
[LOG] 🔥 Extracting semantic blocks for https://en.wikipedia.org/wiki/apple, Strategy: AsyncWebCrawler
[LOG] 🚀 Extraction done for https://en.wikipedia.org/wiki/apple, time taken: 4.17 seconds.
{'accept-ch': '', 'accept-ranges': 'bytes', 'age': '45917', 'cache-control': 'private, s-maxage=0, max-age=0, must-revalidate, no-transform', 'content-encoding': 'gzip', 'content-language': 'en', 'content-length': '104493', 'content-type': 'text/html; charset=UTF-8', 'date': 'Tue, 12 Nov 2024 19:47:35 GMT', 'last-modified': 'Tue, 12 Nov 2024 18:03:41 GMT', 'nel': '{ "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}', 'report-to': '{ "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'server': 'mw-web.codfw.main-5847db6f8b-nhdnp', 'server-timing': 'cache;desc="hit-front", host;desc="cp5022"', 'strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'vary': 'Accept-Encoding,Cookie,Authorization', 'x-cache': 'cp5022 miss, cp5022 hit/23', 'x-cache-status': 'hit-front', 'x-client-ip': '2001:f40:910:1983:d93d:ade6:fe1:f9ab', 'x-content-type-options': 'nosniff'}

Nov 13 '24 08:11 unclecode

@ejkitchen I've figured out the issue. You were right - it occurs when bypass_cache isn't set to true. I noticed that in this case, the code reads the cached version from the database. However, the cached version doesn't save the response header, which is why it remains empty. Remember that if you don't, crawling from local cache occurs by default the second time. Setting bypass cache to true ensures a refresh crawl. I'm not sure if this is necessary in your case. Caching is useful when you know the content is the same and want to process it quickly. If not, you can bypass the cache by setting it to true. Anyway I'll update this feature to save the data in our database and return it.

Nov 13 '24 09:11 unclecode

crawl4ai crawl4ai copied to clipboard

Cannot get response headers

crawl4ai
crawl4ai copied to clipboard