scrapy-pagestorage
scrapy-pagestorage copied to clipboard
ValueTooLarge: Value exceeds max encoded size of 1048576 bytes
the issue arises when there are null characters ( \x00) present in the response.body
for len(response.body) == every null characters will be of length 1 but when we are encoding it using dumps(item, six.type_text(item) every null character becomes of length 6
so for example empty page with 300 Kb of null characters will create encoded object with size of 1.8 MB and here it will throw error
import json
import six
a = u'\x00'
len(a) # 1
b = json.dumps({"value": a}, default=six.text_type({"value": a}))
len(b) # 19
a = u'\x00\x00'
len(a) # 2
b = json.dumps({"value": a}, default=six.text_type({"value": a}))
len(b) # 25
#...etc for every one null character more 5 characters are added to encoded object
[However in Modified UTF-8 the null character is encoded as two bytes: 0xC0, 0x80.] (https://en.wikipedia.org/wiki/Null_character)
@DharmeshPandav I am facing the same issue in one of my live spiders. As per above discussion, will PAGE_STORAGE_TRIM_HTML=TRUE in settings.py will do the trick?
From the fix above by @ruairif , it should yes @shivanshuzyte