scrapy-pagestorage icon indicating copy to clipboard operation
scrapy-pagestorage copied to clipboard

ValueTooLarge: Value exceeds max encoded size of 1048576 bytes

Open DharmeshPandav opened this issue 7 years ago • 2 comments

the issue arises when there are null characters ( \x00) present in the response.body

for len(response.body) == every null characters will be of length 1 but when we are encoding it using dumps(item, six.type_text(item) every null character becomes of length 6

so for example empty page with 300 Kb of null characters will create encoded object with size of 1.8 MB and here it will throw error

import json
import six
 a  = u'\x00'
len(a) # 1
b = json.dumps({"value": a}, default=six.text_type({"value": a}))
len(b) # 19

a  = u'\x00\x00'
len(a) # 2
b = json.dumps({"value": a}, default=six.text_type({"value": a}))
len(b) # 25

#...etc  for every one null character more 5 characters are added to encoded object


[However in Modified UTF-8 the null character is encoded as two bytes: 0xC0, 0x80.] (https://en.wikipedia.org/wiki/Null_character)

DharmeshPandav avatar May 19 '17 11:05 DharmeshPandav

@DharmeshPandav I am facing the same issue in one of my live spiders. As per above discussion, will PAGE_STORAGE_TRIM_HTML=TRUE in settings.py will do the trick?

shivanshuzyte avatar Aug 15 '22 12:08 shivanshuzyte

From the fix above by @ruairif , it should yes @shivanshuzyte

DharmeshPandav avatar Aug 16 '22 15:08 DharmeshPandav