scrapy-pagestorage icon indicating copy to clipboard operation
scrapy-pagestorage copied to clipboard

ValueTooLarge: Value exceeds max encoded size of 1048576 bytes

Open DharmeshPandav opened this issue 7 years ago • 2 comments

the issue arises when there are null characters ( \x00) present in the response.body

for len(response.body) == every null characters will be of length 1 but when we are encoding it using dumps(item, six.type_text(item) every null character becomes of length 6

so for example empty page with 300 Kb of null characters will create encoded object with size of 1.8 MB and here it will throw error

import json
import six
 a  = u'\x00'
len(a) # 1
b = json.dumps({"value": a}, default=six.text_type({"value": a}))
len(b) # 19

a  = u'\x00\x00'
len(a) # 2
b = json.dumps({"value": a}, default=six.text_type({"value": a}))
len(b) # 25

#...etc  for every one null character more 5 characters are added to encoded object


[However in Modified UTF-8 the null character is encoded as two bytes: 0xC0, 0x80.] (https://en.wikipedia.org/wiki/Null_character)

DharmeshPandav avatar May 19 '17 11:05 DharmeshPandav