scrapy icon indicating copy to clipboard operation
scrapy copied to clipboard

JsonItemExporter puts lone comma in the output if encoder fails

Open tlinhart opened this issue 7 years ago • 4 comments

If JsonItemExporter is unable to encode the item, it still writes a delimiter (comma) to the output file. Here is a sample spider:

# -*- coding: utf-8 -*-
import datetime
import scrapy

class DummySpider(scrapy.Spider):
    name = 'dummy'
    start_urls = ['http://example.org/']

    def parse(self, response):
        yield {'date': datetime.date(2018, 1, 1)}
        yield {'date': datetime.date(1234, 1, 1)}
        yield {'date': datetime.date(2019, 1, 1)})

Encoding the second items fails:

2018-01-25 09:05:57 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.item_scraped of <scrapy.extensions.feedexport.FeedExporter object at 0x7fcbfbd81250>>
Traceback (most recent call last):
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 224, in item_scraped
    slot.exporter.export_item(item)
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/scrapy/exporters.py", line 130, in export_item
    data = self.encoder.encode(itemdict)
  File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/scrapy/utils/serialize.py", line 22, in default
    return o.strftime(self.DATE_FORMAT)
ValueError: year=1234 is before 1900; the datetime strftime() methods require year >= 1900

The output looks like this:

[
{"date": "2018-01-01"},
,
{"date": "2019-01-01"}
]

This seems not to be a valid JSON file as e.g. json.load() and jq fail to parse it.

I think the problem is in export_item method of JsonItemExporter class where it outputs the comma before decoding the item. The correct approach would be to try to decode the item (possibly with other needed operations) and perform the write atomically.

tlinhart avatar Jan 25 '18 08:01 tlinhart

Thanks @tlinhart - will take a look at making a fix to that right away. :)

ghost avatar Feb 06 '18 14:02 ghost

Hi, I am looking for a beginner issue to start working with. Any chances i can pick this up and start working on it?

gekco avatar Mar 02 '18 15:03 gekco

hey @gekco! this is almost fixed in https://github.com/scrapy/scrapy/pull/3111, the remaining issue is extra tests which are executed unintentionally.

kmike avatar Mar 10 '18 18:03 kmike

Marking as a good first issue as finishing https://github.com/scrapy/scrapy/pull/3111 may be easy based on the feedback provided there.

Gallaecio avatar Nov 07 '22 08:11 Gallaecio