scrapy
                                
                                 scrapy copied to clipboard
                                
                                    scrapy copied to clipboard
                            
                            
                            
                        JsonItemExporter puts lone comma in the output if encoder fails
If JsonItemExporter is unable to encode the item, it still writes a delimiter (comma) to the output file. Here is a sample spider:
# -*- coding: utf-8 -*-
import datetime
import scrapy
class DummySpider(scrapy.Spider):
    name = 'dummy'
    start_urls = ['http://example.org/']
    def parse(self, response):
        yield {'date': datetime.date(2018, 1, 1)}
        yield {'date': datetime.date(1234, 1, 1)}
        yield {'date': datetime.date(2019, 1, 1)})
Encoding the second items fails:
2018-01-25 09:05:57 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.item_scraped of <scrapy.extensions.feedexport.FeedExporter object at 0x7fcbfbd81250>>
Traceback (most recent call last):
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 224, in item_scraped
    slot.exporter.export_item(item)
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/scrapy/exporters.py", line 130, in export_item
    data = self.encoder.encode(itemdict)
  File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
  File "/home/pasmen/SW/Python/data-sci-env/local/lib/python2.7/site-packages/scrapy/utils/serialize.py", line 22, in default
    return o.strftime(self.DATE_FORMAT)
ValueError: year=1234 is before 1900; the datetime strftime() methods require year >= 1900
The output looks like this:
[
{"date": "2018-01-01"},
,
{"date": "2019-01-01"}
]
This seems not to be a valid JSON file as e.g. json.load() and jq fail to parse it.
I think the problem is in export_item method of JsonItemExporter class where it outputs the comma before decoding the item. The correct approach would be to try to decode the item (possibly with other needed operations) and perform the write atomically.
Thanks @tlinhart - will take a look at making a fix to that right away. :)
Hi, I am looking for a beginner issue to start working with. Any chances i can pick this up and start working on it?
hey @gekco! this is almost fixed in https://github.com/scrapy/scrapy/pull/3111, the remaining issue is extra tests which are executed unintentionally.
Marking as a good first issue as finishing https://github.com/scrapy/scrapy/pull/3111 may be easy based on the feedback provided there.