scrapy-streaming icon indicating copy to clipboard operation
scrapy-streaming copied to clipboard

[WIP / Discuss] Scrapy Streaming docs

Open aron-bordin opened this issue 8 years ago • 1 comments

moving from https://github.com/scrapy/scrapy/pull/1991

PR Overview

This is an initial work in the scrapy streaming docs. You can read it here: http://gsoc2016.readthedocs.io

I'd like to open the discussion about the communication protocol. It's pretty similar with the original protocol in my proposal, with some modifications. My idea is to open the process of the API design, so I can get some feedbacks and modify this API before implementing it.

In my proposal, I've suggested to start the implementation of this API on June 13, so it'd be helpful if we get a definitive API before this date.

Also, suggestion about new messages and new behavior are welcome :smile:

Implementation

Adding some comments about the implementation:

Originally, I've suggested to implement the communication channel between scrapy and the external spider using the ProcessProtocol from twisted, and as pointed in the docs, each message ends with a line break \n. I started an initial POC to get an idea on how this should work.

However, this implementation could get some problems with buffering, because the messages sent by transport.write and received by outReceived can be buffered by the system.

Checking the @Preetwinder POC, he uses https://github.com/Preetwinder/ScrapyStreaming/blob/master/linereceiverprocess.py#L53 to wrap the process and avoid this buffer issues.

Now I'm analyzing the best way to approach this possible problems with stdin/stdout buffering.

As long as all messages must end with a line break, both implementations (streaming core and external spiders) could "buffer" the received data and process it after receiving the line break (end of the message). Also, a different implementation could make it easier. Using the LineReceiver in the streaming core could help while receiving data. But I'm still not sure about the best way to write in the process stdin, unfortunately stdbuf is not available in all platforms.

As part of the communication protocol. the line break is defined as the end of the message. If the external spider developer uses this information and just analyze the received data after the line break, this should be enough.

Do you have any comments about the implementations and these possible issues ?

aron-bordin avatar Jun 01 '16 04:06 aron-bordin