scrapy-streaming
scrapy-streaming copied to clipboard
[WIP / Discuss] Scrapy Streaming docs
moving from https://github.com/scrapy/scrapy/pull/1991
PR Overview
This is an initial work in the scrapy streaming docs. You can read it here: http://gsoc2016.readthedocs.io
I'd like to open the discussion about the communication protocol. It's pretty similar with the original protocol in my proposal, with some modifications. My idea is to open the process of the API design, so I can get some feedbacks and modify this API before implementing it.
In my proposal, I've suggested to start the implementation of this API on June 13, so it'd be helpful if we get a definitive API before this date.
Also, suggestion about new messages and new behavior are welcome :smile:
Implementation
Adding some comments about the implementation:
Originally, I've suggested to implement the communication channel between scrapy and the external spider using the ProcessProtocol from twisted, and as pointed in the docs, each message ends with a line break \n
.
I started an initial POC to get an idea on how this should work.
However, this implementation could get some problems with buffering, because the messages sent by transport.write
and received by outReceived
can be buffered by the system.
Checking the @Preetwinder POC, he uses https://github.com/Preetwinder/ScrapyStreaming/blob/master/linereceiverprocess.py#L53 to wrap the process and avoid this buffer issues.
Now I'm analyzing the best way to approach this possible problems with stdin/stdout buffering.
As long as all messages must end with a line break, both implementations (streaming core and external spiders) could "buffer" the received data and process it after receiving the line break (end of the message).
Also, a different implementation could make it easier. Using the LineReceiver in the streaming core could help while receiving data. But I'm still not sure about the best way to write in the process stdin, unfortunately stdbuf
is not available in all platforms.
As part of the communication protocol. the line break is defined as the end of the message. If the external spider developer uses this information and just analyze the received data after the line break, this should be enough.
Do you have any comments about the implementations and these possible issues ?