rtc
rtc copied to clipboard
Notes
I want to open this issue to track any notes, either from meetings, thoughts, or conversations. Many of these will likely translate into actionable issues, but I find it helpful to have one issue on Github to consolidate and preserve cross issue thoughts.
First, an update on the current status of this work. Previously, we had been working on this in a PR To JupyterLab (https://github.com/jupyterlab/jupyterlab/pull/6871) and discussing it on an issue in that repo (https://github.com/jupyterlab/jupyterlab/issues/5382).
The goal with creating this new repository is to give us some breathing room to really spec out, design, and implement the independent components needed for this work, without having to consider how they will break JupyterLab's data models at the same time. Also, it will hopefully let us get more eyes on this work, by splitting it out from the main repo and focusing on making it useful and documented as stand alone tools.
The basic split is into at least three layers, which are roughly:
- Base CRDT library. This is currently in the datastore package in
lumino(ex-phosphor). This is mostly done, just needs a few bug algorithm tweaks possibly. But overall the API is stable. Basically this provides a strongly typed distributed tabular datastore. - Real time data store built on that work. This would include a patch relay server, or mechanism using Dat or IPFS, and possibly some helper tools to provide richer semantics on top of the datastore, for things like foreign keys, as well as possibly integrations with tools like React and RXJS. Basically add some fluff on top of the underlying data structures to make them more attractive and easily usable. On top of this is where we would build a TODO MVC example, and would be roughly comparable to an open source firebase like client.
- Jupyter specific data store. This builds on top of the real time data store to provide a Jupyter specific data model. It provides a server side "relay" component that connects directly to the existing Jupyter server and a client side component that talks to the server side component. The RTC data model is the backbone that keeps these components on the same page.
I opened this issue because I wanted to record a few thoughts on the data model.
How to implement the "Console"?
The current data model assumes you are doing all your execution through a notebook. So you tell the server to execute some cell and it goes and does it and updates that model:
https://github.com/jupyterlab/rtc/blob/0a6ef543146a12287a8ff6af5eff2119e2786051/index.ts#L112-L121
However, what about execution that isn't tied to a notebook? For example in JupterLab we are able to open consoles and execute code in them like you are able to in a REPL. How would this work?
So I think we need another table, called executions which is basically a log of every execution, with it's text, kernel, outputs, and some other metadata possibly like timing. Then, each "cell" can have a foreign key to the executions table. When a notebook is first loaded we actually don't know what kernel executed each cell. So we should make sure the kernel field in executions can be null.
Although at first this might seem a bit odd (and we do have to worry about GC, with some ability to clear old executions at some point if we need to save memory) it actually brings some real semantic benefits. We would be able to know for each cell what the input text that produced that output was (assuming we executed it and didn't load it from disk). It would also give us a place to store things like execution timing in a structured way.
What to do about comms and metadata?
This is more of an open question than a solution at the moment, but I was realizing that both comms and metadata are tricky because their semantics are not defined in core. But they do have semantics, they just are variable.
For example, let's assume the execution time metadata isn't included in core (although it probably will be since we added it to the notebook spec finally I believe?). So in core you have this metadata field which is Fields.Map<ReadonlyJSONValue>(). But you wanna add some semantics on top of that, like the execution metadata field really has keys a, b, and c and you wanna maybe have your own model for that. So you could create another table called cell_execution_metadata which has a pointer to a cell ID and three other fields a, b, and c. And now you have solved the semantic issue somewhat, by telling ppl to use that other table you can collaborate on each execution key separately. However, then you have two tables in the database that need to synced in some way, so that when the notebook is saved to disk it has the correct metadata.
Some of the same issues come up when dealing with comms. They are also unstructured and used by different extensions to implement different semantics. Originally, I had assumed that comms would be avoided by the proxy server and you would just connect directly over websockets to access them. However, I think we could wrap them in the datastore layer by basically having a table of comm messages (will detail in upcoming PR). In which case there would be a general layer in the data store for "comm" messages, but particular use cases would likely need to add further semantics on top of this.
For example, I have an extension that uses comms for querying the active kernel for same data. The basic semantics there are the client opens a comm channel with the query information and the server should send a message back with the results. We could represent this with two tables, a toProcess that has just an input field and a results with a input and output field. If we assume each response can be cached, then the client will check if the input exists in results and if not adds it to the toProcess and waits for it to show in the results. The server process then checks the toProcess and pops each of them off. I guess we would need another inProgress table that has input and comm_id. So it would go ahead and create some comm channel for that, and associate the comm_id with the input, then wait for the result.
So you can see in both of these cases (the metadata and comms) we will probably create other tables (and possibly other backend services) to deal with the particular semantics of sub-fields. But, they can use the built in fields to trigger those actions. The trick is that you will have a duplication of data in the database, where you have some fields derived from others and we need to do work to make sure they stay up to date. Also, this examples introduces a concept of queue tables, which brings us to our next point...
How to deal with actions?
Currently, I had been been thinking about anything that required some particular action to happen via a REST API call:
https://github.com/jupyterlab/rtc/blob/0a6ef543146a12287a8ff6af5eff2119e2786051/main.py
For example, executing a cell, or loading a notebook from disk. However, as the comm example above shows, we could implement actions instead in "queue" tables, basically create a table for each action we wanna take, and the clients add rows to it, and the server pops rows off to execute actions. Not sure why it took this long to think of this, but I think it has a number of advantages over having a separate REST API.
- Ability to guarantee local ordering of events. Here is an example. Let's say you want to execute cell 1 of notebook xxx. So you send a call to the server to do that. Then you change some values in that cell. You would expect that the local ordering of those actions (executing first, then changing contents) gets mapped correctly so what gets executed is the original contents of the cell, not the contents after editing. However, if executing is a separate REST API call, then these could be mixed up if the event loops are running separately. Basically, it's unclear if there is a guaranteed ordering between the CRDT update and the REST call executing on the server. However, if we moved it all to CRDT updates, and we gaurantee that local ordering is preserved for CRDT message propagation (which I know was something @ian-r-rose and I had a conversation about a while back) and the server has only one event loop to process CRDT updates, then we can verify that it will use the earlier version of the notebook not the later.
- Actions dispatched over websockets. I had been struggling for a while to understand how we could keep some of the semantics of REST APIs and still let them be sent over websockets for efficiency sake. One idea would be to use something liks JSON RPC over websockets (https://github.com/TypeFox/vscode-ws-jsonrpc) or SOAP or some other RPC mechanism. However, this lets us avoid that question by using our existing data transfer protocol instead of relying on a new one.
- Centralize all communication over one protocol. This also means all the communication between server and client would go through sending patches. This means if we did use something like DAT or IPFS to synchronize patches (instead of a central patch server), this would allow us to simply connect directly with that and not have to expose any other ports from the Jupyter server.
There is some other amazing work going on in the JS community around real time text editing! https://twitter.com/jaredforsyth/status/1232532781936173056 Ideally, it would be nice for whatever we build to have support for other backends besides Phosphor's CRDT implementation, if possible. I know it might not be, considering we need to support a number of data types besides just text.
I'm trying to read up on the status of this. Can you give some context on how the deprecation of PhosphorJS is affecting this project?
It has been forked to lumino and being maintained by the JupyterLab team, and the datastore package is the CRDT implementation.
https://github.com/automerge/automerge/pull/253 in particular is an interesting discussion in the development of automerge, I thought.
Thank you, very interesting project. And nice to see it is under active development.
Are you suggesting merging the automerge implementation with the datastore package?
Is the "JSON-like data structure" of automerge sufficient for JupyterLab?
@EWouters
Are you suggesting merging the automerge implementation with the datastore package?
Yes a question is whether we use the datastore package or use automerge. I will work on making some tests and proof of concepts to try out automerge as well.
Is the "JSON-like data structure" of automerge sufficient for JupyterLab?
There is some previous conversation about that (https://github.com/jupyterlab/jupyterlab/issues/5382#issuecomment-523829867).
Hello! Lead developer of Automerge here. Let me know if I can help in any way :)
Few updates from this week, on my end:
- I added the grant we submitted to CZI to fund this work (the results are not public yet): https://github.com/jupyterlab/rtc/pull/12
- Scheduled a first meeting for a week from Monday: https://github.com/jupyterlab/rtc/pull/12
- Moved over the a Todo app example using the existing lumino datastore. I also switched it to use socket.io to back the patch relay server, instead of the previous server which was tornado in Python. That server had more functionality, but this has less lines of code and is easier to prototype with and debug, for me. https://github.com/jupyterlab/rtc/pull/9
- Updated diagram with new packages names: https://github.com/jupyterlab/rtc/pull/11
I am currently working on reworking the rtc-node package to remove any lumino-isms including and making it more compatible with rxjs and react. Then I will update the todo app to use that, and then start working on the jupyter supernode and jupyter node.
For that, I have been looking at the packages nteract provides to interact with a Jupyter server. It seems like rx-jupyter might be a nice package to use! I couldn't tell off hand if it also handles the websocket connections, but I think it does, maybe it's just not typed?
Basically I need to do things like this sendExecuteRequestEpic. I am not sure where the channels is created that we get here. I need to trace that down!
It sounds like Microsoft's new "Fluid framework" might be built on CRDTs?
Steve Lucco looked at it from the fundamental level of ‘What if we built every experience on top of a data structure that was inherently distributed?’” says Dan Costenaro, a principal product manager on Microsoft’s Fluid Framework. “That’s how Fluid Framework was born initially, just a super powerful concept.” [...] “Fluid for developers is a web-based framework that you can use to instantly make your apps collaborative,” explains Spataro. “It provides data structures that perform low-latency synchronization. Those data structures connect between themselves with a relay service, and that relay service is designed to connect endpoints.
It says it's open source on a bunch of news articles but I cannot find it. If it was general enough, it sounds like a solid base to build on for Jupyter! cc @steveluc
I can announce now that we received our grant from CZI to help fund this work for the next year or so! I added a copy of our proposal to this repo, if you would like to read it.
Notes from our first meeting, I opened a few issues to track the next steps. The next meeting is in two weeks:
June 1st
-
Attendees
- Saul Shanabrook, Brian Granger, Vidar, Tim George, Ralf Gommers, Athan Reines, Tony Fast, Yousef Mehrdad, Ani Krishnan, Eric Charles, Zach Sailer, Scott Draves, Damian Avila, Blake Regalia, Christopher Arthur Hansen Brooks, Chris Holdgraf,
-
Agenda
- Introductions (who you are, why you are here)
- Christopher Artuher Hansen
- My current open position, which I hope would be able to contribute to projects like this: https://careers.umich.edu/job_detail/185039/full_stack_research_programmer
- Feel free to email me with questions on it: [email protected]
- Additional links:
- Here are some links to our work:
- https://ipitweb.files.wordpress.com/2019/06/wang_ipit-1.pdf
- The paper from that presentation:
- https://dl.acm.org/doi/10.1145/3359141
- This summer's pub:
- https://www.youtube.com/watch?v=ocQRfmKEutU
- https://dl.acm.org/doi/abs/10.1145/3313831.3376740
- Here are some links to our work:
- Additional links:
- Ivan Gomes at NASA Jet Propulsion Lab
- RTC is next enabling technology
- Christopher Artuher Hansen
- Defining what we mean by collaboration.
-
Real-time shared editing
- Google-docs like editing experience in Notebooks
- Data model that can handle simultaneous distributed editing
- Secondary aspects
- A real-time datastore also has the potential to provide a universal undo/repo system for all entities in JLab.
- It can also enable notebook outputs to be collected from a running kernels while the UI is closed.
- Today's meeting is about this experience
-
Real-time sharing across two servers
- Syncing contents, notebooks, etc across two Jupyter Servers running on different machines.
- Shared server vs. shared Kernel
- Kernels cannot be shared between two real-time syncing servers.
-
Sharing Jupyter services/resources
- Projects, Groups, Teams, etc. accessing the same Notebook directories, kernels, etc.
- Users have an identity that's visible in the UI.
- NOT JUPYTERHUB (at least not currently). JupyterHub is a hub for spawning single-user Jupyter server.
-
Sharing repeatable notebooks
- Notebooks shared between people should just run.
- Notebooks come with their environments embedded and easily installable.
-
Feedback on what collaboration makes sense
- Their mental model is like a single server.
- The state of having multiple kernels on one notebook is mentally more confusing
- Single kernel sounds much less confusing, if the kernel is shared too. The data is part of the collaboration.
- On other hand it's easier to clobber
- We also need other communication channels, if sync
- Another layer, needs to be thought
- POSIX semantics for per cell execution
- Allow private space in notebook
- Research on these https://ipitweb.files.wordpress.com/2019/06/wang_ipit-1.pdf
- https://dl.acm.org/doi/10.1145/3359141
- Important to build system where we can explore these different ideas
- How it can help me work with myself? WHen I am on a bunch of different jupyter servers at once?
-
- Current work around real-time collaboration in Jupyter
- CZI funding for real time data model
- one year timeframe
- Won't achieve full collaborative experience, it's about building collaborative data models for the jupyter server
- Plan
- Base: CRDT implementation, lumino and/or other JS implementations
- Middle: Friendly real time datastore using this with React integration
- Higher level/Jupyter: Support for editing all data in Jupyter server
- Jupyter clients: JupyterLab integration, spyder integration, nteract integration (examples)
- CZI funding for real time data model
- Benchmarks and algorithms [Brian]
- https://github.com/jupyterlab/lumino/pull/78
- https://github.com/jupyterlab/lumino/blob/8726968af142e44ca61ba28ab5f9f6911f34ee3f/datastore-benchmarks/results.md
- Read the benchmarks with a pinch of salt as the results fluctuate from run to run and are not necessarily reproducible
- Main takeaway is that lumino is competitive with other CRDT implementations.
- lumino outperforms automerge by orders of magnitude across the board ;-)
- Yjs and lumino each have higher performance in certain areas, but Yjs general has small patch/id/document sizes.
- Unique ids require careful consideration due to memory allocation concerns
- Complex, lots of caveats, etc.
- LSEQ/KSEQ is the type of CRDT implementation we are using in Lumino
- Lumino needs to provide more ergonomic APIs (as compared to Yjs and automerge), as the Lumino APIs are currently lower level and more verbose.
- Desire for a CRDT algorithmic working group?
- Where can lumino improve? what are the edge cases? are there ideas from Yjs that Lumino can borrow?
- May want to consider an abstraction API to allow abstracting away the underlying implementation; however, this may be difficult due to the leaky-ness of the underlying implementations.
- Next steps
- Will post link to recordings in the minutes (https://github.com/jupyterlab/rtc/issues/21)
- Will post link to CZI deliverables in the minutes
- (https://github.com/jupyterlab/rtc/blob/master/funding/CZI-2020-proposal.md)
- Saul will write up user stories in
jupyterlab/rtcrepo (https://github.com/jupyterlab/rtc/issues/22)
- Discussion Topics
- Background: proof-of-concept RTC editing in JupyterLab; however, this POC was too intimately tied to JupyterLab. Desire to make this independent of JupyterLab in order to allow for faster experimentation and enable RTC in Jupter using other frontends.
- Moving Jupyter Server communication to the server instead of the client
- Allows having a client (supernode) always running on the server in order to allow, e.g., kernels to continue executing without losing results.
- Some tension between supernode and Jupyter server and where responsibilities should reside.
- Where does authentication and identity fit in this picture?
- There is a draft identity API; however, this is a WIP.
- Desire to have permissions for different user types.
- This grant does not explicitly address authentication.
- However, important to consider authentication from the start to avoid having to discard the work which comes out of this grant.
- Need to allow for Jupyter workends to specify the data models of their choice.
- Desire to allow RTC extensions without breaking JupyterLab core.
- Need to treat JupyterLab as an important test case in order to prove concept and need to work in close collaboration in order to refine concept; however, tension in ensuring that RTC work is not greatly influenced by JLab concerns. Intent to provide general Jupyter infrastructure which is not wedded too strongly to any particular Jupyter frontend.
- Introductions (who you are, why you are here)
Sorry I couldn't attend today. I'd love to join the next one if I can.
I had a very productive meeting with @MSeal yesterday. It was great to hear from someone else who has also been thinking about this problem for a while, albeit coming from a slightly different space, nteract instead of JupyterLab. If we can find an approach that bridges those projects and people, then I will feel more confident that we are thinking at the right level.
Protocols
His biggest suggestion was to focus on documenting the protocols, so that this can be adopted by different organizations with custom business logic that might need to rewrite different parts of the stack. This was why Jupyter itself was successful, not due just the initial implementation itself, but that anyone could write their own implementation of different parts of the stack.
We need to do more work here in articulating the different layers of this proposal and what their commitments are. There is currently a bunch of work in the CRDT space (recent automerge work) so we have to play an interesting dance here of leaving us flexible enough in that space to try out different algorithms and implementations, while still moving forward on a consistent interface so we can start building on top of them. Each of these algorithms represents diffs/transactions differently so it might be a while before the actual format over the network stabilizes.
Conflicts in Data Model
We also went through some interesting scenarios about different types of conflicts, like:
What if two people edit the same character at the same time?
I said under the current CRDT approach this would be handled in a best effort manner. There would be no conflict event raised that a client could respond to, it instead would take place according to the rules of the data structures, at some point with some arbitrary choice depending on the random ID each client was assigned. He said it will be important to document this sort of thing, so people know what the expectations are on the protocol.
What if two people execute a cell at the same time?
This would be handled by having a queue of cell executions that is handled by the "supernode." My thinking now is that this would be another model in the datastore. So each client would add an entry saying "please execute cell ID xxx of notebook ID xxx" and the "supernode" would be responsible for popping those off and actually executing the cell. So if it gets to the model and sees two rows with the same cell and notebook, it would know to just execute that once. Or if it sees it's already executing, then it could just keep executing.
So my current understanding is that there are two different levels of conflicts:
- Content editing conflicts This is example 1 above. These would be resolved locally and independently on all clients. The nature of the CRDT algorithms guarantee that if two clients have access to the same transactions, regardless of the order, their final state will be the same. So these will be eventually consistent. The clients will not get information on conflicts, the decision here is to bake in the conflict resolution mechanism to the data structures themselves so that end users (jupyter in this case) don't have a say in how they get handled.
- Stateful action conflicts This is example 2 above. Any action that requires interacting with another system (like the Jupyter server), should be handled in a queue and resolved by the supernode. The idea here is to re-use our existing shared data model to keep these queues synchronized. This is not essential, but seems easier than keeping two separate shared data mechanisms. Since we already have one, from implementing above, let's just re-use that to keep queues synchronized.
"Models" vs "Tables"
He also suggested that the terms "models" and "schemas" might be more familar terms than "tables." Sounds good to me.
So what if we say something like:
The datastore is composed of many models, each with a schema which describes what fields exist and the type of the fields. Each model contains a number of records which have an entry for each field in the schema.
This terminology seems consistent with redux-orm which sits in a similar space and has a fair amount of adoption.
Storing all Jupyter data in real time data model
I also brought up my proposal to store all (or almost all) of the Jupyter data in the real time data model. A very preliminary draft of some of these tables can be found here. I am working on a POC to demonstrate this approach more fully.
He had said he was originally considering just storing maybe the notebook state in the data model, but was interested by this approach. He said it would require a large change to clients to accommodate this. I agreed, saying that clients could always still talk to the original underlying Jupyter Server as they wanted, and could just tap into the real time data model as they needed. However, long term the goal would be to move more of the logic on interacting with the Jupyter server to this one data model, so more of the clients can re-use this code.
For example, take kernel state. Currently, you can ask the server to give you the state of all the kernels, and you also might know when a kernel changes state based on the messages you get back from this. Currently, each client has to do it's own calculation of what it thinks all the kernel states are. In this new model, only the supernode would be responsible for this, and it would take care of updating the Kernel model with the proper information.
I would like to try to iterate here a bit more and see if I can get something up and running as an example, so it's easier to make sure we are on the same page with how this could work.
This would be handled by having a queue of cell executions that is handled by the "supernode."
Note that this answer relies on node being present on the server. Have we abandoned all hope of delivering a solution that does not require node being present on the server?
I also brought up my proposal to store all (or almost all) of the Jupyter data in the real time data model.
While certainly interesting, I have some reservations about introducing a monolithic RTC session for the ~full application state. I'm still thinking about the use case where you want to invite a group of people to follow-along your notebook edits in read only mode (e.g. when teaching a class/tutorial), but not being able to access any other information about my session or system.
Note that this answer relies on node being present on the server. Have we abandoned all hope of delivering a solution that does not require node being present on the server?
I have for prototyping purposes. Although of course the option still remains to re-implement the CRDT algorithms and the server API in Python, if that becomes essential.
I'm still thinking about the use case where you want to invite a group of people to follow-along your notebook edits in read only mode
Yeah, I agree this is a compelling use case. However, I am intentionally ignoring permissions in my first go at this, because it adds a whole other set of complexities. When we do add them, I think we will have to add them at a more granular level then something read only for everything. From my conversations with @Zsailer's, it seems like we are moving towards a fine grained event based permissions on the server, so we will have to thread that all through at some point. For things like "You are not allowed to execute this notebook, but are allowed to execute this other one", etc.
However, I am intentionally ignoring permissions in my first go at this
Sure, but my point was that a shared CRDT session for all application state might be impossible to refactor to support permissions later on. I'm trying to avoid us making irreversible decisions (or at least being aware of decisions that will require a major refactor to undo).
Thanks Vidar for bringing this up, I opened an issue to discuss it more: https://github.com/jupyterlab/rtc/issues/28
Note that this answer relies on node being present on the server. Have we abandoned all hope of delivering a solution that does not require node being present on the server?
I have for prototyping purposes. Although of course the option still remains to re-implement the CRDT algorithms and the server API in Python, if that becomes essential.
I would hope the intent would be to make an API protocol that was language independent, even if an initial implementation is in Node. I'd prefer to have a Python implementation available for that protocol independent of if there was a node or go or java version around for multiple reasons. A) You want to be able to make a jupyter server extension that handles the new API protocol pattern without having an extra server in the hop. And B) all the other jupyter backend work is in Python, which makes it easier to maintain as it's more consistent within the Jupyter scope.
I would hope the intent would be to make an API protocol that was language independent, even if an initial implementation is in Node.
Sure, but there is also the question of what code would need to be ported to Python for a super node to be able to run Python-only. These are the ones I can think of at the moment:
- The CRDT algorithm.
- The code mapping CRDT <--> notebook document model
- The code mapping Jupyter notebook messages --> notebook document model
The last two points might be possible two replace with code that maps Jupyter notebooks messages --> CRDT, but it all depends on which tasks the super node needs to be able to perform.
2020.06.15 Community Meeting
- Attendees
- Saul Shanabrook
- Vidar T. Fauske
- Athan Reines
- Brian Granger
- Daniel Althviz
- Joseph Wang
- Cryptocurrency
- Merge worlds of science and finance
- Trying to get funding from Hong Kong government
- trying to get grant that is open to needs from community
- Zihan Wang
- created plugin for Jupyter Notebook and published paper on RTC
- https://dl.acm.org/doi/10.1145/3359141
- Now interested in access control
- created plugin for Jupyter Notebook and published paper on RTC
- Scott Draves
- FRL (née CTRL-labs)
- Agenda
- Supernode example: https://github.com/jupyterlab/rtc/pull/34
- Would love critique!
- Brian: Is this a replacement for the Jupyter Server?
- Saul: No, it sits next to it and calls out to it.
- Brian: We have many fields in Lumino, scalars, maps, lists, text.
- Do we need non primitive types for other things besides the notebook?
- Saul: No I don't think so! Just for folders maybe?
- Nick:
- Can you write this as a custom kernel?
- Can we not use node?
- Maintaining a separate communication way to communicate over websockets is wasteful
- Saul:
- I see this as lower level than the kernel mechanism. You wouldn't have a way to talk to kernels directly in a world where this was used as the client,
- Performance and testing of lumino data store.
- Adding tests for lumino datastore id generation [Brian]
- Work on performance benchmarks.
- Use these as a baseline for futher work.
- Performance characteristics very depend on metadata IDs
- To try fiddling with this, we need a test suite.
- Trying to get performance and tests there as a baseline
- Also performance depends on time to apply transaction form remote host
- This happens when replaying all transactions from the host
- y.js has a nice thing here of a binary packing algorithm
- Use something other than verbose json data structures for transactions
- Vidar:
- There are ways to provide checkpointing so we can collapse history
- Saul: What about automerge performance updats?
- Brian: Ours is flexible enough down the road to swap out id generation and metadata generation, so we can incorporate performance changes
- Hasn't run against improved automerge yet
- Refactoring model factory logic in JupyterLab
- Right now, model factories are hardwired in JLab to the modeldb.
- To enable us to build RTC as separate extensions, we should enable model factories to provide any model type.
- Pushing to get this into 3.0.
- ModelDB is pretty different datastore models, pretty confusing to have both
- goal is to make modeldb optional
- Collaborative sessions in lab:
- What level of the UI do we break sessions into? (full UI state, per-document)
- What lifetime do we expect sessions to have? (one collaborative session ie. ~hr, full history through time ie. ~yrs)
- These questions are directly tied to UX.
- Motivation: Permissions/sharing, performance
- If we turn on RTC all the time we might have performance issues
- Brian:
- Lifecycle I have been thinking of is if no one has document open, we reset the state
- More worried about long histories than many records
- When to synchronize depends on how long people are able to work offline and come back together and how long undo/redo state is kept
- Vidar:
- could have one session that is UX
- One rtc session per document
- Let's you split up the state more
- Not sure of technical consequences
- Work plan
- How do we break down the RTC work into separate work streams, so that we can partition the work among those interested?
- All work tracking should be on GitHub issues, should we also use GH projects?
- Supernode example: https://github.com/jupyterlab/rtc/pull/34
Just uploaded the meeting from today to YouTube: https://youtu.be/fwvH53GXxM8
I had two calls this week about RTC, one with @bollwyvl and another with @williamstein, which were both very useful! Thank you to both of you.
I will try to summarize the notes here from our discussions. Please comment with additions/corrections for anything I missed.
Calls
Nick
He had worked on two previous collaborative Jupyter notebook experiences.
The first was using ShareJs/DerbyJS to provide a shared data model. It also used Node JS on the server, and used primus for communication. The type of use cases this was designed to support were:
- "Follow the leader" where one person is editing and the rest are following along. Useful in an education or conference setting. Only one person editing.
- Multi screen/second screen, like a remote control for a notebook on your phone while you are presenting
The second was using dat-js to build a prototype using the dat file system as the backing store. Every cell was a folder, every output was a file. This worked for the follow along use case. You could hand off a dat ID to to another user to instantly have them seeing what you were doing.
We also talked about how the compute here is myabe not as tied to a particular Jupyter Server but instead could be answered by the browser (like with jyve) or by connecting to your own server to start executing if a session is shared with you.
Also more of the fork/join sort of model, like in an education setting where a teacher might have a notebook, all students would then go an edit one cell, and the teacher could see all their different versions.
William
William has built a working real time system around Jupyter Notebook ins CoCalc. Some features of that system:
- Denounces changes every 3 seconds and saves checkpoint. Can click and drag to move through history of the notebook to different checkpoints.
- Rewrote entire Jupyter Server (instead of calling out to it, like our current approach here), so there is one integrated node server.
- All code written in JS and isomorphic.
- The architecture is very similar to the one being explored here, in that there is an underlying piece that is a data model sync, which keeps in a sync a list of changes between all clients and the server. These are managed in memory in the server, and then eventually serialized out to a database for longer term storage. There is also some cleanup tasks which combine edits on notebooks that haven't been edited in a while to reduce the number of changes in the DB. The model idea is very similar, where diffs are synced between clients and the server talks to the kernels directly, and none of the clients talk to it. They just communicate changes. One difference is the implementation of the text field (and possibly other fields like lists/maps). Instead of a CRDT approach, he chose to send around diffs. The reasoning was that although CRDTs have better semantics when two people are editing close together, they increase the complexity and that it wasn't that often that two people would say edit the same line of the same notebook at the same time.
In terms of future collaboration, he had been thinking about extracting out the shared data model piece for a while, because @rgbkrk had been talking to him about it before. He said the code is in pretty good shape. And it was a lot of work to get all the syncing parts together, and could be reused here.
I brought up that my main hope with this grant would be to have a shared library for the logic that turns the Jupyter Server calls into a data model. For example, all the logic, which is repeated in every client, around sending messages on different channels for execution, and then collecting the right execution results and matching them with the cell that executed them, and inserting the outputs. Through this work, we would move that logic all to the server, instead of in the client where it is today in JupyterLab.
And since that logic is rather complicated, especially on top of the data syncing part, the hope was that we could offload some of this complexity out of the core JupyterLab codebase, and share it between different frontends. This would also open the door to make new Jupyter frontends, like a mobile application, in a quite straightforward manner, because it would really just be about rendering and dispatching actions.
He understood the motivation here, but that part of the cocalc codebase wouldn't be as easy to separate. We thought about the other current in development Jupyter clients:
- nteract
- colab
- VS Code
- Kaggle (used to have RTC but was taken out)
- Spyder
I said I would try to have conversations with the colab and VS code teams, and continue the conversation with the nteract team, to understand their interest levels on possibly standardizing on a shared server side data model.
In terms of implementation details, he also said that the database basically has to be shared by notebook for cocalc, so that you only pull in changes on the notebook you are interested in.
One thing he emphasized was that it was very important to get the response time from executing a cell to seeing an output down. So he cautioned against going through something heavyweight, like a traditional RDBMS in this path.
@saulshanabrook thanks for reaching out! I'm of course also interested in getting the full-fat, collaborating-with-my-team-on-big-data use case, but the use cases we talked about would potentially be quick wins, even without a large infrastructure setup, either being local, or entirely driven by voila-level complexity.
As you mention the isomorphic/performance angle: it would be very interesting if the "guts" of the rtc engine could be deployed as WebAssembly, for example with assemblyscript. For the single-user case, who I am still desperate to protect from having to run nodejs, the wasmer runtime has bindings for many popular languages... and of course would work in the browser. I can also warm up the conda-forge PR, which was previously blocked due to needing rust nightly (now available). This would delegate wiring it up to whatever network connection is needed for the use case, given the server environment it lived in (e.g. wasm-running-in-a-python-server, here's a zmq socket for a kernel, a socketio for talking to my client, and... something else to send/receive patches, get to it!)
June 29th
My internet cut out a bit during this, and I was involved heavily in the conversation so didn't take great notes during it! Sorry about that. I will upload a recording of at least part of the meeting, I also forgot to hit record for the first bit... 🤦♀️
Introductions
- Matthew Seal: Works on Jupyter backend libraries and nteract. At startup called Notable. Before that was at Netflix. Find collaborative experience intriguing. Been talking to Kyle about it for a while.
Announcements
Updates
- Joseph:
- Wiki? Just would like people to populate wiki
- Hong kong government funding businesses to go online
- Getting conflicting messages on whether this would be useful
- Money intended for small businesses in hong kong. But they can buy services outside of hong kong. 100 million USD grant
- Getting grant on visualization and technologies
- If I get the point where I can figure out what to till out, then it becomes feasible.
- Wiki? Just would like people to populate wiki
- Saul:
- First, a meta thought on my process here so far. I feel that I am currently in the exploratory phase here. We have some tentative plans, that I am working on formalizing in the SPEC document. But there are also lots of stakeholders who I am trying to connect with more and understand how we could collaborate. So I see the current implementation work as really be in service to furthering the intellectual and social job of coming up with a plan and figuring out exactly the needs we want to address. Of course, at the end of this process the hope is to have some code that is in use in JupyterLab. But I personally at this phase feel like I need to tread lightly, mixing the time of putting together code to move the concepts along with constant questioning and opening up.
- initial spec PR
- I had some very helpful meetings last with William Stein and Nick Bollweg.
- One main takeaway I had was re-focusing the core goals of the grant:
- Being able to share, between frontends, some of complexity of turning the Jupyter Server communications into a consistent data model will help us all reduce internal complexity and provide a larger base for new Jupyter clients/tools to build on.
- Also, I am working on adapting to a new framing that centers the specification of protocols, over any particular implementations, like our CRDT model.
- Seeing William Stein's CoCalc implementation that sent around diffs instead of using CRDTs for text fields, made we wonder if there is space to start with a simpler model like that. And either build in support for supporting CRDTs as an opt in between a certain client or server, or as another protocol entirely.
- Another way to view this problem is that it's actually trying to do two things at once:
- a) Provide a way for multiple users to collaborate on a text field together
- b) provide a data model that the server can keep up to date and push changes to the client
- The thinking at the time of writing the proposal was that we can kill these two birds with one stone, the Lumino CRDT implementation and tooling built on it. Now I am wondering that maybe we can figure out how to address these goal in some possibly more separate way.
- First, a meta thought on my process here so far. I feel that I am currently in the exploratory phase here. We have some tentative plans, that I am working on formalizing in the SPEC document. But there are also lots of stakeholders who I am trying to connect with more and understand how we could collaborate. So I see the current implementation work as really be in service to furthering the intellectual and social job of coming up with a plan and figuring out exactly the needs we want to address. Of course, at the end of this process the hope is to have some code that is in use in JupyterLab. But I personally at this phase feel like I need to tread lightly, mixing the time of putting together code to move the concepts along with constant questioning and opening up.
- Brian
- I don't understand how we are organizing this effort.
- Let's start with UX design (user flows, user stories, etc.) and work backwards to the technical design.
- There are lots of people who want to participate at different levels, need to build a roadmap and work plan on GitHub (projects, issues).
- Assorted notes Saul took down during discussion:
- Maybe one way to split up the focus of this work is on the user facing collaborative documents and the server focusing communication and execution of cells.
- Brian noted that the properties of these are different so it makes sense to use different technology for each.
- Brian: The server facing tools don't really need undo redo or history in the same way, because that doesn't make sense for execution, it's non reversible
- We also talked about some potential UIs to prototype this work:
- Tony brought up a chat app where you can execute a single cell and drag it into a chat for others to view the code and the result
- Brian suggested a focus on personas and user stories
- Saul and Carol will work together to improve the contributing story on the repo and make it clear how people can help out.
Discussions
- Possible to create a specification that is implementable by multiple backends?
- Based on Konrad Hinsen's model: https://github.com/jupyter/enhancement-proposals/pull/4 can we separate the executions from the notebook format? Have the executions be a more primitive table and then the notebooks build on them?
- Then we could start with executions (and maybe kernels as well) and not support notebooks to begin with?
- Microservice architecture where different services to deal with each different schema?
July 13th
- Updates
-
Saul
- Created project board
-
Carol:
- Moving documentation to one place
- Moving towards adding towards simple sphinx docs
- If folks have things they wanna see in docs, open issues!
-
Eric:
- Issue 48, picture of packages and how they interact together
- Migrate this PR to basic read the doc content
- Been trying to understand content and how CRDT algorithm is working
- to confirm understanding of architecture and algorithm, will create basic chat example, to make sure what we have is agnostic to JupyterLab
- If it works for chat, it should work for jlab, nteract etc.
- Issue 48, picture of packages and how they interact together
-
Brian
- Going to open a few issues to help us organize the UX side of things.
- User stories
- How Might We Questions.
- Empathy Mapping.
- User flows.
- We may want to setup some separate meetings to do some collaborative design sessions.
-
- Discussion
- Darian:
- How will this be added to JupyterLab?
- Saul: Iterate out of core and
- How will this be added to JupyterLab?
- Brian:
- Steve put out link to recent automerge talk on hard parts of CRDts
- https://github.com/jupyterlab/rtc/issues/15#issuecomment-657103225
- https://martin.kleppmann.com/2020/07/06/crdt-hard-parts-hydra.html
- CRDTs are class of algorithms, many implementations, edge cases not well documented.
- Basically still CS research level work, the talk is a good way to get into it!
- Matthew: Is yjs peer to peer?
- Brian: It requires patch forwarding. All you need to do is make sure all patches get to all parties eventually
- Matthew: Who is coordinator?
- Brian: Up to you! They provide implementations with a couple technologies
- Matthew: Might be good to provide interface at the top level, and then allow different interfaces
- Brian: The interface vs implementation for CRDTs is quite challenging
- There are a couple places to think about this question
- The patch/transaction format, the cost of serializing is critical thing..
- So Messages you send over the wire, is not very generic
- Interfaces are very highly tuned to that CRDT algorithm. Binary data only tuned to that.
- Structural form of transactions, yjs does binary packing that is very efficient and fast.
- You could implement yjs for a different format for its own data.
- Hard to implement generic protocol, that works accross implementations
- metadata, ids, etc are all critical and specific
- The layer you can start to think about interface/implementation is level of data model that is passed to view components
- At the network protocol layer, it will be really challenging to standardize
- Matthew: That's really helpful to know. One thing to consider as well, in this case we are doing CRDT implementations, we dont have large deltas. Do you know why performance would be impacted by having JSON as message buffer?
- Brian: CRDT sequences, imagine if you have two array elements, 0 and 1. CRDT allows you to insert things between that, by using IDs that are basically arbitrary precision floats. They don't use floating point numbers, they use binary encoding scheme, also includes identifier for client, and as well as time clock. But conceptually, like floats. So at a minimum, the more you edit a document, the more precision you need to insert elements between other elements.
- The way they grow is highly dependent on CRDT algorithm
- some use binary tree, others make performance tradeoffs
- At runtime need to do lots of comparisons... So use strings for them, pack binary ids into strings
- Thats one part where this a growth. Each individual patch might be small, but to assemble current state of document, you need to reply many thousands of patches
- Matthew: I would image there is resync or check point...
- Brian: Hard part won't be core algorithm, yjs automerge lumino, all in ball park. I think the hard part will be those checkpoining things, as well as understanding how checkpoints read/write from filesystem.
- Matthew: I'll listen to this.
- Brian: One question here is to think about offline work. When do we checkpoint? If we wanna enable true offline work, you can't use whether there is a network connection to manage sync points. More than willing to go deeper here if there is interetest
- Saul: This is a helpful conversation. TODO: Question on start with diff and upgrade
- Brian: Think about sequence... Need insert delete, We could have JSON map to have this... If you just use this, you have conflicts
- Saul: Yeah you might get conflicts, but sometimes that's OK? Unless you are both editing at the same time.
- Brian: For notebooks cells it might be ok, for text maybe not.
- Tony: Hyperdrive might be interesting to look it
- https://pfrazee.hashbase.io/blog/hyperswarm
- https://github.com/hypercore-protocol/hyperdrive
- Brian: my use case informed by how painful people have with conflicts. I am concerned that you get conflicts. It seems like core user stories is that users don't have to worry about conflicts and merging.
- Steve put out link to recent automerge talk on hard parts of CRDts
- Tony design ideas:
- On all my jupyterhub I have been having jitsi chat.
- These conflict becomes resolved when you have chat, they let you avoid conflict.
- If jitsi chat could be supported by RTC chat.
- I have never been able to work on notebooks so quickly as when we all chat on jitsi
- Binder: https://github.com/deathbeds/_fam
- nbgitpuller, to stay up to date on what folks are doing.
- It's not just video chat, but plus other things
- But what could this look like in a larger suite of tools
- Brian:
- One thing that is interesting, is that deep technical questions bubble up to users in a particular way. I want us to start with user stories and understand those.
- Eric:
- I have read a bit along the issues... And identity is something I care about. If you wanna collaborate, you need to know who you are talking to. This will come back in coming weeks/months. I am curious if this feeling will be shown in user stories, first step is to identify yourself. So you must know your peer and who you collaborate.
- If ppl are happy for anonymous, it's good enough, if they need some security, you must identify yourself.
- Today I am not sure we cover those aspects
- Darian:
Amazing note taking btw. Thanks for being diligent on capturing these.
