jupyter_server
jupyter_server copied to clipboard
draft: File ID Manager base implementation
Overview
Draft implementation of File ID service proposed here.
File ID Service maintains a bi-directional mapping between file IDs and file names. This service permits bi-directional lookup, and preserves file ID across filesystem operations done through the ContentsManager.
- Adds
FileIdManagerclass with assoc. unit tests - Integrates
FileIdManagermethods intoContentsManagerandAsyncContentsManagerfilesystem methods - Appends
idproperty that contains the file ID to the Contents API GET/POST/PUT/PATCH /api/contents/{path} operations. - Adds benchmark under
jupyter_server/benchmarks/fileidmanager_benchmark.py.
Testing
To test with JupyterLab stable, developers can run the following in a separate conda environment:
git clone [email protected]:jupyter-server/jupyter_server.git
cd jupyter_server/
git remote add dlqqq [email protected]:dlqqq/jupyter_server.git
git pull dlqqq file-id-service
pip install jupyterlab
pip install -e ".[dev,test]"
Benchmarks
To run File ID Manager benchmark:
python jupyter_server/benchmarks/fileidmanager_benchmark.py
Benchmarks run on m5.12xlarge AWS EC2 instance.
% python jupyter_server/benchmarks/fileidmanager_benchmark.py
Index benchmark (separate transactions)
100 files | 0.2932 s
1,000 files | 2.9968 s
Index benchmark (single transaction, atomic INSERTs)
100 files | 0.0032 s
1,000 files | 0.0072 s
10,000 files | 0.0362 s
100,000 files | 0.3430 s
1,000,000 files | 3.5570 s
Index benchmark (single transaction, batched INSERTs)
100 files | 0.2897 s
1,000 files | 0.2663 s
10,000 files | 0.2613 s
100,000 files | 0.2678 s
1,000,000 files | 2.7359 s
Recursive move benchmark
100 files | 0.0065 s
1,000 files | 0.0093 s
10,000 files | 0.0370 s
100,000 files | 0.3499 s
1,000,000 files | 3.6639 s
Recursive copy benchmark
100 files | 0.0106 s
1,000 files | 0.0121 s
10,000 files | 0.0233 s
100,000 files | 0.1632 s
1,000,000 files | 1.5505 s
Recursive delete benchmark
100 files | 0.0033 s
1,000 files | 0.0047 s
10,000 files | 0.0150 s
100,000 files | 0.1359 s
1,000,000 files | 1.4314 s
I think we could borrow quite a bit of SQLite connection logic in SessionManager here to avoid duplication of code. We'll need to abstract out some of these pieces and share them across classes.
For example, the definition of the connection, cursor, etc. could be abstracted out of the SessionManager and used by both of these managers: https://github.com/jupyter-server/jupyter_server/blob/e59610b6c7270e2987979b079b87cc9ef9d6ad2d/jupyter_server/services/sessions/sessionmanager.py#L200-L228
Thank you for working on this, @dlqqq! I think a File/Document ID will be really useful for a lot of use cases! I'm excited to see this land.
My number one concern about the current implementation is that the file ID isn't strictly unique—i.e. it's just an index in the database. In order to make this service more broadly useful, I think these IDs must be unique.
Let me give a simple example...
This service could be useful for sharing notebooks between two running Jupyter Servers (not necessarily RTC, just basic copying between two servers)—an enhancement many people have asked us for.
In this example, each server has its own file ID database that maps file IDs to their path on the individual filesystems. If we "share" this file from one server to another, I would hope we could use this service to have a (admittedly, weak) link between these two copies of the document. In Jupyter Server today, we can't create this link, because using name/path wasn't a viable solution given the lack of uniqueness. In this implementation, we also can't create this link, because we can't guarantee uniqueness from these IDs between the two servers. They will likely have conflicting/unavailable indexes in their database since have completely different filesystem structures.
Thank you for working on this, @dlqqq!
Yes, thank you!
Let me give a simple example...
Sorry for jumping here without reading the complete history.... have the requirements for distributed systems/servers like jupyterhub be taken into account? (uniqueness, portability across servers...)
@Zsailer Hey Zach! Thank you for taking the time to review my progress so far. Right now, I'm working on implementing handling for out-of-band file system operations along with a design doc that dives deeper into how the implementation actually works along with some of its shortcomings.
I've addressed a few of your questions below and will address the others once I'm done with my current work. Keep in mind though, the logic for handling out-of-band operations gets a bit more tricky, so it'll likely need a re-review.
I think we could borrow quite a bit of SQLite connection logic in SessionManager here to avoid duplication of code. We'll need to abstract out some of these pieces and share them across classes.
Well, the SQL connection/statement logic is fairly simple and I'm not sure if it's worth abstracting. The SQL statements are all fairly self-contained and only take 2, maybe 3 lines per statement. Over-abstracting and splitting the source over too many different files hinders readability. But I'll take a closer look at this concern when I'm done with my current work.
My number one concern about the current implementation is that the file ID isn't strictly unique—i.e. it's just an index in the database. In order to make this service more broadly useful, I think these IDs must be unique. ... This service could be useful for sharing notebooks between two running Jupyter Servers (not necessarily RTC, just basic copying between two servers)—an enhancement many people have asked us for.
I'm lost as to why a file ID needs to be, say, "globally unique" to support this use-case. This is just a file copy. If the receiving server needs data (e.g. comments) attached to that copy, the sending server sends that data in tandem. The receiving server is then free to assign a different ID to that new file.
IMO, there's no reason for these two servers to share a file ID for a given file on two separate filesystems because... well, they're not the same file. A copy of a file is a new file. This is how it's currently working in local filesystems. Circling back to commenting, if I add a comment in one file, I should not be adding a comment to another file just because it's a copy of the original and shares the same ID. Shared IDs across files leads to shared state across files, which gets complex really fast.
I'm mainly giving pushback here because using a UUID for the primary key is a performance detriment. I'm not giving a firm "no" here.
@echarles Great questions! Yes, these concerns will be taken into much further study, but only after we handle operations on local filesystems. Could you elaborate more on what you mean by "distributed [Jupyter] servers"? AFAIK JupyterHub is still a single server, so it seems more accurate to refer to it as "remote" rather than "distributed".
Regarding portability and uniqueness, could you elaborate more on these as well? @kevin-bates brought these points up too. Do you mean that file IDs should be globally unique as mentioned earlier? Again, I'm still a bit lost on the use case for this; to my knowledge Jupyter server only runs on a single machine, and a client can't connect to multiple Jupyter servers.
Rough sketch on how to handle remote Jservers/filesystems: The idea is that the ContentsManager implementation for remote Jservers/filesystems should expose methods like dir_exists(), file_exists(), stat(), etc. that the FileIdManager invokes via self.parent.X(). For local filesystems, self.parent refers to the default implementation for local filesystems, FileContentsManager.
(disclaimer: I have not taken the needed time to read and digest the dense and very useful discussion around the file id service, but think it is not too early to lay down any requirements)
Great questions! Yes, these concerns will be taken into much further study, but only after we handle operations on local filesystems.
Well, if the design and the technical implementation shows to be fairly different when we would look after, this may be annoying (e.g. if the generated database in case of many/distributed/remote servers is different form le local database). To which extends, not sure, one could argue we don't really care as the local database would be anyway not migrated to any such distributed system. But playing devil advocate, I could imagine my local server has a custom server extension that sends to a corporate backend those IDs and that one day my notebooks would have to be served from JupyterHub. The reverse is also true. A JupyterHub instance could allow the downoad of the notebook that could be run on my local laptop.
In short, I would prefer to cover the JupyterHub case in the design.
Could you elaborate more on what you mean by "distributed [Jupyter] servers"? AFAIK JupyterHub is still a single server, so it seems more accurate to refer to it as "remote" rather than "distributed".
I am thinking to a case where the notebook files are loaded from a shared storage system, e.g. NFS. In that case, the user goes today to server 1 instance, and tomorrow to server 2 instance.
Another case is RTC where a server hosted by JupyterHub will be consumed by multiple users.
So true, "distributed" is a bad name, IMHO a server is always "remote", "ephemeral" may be better but still not very good..
Hey @kevin-bates! Thanks for the review. You can make some excellent points. However, I literally was just about to push my latest changes on top of this PR. I'll do my best to address your comments soon.
To the rest of the team, I wanted to segment design discussion to a separate tracking issue, since this project will likely have to span multiple PRs due to its size. See here: https://github.com/jupyter-server/jupyter_server/issues/940
I'm pushing my latest changes (which track out-of-band filesystem operations) now.
EDIT: oh geez, the commit history is weirdly interweaved with review comments. yikes. LMK if you all want me to rebase and edit the commit dates.
Codecov Report
Merging #921 (86fcaf3) into main (1ec1aee) will increase coverage by
1.21%. The diff coverage is100.00%.
:exclamation: Current head 86fcaf3 differs from pull request most recent head fb8cc40. Consider uploading reports for the commit fb8cc40 to get more accurate results
@@ Coverage Diff @@
## main #921 +/- ##
==========================================
+ Coverage 71.44% 72.66% +1.21%
==========================================
Files 65 66 +1
Lines 7705 8084 +379
Branches 1289 1339 +50
==========================================
+ Hits 5505 5874 +369
+ Misses 1805 1804 -1
- Partials 395 406 +11
| Impacted Files | Coverage Δ | |
|---|---|---|
| jupyter_server/pytest_plugin.py | 88.88% <100.00%> (+0.70%) |
:arrow_up: |
| jupyter_server/serverapp.py | 65.88% <100.00%> (+0.60%) |
:arrow_up: |
| jupyter_server/services/contents/fileidmanager.py | 100.00% <100.00%> (ø) |
|
| jupyter_server/services/contents/filemanager.py | 72.25% <100.00%> (+0.05%) |
:arrow_up: |
| jupyter_server/services/contents/handlers.py | 86.55% <100.00%> (ø) |
|
| jupyter_server/services/contents/manager.py | 83.29% <100.00%> (+0.56%) |
:arrow_up: |
| jupyter_server/auth/identity.py | 82.93% <0.00%> (-7.27%) |
:arrow_down: |
| jupyter_server/auth/security.py | 75.67% <0.00%> (+0.33%) |
:arrow_up: |
| ... and 5 more |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
@kevin-bates Thank you for your suggestions! I addressed as many as I could, but I believe it's best if I start working on separating my work into a standalone server extension. The great thing about this is that we can split up each of your concerns into a separate PR to address them.
Hey team! This is now migrated into a separate server extension: https://github.com/jupyter-server/jupyter_server_fileid
Thank you all for your review comments! I've left open review questions as issues on that repo.
Great discussion, a few comments:
- Right now real-time collaboration only work where multiple users access a single Jupyter Server. The case where multiple users access multiple servers isn't handled yet and a lot of other things would have to change to get that working. As such I don't think the File ID service needs to solve this case yet. However, we should look forward to when it would have to deal with that case.
- The scoping of a given FileID service is a ContentManager. Thus, if a user has multiple ContentManagers serving different file systems, they would presumably have 1 file ID service for each of this. For example if a user has a local file system and an S3 bucket exposed through 2 ContentManagers, they would have two File ID services.
- Related to this case, I can imagine situations where a user wants to move or copy a file between different ContentManager (copy file to S3 for example). In that case, it may help to have UUIDs that are globally unique.
On Wed, Jul 27, 2022 at 10:44 PM Eric Charles @.***> wrote:
(disclaimer: I have not taken the needed time to read and digest the dense and very useful discussion around the file id service, but think it is not too early to lay down any requirements)
Great questions! Yes, these concerns will be taken into much further study, but only after we handle operations on local filesystems.
Well, if the design and the technical implementation shows to be fairly different when we would look after, this may be annoying (e.g. if the generated database in case of many/distributed/remote servers is different form le local database). To which extends, not sure, one could argue we don't really care as the local database would be anyway not migrated to any such distributed system. But playing devil advocate, I could imagine my local server has a custom server extension that sends to a corporate backend those IDs and that one day my notebooks would have to be served from JupyterHub. The reverse is also true. A JupyterHub instance could allow the downoad of the notebook that could be run on my local laptop.
In short, I would prefer to cover the JupyterHub case in the design.
Could you elaborate more on what you mean by "distributed [Jupyter] servers"? AFAIK JupyterHub is still a single server, so it seems more accurate to refer to it as "remote" rather than "distributed".
I am thinking to a case where the notebook files are loaded from a shared storage system, e.g. NFS. In that case, the user goes today to server 1 instance, and tomorrow to server 2 instance.
Another case is RTC where a server hosted by JupyterHub will be consumed by multiple users.
So true, "distributed" is a bad name, IMHO a server is always "remote", "ephemeral" may be better but still not very good..
— Reply to this email directly, view it on GitHub https://github.com/jupyter-server/jupyter_server/pull/921#issuecomment-1197690638, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAGXUGKMHTL3T2TXJ3XUDLVWIM35ANCNFSM53PJHWXQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Brian E. Granger
Senior Principal Technologist, AWS AI/ML @.***) On Leave - Professor of Physics and Data Science, Cal Poly @ellisonbg on GitHub
- The scoping of a given FileID service is a ContentManager. Thus, if a user has multiple ContentManagers serving different file systems, they would presumably have 1 file ID service for each of this. For example if a user has a local file system and an S3 bucket exposed through 2 ContentManagers, they would have two File ID services.
(This is why it would have been nice if FileIDs were a function of (and emitted from) the ContentsManager.) Since the FileID service is now a consumer of ContentsManager events, how will a given FileID service instance determine that the received event was relative to a LargeFileContentsManager, an S3ContentsManager, or a FooContentsManager?
- Related to this case, I can imagine situations where a user wants to move or copy a file between different ContentManager (copy file to S3 for example). In that case, it may help to have UUIDs that are globally unique.
Yes, but more importantly, is that once objects referencing FileIDs find their way to other systems (via sharing or whatever), they will be incorrectly associated with the wrong files until UUIDs are used as the primary ID.
Kevin, I think you are asking the right questions about how the file id service will work across different file systems. At the same time, there are a lot of other issues with how Jupyter Lab/Server work with different and multiple file systems. We plan on tackling those questions after JupyterLab 4 and Jupyter Server 2 are out and that work will include all of the questions you are asking about. For now, we want to make sure that the basic usage cases of commenting, notebook jobs, and RTC work well when files/directories are moved/renamed in JLab 3 and 4.
On Tue, Oct 11, 2022 at 7:42 AM Kevin Bates @.***> wrote:
- The scoping of a given FileID service is a ContentManager. Thus, if a user has multiple ContentManagers serving different file systems, they would presumably have 1 file ID service for each of this. For example if a user has a local file system and an S3 bucket exposed through 2 ContentManagers, they would have two File ID services.
(This is why it would have been nice if FileIDs were a function of (and emitted from) the ContentsManager.) Since the FileID service is now a consumer of ContentsManager events, how will a given FileID service instance determine that the received event was relative to a LargeFileContentsManager, an S3ContentsManager, or a FooContentsManager?
- Related to this case, I can imagine situations where a user wants to move or copy a file between different ContentManager (copy file to S3 for example). In that case, it may help to have UUIDs that are globally unique.
Yes, but more importantly, is that once objects referencing FileIDs find their way to other systems (via sharing or whatever), they will be incorrectly associated with the wrong files https://github.com/jupyter-server/jupyter_server_fileid/issues/3#issuecomment-1250083719 until UUIDs are used as the primary ID.
— Reply to this email directly, view it on GitHub https://github.com/jupyter-server/jupyter_server/pull/921#issuecomment-1274810008, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAGXUAY6QGRSTRSAP7J3V3WCV4HHANCNFSM53PJHWXQ . You are receiving this because you were mentioned.Message ID: @.***>
-- Brian E. Granger
Senior Principal Technologist, AWS AI/ML @.***) On Leave - Professor of Physics and Data Science, Cal Poly @ellisonbg on GitHub
I see, thank you for your response. Just to be clear, I'm not as worried about the filesystem issues as I am about avoiding data integrity emergencies - per my second point above - which we can continue discussing in the linked issue.