readthedocs.org
readthedocs.org copied to clipboard
File tree diff
This is on top of https://github.com/readthedocs/readthedocs.org/pull/11643.
Closes https://github.com/readthedocs/readthedocs.org/issues/11319
Ref https://github.com/readthedocs/readthedocs.org/pull/11507
This is still missing tests and docstrings, but it would be great to have a review if we are good with this direction.
However, I'd definitely change where we are parsing and calculating the hashes. I don't like we are doing this together inside the Celery task for search indexing. I'd like this to live outside the search and generate the manifest.json file inside the build process while we have the files on disk already. Besides, this gives the ability to the user to generate their own manifest.json using the technique they want. This is something we discussed and agreed on, so I wouldn't change it unless there are specific good reasons to do it.
Doing this in the build process itself has some downsides:
- Build time may increase for users
- To regenerate the manifest, a build will be needed. New users wanting to use this feature will need to re-build their versions (we can just re-trigger a task instead with this approach).
- We already walk the file tree in the celery task, there is no need for additional code or test setup.
We still can have the "create your own manifest" feature, this implementation doesn't exclude that feature.
Build time may increase for users
This is fine to me. I think it's good to consider the generation of the manifest as part of the build time. That will give users/us a more realistic measure of the build time process. Projects with heavy CPU time consumption should pay us more money for bigger resources, instead of us hiding this behind a Celery task and paying the cost by ourselves --without even understanding who are the customers that consume these times on those tasks (we are not tracking this for these Celery tasks).
I think in most of the cases it won't affect too much, I'd say, but I'm happy to see some data for performance tests here if that helps.
To regenerate the manifest, a build will be needed. New users wanting to use this feature will need to re-build their versions (we can just re-trigger a task instead with this approach).
I'm not seeing and issue here since generating manifest for old versions isn't super useful for now. We are only considering Pull Request and latest versions; which are being built all the time.
Even if we will be able to trigger a task to generate the manifest for any version, it will be only possible using our own algorithm since we don't know where and how the user could be generating this file. So, exposing such a feature could even be confusing or even destructive of an already generated manifest using a custom process.
I'm fine focusing ourselves on new versions for now and keep the implementation simple.
We already walk the file tree in the celery task, there is no need for additional code or test setup.
Async things are always more complex to handle, maintain, debug, etc. Besides, it could even end up being a worse UX for users since the build could have finished already but the generation of the manifest doesn't, making the PR preview to not show the file tree diff (because it's not ready yet). We would need to implement polling in the client and add more UI/UX around it to handle these cases, complicating things more in both sides: backend and frontend.
On the other hand, walking a tree of files in disk is a pretty fast operation. I'm not seeing a lot of benefits in doing it only once compared with the complexity added by being inside a Celery task and entangled with the search code.
I'd start as simple as possible using a sync approach generating the build manifest inside the build process as we previously discussed all together.
It sounds like maybe we need to discuss when this code runs a bit more in a call this week? I don't think I was thinking too deeply about it in the design process, tbh. In general this approach mostly makes sense to me, but the UX around the delay in the indexing is a reasonable concern.
I do think slower builds are a bad UX, and we can just count this time in the build process in some other way, but making builds slower to upload isn't a good thing to design if we can avoid it.
Overall this architecture makes a ton of sense to me, since we're already doing all the parsing of the primary content in the search code. Perhaps a better question is why don't we also migrate the search indexing to be run during the build? The same logic presumably applies, and that would allow us to keep the abstractions married together.
I assume we need tasks to run this logic in an on-demand fashion (eg. search reindexing, or build manifest creation outside of a build), but it seems like it wouldn't be hard to support both: by default running while the files are on the build server, but also possible to run them against S3 if needed?
Perhaps a better question is why don't we also migrate the search indexing to be run during the build? The same logic presumably applies, and that would allow us to keep the abstractions married together.
I don't know what were the original reasons about why this process was separated from the build itself. All my opinions for the hash generations applies here as well. However, I think the search indexing makes usage of the DB (eg. ImportedFile), and that was probably the reason to run it in a Celery task on the web queue.
Removing async and Celery queues ops complexity is a win to me. If that's not possible, at least, I would try to not add more work to those queues if we can avoid it, and I think we can on the file tree diff case.
I assume we need tasks to run this logic in an on-demand fashion (eg. search reindexing, or build manifest creation outside of a build),
We won't be able to run on-demand manifest creation if we want to allow users creating their own manifest using their own rules, as we talked. This is because the commands will be executed inside the build process itself.
We won't be able to run on-demand manifest creation if we want to allow users creating their own manifest using their own rules
We can still have that feature, the user-generated manifest will take precedence over our process.
Removing async and Celery queues ops complexity is a win to me. If that's not possible, at least, I would try to not add more work to those queues if we can avoid it, and I think we can on the file tree diff case.
We will be moving that complexity to the build itself, we can't escape from that. Executing things outside the build also keep us in the direction of isolating the builds from our application.
I'd be :+1: on moving forward with this approach so we can start testing it, since it's already done, and then have a larger discussion about how to rearchitect this code. I think the ideas around the UX and manifest creation are gonna take some time to get right, so I'm fine doing the indexing after the build while we work on it behind a feature flag.
There's definitely a larger philosophical discussion here, and I don't want to block our initial testing on that discussion if we can avoid it, and changing where this code runs is easy to do after we have an initial version out and tested.
I agree with the points @humitos shared above. I think we should aim to surface as much of our build process to users as possible, especially any part where we are giving or plan to give control to the user.
It does feel like a little bit of a miss to have the structure of build.jobs and not utilize this. Using build.jobs as a place to run and override both search indexing and hash manipulation would fit very well with the design of the rest of the build process. This also feels like stronger delineation between the user build and our application as it is explicit and not hidden magic.
I think concerns over build time are premature, though certainly not invalid. We can guess that large projects might struggle with this, but we have no figures on timing yet. Part of our plan should be to gather these figures. Are we talking 1-10s for most projects? That seems perfectly fine and I'm happy to avoid complexity added by more async tasks.
While we can find some way to give users control of hashing in this current model, I think the overall UX is of this pattern now is moving away from what we've been trying to move towards with build.jobs.
I'd be 👍 on moving forward with this approach so we can start testing it, since it's already done, and then have a larger discussion about how to rearchitect this code.
Having said all that, I'm also :+1: on this. But I will say with the caveat that this works as long as we are maintaining a neutral stance on this implementation and aren't investing more on this async pattern as we are testing.
From the call, sounds like we're :+1: on fixing this up for merge and starting to test the indexing and API internally.