readthedocs.org Version file tree diff: design doc

ref https://github.com/readthedocs/readthedocs.org/issues/11319

Manual linking to the preview since we don't have this feature yet :D https://dev--11507.org.readthedocs.build/en/11507/design/file-tree-diff.html

:books: Documentation previews :books:

User's documentation (docs): https://docs--11507.org.readthedocs.build/en/11507/

Developer's documentation (dev): https://dev--11507.org.readthedocs.build/en/11507/

Jul 30 '24 21:07 stsewd

I just checked the latest commit with the updates and I feel there are still some things I'd like to discuss and to be reflected in this document:

Discard the idea of using S3 to save the data and only use the database.
Mention potential sorting of files changed based on PageViews.
Discard the sorting based on lines changed for the v1 implementation.
Use a VersionDiff database model that points to two different Build objects.
Keep the endpoint un-authed and leave the authed version for v2.
Change the endpoint URL to use query parameters.
Always generate the VersionDiff on successful builds instead of in a background task pattern the first time the API endpoint is hit ("lock on request").
Limit the API endpoint to only work over external versions for v1.
Always return full objects in the API response instead of only IDs/slugs (following the APIv3 pattern).
Add it under a feature flag so we can roll it out slowly and start testing the new feature in our own projects first. That may help us to estimate the impact into S3 queries.

Aug 06 '24 10:08 humitos

Discard the idea of using S3 to save the data and only use the database.

I'm +1 on that, but wanted to see other's opinions

Mention potential sorting of files changed based on PageViews.

I'm not really sold on this idea, the order will be kind of random and unexpected for the changes done on the PR.

Discard the sorting based on lines changed for the v1 implementation.

I'm okay leaving that for later

Use a VersionDiff database model that points to two different Build objects.

It should also point to the versions, since builds aren't deleted when a version is deleted. Or we can manually delete diff objects once a version is deleted.

Keep the endpoint un-authed and leave the authed version for v2.

That will be a breaking change, we should decide if we want to have it under auth or not.

Always generate the VersionDiff on successful builds instead of in a background task pattern the first time the API endpoint is hit ("lock on request").

This is assuming we will only run a diff between two sets of versions, PR preview and latest? or stable? I still think it's useful to allow diffing between any versions on demand.

From Eric's comments, I think we don't want to expose this as an API yet anyway, but only from the addons endpoint. Which again, requires us to choose which version we will make the comparison over, I guess we will choose the one that the default branch points to.

Aug 06 '24 14:08 stsewd

Keep the endpoint un-authed and leave the authed version for v2.

That will be a breaking change, we should decide if we want to have it under auth or not.

I'm saying that we can implement the authed version only for Read the Docs for Business to access private projects. For now, we can start implementing this feature on Read the Docs for Community to get some experience/feedback and grow from there for the v2.

requires us to choose which version we will make the comparison over, I guess we will choose the one that the default branch points to.

We should use the same setting we are already using for DocDiff, which is currently LATEST. In the future, we should allow to configure the version to compare against from the addons configuration.

This is assuming we will only run a diff between two sets of versions, PR preview and latest? or stable? I still think it's useful to allow diffing between any versions on demand.

There are good points for each of the different approaches. I think this decision is important since it will dictate how we and other people will use this feature. @ericholscher @agjohnson what are your thoughts on each of these different approaches?

Aug 06 '24 14:08 humitos

It doesn't really seem like we need to make a decision about diffing different versions now? We just ship with the initial implementation that diffs the PR build against the latest version? We can always add some way to trigger diffing of arbitrary versions in the future, but it sounds like we don't need to do that in the initial implementation ("v1")?

Aug 06 '24 17:08 ericholscher

@ericholscher what's your position about 1) implementing the diffing as a background task and requiring hitting the API twice to get the data, or 2) always perform the diffing at build time and hit the API endpoint only once always getting the same response?

Aug 07 '24 09:08 humitos

implementing the diffing as a background task and requiring hitting the API twice to get the data

For clarification, the first call will trigger the task and wait for the result.

Aug 07 '24 14:08 stsewd

It seems pretty straightforward to support both the situation where we have the files on disk, or need to query S3 for a set of them. I think to start we do it during build time while we have the files on disk for PR builds?

Aug 07 '24 15:08 ericholscher

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

I like this idea

Aug 14 '24 01:08 stsewd

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

I like this idea

We already built this, and then ripped it out :)

https://github.com/readthedocs/readthedocs.org/pull/8111

Aug 15 '24 17:08 ericholscher

We already built this, and then ripped it out :)

Heh yeah, though we ripped it out because it was expensive and we also weren't using it. I don't feel we'd need a per-file model for this either way, a single manifest file seems reasonable.

I am starting to feel that rclone is the wrong solution to build on top of. Rclone only does comparison at the file metadata level, which is only the bare minimum we need. It still requires us to do secondary comparison to avoid giving the user a long list of files that haven't actually changed.

This may or may not require something like a hash manifest file, but definitely requires we do something on top of rclone.

Some prior art to look at here is sitediff:

https://github.com/evolvingweb/sitediff

A few different options it has for avoiding false positive diffs are:

Selector targeting https://github.com/evolvingweb/sitediff?tab=readme-ov-file#selector
Rule based excludes https://github.com/evolvingweb/sitediff?tab=readme-ov-file#sanitization
And others like DOM transforms etc

So, really quite advanced, and obviously way more than we'd build. But it shows the difficulty of comparing two sets of HTML and some of the solutions necessary. Selector targeting seems like the easiest to accomplish.

I think it would help to see actual output of rclone between two versions of our docs, maybe another project too. We've only discussed many of the plans here theoretically so far, it would help to figure out exactly what we're going to be working with.

Aug 15 '24 20:08 agjohnson

I gave a quick test to sitediff, it's a neat tool. I didn't want to crawl our prod docs, so just did the dev docs. I'd expect no changes between these two versions.

Unconfigured, the output is what I expected from rclone -- quite noisy:

Diffs show a lot of noise outside of content changes in all these files, any place latest, stable, or a commit is compared. This would be comparable to using rclone to detect changes in files.

Using a configuration with selector: "div.document[role='main']" the results are far cleaner:

The noisy changes with version names and commit hashs are larger avoided, but the diffs here are mostly from intersphinx altering the link titles. Unfortunately, these are also invisible changes to the user, so the UX of telling users these files changed and them linking them to a doc page where nothing looks changed is going to be something we need to work around.

The second is better, but I am still expecting no changes from this at all, so not ideal still.

Aug 15 '24 21:08 agjohnson

@agjohnson the research around sitediff you've done is great! It does almost exactly what we want to build here. There are really good ideas and thoughts to consider there.

I did a quick test by myself as well and it worked great, finding differences only on one file after defining the selector 👍🏼

The noisy changes with version names and commit hashs are larger avoided, but the diffs here are mostly from intersphinx altering the link titles.

They mention similar issues on their README file and solve them by using sanitization config.

That said, I think we should continue discussing what's the path we want to follow here since there are a few possible implementations opened already and it seems there are more research we can do before making the next step.

Aug 16 '24 16:08 humitos

Yeah, and I wouldn't build exactly what sitediff built but testing with it does highlight what we can expect from similar approaches we're discussing. We'll hit these hurdles too, and might find ourselves needing similar features -- not that I think we should build something this complex.

Agreed we should continue discussing this more. I don't have a great feeling for where we do go, but it's feeling like we'll need to give more user control than file-based comparison.

I suppose to echo what we're doing with output paths, we could expect the build to output a manifest. We could make this a job step that the user can override and apply advanced transform rules if they want.

A left field option could be using docdiff to do the comparisons, and crawling each site. This is back to the async issue though and could be slow. However, it would be much better UX in that the list of files would align with what docdiff would give the user visually.

Aug 16 '24 18:08 agjohnson

I thought more about this. I started thinking about "build an integration with sitediff and use it behind the scenes to get the list of files changed". I took a look at this and it seems it's an application on itself and doesn't allow to be used "as a library" or as an "external command" to grab all the metadata/data it generates and uses to show those nice pages. Besides, it's written in Ruby and it will require setting up another environment just for this and also find a way to interact with our Python code... I discarded this idea after this quick research.

Then, I went back to the origins of all this discussion: "we need a way to map a source file into a URL" 💡

If we are thinking of giving users the ability to somehow configure the "file tree diff" feature to be able to detect what are the changes between two rendered HTML by defining what's the main selector, and potentially allowing them to define other selectors to ignore some chunks, and... blah blah -- I would like to propose some a lot easier for everybody: why not just ask them "how to map a source file into a URL" 🪄 ?

Knowing that, we can use the right tool designed to compare files that we all know: Git. With that information, we can then run the mapping the user provided us [^1] over each of the changed files to get the URL. Git will tell us:

what are the files that changed
how many lines have changed on each file (useful for sorting)
... a lot more useful data -- it's Git

There are two cons that I'm seeing here:

we need to ask the user for a mapping [^2]
we won't be able to find changes outside source files (e.g. template changes) [^3]

In my opinion, this looks like a good idea to discuss and explore more compared to the alternatives we have been discussing. It may not be perfect, but looks a lot easier to understand, to implement, and to explain to users.

I'd love to know your feedback about this. Where do you see potential issues? What do you like? What do you dislike? How do you compare it with the alternatives?

[^1]: we could have pre-defined regex for documentation tools we already know MkDocs, Sphinx, etc or let the user to create their custom one. We could add more pre-defined values for those new tools as we grow. [^2]: not a problem if we have a list of pre-defined ones, tho [^3]: even with something like sitediff this would communicate changes everywhere

Aug 20 '24 16:08 humitos

I discarded this idea after this quick research.

Agreed. And sorry if I wasn't clear above but I was just testing sitediff, I also did not want to build on top of this tool. It just already does much of what we described building, so worth comparing at least.

why not just ask them "how to map a source file into a URL" 🪄 ?

I also had this same thought path. Though I was also considering building on top of what files PRs list as changed. I realized that this is not a good option as the project might not build the PR base branch or the version might not match.

Relying on Git to detect changes is a closer possibility. I would still vote to keep this as a backup option for now. It requires us maintaining more custom Git operations and the user has to give us a mapping.

I could see us requiring a mapping file in the build, or we might even use a <meta name="source-file" value="docs/install/index.rst" /> in the built HTML.

However, after thinking through source file mapping, and requiring the user tell us how to map changes, I came all the way back to just asking the user to tell us what changed:

I suppose to echo what we're doing with output paths, we could expect the build to output a manifest. We could make this a job step that the user can override and apply advanced transform rules if they want.

That is, projects would have to output a hash manifest file of changes to HTML files in their build:

To start, our build job would just hash whole files. This is just an easy way to move forward for now and gives us noisy output similar to rclone.
Our default job would eventually support Sphinx/Mkdocs output generically. We hash just the contents of parsed HTML div.document[role="main"] etc -- we have some good defaults for the majority of projects.
Projects can override this job and hash content however they please, performing transform steps to content, etc. There just needs to be a _readthedocs/html/checksum.md5/etc at the end of the build.

The plan above for describing differences is still just pulling the two checksum files and noting what doesn't match. We have the page URLs in the checksum file, can note page deletions/additions, and when our defaults don't work the user has full control.

It does seem like we should involve the user at some point, at least for some portion of projects.

Aug 20 '24 17:08 agjohnson

We discussed this in the call today. Takeaways:

We need to diff HTML, because there are a number of edge cases where content can change based on outside files (eg. include files, code API rendering, etc.)
Doing file diffs is too noisy because of metadata changes (eg. dates, commit hashes, etc.)
We should be doing content diffs, where content is defined by a content CSS selector. This should be user configurable in v2, but for v1, we can use our existing content guessing logic from search and other places.
We should store the content hash on the ImportedFile on build, as well as support the build outputting them into a file ($READTHEDOCS_OUTPUT/readthedocs-hashes.json).
When we want to compare 2 verisons, we can compare content hashes, and don't need to re-read the full file contents from S3.

Sep 04 '24 17:09 ericholscher

For a first pass, I feel we should do whatever is quickest so that we can work on the UX here too. This might still be file based hashing just to start experimenting even. Relying on our search code to break down a DOM to text only representation feels like a second or third pass maybe?

Thinking about text representation vs DOM comparison more, DOM comparison seems like where we should be ultimately though. Users should be alerted on changes like image URLs, HTML classes, and non-text DOM changes, not just text changes. Maybe text-only mode is a configurable option later though.

Sep 04 '24 19:09 agjohnson

This might still be file based hashing just to start experimenting even. Relying on our search code to break down a DOM to text only representation feels like a second or third pass maybe?

I think we can just start with extracting the main content of the file (we already do this for search) and just hash it as is.

I have updated the design doc to include things that we discussed, I left the old ideas in case they are useful, but I'm fine removing them.

Sep 04 '24 21:09 stsewd

I think we can just start with extracting the main content of the file (we already do this for search) and just hash it as is.

Yeah, I think we're talking about the same thing. I'm saying we avoid rendering the DOM to text and instead hash the inside contents of an element inside the HTML document. I'm not familiar with what we are technically doing with search indexing.

Sep 04 '24 22:09 agjohnson

Also, with the new implementation discussed for v1; how would be the API response? Do we need to adjust anything there?

The initial implementation won't have a public API, we will only expose this feature within the addons. But don't think the response proposed before needs major changes.

Oct 01 '24 16:10 stsewd

I'd like to move forward on this. It feels like we're mostly on the same page, so I think we can resolve most of the remaining questions on an actual implementation at this point :+1:

Oct 01 '24 18:10 ericholscher

Going to merge this since we are working on the implementation now.

Oct 09 '24 15:10 stsewd

readthedocs.org readthedocs.org copied to clipboard

Version file tree diff: design doc

readthedocs.org
readthedocs.org copied to clipboard