pydoctor icon indicating copy to clipboard operation
pydoctor copied to clipboard

Generate Sphinx .fjson files for the generated HTML files to enable Read The Docs search.

Open adiroiban opened this issue 3 years ago • 11 comments

Read The Docs server side search works by indexing the .fjson files generated by the Sphinx json builder.

The indexing code on the RTD side is here https://github.com/readthedocs/readthedocs.org/blob/655bfd53f6fd3a6ea4c989354fd0f89a753d4369/readthedocs/search/parsers.py#L357

It looks for these JSON members

  • current_page_name
  • title
  • body

body is just HTML data.

From body it extracts sections and domain_data.

I think that it would be interesting to generate these json files for each HTML file generated by pydoctor.

In this way, when pydoctor is called on read the docs the HTML pages are indexed by RTD search.

adiroiban avatar Feb 04 '21 21:02 adiroiban

There is this Twisted ticket which asks for a global search - https://twistedmatrix.com/trac/ticket/1546

adiroiban avatar Feb 05 '21 01:02 adiroiban

@adiroiban

Looks easy

Are you sure the only requirements is to generate the files, nothing else to indicate ReadTheDocs to look for there files in the api folder for instance ?

I could add fjson files generation on demand with an option like --generate-readthedocs-index-files

tristanlatr avatar May 12 '21 14:05 tristanlatr

I am not sure that this is enough.

With a quick read of the sphinx code it looks like it search for 'fjson' for each HTML page created by sphinx.

There is a separate search method for mkdoc files... so the fjson will work, only as long as the RTD build is set up for Sphinx

Since the search code is external to Sphinx, I expect that it will just do a search for any .html file found inside the sphinx build folder.

I guess that I/we can ask via https://gitter.im/readthedocs/community

adiroiban avatar Mar 05 '22 00:03 adiroiban

Did you asked @adiroiban ?

tristanlatr avatar Mar 24 '22 23:03 tristanlatr

Hi, developer from RTD here. Reading https://pydoctor.readthedocs.io/en/latest/sphinx-integration.html#building-pydoctor-together-with-sphinx-html-build, looks like the extension just puts the generated HTML files in the sphinx's build directory?

For Sphinx we install a custom extension that hooks into the build process to generate the fjson files (historically we used to do a sphinx-build -b json call to generate those, but doing it in the html build step itself is faster).

https://github.com/readthedocs/readthedocs-sphinx-ext/blob/fdee1b636c73377dcb4cd4e0e54b5f3dcaded8bc/readthedocs_ext/readthedocs.py#L201-L201

So, you could do something similar in your extension and generate fjson files for the HTML files you generate. But note that this is an internal operation, so isn't guaranteed that it won't break in the future.

Another approach would be to generate the search index directly from the html files (as we do for mkdocs) for sphinx projects that make use of pydoctor

https://github.com/readthedocs/readthedocs.org/blob/332bda682082c3f54bb285f873f72837f651ae06/readthedocs/search/parsers.py#L510-L512

But that's a decision that we would need to take if we want to support that.

In any case, I'd recommend checking https://dev.readthedocs.io/en/latest/search-integration.html to see if your html files are structured as our index code expects it.

stsewd avatar Mar 28 '22 19:03 stsewd

Thanks @stsewd

For pydoctor, we can generate the fjson file internally as part of the normal build. That is no problem and there is no need to call a separate -b json. I agree that this is faster.

The question was whether these fjson files needs to be registered somehow, or just make sure they are present on the filesystem...and from there RTD will work its magic.

Thanks again and thanks for such a great project as RTD :)


@tristanlatr I just that we should just look at generating the fjson files and hope for the best :)

adiroiban avatar Mar 28 '22 20:03 adiroiban

Hello,

Thanks for your answer @stsewd, I'm glad that officially supporting pydoctor a potential option for you. I'm willing to make the HTML pages comply with your standard, it requires only minor changes.

I think it could be hard (but not impossible) to integrate fjson generation on the RTD side. But first the pages must comply with the requirement, and I suppose the integration won't happen tomorrow.

So we could provide an option for the shorter (maybe infinite) term to generate these files.

I have a follow-up question @stsewd, if I may: What are the possible values/meaning of the JSON dict values? Here's my guess, but correct me If I'm wrong. All values are strings. 'body' -> html string (or maybe should it be only the raw text of the body?) 'title' -> str 'sourcename' -> string of the filename 'current_page_name' -> string not sure what this means ? how is it different from the title? 'toc' -> html string, should it be a ul ? 'page_source_suffix' -> .md or .rst I guess, for us that would be .html or .py idk...

Maybe you have a link to a fjson file that is published online ? I've tried https://pydoctor.readthedocs.io/en/latest/quickstart.fjson while the real url is https://pydoctor.readthedocs.io/en/latest/quickstart.html but does not work :/ I must have missed something.

Thanks for your insights,

tristanlatr avatar Mar 29 '22 04:03 tristanlatr

This is an example of the fjson file from the index page from https://docs.readthedocs.io/en/stable/

index.zip

But for search you only need:

  • title: string
  • body: string of the html of the main content only (it doesn't start from <html>, but from <section>/<div> or a similar tag)
  • current_page_name: string, this is a sphinx concept, basically the file name without the extension (index.html -> index), but don't think we are using it for search anymore.

You can also generate almost the same files if you run sphinx-build -b json . _build/json

stsewd avatar Mar 29 '22 15:03 stsewd

Thanks for your answer @stsewd, if I understand correctly, these files are not served by you web servers, right ? This is why we can't simply get them with HTTP ?

tristanlatr avatar Mar 29 '22 15:03 tristanlatr

Thanks for your answer @stsewd, if I understand correctly, these files are not served by you web servers, right ? This is why we can't simply get them with HTTP ?

that's right, we don't serve those, they are stored in another bucket in S3.

stsewd avatar Mar 29 '22 15:03 stsewd

Perfect,

I'll work on that when I have a moment. I'm more busy than usual these days.

Thanks a lot @stsewd, I'll keep you posted when pydoctor HTML pages are complying with the RTD requirements.

tristanlatr avatar Mar 30 '22 01:03 tristanlatr