docusaurus
docusaurus copied to clipboard
New Docusaurus plugin: `docusaurus-plugin-llms-txt`
Have you read the Contributing Guidelines on issues?
- [X] I have read the Contributing Guidelines on issues.
Description
Hello!
I'd like to propose the creation of a new docusaurus plugin for the creation of a llms.txt file.
As background, an llms.txt file is a way of co-locating needed information such that an LLM processing web pages is able to do so effectively and quickly. More information can be found on the llms.txt proposal website
The goal of this feature is to create a plugin that will generate this file for your docs.
Has this been requested on Canny?
No
Motivation
Having an llms.txt is a business need for our company, but I'm certain it would be helpful for others. I have already created a proof-of-concept plugin that writes the file, but I wanted to bring this to the community as a whole to see how the design could be improved and/or match others' use cases.
Additionally, many docs sites that I use and admire, such as dub.co and Cloudflare have already implemented llms.txt files. The former via Mintlify, the latter via a script.
Adding a new, optional plugin would give Docusaurus users the ability to opt-in to this new standard and allow their documentation to be accessed easily by LLM agents.
API design
The design is very similar to the sitemap plugin.
Configuration
ignorePatterns: ability to ignore specific paths from being included in thellms.txt.filename: ability to change the name and output path of thellms.txtfile.
I also considered configuration modifying the output of the file, but the llms.txt proposal is fairly set on the file structure so I declined to add that as a configuration option.
Have you tried building it?
Yes! I made a simple plugin that works for our use case. I would like to move it into this repo in order to make it available to other folks and to get feedback on the design. I feel it could be made much simpler and cleaner π
Self-service
- [X] I'd be willing to contribute this feature to Docusaurus myself.
We usually add to the Docusaurus core repo what we use on our own website. In this case it's not really a need we have, and adding this to core would put the burden on us to maintain over the long term. Many people want to add their plugin to core but we can't do that otherwise we'd end up with too many packages to maintain. I'd prefer to keep a lean core and keep this plugin in the community.
Also, I'm really not sure what this plugin adds to the regular sitemap.xml file.
I've checked your POC (https://github.com/prisma/docs/pull/6645), and as far as I understand it basically creates a list of links in Markdown format, but using a .txt extension instead of Markdown (why not use llmt.md? π€·ββ )
Result: https://jharrell-llms-plugin.docs-51g.pages.dev/llms.txt
If we look at the Cloudflare example, it's also a list of links in Markdown format. https://developers.cloudflare.com/llms.txt
I'm not sure how the h2 headings and occasional labels bring a lot of value to a regular sitemap.
Also your site recommends to link to html.md files, which you don't do here so the LLM would have to read an HTML page anyway. And we use MDX which means that this format is already less "LLM-friendly" that regular Markdown files which cannot contain React components that must be evaluated.
This site is linking to Markdown files, for example: https://docs.fastht.ml/llms.txt
This field is not my expertise but I don't find this proposal super convincing, it looks more like an early draft not covering edge cases than a spec. I don't think it should be added to Docusaurus core until the proposal gets more mainstream adoption and gets better documented. I'd also prefer if there was first a community implementation that satisfies early adopters. We can consider adding this to our repo later, but not as a first step. Until then I'm happy to support you developing a community plugin, so let me know if you are unable to achieve your needs.
I also wonder why it has to be a Docusaurus plugin. You could build a generic CLI tool that handles a full generic static deployment, looking for sitemap.xml or HTML files and creating LLM-friendly content.
Thank you @slorber for the context. I don't disagree with your assessment, so I'll be closing this proposal.
For further information about why this over a sitemap etc there's this section of the proposed standard: https://llmstxt.org/#existing-standards
Thank you for the feedback, though. I'll continue to use my simplified solution.
@jharrell could you share your "simplified solution", this does sound useful :)
@DenhamPreen sure. We're still iterating but here's the latest: https://github.com/prisma/docs/blob/22208d52e4168028dbbe8b020b10682e6b526e50/docusaurus.config.ts#L95
It generates an llms-full.txt and llms.txt.
Happy to chat further about it if I can help, my bsky account is in my GitHub profile π
Considering the number of issues linking back to this one, I think we should consider implementing such a plugin in Docusaurus core, even though I'm still not super convinced by the usefulness of this file.
See some ideas on how this can be implemented here: https://github.com/facebook/docusaurus/discussions/11191#discussioncomment-13244061
Note: many have copied the Prisma implementation, which works, but IMHO it's not ideal. We only generate llms.txt in prod builds, so it's preferable to read the dirs and md source files there instead of inside loadContent() (which is more useful in dev to support hot reload).
Plugin design?
In my opinion, the problems we have when generating such llms file are:
- which routes/source files should be listed
- how to group routes together under Markdown headings of various levels
- how do we order the headings
- how do we order the links within each group
We already exposed some metadata in the postBuild({routes, routesBuildMetadata}) attributes.
export default function pluginLlms(
context: LoadContext,
options: PluginOptions,
): Plugin<void> | null {
return {
name: 'docusaurus-plugin-llms',
async postBuild({routes, routesBuildMetadata}) {
const finalRoutes = flattenRoutes(routes);
finalRoutes.forEach((route) => {
if (route.metadata?.sourceFilePath) {
console.log(
`Route ${route.path} was created from markdown file ${route.metadata?.sourceFilePath}`,
);
}
});
},
};
}
We could easily enrich these with extra metadata that could be useful to generate the llms files (title, breadcrumb, other metadata).
Now, I'm not sure what kind of API the plugin could expose to define how Markdown files are grouped, and how we define an explicit order (does it even matter to LLMs?).
I guess we could provide a callback so that you can define the Markdown breadcrumb for each file, but that wouldn't allow you to order things:
['llms-plugin',{
getMarkdownBreadcrumb: ({title, sourceFilePath}) => {
if (isBlog(sourceFilePath)) {
return ["Blog"];
}
if (isiOSDocs(sourceFilePath)) {
return ["Docs","iOS"];
}
if (isAndroidDocs(sourceFilePath)) {
return ["Docs", "Android"];
}
return null;
}
}]
Does it make sense?
It could be useful if the community provided concrete examples. Given a specific sample site, what kind of LLMS file do you expect it to output, and why?
The more diverse examples we have, the easier it becomes to design an API that suits all needs.
I also created a plugin: https://github.com/rachfop/docusaurus-plugin-llms
I also started working on our own version of the plugin: https://github.com/signalwire/docs/pull/290
I tried to mimic the behavior 1:1 with stripe where it also generates a .md version of every doc in the same route. So most routes you can append just .md and get the raw version of the doc.
I also opted to do a postBuild action and converted the rendered HTML back to markdown. My decision to do this was because I didn't want to deal with components and partials in the initial content.
I also added some built in rehype plugins to handle certain edge-cases that are known (like list in tables), and updating links to match the URL options that are set in the config (relative path, full URL, markdown generated doc)
I tried to make it as flexible as possible to work with any current Docusaurus website via the contentSelectors property.
Example preview can be seen here: https://deploy-preview-290--signalwire-docs.netlify.app/llms.txt
Because its not pushed to our main site yet (also haven't implemented validate URL logic yet) don't use the full URLs, but just the relative portion (e.g: dont use https://example/com/ai.md use /ai.md
I think an official LLMs.txt plugin would be great. Here's how I believe it should work:
Proposed Build Process
1. llms.txt Generation
- Scan all content directories (
/docs,/blog, etc.) for .md files - Generate a hierarchical tree structure based on file paths and frontmatter metadata (see Vercel llms.txt)
- Use frontmatter
descriptionfield or a newllms_descriptionmetadata attribute for page descriptions - The llms.txt files should be located at the root of the content. So
/docs/llms.txt,/blog/llms.txt
2. Raw Markdown File Generation
The llms.txt standard recommend providing an .md for every page which the main llms.txt can link to.
- All .md pages need a cleaned markdown file with frontmatter and jsx stripped at
{page-url}.md - Example:
/docs/integrations/reactβ/docs/integrations/react.md - Maintain internal links but convert them to reference other
.mdfiles (if that is even possible)
3. llms-full.txt Generation
- With all .md files from 2. combined create a llms-full.txt file also at the root of the content. So
/docs/llms.txt,/blog/llms.txt
I would be happy to give it a shot.
I think an official LLMs.txt plugin would be great. Here's how I believe it should work:
Proposed Build Process
1. llms.txt Generation
- Scan all content directories (
/docs,/blog, etc.) for .md files- Generate a hierarchical tree structure based on file paths and frontmatter metadata (see Vercel llms.txt)
- Use frontmatter
descriptionfield or a newllms_descriptionmetadata attribute for page descriptions- The llms.txt files should be located at the root of the content. So
/docs/llms.txt,/blog/llms.txt2. Raw Markdown File Generation
The llms.txt standard recommend providing an
.mdfor every page which the main llms.txt can link to.
- All .md pages need a cleaned markdown file with frontmatter and jsx stripped at
{page-url}.md- Example:
/docs/integrations/reactβ/docs/integrations/react.md- Maintain internal links but convert them to reference other
.mdfiles (if that is even possible)3. llms-full.txt Generation
- With all .md files from 2. combined create a llms-full.txt file also at the root of the content. So
/docs/llms.txt,/blog/llms.txtI would be happy to give it a shot.
You may want to take a look at the plugin we just created, as it does a lot of what your looking for (link conversion, md file generation, ability to pass your own remark & rehype plugins to alter the generation, etc..)
The big difference for our approach is we work with the routes that are passed to us during the postBuild cycle. We then use unified to help find the html file and convert it back to markdown. My decision to convert the rendered HTML back to markdown is to handle MDX partials and React components. Content may not be displayed properly since they are not rendered yet.
I think the only things we don't currently have you are looking for is
- A
llms-full.txtgeneration option - The ability to overwrite the description for a page(route).
With how the code is made, both of these options would be easy to add.
Feel free to open any feature request or issues at: https://github.com/signalwire/docusaurus-plugins
@Devon-White your plugin works great! Thanks for sharing π Implemented here https://github.com/cedarjs/cedar/pull/118/files Exposed at this url https://cedarjs.com/llms.txt
Joining the party as well and sharing this in case itβs useful to anyone: I also followed the path of creating a custom plugin to generate /llms.txt and /llms-full.txt, tailored to my needs for Juno.
Note: I'm not an AI expert at all, more a real noob.
I may have taken a different approach by actually retro-engineering the Markdown files from the generated HTML files - i.e., once the site is generated, I filter the files to build a tree of the information I consider important for the language model, and then generate a Markdown file for each of those HTML files using Turndown. In addition, I also use JSDOM to extract the title and description for each of those links.
Worth noting: when I generate the Markdown files, I also manipulate the links to construct a navigation structure related to those files - i.e., Markdown files link to other Markdown files, not to the original HTML.
The solution definitely needs more iteration, and itβs not my most brilliant work - I left performance considerations aside and havenβt written any tests yet π - but itβs a start.
I scoped all the logic and functions into a single module.
You can find the plugin here π https://github.com/junobuild/docs/blob/main/plugins/docusaurus.llms.plugin.ts
Interesting blog post from a Sentry engineer implementing llms.txt: https://byk.im/posts/marking-it-up-and-down/
Since we cannot go directly from MDX to Markdown, we had to render the HTML from MDX first and then convert it to Markdown, essentially doubling the work.
So, in addition to the custom plugin I shared above, Iβve added a CI job that snapshots the generated llms.txt files and commits any changes with the PR. It slows down a bit, but this way I can review the diff and make sure the LLMs still look good. No formal tests but, some sort of assertions. Long story short: if that could be useful to anyone in addition to the plugin too, here you go π https://github.com/junobuild/docs/blob/main/.github/workflows/llms.yml
where did we end up with this?
Additionally, for the plugin i created above, i have created a theme to add a Copy Page button with a drop down to ask about the contents directly to a LLM provider in their cloud chat. This is mimicking the behavior that we are beginning to see from big documentation providers like Mintlify.
Example:
The copyButton will render on any page that has a MD file generated for it.
To accompany this, i also pushed a major upgrade for the plugin, which includes a overhaul of the config interface. For the most part its just a reorganization, but I also introduced Sections as a feature. Sections allow you to specify exactly what docs are housed under a section. You will also have the ability to add descriptions, title names, and subsections in a section object.
I also added the ability t add attachedFiles which will parse the contents of the file and create a markdown file with its contents inside of it. Helpful for items that are useful for llms, but not handy as a doc (example would be a OpenAPI spec)
Right now these additions are on the alpha tagged release, just in case feedback is provided and changes are needed. Once its in a comfortable state for a time period, i will do the official release.
The alpha packages however can be found here:
Plugin: https://www.npmjs.com/package/@signalwire/docusaurus-plugin-llms-txt/v/2.0.0-alpha.2
Theme: https://www.npmjs.com/package/@signalwire/docusaurus-theme-llms-txt/v/1.0.0-alpha.3
@Devon-White Thanks for the plugin! It works like a charm: https://github.com/mlflow/mlflow/pull/18676