continue
continue copied to clipboard
Use Jina AI Reader for indexing docs
Validations
- [X] I believe this is a way to improve. I'll try to join the Continue Discord for questions
- [X] I'm not able to find an open issue that requests the same enhancement
Problem
The current method of parsing website contents via JSDOM is error prone and not optimal for LLMs. It errors at parsing sites such as https://angular.io/docs/.
Solution
Use the service by Jina AI to turn websites into LLM friendly markdown before indexing them. This works by prefixing the site url with https://r.jina.ai/, e.g. https://r.jina.ai/https://angular.io/docs/.
The feature would be a simple string prefix as in https://github.com/continuedev/continue/commit/850458ba833eeb12bf18c62dfeca2489737750c5.
@frederikschubert This is a great find! Only thing we'll have to think about is rate-limiting. Probably the 20 RPM won't be enough to index most sites. I'll have to check out the pricing on this and at least we'll use it for our pre-indexed docs
Yes, that is true. They are essentially using turndown, which could be integrated in the continue plugin.
So, after further investigation, I came up with a minimal reproduction of their approach: 0. (optional) use playwright or another browser to access dynamic web pages.
- inject readability script into each page.
- use turndown to convert the reader website into readable markdown.
- (optional) use prettier or another formatter to clean up the markdown.
In local tests this works quite well for all crawled websites.
Example
Angular documentation https://angular.dev/guide/pipes/template#chaining-pipes
This approach
# Using a pipe in a template • Angular
To apply a pipe, use the pipe operator (`|`) within a template expression as shown in the following code example.
`<p>The hero's birthday is {{ birthday | date }}</p> `
The component's `birthday` value flows through the pipe operator (`|`) to the [`DatePipe`](https://angular.dev/api/common/DatePipe) whose pipe name is `date`. The pipe renders the date in the default format like **Apr 07, 2023**.
app.component.ts
`import { Component } from '@angular/core'; import { DatePipe } from '@angular/common'; @Component({ standalone: true, templateUrl: './app.component.html', imports: [DatePipe], }) export class AppComponent { birthday = new Date(); } `
## [Additional parameters for pipes](https://angular.dev/guide/pipes/template#additional-parameters-for-pipes)
Pipes can take additional parameters that configure the transformation. Parameters can be optional or required.
For example, the `date` pipe takes optional parameters that control the date's display format. To specify the parameter, follow the pipe name with a colon (`:`) and the parameter value (the format).
`<p>The hero's birthday is in {{ birthday | date:'yyyy' }}</p> `
Pipes can also take multiple parameters. You can pass multiple parameters by separating these via colons (`:`). For example, the `date` pipe accepts a second optional parameter for controlling the timezone.
`<p>The current time is: {{ currentTime | date:'hh:mm':'UTC' }}</p> `
This will display the current time (like `10:53`) in the `UTC` timezone.
## [Chaining pipes](https://angular.dev/guide/pipes/template#chaining-pipes)
You can connect multiple pipes so that the output of one pipe becomes the input to the next.
The following example passes a date to the `DatePipe` and then forwards the result to the [`UpperCasePipe`](https://angular.dev/api/common/UpperCasePipe "API") pipe.
`<p>The hero's birthday is {{ birthday | date }}</p> <p>The hero's birthday is in {{ birthday | date:'yyyy' | uppercase }}</p> `
Reader
Title: Angular
URL Source: https://angular.dev/guide/pipes/template
Markdown Content:
To apply a pipe, use the pipe operator (`|`) within a template expression as shown in the following code example.
`<p>The hero's birthday is {{ birthday | date }}</p> `
The component's `birthday` value flows through the pipe operator (`|`) to the [`DatePipe`](https://angular.dev/api/common/DatePipe) whose pipe name is `date`. The pipe renders the date in the default format like **Apr 07, 2023**.
`import { Component } from '@angular/core'; import { DatePipe } from '@angular/common'; @Component({ standalone: true, templateUrl: './app.component.html', imports: [DatePipe], }) export class AppComponent { birthday = new Date(); } `
[Additional parameters for pipes](https://angular.dev/#additional-parameters-for-pipes)
---------------------------------------------------------------------------------------
Pipes can take additional parameters that configure the transformation. Parameters can be optional or required.
For example, the `date` pipe takes optional parameters that control the date's display format. To specify the parameter, follow the pipe name with a colon (`:`) and the parameter value (the format).
`<p>The hero's birthday is in {{ birthday | date:'yyyy' }}</p> `
Pipes can also take multiple parameters. You can pass multiple parameters by separating these via colons (`:`). For example, the `date` pipe accepts a second optional parameter for controlling the timezone.
`<p>The current time is: {{ currentTime | date:'hh:mm':'UTC' }}</p> `
This will display the current time (like `10:53`) in the `UTC` timezone.
[Chaining pipes](https://angular.dev/#chaining-pipes)
-----------------------------------------------------
You can connect multiple pipes so that the output of one pipe becomes the input to the next.
The following example passes a date to the `DatePipe` and then forwards the result to the [`UpperCasePipe`](https://angular.dev/api/common/UpperCasePipe "API") pipe.
`<p>The hero's birthday is {{ birthday | date }}</p> <p>The hero's birthday is in {{ birthday | date:'yyyy' | uppercase }}</p> `
@sestinj What do you think about adding playwright for this use-case as an optional dependency?
@frederikschubert This is great! Readability is what we are currently using, and I like the other ideas in the following steps to clean it up.
Playwright is going to take up a lot of space to ship alongside though (assuming we would ship chrome). But one thing I'm thinking is that we'll just host a service that can do the scraping, especially because this isn't a sensitive operation in most cases
We could still use the cleanup logic on the client side