pg-aiguide icon indicating copy to clipboard operation
pg-aiguide copied to clipboard

Make sure inline links are fully qualified URLs during scrape process

Open MasterOdin opened this issue 3 months ago • 0 comments

When scrapping documentation pages from the web, we should make sure that any links are converted to fully qualified version of themselves (e.g. going from something like:

[migrate your entire database at once](/self-hosted/latest/migration/entire-database/]

to

[migrate your entire database at once](https://docs.tigerdata.com/self-hosted/latest/migration/entire-database/]

Right now the LLM likes to quote the returned markdown chunks where the former end up showing as weird broken text vs the latter. While we could maybe fix this via prompting as well, I think better to just eat the extra tokens in embedding and then make it easier for the LLMs to use.

It'll probably be easier/better though to try to do this manipulation against the HTML source, vs after we convert it to markdown.

MasterOdin avatar Sep 20 '25 01:09 MasterOdin