continue icon indicating copy to clipboard operation
continue copied to clipboard

Restore the useLocalCrawling & maxDepth settings for indexed documents

Open vincentkelleher opened this issue 5 months ago • 5 comments

Description

This re-introduces the useLocalCrawling & maxDepth configuration parameters for document indexing as they were ignored since the JSON to YAML configuration migration.

Checklist

  • [x] I've read the contributing guide
  • [x] The relevant docs, if any, have been updated or created
  • [x] The relevant tests, if any, have been updated or created

Tests

DocsService tests are skipped and commented at the moment :point_right: https://github.com/continuedev/continue/blob/bbb81ff032608e03a2208be908c1394da228ad6a/core/indexing/docs/DocsService.skip.ts

vincentkelleher avatar Jun 03 '25 09:06 vincentkelleher

Your cubic subscription is currently inactive. Please reactivate your subscription to receive AI reviews and use cubic.

cubic-dev-ai[bot] avatar Jun 03 '25 09:06 cubic-dev-ai[bot]

Deploy Preview for continuedev ready!

Name Link
Latest commit 16b80d9117acd481e3b6ad531bb8649611f1bb77
Latest deploy log https://app.netlify.com/projects/continuedev/deploys/68516b828b9d6c00083ce358
Deploy Preview https://deploy-preview-5958--continuedev.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

netlify[bot] avatar Jun 03 '25 09:06 netlify[bot]

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

github-actions[bot] avatar Jun 03 '25 09:06 github-actions[bot]

I have read the CLA Document and I hereby sign the CLA

vincentkelleher avatar Jun 03 '25 09:06 vincentkelleher

😱 Found 1 issue. Time to roll up your sleeves! 😱

recurseml[bot] avatar Jun 13 '25 20:06 recurseml[bot]

All tests are green :tada:

So recurseml, is that enough sleeve rolling for you ? :smirk: :rofl:

vincentkelleher avatar Jun 18 '25 06:06 vincentkelleher

Bump

Could someone do a quick review of this PR ? :innocent: It's very simple and is highly needed in my team :blush:

vincentkelleher avatar Jun 19 '25 07:06 vincentkelleher

This PR is now one month old, anyone to check it and merge it ? :cry:

vincentkelleher avatar Jul 03 '25 14:07 vincentkelleher

@vincentkelleher do you need the maxDepth param specifically? We want to merge this with useLocalCrawling but maybe deprecate maxDepth in favor of an allowList/blockList pattern

Apologies for the delays!

RomneyDa avatar Jul 10 '25 01:07 RomneyDa

@RomneyDa I was just aware of maxDepth because it was there historically.

I imagine allowList/blockList would be a list of regex ?

Thanks for the feedback :blush:

vincentkelleher avatar Jul 10 '25 07:07 vincentkelleher

Got it! So would it solve your issue if I merged and then removed maxdepth and kept uselocalCrawling?

(Or if you'd like to)

Yes, I think glob patterns for allow/block

RomneyDa avatar Jul 10 '25 08:07 RomneyDa

@RomneyDa I have the feeling that maxDepth requires less thinking and is safer as you won't explicitly know how many pages will be indexed by each glob, don't you think ?

vincentkelleher avatar Jul 10 '25 11:07 vincentkelleher

we're thinking also about adding a maxPages to give a more direct limit, but I think people generally either want to index all docs that match a pattern (with a hard limit perhaps). maxDepth doesn't create any hard limit, it could yield tens of thousands of pages in somewhat edge casey scenarios

RomneyDa avatar Jul 11 '25 00:07 RomneyDa

would maxPages and useLocalCrawling be sufficient? The other issue with maxDepth is it's not super clear how it works, i.e. as a dev I can't keep a 3-link-deep map of the docs pages I want in my head

RomneyDa avatar Jul 11 '25 00:07 RomneyDa

It's true that there are clearly two types of limits:

  • hard limits with maxPages
  • soft limits with maxDepth, allowList and blockList

Usually having a max depth of 1 seems like a reasonable case as you want everything directly linked to the subject of the page, doing more than 1 or 2 would directly bring you to indexing the whole website in most cases IMHO. I also think that having an allow or block list would be about the same, if not worse, than a max depth over 1 as you will have to clearly know the sitemap.

Having a maximum number of pages would be a good guard-rail to avoid using too many hardware resources, that seems like a good feature :+1:

I would go for useLocalCrawling, maxDepth and maxPages :innocent:

vincentkelleher avatar Jul 11 '25 07:07 vincentkelleher

@vincentkelleher appreciate the feedback! Do you currently have cases for which you set maxDepth > 1?

RomneyDa avatar Jul 13 '25 19:07 RomneyDa

@RomneyDa I don't have any in mind right now :thinking:

vincentkelleher avatar Jul 15 '25 14:07 vincentkelleher

After running by team opened a new PR to remove maxDepth for YAML, opened a ticket to add maxPages/allow/block list or similar to replace. Will leave useLocalCrawling. Thanks for the contribution!

RomneyDa avatar Jul 17 '25 14:07 RomneyDa

:tada: This PR is included in version 1.1.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

sestinj avatar Jul 22 '25 05:07 sestinj