continue icon indicating copy to clipboard operation
continue copied to clipboard

Large typescript files are silently not indexed

Open breynolds3 opened this issue 1 year ago • 0 comments

Before submitting your bug report

Relevant environment info

- OS: windows 10
- Continue: v0.9.199-vscode
- IDE: VSCode 1.92.2
- Model: Any besides llama3
- config.json:
  
{
  "models": [
    {
      "title": "Llama 3",
      "provider": "ollama",
      "model": "llama3"
    },
    {
      "title": "Ollama",
      "provider": "ollama",
      "model": "AUTODETECT"
    },
    {
      "title": "GPT4",
      "model": "gpt-4",
      "apiBase": "x",
      "apiKey": "x",
      "systemMessage": "You are an expert software developer. You give helpful and concise responses.",
      "useLegacyCompletionsEndpoint": false,
      "completionOptions": {
        "maxTokens": 4096,
        "temperature": 0.5,
        "topP": 0.8
      },
      "contextLength": 128000,
      "provider": "openai"
    },
    {
      "title": "Code Llama",
      "model": "phind-codellama-34b-v2",
      "apiBase": "x",
      "apiKey": "x",
      "useLegacyCompletionsEndpoint": false,
      "completionOptions": {
        "maxTokens": 4096,
        "temperature": 0.5,
        "topP": 0.8
      },
      "contextLength": 128000,
      "provider": "openai"
    },
    {
      "title": "Llama3-70b",
      "model": "llama3-70b",
      "apiBase": "x",
      "apiKey": "x",
      "useLegacyCompletionsEndpoint": false,
      "contextLength": 128000,
      "provider": "openai"
    },
    {
      "title": "Llama3-8b",
      "model": "llama3-8b",
      "apiBase": "x",
      "apiKey": "x",
      "useLegacyCompletionsEndpoint": false,
      "completionOptions": {
        "maxTokens": 2048,
        "temperature": 0.5,
        "topP": 0.8,
        "stop": [
          "<|start_header_id|>",
          "<|end_header_id|>",
          "<|eot_id|>"
        ]
      },
      "contextLength": 128000,
      "provider": "openai"
    }
  ],
  "customCommands": [
    {
      "name": "test",
      "prompt": "{{{ input }}}\n\nWrite a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
      "description": "Write unit tests for highlighted code"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Starcoder 3b",
    "provider": "ollama",
    "model": "starcoder2:3b"
  },
  "embeddingsProvider": {
    "provider": "ollama",
    "model": "BAAI/bge-small-en-v1.5:latest",
    "apiBase": "http://localhost:11434"
  },
  "allowAnonymousTelemetry": false,
  "docs": []
}

Description

Larger typescript files produce a chunk that is larger than the max chunk size. CodeChunker passes them along and they are rejected. getSmartCollapsedChunks / tree-sitter doesn't seem to handle things properly. One workaround is to fallback to basicChunker if codeChunker produces an oversized chunk. This change at least allowed me to retrieve context from the larger files and greatly improved the context provided. https://github.com/breynolds3/continue/commit/230dbd84967d8f42ee43eaa9d6cd1989cf0d0a64

To reproduce

  1. Copy the following file into a folder https://github.com/microsoft/vscode/blob/main/src/vs/workbench/browser/workbench.contribution.ts
  2. Open the folder in VSCode
  3. Index the project
  4. Close VSCode
  5. Open ~/.continue/index.sqlite with sqlite3
  6. Run select distinct path from chunks;
  7. Observe the file is not present in the table.

Log output

No response

breynolds3 avatar Aug 25 '24 20:08 breynolds3