repomix icon indicating copy to clipboard operation
repomix copied to clipboard

Add file truncation

Open klntsky opened this issue 1 year ago • 5 comments

The use case is: I have multiple JSON data files. I want to include them in the LLM input, but only to show their structure, not the contents. I'd like to be able to specify that I just want to include the first N lines.

klntsky avatar Nov 02 '24 13:11 klntsky

Hi @klntsky!

I'm thinking of implementing this with a new process config option. Does this kind of structure match what you had in mind?

repomix.config.json

{
  "output": {
    // ... output config
  }
  "process": {
    "maxLines": 100,             // Default limit for all files
    "patterns": [
      {
        "pattern": "**/*.json",  // Special limits for JSON files
        "maxLines": 20
      }
    ]
  }
}

The output would look like:

{
  "users": [
    {
      "id": 1,
      "name": "John"
    }
  ]
... (truncated)

Let me know if this is heading in the right direction!

yamadashy avatar Nov 03 '24 03:11 yamadashy

In some cases it may be useful to limit chars or words, not lines (e.g. unformatted json). Maybe all three should be configurable?

klntsky avatar Nov 03 '24 05:11 klntsky

@klntsky If I'm understanding your intention correctly, I think the underlying issue here is that including entire file contents can consume a large number of tokens, which is a common problem for projects using repomix with LLMs.

Given this context and considering how LLMs process text, I think focusing on token count would be the most appropriate approach initially. Something like:

{
  "process": {
    "maxTokens": 1000,          // Global token limit
    "patterns": [
      {
        "pattern": "**/*.json",  
        "maxTokens": 500        // Pattern-specific token limit
      }
    ]
  }
}

I'd like to start with this simpler requirement to minimize potential bugs.

What do you think about this approach?

yamadashy avatar Nov 03 '24 07:11 yamadashy

Yep, token limits seem to cover both cases, but I'd like to have lines too, because it's not immediately clear how many tokens are there in a part of the file, while lines can be inspected visually.

klntsky avatar Nov 03 '24 14:11 klntsky

That makes sense. We could support both maxLines and maxTokens, truncating when either limit is reached.

Let me think about this a bit more.

yamadashy avatar Nov 04 '24 15:11 yamadashy