semchunk Splitting markdown-formatted outlines in an odd way

My markdown doc is structured as:

# header1

## header2

Some text

## header2 

Some more text


### Step 0: this is pre-planning step

* ⚠️ this is a warning
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 1: the first actual step

▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step

etc...

My code:

import semchunk
chunker = semchunk.chunkerify('gpt-4', chunk_size = 2000)
chunker(text)

I would expect the chunker to split by headers, when possible; however, the chunks generally END with a header.

An example chunk:

▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step

...instead of:

### Step 1: the first actual step
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

Any idea why this is happening?

Jan 01 '25 22:01 nick-youngblut

I’m assuming the text in question has two newlines separating headers from succeeding content? Like this:

# Header

Content.

Instead of like this:

# Header
Content.

If that’s the case, then what is happening under the hood is that semchunk is splitting your text at the occurrence of two newlines into:

[
    “# Header 1”, “Content 1.”,
    “# Header 2”, “Content 2.”
]

And then when semchunk goes to rejoin the splits to form new chunks meeting your desired chunk size, you might end up with:

[
    “# Header 1\n\nContent 1.\n\n# Header 2”,
    “Content 2.”
]

semchunk heuristically leverages the fact that normal English text tends to use newlines and other delimiters like punctuation to indicate varying degrees of semantic separation, but when it comes to Markdown, specialised syntax might take the place of those patterns.

I myself have run into this problem with Markdown. There's an easy solution, however.

Before passing your text to semchunk, you can preprocess it with this code:

import re

# Remove empty lines after Markdown headings.
text = re.sub(r'(^#+[^\n]+\n)\n', r'\1', text, flags = re.MULTILINE)

With that code, your original text ends up looking like this:

# header1
## header2
Some text

## header2 
Some more text


### Step 0: this is pre-planning step
* ⚠️ this is a warning
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 1: the first actual step
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step
etc...

Which then produces much nicer chunks:

import semchunk

chunker = semchunk.chunkerify('gpt-4', chunk_size = 100)
chunks = chunker(text)

for chunk in chunks:
    print(chunk)
    print('-'*80)

# header1
## header2
Some text

## header2 
Some more text
--------------------------------------------------------------------------------
### Step 0: this is pre-planning step
* ⚠️ this is a warning
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list
--------------------------------------------------------------------------------
### Step 1: the first actual step
▶ the top line
    ▶ the next line
    ▶ the next line
    ▶ the next line
        1. numbered list
        1. numbered list

### Step 2: the second step
etc...
--------------------------------------------------------------------------------

Given that I can see an opportunity to improve Markdown chunking even further by introducing some new specialised rules, I'm going to leave this issue open for now and work on adding an extra markdown argument that can be used to invoke those rules 😊

Jan 02 '25 01:01 umarbutler

Thanks! I'll give text = re.sub(r'(^#+[^\n]+\n)\n+', r'\1', text, flags = re.MULTILINE) a try.

Jan 02 '25 16:01 nick-youngblut