humanify icon indicating copy to clipboard operation
humanify copied to clipboard

failing test: src/test/local.e2etest.ts:1:468 - Unminifies an example file successfully - AssertionError [ERR_ASSERTION]: Expected "UNREADABLE" but got GOOD

Open 0xdevalias opened this issue 9 months ago • 7 comments

This test has started failing in main since this PR bumping node-llama-cpp was auto-merged:

  • https://github.com/jehna/humanify/pull/382
    • https://github.com/jehna/humanify/pull/382#issuecomment-2760104898
✖ failing tests:

test at src/test/local.e2etest.ts:1:468
✖ Unminifies an example file successfully (10744.814553ms)
  AssertionError [ERR_ASSERTION]: Expected "UNREADABLE" but got GOOD

  The function given performs a task that appears to split a given string (`e`) into chunks of a specified size (`t`)
      at expectStartsWith (/Users/devalias/dev/jehna/humanify/src/test/local.e2etest.ts:24:5)
      at TestContext.<anonymous> (/Users/devalias/dev/jehna/humanify/src/test/local.e2etest.ts:30:9)
      at async Test.run (node:internal/test_runner/test:866:9)
      at async Test.processPendingSubtests (node:internal/test_runner/test:574:7) {
    generatedMessage: false,
    code: 'ERR_ASSERTION',
    actual: false,
    expected: true,
    operator: '=='
  }

This test was failing within node-llama-cpp update PR's for a long time:

  • https://github.com/jehna/humanify/issues/110
  • https://github.com/jehna/humanify/issues/113
  • https://github.com/jehna/humanify/issues/119
  • https://github.com/jehna/humanify/issues/123
  • https://github.com/jehna/humanify/issues/126
  • https://github.com/jehna/humanify/issues/148
  • https://github.com/jehna/humanify/issues/194
  • https://github.com/jehna/humanify/issues/235
  • https://github.com/jehna/humanify/issues/243
  • https://github.com/jehna/humanify/issues/259
  • https://github.com/jehna/humanify/issues/270
  • https://github.com/jehna/humanify/issues/289
  • https://github.com/jehna/humanify/issues/298
  • https://github.com/jehna/humanify/issues/301

Which I started looking into in this comment:

  • https://github.com/jehna/humanify/pull/301#issuecomment-2642723208

Because it was failing, the automerge github action refused to merge the PR; but for some reason, this latest PR was merged, and now the code on the main branch fails on this test.

I haven't had a chance to debug more specifically what changed between the last version of node-llama-cpp that worked, and the version it started breaking in, but I suspect it has something to do with the specifics of the Gbnf grammar/similar that's used to constrain the models output.

The following links may be relevant for understanding if/what changes need to be made to the Gbnf grammar (or other ways to achieve similar):

  • https://node-llama-cpp.withcat.ai/guide/grammar
    • https://node-llama-cpp.withcat.ai/guide/grammar#reducing-json-schema-hallucinations
  • https://node-llama-cpp.withcat.ai/api/classes/Llama#getgrammarfor
  • https://github.com/ggerganov/llama.cpp/tree/master/grammars
  • https://node-llama-cpp.withcat.ai/guide/chat-session#response-json-schema
  • https://node-llama-cpp.withcat.ai/guide/function-calling
  • https://node-llama-cpp.withcat.ai/guide/chat-session#function-calling

Background Context

Interesting that this upgrade was able to run the tests and merge successfully when it's been failing for so long prior.. I wonder what the root issue was, and what fixed it; as looking at the main commits in this repo, it doesn't look like the tests were fixed/etc at all.

Background context:

See my previous exploration/analysis into this build failure in the following comments:

  • https://github.com/jehna/humanify/pull/301#issuecomment-2642041534
  • https://github.com/jehna/humanify/pull/301#issuecomment-2642723208

The following links may be relevant for understanding if/what changes need to be made to the Gbnf grammar (or other ways to achieve similar):

  • https://node-llama-cpp.withcat.ai/guide/grammar
    • https://node-llama-cpp.withcat.ai/guide/grammar#reducing-json-schema-hallucinations
  • https://node-llama-cpp.withcat.ai/api/classes/Llama#getgrammarfor
  • https://github.com/ggerganov/llama.cpp/tree/master/grammars
  • https://node-llama-cpp.withcat.ai/guide/chat-session#response-json-schema
  • https://node-llama-cpp.withcat.ai/guide/function-calling
  • https://node-llama-cpp.withcat.ai/guide/chat-session#function-calling

Originally posted by @0xdevalias in https://github.com/jehna/humanify/issues/301#issuecomment-2642723208


Resolving this will likely also resolve the following from this issue:

  • https://github.com/jehna/humanify/issues/52#issuecomment-2679997649

Originally posted by @0xdevalias in https://github.com/jehna/humanify/issues/337#issuecomment-2679998964

Actually.. cloning locally (7beba2d32433e58bb77d0e1b0eda01c470fec3e2) and running npm test seems to fail with the same error as previously:

⇒ npm test

..snip..

> [email protected] test:e2e
> npm run build && find src -name '*.e2etest.ts' | xargs tsx --test --test-concurrency=1

..snip..

✖ failing tests:

test at src/test/local.e2etest.ts:1:468
✖ Unminifies an example file successfully (10744.814553ms)
  AssertionError [ERR_ASSERTION]: Expected "UNREADABLE" but got GOOD

  The function given performs a task that appears to split a given string (`e`) into chunks of a specified size (`t`)
      at expectStartsWith (/Users/devalias/dev/jehna/humanify/src/test/local.e2etest.ts:24:5)
      at TestContext.<anonymous> (/Users/devalias/dev/jehna/humanify/src/test/local.e2etest.ts:30:9)
      at async Test.run (node:internal/test_runner/test:866:9)
      at async Test.processPendingSubtests (node:internal/test_runner/test:574:7) {
    generatedMessage: false,
    code: 'ERR_ASSERTION',
    actual: false,
    expected: true,
    operator: '=='
  }

So this seems like it never should have passed and been able to be merged.. probably a race condition/similar bug in the automerge GitHub action that likely would be resolved by the improvements raised in:

  • https://github.com/jehna/humanify/pull/346

Originally posted by @0xdevalias in https://github.com/jehna/humanify/issues/382#issuecomment-2760104898

0xdevalias avatar Mar 28 '25 04:03 0xdevalias

for those who haven't taken a look yet, the test is going to to humanifyjs this file: https://github.com/jehna/humanify/blob/main/fixtures/example.min.js

-it starts by confirming example.min.js is unreadable (it is) -> assert this minified code is unreadable -then it humanifyjs's it, and -confirms that the output of the humanify'd js file is readable 👍

the test is failing because it thinks example.min.js in its unmodified form is readable/good (???).

@0xdevalias do you suspect the issue is in https://github.com/jehna/humanify/blob/main/src/plugins/local-llm-rename/gbnf.ts ? or with node-llama-cpp itself? i mean, current version of node-llama-cpp is @ 3.7.0 and this lib is @ 3.0.0

skilbjo avatar Apr 21 '25 04:04 skilbjo

ok, version bump had no impact (but not a bad idea)

anyways i got it to work by editing the system prompt,

from

""" Your job is to read the following code and rate it's readabillity and variable names. Answer "EXCELLENT", "GOOD" or "UNREADABLE". """

to

""" Your job is to read the following code and rate its HUMAN readability and variable names. If it is undecipherable or looks like minified code, answer "UNREADABLE". If it looks like something a human wrote and is very readable, answer "EXCELLENT". Your answer options are "EXCELLENT", "GOOD" or "UNREADABLE". """

skilbjo avatar Apr 21 '25 05:04 skilbjo

do you suspect the issue is in main/src/plugins/local-llm-rename/gbnf.ts ? or with node-llama-cpp itself? i mean, current version of node-llama-cpp is @ 3.7.0 and this lib is @ 3.0.0

@skilbjo So the test started failing in main since this PR got merged when it really shouldn't have (likely due to a race condition/similar due to how the automerge GitHub action works, as called out in https://github.com/jehna/humanify/pull/346#issuecomment-2760146617):

  • https://github.com/jehna/humanify/pull/382

But node-llama-cpp version upgrade PRs have been failing with that same test for ages, which is why it was on such an old version.

I'd have to go back and review my detailed notes (referenced/linked to in 'Background Context' in the main issue description above), but from memory, I believe that node-llama-cpp version 3.0.0-beta.44 worked fine; and then seemingly anything after that version (even a minor bump from it) stopped working.

Not sure if this was the very first place I started looking into it, but it looks like it might have been: https://github.com/jehna/humanify/pull/301#issuecomment-2642041534

There's a vague overview of the code path of that failing test, leading through to the call to llama and the gbnf helper.

So yeah.. based on that / what I remember, I think my theory was that something may have changed in how Gbnf was meant to be called/used with node-llama-cpp.

Skimming through https://github.com/jehna/humanify/pull/301#issuecomment-2642041534 more, It looks like I went looking through the changelog to see if anything jumped out at me, but I don't think I found anything immediately 'breaking change' obvious.

The next comment (https://github.com/jehna/humanify/pull/301#issuecomment-2642723208) seems to suggest that I thought it was breaking since v3.0.0-beta.45; though I didn't seemingly come up with any concrete answers there.

This was the shortlist of links I left in that comment though, as places to read through more to understand if/what changes might have occurred in how to properly use GBNF/similar forms of structured outputs:

The following links may be relevant for understanding if/what changes need to be made to the Gbnf grammar (or other ways to achieve similar):

  • https://node-llama-cpp.withcat.ai/guide/grammar
    • https://node-llama-cpp.withcat.ai/guide/grammar#reducing-json-schema-hallucinations
  • https://node-llama-cpp.withcat.ai/api/classes/Llama#getgrammarfor
  • https://github.com/ggerganov/llama.cpp/tree/master/grammars
  • https://node-llama-cpp.withcat.ai/guide/chat-session#response-json-schema
  • https://node-llama-cpp.withcat.ai/guide/function-calling
  • https://node-llama-cpp.withcat.ai/guide/chat-session#function-calling

Originally posted by @0xdevalias in https://github.com/jehna/humanify/issues/301#issuecomment-2642723208

Unfortunately I have been pretty busy/distracted in other directions since then, so haven't had the time to dig deeper into those specifics to try and figure out what exactly was causing it/how to fix it/etc.


anyways i got it to work by editing the system prompt

@skilbjo Interesting.. does that prompt change work on both the old version of node-llama-cpp (3.0.0-beta.44), the next version that failed (3.0.0-beta.45), and the newer/latest versions (eg. 3.7.0)?

0xdevalias avatar Apr 21 '25 05:04 0xdevalias

didn't try it on older versions. i'm no expert, but my guess is that increased specificity in the system prompt goes a long way (you a trying to identify BOT or HUMAN you stupid robot), and so it's probably not that important to chase down why it worked in the pre-3.0.0-beta.xx versions, and just move on to a better system prompt and bring all the deps to the current versions.

skilbjo avatar Apr 21 '25 05:04 skilbjo

didn't try it on older versions

@skilbjo nods, fair.

but my guess is that increased specificity in the system prompt goes a long way

@skilbjo nods, yeah; it definitely seems like a better prompt overall.

so it's probably not that important to chase down why it worked in the pre-3.0.0-beta.xx versions

@skilbjo On one hand I sort of agree; but on the other hand, given we're using the same model/prompt/etc; the engineer in me really wants to understand/narrow down what the root cause of the issue was that broke it between 3.0.0-beta.44 and 3.0.0-beta.45; and particularly if there was something that changed related to the Gbnf/etc stuff that could be improved; even if the improved system prompt does seem to resolve/work around the issue (as it's possible that it is masking the root cause of the failure, etc)

and just move on to a better system prompt and bring all the deps to the current versions

@skilbjo Yeah, from a 'get things unbroken' perspective I agree; though since I don't have merge/push access to fix things on this repo at this stage anyway (see https://github.com/jehna/humanify/issues/358); and while I have been thinking about forking the project (see https://github.com/jehna/humanify/issues/358#issuecomment-2817712741), I don't have capacity to work on that just yet due to competing priorities; so it makes little real impact one way or the other at this stage.

0xdevalias avatar Apr 21 '25 05:04 0xdevalias

a common criticism of LLMs is that they are non-deterministic. if you do find the smoking gun, i tip my hat to you.

skilbjo avatar Apr 21 '25 07:04 skilbjo

a common criticism of LLMs is that they are non-deterministic

@skilbjo Sure.. but that prompt was seemingly stable and good enough to work from whenever it was first committed right up until the version where it started failing; so it was deterministic enough, particularly with the guidance from the Gbnf/etc.

And given the underlying model didn't change (to my knowledge), there'd have to be a reason why it suddenly stopped working around that version. That might have been something in node-llama-cpp, or perhaps it was something in the underlying llama.cpp project that it wraps; but it is very unlikely that the root cause of it is just akin to "lol, it just decided it's vibes were off from that point on, and then consistently failed every test run since"

Whether I/someone else is able to narrow down what that change was or not is another story; but it is something I am interested in looking into deeper when I have capacity.

0xdevalias avatar Apr 21 '25 07:04 0xdevalias