LLMs-Planning VAL always response ""

Can your VAL be evaluated correctly? Download I'm asking for VAL, then set the environment variable: os.environ['VAL'] =/home/zhouxueyang/agent/plan/planbench/LLMs-Planning-main/planner_tools/VAL

evaluate file: {'instance_id': 2, 'example_instance_ids': [1], 'query': 'I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do\n\nPick up a block\nUnstack a block from on top of another block\nPut down a block\nStack a block on top of another block\n\nI have the following restrictions on my actions:\nI can only pick up or unstack one block at a time.\nI can only pick up or unstack a block if my hand is empty.\nI can only pick up a block if the block is on the table and the block is clear. A block is clear if the block has no other blocks on top of it and if the block is not picked up.\nI can only unstack a block from on top of another block if the block I am unstacking was really on top of the other block.\nI can only unstack a block from on top of another block if the block I am unstacking is clear.\nOnce I pick up or unstack a block, I am holding the block.\nI can only put down a block that I am holding.\nI can only stack a block on top of another block if I am holding the block being stacked.\nI can only stack a block on top of another block if the block onto which I am stacking the block is clear.\nOnce I put down or stack a block, my hand becomes empty.\nOnce you stack a block on top of a second block, the second block is no longer clear.\n\n[STATEMENT]\nAs initial conditions I have that, the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the blue block is on top of the orange block, the red block is on the table, the orange block is on the table and the yellow block is on the table.\nMy goal is to have that the orange block is on top of the blue block.\n\nMy plan is as follows:\n\n[PLAN]\nunstack the blue block from on top of the orange block\nput down the blue block\npick up the orange block\nstack the orange block on top of the blue block\n[PLAN END]\n\n[STATEMENT]\nAs initial conditions I have that, the red block is clear, the yellow block is clear, the hand is empty, the red block is on top of the blue block, the yellow block is on top of the orange block, the blue block is on the table and the orange block is on the table.\nMy goal is to have that the orange block is on top of the red block.\n\nMy plan is as follows:\n\n[PLAN]', 'ground_truth_plan': '(unstack yellow orange)\n(put-down yellow)\n(pick-up orange)\n(stack orange red)\n', 'llm_raw_response': "Based on the initial conditions and the goal, I will create a plan to achieve the goal.\n\n[PLAN]\nunstack the yellow block from on top of the orange block\nput down the yellow block\nunstack the red block from on top of the blue block\nput down the red block\npick up the orange block\nstack the orange block on top of the red block\n[PLAN END]\n\nThis plan should achieve the goal of having the orange block on top of the red block. Let me know if you'd like me to explain the reasoning behind each step!", 'extracted_llm_plan': '(unstack d c)\n(put-down d)\n(pick-up c)\n(stack c a)\n', 'llm_correct': False}

But when I evaluated the response, always returns ' '.

Is there something wrong with my deployment?

Feb 20 '25 03:02 Zxy-MLlab

My sense is that the compiled VAL might not be working for your machine. Kindly re-make VAL from here https://github.com/KCL-Planning/VAL and then set the env variable. That should help.

Mar 16 '25 06:03 karthikv792

@Zxy-MLlab @karthikv792 Have you fixed the bug yet? I also met the same bug. When evaluating BlocksWorld, the ground truth plan keeps natural names (red/blue/yellow/orange) while the LLM-extracted plan is normalized to symbolic IDs (a,b,c,d). Because comparison happens across two naming namespaces, otherwise-correct plans are marked incorrect, yielding Accuracy = 0.0 (0/500).

In the author’s sample results, both ground_truth_plan and extracted_llm_plan are normalized to a,b,c,d. On my run, only the extracted plan is normalized. Example (trimmed from an instance result):

{
  "ground_truth_plan": [
    "(unstack yellow orange)",
    "(put-down yellow)",
    "(pick-up orange)",
    "(stack orange red)"
  ],
  "extracted_llm_plan": [
    "(unstack d c)",
    "(put-down d)",
    "(pick-up c)",
    "(stack c a)"
  ],
  "llm_correct": false
}

I think this is the reason for accuracy = 0.

I still don't know how to fix this error. Any suggestion is appreciated. Thanks!!!

Aug 28 '25 02:08 chihuy124

May I know the exact command you are running for this? I am assuming that there is a issue in your VAL setup but I want to make sure that thats the case. Could you paste the extracted plan in a file (with the list joined into a string with '\n'), name it let's say plan_file and then run the following command?

$VAL/validate instances/blocksworld/domain.pddl instances/blocksworld/generated_basic/instance-<number>.pddl plan_file The number should be the corresponding instance number for which the extracted plan is the proposed solution.

Aug 28 '25 07:08 karthikv792

The point is that we don't do a one-to-one comparison with the groundtruth. We essentially simulate the plan given the world (which is the domain) and validate the plan. VAL (when presented with the domain and the problem file) acts as the simulator in which the plan is executed to find out if it is successful.

Aug 28 '25 07:08 karthikv792

Command I’m running

python3 plan-bench/llm_plan_pipeline.py \
  --task t1 \
  --config blocksworld \
  --engine qwen3-4B-Instruct-2507 \
  --verbose True

Engine wiring (in llm_utils.py)

elif engine == "qwen3-4B-Instruct-2507":
    model_name = "Qwen/Qwen3-4B-Instruct-2507"
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
    qwen_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        torch_dtype=torch.float16
    )
    messages = [
        {"role": "system", "content": "You are the planner assistant who comes up with correct plans."},
        {"role": "user", "content": query}
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(qwen_model.device)
    outputs = qwen_model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=1e-5,
        top_p=1.0
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.strip()

Result file results/blocksworld/qwen3-4B-Instruct-2507/task_1_plan_generation.json contains (example):

{
  "ground_truth_plan": [
    "(unstack yellow orange)",
    "(put-down yellow)",
    "(pick-up orange)",
    "(stack orange red)"
  ],
  "extracted_llm_plan": [
    "(unstack d c)",
    "(put-down d)",
    "(pick-up c)",
    "(stack c a)"
  ],
  "llm_correct": false
}

My observation: the ground_truth_plan is rendered with color names (yellow, orange, red), while the extracted_llm_plan is in object symbols (a, b, c, d).

Questions:

Could this namespace mismatch (colors vs a/b/c/d) be the reason llm_correct becomes false on my setup?
From your comment, I understand the evaluation doesn’t string-match to a ground truth but uses VAL to simulate the extracted plan against the domain/problem. In that case, is the ground truth in the JSON purely informational? Or does the pipeline also normalize both GT and LLM plans into the problem’s object namespace before any comparison?
If normalization is expected: which part of the code performs the mapping from natural names (e.g., colors) to the problem objects (often a/b/c/d)—is it text_to_plan(...) in utils? I’m trying to confirm whether my environment failed to map the GT side (I saw typed objects in some problems).

Aug 28 '25 07:08 chihuy124

No, the namespace to be tested on is a/b/c/d. The ground truth plan is mostly informational
It depends on whether you are sending a natural language prompt or a PDDL prompt. If it is in natural language, the namespace within the prompt is colors which then get translated back to a/b/c/d. If it is PDDL prompt then there is no translation.
Yes text_to_plan() is the one.

Sep 04 '25 15:09 karthikv792