shecc icon indicating copy to clipboard operation
shecc copied to clipboard

Fix array compound literal parsing

Open lorettayao opened this issue 2 months ago • 16 comments

Summary Fix a segfault when evaluating array compound literals like (int[]){1,2,3,4,5} in expression context. The literal now yields the temporary array’s address (via decay) instead of collapsing to a single element, so indexing and reads work correctly.

Motivation Previously, code like the snippet below crashed because the array literal was reduced to a scalar and later treated as a pointer.

Reproduction (manual)

#include <stdio.h>

int main(void) {
    int *a = (int[]){1,2,3,4,5};   // used to crash later
    printf("Testing basic array compound literal:\n");
    for (int i = 0; i < 5; i++)
        printf("  a[%d] = %d\n", i, a[i]);

    int sum = a[0] + a[4];
    printf("Sum = %d (expect 6)\n", sum);
    return 0;
}

Before: segfault After: prints a[0..4] and Sum = 6 as expected

Approach (high level)

  • Ensure (type[]){…} produces a real temporary array object and the expression value decays to its address.
  • Only collapse to a scalar when a scalar is actually required by context.
  • No changes to struct compound literals or plain scalars.

Scope

  • This PR addresses the array side of #299 sufficient to eliminate the segfault in basic use.
  • Struct compound literals and broader initializer behaviors are intentionally out of scope here.

Tests

  • No test files are added in this PR. The manual snippet above demonstrates the fix.

Compatibility

  • Restores standard pointer semantics for array compound literals.
  • No behavioral changes for structs or scalars.

Issue

  • Partially addresses #299 (array compound literals).

Summary by cubic

Fix parsing of array compound literals so they allocate a temporary array and decay to its address, preserving pointer semantics and preventing segfaults.

  • Bug Fixes

    • Eliminates crashes when using (int[]){...} in expressions, assignments, and parameter passing.
    • Correct behavior across binary ops, pointer arithmetic, ternary results, and function-call (incl. variadic) arguments.
    • Zero-length array literals yield 0 for scalar uses to avoid garbage reads.
    • Addresses #299.
  • Refactors

    • Added parse_array_compound_literal and scalarize_array_literal to centralize handling; scalarize only when a scalar is required.
    • Emits element writes and counts initializers; tracks brace consumption to avoid double reads.
    • Struct and scalar compound literals remain unchanged.

Written for commit a4aba549e0f1755ad4105ffb4fc503a77922a17b. Summary will update automatically on new commits.

lorettayao avatar Oct 06 '25 13:10 lorettayao

You MUST ensure bootstrapping is fully functional before submitting pull requests.

jserv avatar Oct 06 '25 14:10 jserv

IIUC, compound literals are a feature supported only since C99, and the shecc README mentions from the very beginning that this project aims to support ANSI C. Therefore, IMO, this at least does not "fix" anything.

visitorckw avatar Oct 06 '25 14:10 visitorckw

IIUC, compound literals are a feature supported only since C99, and the shecc README mentions from the very beginning that this project aims to support ANSI C. Therefore, IMO, this at least does not "fix" anything.

As far as I know, shecc is planned to fully support the C99 standard, so I think the term "fix" is acceptable.

DrXiao avatar Oct 06 '25 14:10 DrXiao

Fine, but since we haven't claimed to fully support C99 and, IIUC, array compound literals were never supported before, this seems more like supporting a new feature to me, rather than fixing an existing problem.

visitorckw avatar Oct 06 '25 14:10 visitorckw

In fact, shecc has ability to handle array compound literals, but it only captures the first element (#299).

Therefore, this pull request specifically aims to fix it.

DrXiao avatar Oct 06 '25 15:10 DrXiao

In fact, shecc has ability to handle array compound literals, but it only captures the first element (#299).

Therefore, this pull request specifically aims to fix it.

Thanks, that resolves my doubt. However, should we also mention the issue in the first commit?

visitorckw avatar Oct 06 '25 15:10 visitorckw

Looking at the changes, the core logic for handling array compound literals looks solid—good job centralizing the decay behavior to avoid ad-hoc fixes everywhere. But yeah, there are some style inconsistencies and comment opportunities that could make this even tighter. Here's what I'd suggest, focused on parser.c (the bulk of the changes) and the test additions. I'll keep it practical: specific line tweaks for consistency with the rest of the codebase (e.g., brace placement, naming, spacing) and punchier comments.

Style Fixes for parser.c

  1. Brace placement and indentation in new functions:

    • In parse_array_compound_literal, the if (!lex_peek(T_close_curly, NULL)) block has inconsistent indentation for the inner for loop—align it to match the function's style (2-space indents). Also, add braces around single-statement blocks for consistency (e.g., the count++ line).
      if (!lex_peek(T_close_curly, NULL)) {
          for (;;) {
              read_expr(parent, bb);
              read_ternary_operation(parent, bb);
              var_t *value = opstack_pop();
              if (count == 0)
                  var->init_val = value->init_val;
              if (emit_code) {
                  // ... (existing code)
              }
              count++;  // Add braces if expanding later
              if (!lex_accept(T_comma))
                  break;
              if (lex_peek(T_close_curly, NULL))
                  break;
          }
      }
      
      This matches the style in read_expr and avoids future bugs.
  2. Variable naming consistency:

    • vd is overused as a generic var_t pointer—rename to something descriptive in read_ternary_operation (e.g., result_var for the final vd). It's clear in context, but vd feels like a holdover from declarations elsewhere.
      var_t *result_var = require_var(parent);
      gen_name_to(result_var->var_name);
      add_insn(parent, then_, OP_assign, result_var, true_val, NULL, 0, NULL);
      add_insn(parent, else_, OP_assign, result_var, false_val, NULL, 0, NULL);
      // ...
      result_var->is_ternary_ret = true;
      opstack_push(result_var);
      
    • Similarly, in read_body_assignment, rhs_val and rhs are redundant—use rhs consistently and drop the intermediate.
  3. Spacing and operator alignment:

    • In is_array_literal_placeholder, tighten the return condition: return var && var->array_size > 0 && !var->ptr_level && var->var_name[0] == '.';
    • In binary op handling (around line 2908), the new rs1_is_ptr_like checks have good logic but uneven spacing—align the bool defs:
      bool rs1_is_ptr_like = rs1 && (rs1->ptr_level || rs1->array_size);
      bool rs2_is_ptr_like = rs2 && (rs2->ptr_level || rs2->array_size);
      
    • Remove extra blank lines after function defs (e.g., after scalarize_array_literal) to match the compact style in the original parser.
  4. Minor: Unused params and locals:

    • In scalarize_array_literal, hint_type defaults to TY_int if invalid, but the null check for elem_type could be simplified: type_t *elem_type = hint_type ? hint_type : array_var->type ? array_var->type : TY_int;. Drops the nested if.
    • var->init_val = 0; in a few spots (e.g., zero-length handling) is fine, but ensure it's set consistently—it's already good, just a nit.

Comment Improvements

The inline comments are mostly helpful, but some are wordy or could be more precise. Aim for concise, factual notes that explain why not just what.

  1. Update the placeholder comment (around line 1364):

    • Current: "Identify compiler-emitted temporaries that hold array compound literals. Parsing assigns these temporaries synthetic names via gen_name_to (".tN") and they keep array metadata without pointer indirection."
    • Suggested: "Detect compiler temporaries ('.tN' names) holding array compound literals. These retain array metadata (size/type) without decaying to pointers until needed."
      • Shorter, focuses on the key invariant (no auto-decay).
  2. Add a brief header comment to parse_array_compound_literal:

    /* Parse array compound literal contents: emit element stores to a temporary array.
     * Counts initializers and consumes the closing brace. Used only for array types. */
    void parse_array_compound_literal(var_t *var, /* ... */)
    
    • Explains purpose without repeating the PR description.
  3. Clarify scalarization calls:

    • In function params (around line 1650), add: // Decay array literals to scalars for non-pointer params, per C semantics.
    • In ternary handling: // Scalarize array literals if the other branch isn't pointer-like, to match types.
    • In assignment: // Decay RHS array literal to scalar if target is non-pointer (e.g., int x = (int[]){1}; uses first elem).
      • These tie back to the motivation (avoiding scalar collapse in pointer contexts) without verbosity.
  4. Zero-length handling (in read_expr_operand):

    • Current implicit via if (compound_var->array_size == 0). Add: // Zero-length arrays default to 0 constant (avoids garbage reads in scalar contexts).

For Tests in driver.sh

  • The new tests are great for coverage, but add a comment block at the top of the array sections:
    # Array compound literal tests: verify decay to pointer in assignments and args
    
  • In the sum test, the 0, 0 padding is clever for partial init, but add an expected output comment: # Expects sum=6 from first 3 elements.
  • Consistent with existing tests: Use try_ 0 everywhere (you did), and keep EOF indents clean.

These tweaks keep the code readable and maintainable without overhauling anything. Total changes: ~20 lines. Run a style check (assuming you have one) after, and it'll align nicely with the rest of the parser. If you push an updated commit, I can re-review the diffs.

cubic-dev-ai[bot] avatar Nov 07 '25 12:11 cubic-dev-ai[bot]

I defer to @ChAoSUnItY for confirmation.

jserv avatar Nov 14 '25 08:11 jserv

@lorettayao You can reply to or chat with @cubic-dev-ai so the AI bot can learn from your feedback and provide smarter code reviews over time.

jserv avatar Nov 19 '25 16:11 jserv

The CI error is caused by IR regression test, if you happened to have this test failed, you can execute make update-snapshots to update the IR snapshots, then push the updated snapshots so that the test would proceed.

ChAoSUnItY avatar Nov 23 '25 17:11 ChAoSUnItY

The CI error is caused by IR regression test, if you happened to have this test failed, you can execute make update-snapshots to update the IR snapshots, then push the updated snapshots so that the test would proceed.

@ChAoSUnItY When i update-snapshots, there are only the fib-riscv.json & hello-riscv.json been updated, is that normal? I wonder if the CI fail is because I did not update fib-arm.json and hello-arm.json.

lorettayao avatar Nov 25 '25 01:11 lorettayao

there are only the fib-riscv.json & hello-riscv.json been updated, is that normal?

Sometimes some changes only affect certain backend, which is normal. If you're not sure if this is correct, you can perform make check-snapshots to see the result first on local machine, if it succeeds, then it should be correct on CI, too.

ChAoSUnItY avatar Nov 25 '25 03:11 ChAoSUnItY

@jserv I have rebased to master, and I saw my branch is up to date with 'origin/master'. I am confused if the conflict is because of the rebase issue.

lorettayao avatar Nov 25 '25 08:11 lorettayao

I guess your origin points to your own repository on github instead of the upstream repository under sysprog21?

visitorckw avatar Nov 25 '25 08:11 visitorckw

I guess your origin points to your own repository on github instead of the upstream repository under sysprog21?

yes! Thank you for pointing out!

lorettayao avatar Nov 25 '25 08:11 lorettayao

@lorettayao, note that the commit message body should be wrapped at 72 characters.

DrXiao avatar Nov 27 '25 14:11 DrXiao