isle icon indicating copy to clipboard operation
isle copied to clipboard

Documentation + clearing lots of unknowns

Open floriandotorg opened this issue 11 months ago • 13 comments

Hey everyone!

First off, a big thank you for your amazing work! LEGO Island is was one of my childhood favorites and now see how it all works—brilliant.

Now, based this code, a friend and I are remaking LEGO Island for the browser (https://github.com/floriandotorg/brickolini-island). To do that, I’ve been digging into the codebase as much as possible.

In this PR, I’ve added probably hundreds of unknowns. I also started documenting the code a bit—I thought that might be useful to share as well.

Additionally, I’ve included a small AI script I wrote. It uses Gemini to ask questions about the codebase. It’s not perfect, but it helps clarify more complex parts here and there.

#!/bin/bash

output="all_code.txt"
> "$output"

find . -type f \( -name "*.cpp" -o -name "*.h" \) | while read -r file; do
  echo "<$file>" >> "$output"
  cat "$file" >> "$output"
  echo "" >> "$output"
done
import os

import dotenv
from google import genai
from google.genai import types

dotenv.load_dotenv()

with open("all_code.txt", "r", encoding="utf-8") as f:
    code_content = f.read()

question = "What does MxDSBuffer? Detailed"

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

res = client.models.generate_content(
    model="gemini-1.5-pro",
    contents=f"Here is the code:\n\n{code_content}\n\n{question}",
    config=types.GenerateContentConfig(
        temperature=1,
        system_instruction="You are a master reverse engineer. You will be given a whole codebase and a question. You will need to answer the question based on the codebase.",
    ),
)

print(question)
print(res.text)

Unfortunately, the codebase is too large (about 1.5 million tokens) to feed into more powerful LLMs like GPT-4.1 at once.

This PR isn’t really meant to be merged—it's more of a reference. That said, I think the unknowns I’ve added are pretty accurate and could likely be merged into the main codebase.

I'll keep this up to date as I go.

Cheers.

floriandotorg avatar May 07 '25 09:05 floriandotorg

I personally like the m_buttonX and m_buttonY names instead of just m_X and m_Y. In the context of what you are working with it does not matter, but it feels like it's an X and Y value attached to something instead of just a random value.

JPeisach avatar May 07 '25 11:05 JPeisach

I found it a bit confusing because it appears like a button press state or something similar, but it's rather just the mouse position. Maybe GetXPostion() is a better name?

floriandotorg avatar May 07 '25 12:05 floriandotorg

Maybe?

I also floated the idea a while ago on Matrix of using Doxygen, maybe others would like that - though it might require annotating literally everything

JPeisach avatar May 07 '25 12:05 JPeisach

@floriandotorg thanks for this!

There's definitely valuable stuff in this PR, but as you said yourself, it's not ideal to be merged directly. What I'd suggest:

  1. Break up naming improvements into individual PRs, i.e. have 1 PR for each compilation unit. This makes it easier to digest and review.
  2. Keep and maintain documentation and related utilities outside of this repository. I'm not a fan of having it all inside one repository, since it adds considerable overhead over time (as more and more is being added to the docs). Better to keep it separate (i.e. on a website, other repo,...).

With regards to documentation, maybe the isledecomp team will come up with a format that everyone can contribute to as part of this organization, but we don't have an initiative for it at this time.

foxtacles avatar May 07 '25 18:05 foxtacles

Guten Abend @foxtacles,

  1. Makes sense. My time is a bit limited, but I'll see what I can do.
  2. Alright, as soon as you decided how you wanna do it, I'll happily contribute. Have to say, though, I like the idea of using Doxygen. Esp. since this way, everything is also documented inside the code, making it easier to follow. But you decide and let me know.

floriandotorg avatar May 07 '25 18:05 floriandotorg

  1. Alright, as soon as you decided how you wanna do it, I'll happily contribute. Have to say, though, I like the idea of using Doxygen. Esp. since this way, everything is also documented inside the code, making it easier to follow. But you decide and let me know.

We'll consider it, nothing is ruled out yet - my main concern with in-code documentation is that it would add too much noise, and distract from the primary goal of this repo to provide an accurate decompilation. We already have a lot of "decomp" documentation (reccmp annotations) that is embdedded in the code, so the documentation would be added on top of it.

Maybe we could maintain a fork that has the documentation in the code. This way we can cleanly separate the efforts of decompilation and documentation.

Happy to hear opinions from everyone else though.

foxtacles avatar May 07 '25 18:05 foxtacles

Good point. Maybe a possibility would be a repo with just skeleton code + annotations.

floriandotorg avatar May 07 '25 18:05 floriandotorg

You gave me ideas. I wanted to try the skeleton idea and thought I could use AI to create some basic annotations. But man, it just completely documented the code. Now we have a fully functional Doxygen doc—including class hierarchies and diagrams. My hand-made docs are also included: https://github.com/floriandotorg/isle-documentation

CleanShot 2025-05-07 at 22 54 35

floriandotorg avatar May 07 '25 20:05 floriandotorg

It even added documentation for unknown functions:

CleanShot 2025-05-07 at 22 57 10

This needs to be validated, of course.

floriandotorg avatar May 07 '25 20:05 floriandotorg

Looks like a viable approach to me. We could take a version of that into the isledecomp org, but if we want to bootstrap it with AI I'd suggest that all AI-generated parts should be marked as such. Ideally everything would eventually be manually verified (and improved) - generally we are striving for maximum accuracy and correctness in everything we do.

foxtacles avatar May 07 '25 21:05 foxtacles

@floriandotorg Just some AI related insight. To take advantage of smaller, smarter models, you could index the codebase by function or similar into a vector store. Using an agentic approach, you could then fetch the relevant functions and their relationships, providing more targeted context.

I'd also advise giving Llama 4 Scout a go. It's got a 10m context window :)

PyroFilmsFX avatar May 08 '25 03:05 PyroFilmsFX

Sounds good, so I'll do the following:

  1. I'm gonna prepare some PRs to clear the unknowns, would like to have them in the code base first
  2. I like the idea of the vector store, I'll give it a try (I already tried models with larger context, even Gemini 1.5 works, but it showed that they are not smart enough to give some real insights).
  3. I will mark everything AI generated with “[AI]” (good point).

After this, when everybody's happy, feel free to move the repo into the org.

floriandotorg avatar May 08 '25 10:05 floriandotorg

64,830,386 tokens later..

I tinkered around a lot with the prompt, but there is still A TON of cleaning up to do. Everything is marked with [AI], so we can verify step-by-step: https://floriandotorg.github.io/isle-documentation/

floriandotorg avatar May 10 '25 21:05 floriandotorg

I think for now no one in the team has time to dedicate to this. I still think we should set up some ways to host/provide documentation though, so I'm hoping we will get back to this in the future.

foxtacles avatar Jun 12 '25 00:06 foxtacles