Motivation

With the recent introduction of eval-callback example, we now having more tools for debugging when working with llama.cpp. However, one of the tool that I feel missing is the ability to dump everything inside a gguf file into a human-readable (and interactive) interface.

Inspired from huggingface.js where users can visualize the KV and list of tensors on huggingface.com, I would like to implement the same thing in llama.cpp. I find this helpful in these situations:

Debugging convert.py script when adding a new architecture
Debugging tokenizers
Debugging changes related to gguf (model splits for example)
Debugging tensors (i.e. display N first elements of a tensor, just like eval-callback)
Debugging control vectors
... (maybe other usages in the future)

The reason why I can't use huggingface.js is because it's based on browser, which make it tricky when reading a huge local file. It also don't have access to quantized types (same for gguf-py).

Possible Implementation

Ideally, I want the implementation to be a binary named gguf-viewer that when run, will open a web page in localhost:8080. User can then go to the web page to explore the gguf file. It will have these sections:

Complete list of KV
Tokenizer-related info (for example: list all tokens, lookup one token)
List of all tensors

Apr 17 '24 04:04 ngxson

Have you seen:

gguf-dump for printing metadata ?

Or do you want something dynamic during the forward ?

Apr 17 '24 06:04 phymbert

Yes I tries gguf-py but it does not have access to quantized types

Apr 17 '24 06:04 ngxson

This could be quite fun. The web page can also generate a set of useful llama.cpp commands for that specific model (e.g. run main, server, etc) that can be copy-pasted for convenience.

Apr 17 '24 06:04 ggerganov

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jun 03 '24 01:06 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jul 18 '24 01:07 github-actions[bot]

@ngxson reopen? Also, I'd like to suggest similar functionality for imatrices. Or should I open a parallel FR?

Jul 18 '24 02:07 oldgithubman

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sep 04 '24 01:09 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Oct 25 '24 01:10 github-actions[bot]

This is something I have been planning on working on so I took the liberty to assign this task to myself.

I am putting together some designs and will post a link to them here soon. I am going to be requesting a bit of stakeholder info on this one after my initial designs to make sure the use cases are covered.

Mar 10 '25 21:03 bandoti

Yes feel free to take this task. Things changed quite a lot since I created this issue, I feel like it's no longer serves my initial goal (to ease the process of adding new models), but would be nice to have something like @ggerganov suggested above!

Mar 10 '25 21:03 ngxson

I came up with an initial set of high-level features regarding the gguf-viewer program (see below). However, I am in need of help generating ideas regarding the issue-reporting process. I am trying to figure out what should go in with a GGUF viewer, and whether a separate tool should be created with broader scope to capture diagnostic information/issue reporting.

While my intent is not necessarily to discuss implementation details at the moment, I think a good solution for the tool is using Python and TKinter with a custom C extension to expose access to the GGML library. And this also goes for the potential diagnostic tool as Python would be great to: (1) spawn the server process; (2) use the OpenAI APIs; (2) tee the logs (if necessary); (3) load C extensions directly (to access GGUF/GGML libraries).

I would like to interactively explore GGUF files.
Meta-Data, tensor info, and the actual tensors should be explorable.
More than one GGUF file should be able to be opened at once.
Tensors should be visible as both a summary and as visual blocks representing the binary data.
Tensors should be shown as a high level overview and may be “zoomed into” for more details.
The GGUF viewer should be minimal on dependencies and be simply deployable with the llama.cpp suite of programs. It should have access to the GGML/GGUF C APIs.
Complete list of tokens should be explorable, and should be visible as both strings and numeric values.
Use some sort of heatmap to relate tensor types to visual blocks—colour-coded by category.
Generate formatted report of loaded model (HTML/Markdown/XML/JSON).

Mar 31 '25 20:03 bandoti

Yeah the idea seems good.

Python and TKinter with a custom C extension to expose access to the GGML library

In fact, when I initially created this issue, the reason why I proposed to have this in cpp was because there is no implementation of quantization outside of cpp at that time.

But things changed a lot since then, gguf-py has quants.py which allow quantize-dequantize using numpy

Going a bit further, I think it's also possible to do this completely on web environment (a bit like https://netron.app/ but having lot more gguf-specific functions). We could:

Build on top of huggingface/gguf package which allow access to KV metadata
Use custom dequantization functions (can be either reimplement from python code, or I can expose these methods via my wasm binding)
Base on FileReader API to read the file chunk by chunk, allow loading even big GGUF

Mar 31 '25 21:03 ngxson

@ngxson Interesting projects—I will keep an eye on them!

I notice in the default install, we are not bundling the gguf-py libraries. Is this something we should bundle with the install? Main reason I ask is because I think it's important to make these diagnostic tools something that can work out-of-the-box on a llama.cpp install for those who are not necessarily interested in pulling in several ML Python dependencies. If it is not something we want to include by default, then naturally a Python extension would make more sense as it can just wrap the ggml library directly.

The "user journey" I imagine is: (1) I have an issue with my model—or I'm curious about a new model; (2) double-click gguf-viewer.py to open a GUI; (3) open a model file—(or select an HF URI to download); (4) explore the model; (5) generate a report and attach to github issue.

I think a lot of people can benefit from this local-first approach, as it reduces the barrier to entry and makes the diagnostic tools more portable in that sense. Even popping open a browser and having a separate server process introduces cognitive load with firewall warnings, and so forth.

That being said, I would like to understand user journeys from the online-first perspective as well—as in on the cloud. I am thinking it is fair to pull meta-data from the models (as we have now), but when it comes to "zooming in" to view the tensor blocks might pose some other issues. Perhaps we can satisfy both needs somehow.

Apr 01 '25 17:04 bandoti

Closing because the work was already done in #12930

Apr 19 '25 12:04 bandoti

I am reopening this issue as I closed it prematurely—several of the proposed features are not added in #12930, so this should remain open to iterate on that as a baseline.

Apr 21 '25 13:04 bandoti

Yeah, @bandoti, I do think (4) on your user-journey list would be quite useful. Being able to view and explore the high level graph representation of a "model", rather than just the the meta and tensor data in a GGUF, would be very cool. I'm envisioning some equivalent of opening an .onnx in netron. Basically, a more seamless version of putting ggml_graph_dump_dot in your code

Is that similar to what you meant, or am I off on my own tangent?

Apr 24 '25 21:04 LukeRouleau

@LukeRouleau My initial vision on it is not so much a graph, but more a block-level visual explorer. Initially, a birds-eye view on the data, where double-clicking a block will open a new window with a view on the selected tensor block. These would be color-coded (or greyscale shades) and when moused over on the graphical view, the tree-view of tensors would be highlighted, and vice-versa. Think a sort of disassembly view on the data.

Certainly this new window could expand to provide a graph/tree view of the selected tensors. I am going to (at some point soon) post some drawings of the design I was considering. But a quick description: a column view with three columns (a) list view of meta data; (b) tree view of tensors by grouping; (c) block view of tensor data scaled down but proportional to actual data.

Apr 28 '25 19:04 bandoti

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jun 13 '25 01:06 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jul 28 '25 01:07 github-actions[bot]

I love the idea !!!!

Oct 04 '25 19:10 ServeurpersoCom

I built this as an exercise, now I’ll be able to add it to my llama.cpp dev server page, so when I click on a .gguf, it’ll launch the backend and open the proxied viewer page lol. I’d love some ideas on how to visualize the weights, maybe as heatmaps or graphs; I could even make some in D3.js, like the one on the root page of my domain (showing a real mesh network). Next, I'm planning to experiment with heatmaps to visualize quantization quality and block entropy, I'm already working on a raw visualization using a normalized canvas (-1 to +1); when you drag the view itself, the backend streams the tensor weights as a raw raster, allowing me to plug in any algorithm on the backend side, with the projection streamed in real time so I can navigate large volumes of data smoothly.

https://github.com/ggml-org/llama.cpp/compare/master...ServeurpersoCom:llama.cpp:llama-gguf-viewer

Oct 04 '25 23:10 ServeurpersoCom

Next, I’ll try to visualize the quantization blocks, those little per-group slices (like 32×N tiles), and maybe add some filters to highlight scale or residual patterns. Later, I’ll make it possible to compare two GGUF models differences

Oct 05 '25 03:10 ServeurpersoCom

@ServeurpersoCom I am somewhat behind on this, but I will try to get the drawings done for my initial vision, and some ideas for the graph view I would be happy to contribute as soon as I am able.

Regarding the implementation, I would suggest that if you are experimenting on the server side of things, it would probably be beneficial to follow the new SvelteKit #14839 changes to maintain similar Look & Feel as the server along with the other benefits SvelteKit provides.

Oct 05 '25 20:10 bandoti

Built this in one day, as KISS as possible: pure C++/GGML + vanilla JS. I still need to transfer FP32 weights in binary instead of JSON to squeeze out more performance. Next step: time to Sveltify it!

https://www.youtube.com/watch?v=gJSp-Ske-Vc

Oct 05 '25 20:10 ServeurpersoCom

@ServeurpersoCom Very cool! Showing the weight values when hovering on the pixels would be useful.

Oct 06 '25 07:10 ggerganov

@ServeurpersoCom Very cool! Showing the weight values when hovering on the pixels would be useful.

Sure, on it :) OBS doesn’t capture Firefox tooltips, so when I tried to show the token hex and info, it didn’t show up on the video 😅

Oct 06 '25 08:10 ServeurpersoCom

I'm noticing some strange artifacts on certain slices of specific models, looks like repeated patterns along one axis, which could either be mathematically expected or a quantization glitch. When hovering over a pixel, I'll display the raw pre-dequantized value, the exact quantization formula applied, and the resulting FP32 value. To make visual comparison between different GGUF quantizations easier, I'll upgrade the backend to support multi-threaded tensor streaming. That way, we can open several tabs in browser with different quant levels of the same model and compare them at same slice.

Oct 06 '25 08:10 ServeurpersoCom

Btw can this program stream tensor data directly from HF? If yes, it could become a static web app that would run straight in the browser.

Oct 06 '25 09:10 ggerganov

For sure we can run the tiny backend on a HF Space. I just need to optimize the communication between the frontend and/or the streaming layer, avoiding resending data that’s already displayed on the client, and possibly using some lightweight image compression technique to speed things up.

Oct 06 '25 09:10 ServeurpersoCom

If we rely only on the GGML public API, the tooltip can safely decode any block using ggml_get_type_traits(type)->to_float, which gives us the FP32 values directly. That works fine and remains fully future-proof, no need to hardcode every quantization format.

However, if we want to display more detailed info (like scale, zero-point, bias, etc.), I currently have to re-implement each quantization layout manually (q4_0, q5_1, q6_K, ...). That's fragile, because every time a new format appears in GGML, the viewer will lag behind until I update the parsing logic.

So either we settle for showing only the dequantized FP32 output, or, if there's a cleaner way to access the raw quantization parameters through GGML without duplicating the internal structures, I'd love to use that instead. Does GGML expose anything like that, or is to_float really the only safe public entry point?

Oct 06 '25 13:10 ServeurpersoCom

Feature request: Graphical GGUF viewer

Motivation

Possible Implementation