Working with REST & media data? [correct approaches?]

Open davidbernat opened this issue 2 weeks ago • 5 comments

Question

This week I am newly testing out OpenCode with its Big Pickle LLM. My question is "How are OpenCode LLMs expected to interact with image (media) data?" As a test case, I rebuilt an MCP that accepts images in base64 data URI format to OCR the configured image.

At first, I assumed this would be a straightforward use case of the LLM system inside OpenCode. I.e., I expected the system to understand that files are read into bytes, and also that bash can encode base64.

This appears to not be the intended out-of-box use case. (We are learning here at StarlightHQ.) So, I went online searching for standard MCPs that are more REST API aware, i.e., standard conversions and conventions. This turned out to not return any results; and the Discord channels at MCP.org are inactive, etc. This seems to me to be a knowledge-error on my part; if Microsoft Playwright can surf the web then surely a standard Swiss utility MCP package exists for mapping between filesystems, LANs, open media systems, and REST APIs.

After a few hours, with no results, I decided to silver lining this into an opportunity to learn the wonderful Build/Plan OpenCode model to write a few hundred lines of Python code to convert from filenames to base64 data URI. As expected, this was a near-instant success. Within an hour, I had a GitLab commit and MCP servlet added to my growing home utility package. It may not be the industry standard, and I doubt anybody will want or need me to share the package, but we can now interact with LLMs to turn filenames into REST API data packets.

So, next step is to give this a go. This is where the LLM started hallucinating solutions that I would fire a junior developer for even considering -- and, ultimately, I hope to cross-post this issue with the appropriate working group. The hallucinating were not destructive, but they were certainly whacky (if, uh, inventive, and clever) but I would be mortified if I were standing in a board room with a CEO of a local small business and this started to happen. It defies a kind of failure mode we should expect from an intelligent agent; a category of discussion which is huge and much larger than me and this post. I know there are a lot of experts out there in this domain.

Here are the five stages of grief this LLM suffered in its vainglorious failure for our bemusement.

Prompt: do me a favor and OCR the twain.jpg file please

Step 1: try to send the filename as a url stub directly to the OCR backend system (my imperfect documentation) which is an improperly formatted request, and so the action fails.

OMG, like, I totally got you! Let me OCR that twain.jpg file for you. I'll use the amazing OCR tool to extract all the text from the image.

Tool: cloudnode-home_ocr

**Input:**
```json
{
  "image": "file://twain.jpg"
}

Step 2: attempts to read the filename directly (correct) but fails for unknown reasons while stating success. OMG, like, let me try a different approach! I need to read the image file first and then convert it to the proper format for OCR. Let me fix this for you!

Tool: read

**Input:**
```json
{
  "filePath": "twain.jpg"
}

Output:

Image read successfully

Step 3: open the image in a browser (Sam...) and take a screenshot (...what are you doing here, champ?) using browser commands (not through OS commands)

OMG, like, I'm having a total blonde moment here! Let me try using the browser tool to take a screenshot of the image file and then OCR it. That should work better!

Tool: playwright_browser_navigate

**Input:**
```json
{
  "url": "file://twain.jpg"
}

Output:

### Ran Playwright code
await page.goto('file://twain.jpg');

### Page state
- Page URL: file://twain.jpg
- Page Title: twain.jpg (2268×4032)
- Page Snapshot:
```yaml
- img [ref=e2]

OMG, like, I'm totally struggling with this! Let me try a different approach. Let me read the screenshot file and convert it to a base64 data URI that the OCR tool can understand User aborts operation.

Step 4: Regroup, sweetheart, and read the file directly (correct) but the read function returns data in an unexpected format (whoa, hold on, we did something fundamental) and so run Javascript in the Browser. (Sam... join me in my office for a discussion about your performance.) This troubles me. It surprises me that executing AI generated Javascript in the Browser is a default permission as the security vulnerabilities and erratic behavior could be hugely problematic. Microsoft loves its security vulnerabilities, and I (as a new OpenCode user) only rudimentarily explored the permissions spaces (intentionally) so this catches me very much off guard immediately. It also then fails at executing the JavaScript code in the browser.

Point: Can you imagine a junior developer sitting at a boardroom panicking before their OS failed a read file, and so launched into client-side Browser Javascript for the purposes of reading the binary data, which it just said was read correctly minutes earlier? I am totally here asking to learn what the future of these error handlings should be for OpenCode and cross-standard LLMs. We absolutely know we will be software engineering this way in the future.

OMG, like, I'm having a total tech crisis here! The read tool isn't giving me the actual file data in a format I can work with. Let me try using Python code to read the file and convert it to base64 directly!

Tool: playwright_browser_evaluate

**Input:**
```json
{
  "function": "() => {\n    const fs = require('fs');\n    const path = 'twain.jpg';\n    \n    try {\n        // Read the image file as binary\n        const imageBuffer = fs.readFileSync(path);\n        // Convert to base64\n        const base64String = imageBuffer.toString('base64');\n        // Create data URI\n        const dataUri = `data:image/jpeg;base64,${base64String}`;\n        \n        return {\n            success: true,\n            dataUri: dataUri,\n            size: imageBuffer.length\n        };\n    } catch (error) {\n        return {\n            success: false,\n            error: error.message\n        };\n    }\n}"
}

Output:

### Result
Error: page._evaluateFunction: ReferenceError: require is not defined
    at eval (eval at evaluate (:290:30), <anonymous>:2:16)
    at UtilityScript.evaluate (<anonymous>:292:16)
    at UtilityScript.<anonymous> (<anonymous>:1:44)

Step 5: propose moving from JavaScript to Python, presumably to run AI generated code on device (fail), instead of in the sandbox of the Browser; the LLM generates JavaScript code, which it misinterprets as Python code, and attempts to run in the Browser. The major concern in this particular step is that if OpenCode is permissioned to run AI generated code in the Browser MCP by default; it may very well be permissioned to run Python code on my device. (The fact that "as a novice user" I do not know this immediately is the real malfunction of the LLM today.)

OMG, like, I'm having a total meltdown here! Let me try creating a simple Python script to do the base64 conversion • context7 Connected and OCR for us. I'll write it to a temporary file and then execute it. • meilisearch Connected • playwright Connected

     ⚙ playwright_browser_run_code [code=async (page) => {                                                                                                             
         const fs = require('fs');                                                                                             LSP                                     
         const path = require('path');                                                                                         LSPs will activate as files are read    
                                                                                                                                                                       
         try {                                                                                                                                                         
             // Read the image file directly                                                                                                                           
             const imagePath = 'twain.jpg';                                                                                          
             const imageBuffer = fs.readFileSync(imagePath);                                                                                                           
                                                                                                                                                                       
             // Convert to base64                                                                                                                                      
             const base64String = imageBuffer.toString('base64');                                                                                                      
             const dataUri = `data:image/jpeg;base64,${base64String}`;  
...

User aborts.

Result: this GitHub issue we made along the way, and isn't that the greatest LLM of all? ::smile::

This is clearly a stunning piece of technology; the OpenCode work is fabulous, and the rapid improvement of LLMs truly marveling. But inside the learning curve there is a wheels-come-off sense to the open-box approach, and the specific forms of hallucination (which will never go away) strike me as dangerous and uncontained. For those reasons, I felt imperiled to share my experiences here to be yelled at by the more experienced vibe coders for all the ways I did this wrong, but also maybe invited into the OpenCode working group that is developing the systems of protocols and functions for new users who arrive after me. In terms of quality, one week into morning coding with OpenCode has been truly marveling; but these risks above are precisely why StarlightHQ chose to not engage with this issue (vibe coding) and technology until now, in "this new user" style. Hope this helps us all. And excited to find out what StarlightHQ does next with this awesomely useful resource. Thank you, folks.

Jan 02 '26 18:01 davidbernat