Accept image pastes from the clipboard for multi-modal LLMs to consume
Description of the new feature
When using CLI based multi-modal LLMs for coding, it is common to paste a screenshot or image to the LLM for reference. In Windows I have to take a screenshot then drag and drop the file into the terminal chat which works, but is awkward and requires context switching.
Proposed technical implementation details
I'm not sure how nix based shells handle this, but it is very cool and handy. You can see in the original Claude Code demo they paste the image directly into the shell.
Hooking up the previous discussion: https://github.com/microsoft/terminal/discussions/19397
Summing it up: We don't have any idea what format the application might be expecting to receive an image in. Would we paste a path? Raw image data?
@lhecker suggested that we could write the image to a temporary path, hold the handle open for ~10 minutes, yeet the filename into the input stream and letting the application do with it what it will.
This absolutely will not work if you page into e.g. a process running in WSL.
If we wanted to support something like this, I was thinking a data uri with the encoded contents would be more useful, because it at least has the potential to work across a remote connection. Otherwise the app might as well just access the clipboard directly.
It would probably require a warning beforehand so the user knows they're about to paste something large, and maybe only allowed when bracketed paste is enabled. Also only worth doing if there are actually apps wanting to make use of this functionality.
For a good example of how to handle this, check out how iTerm2 does this when using Claude Code