llava-cli: format batch --image descriptions according to --template
- [x] I have read the contributing guidelines
- Self-reported review complexity:
- [x] Low
- [ ] Medium
- [ ] High
Problem. Using llava-cli --image 1.jpg --image 2.jpg... batch mode generates several image descriptions in succession. Keeping the model in memory allows for faster generation. But the output format is not eminently useful. To wit, the image file name is not mentioned.
Rationale. Expediency could have chosen a specific data output format, such as JSON. But keeping with llama.cpp's versatility with command-line options, it seemed reasonable to let the user specify their own data format.
Improvisation. This pull request introduces a optional --template argument to format output of bulk image descriptions. If --template is not supplied, output is exactly as it was before, so this commit is atomic.
Help screen. The following line is added to the -h help message.
--template STRING output template replaces [image] and [description] with generated output
Prerequisites. For this example, we create a shell script, describe.sh, to launch any particular llava model and options (yours will be different).
llama-llava-cli -ngl 16 \
-m ~/.local/share/models/Obsidian/obsidian-q6.gguf \
--mmproj ~/.local/share/models/Obsidian/mmproj-obsidian-f16.gguf \
-c 4096 "$@"
Next, we cd to a directory containing a few images. And demonstrate using this new --template option!
shopt -s nullglob
cd Pictures
printf -- "--image %q " *.png *.webm *.jpg *.jpeg | describe.sh -p "Write a one paragraph caption for the image." --template '<figure><img src="[image]" alt="[image]"><figcaption>[description]</figcaption></figure>' --log-disable | tee data
The printf %q outputs file names with spaces and special characters properly escaped. We could have used find. The nullglob option to shopt is necessary to prevent bash from causing errors. If no images are found matching [pattern], it tries to pass off the glob pattern itself as one of the images. So we turn that feature off.
Photos are processed one by one, formatting the output according to the data file that looks like this.
<figure><img src="test pattern.png" alt="test pattern.png"><figcaption> The colorful television screen displays the image of a fish tank with blue, red, yellow, green, and blue elements.
</figcaption></figure><figure><img src="trading patterns.png" alt="trading patterns.png"><figcaption> A computer monitor displaying a variety of graphs and diagrams.
</figcaption></figure><figure><img src="Youtube-button.png" alt="Youtube-button.png"><figcaption> The YouTube logo is red and white.
</figcaption></figure><figure><img src="20230218_215924.jpg" alt="20230218_215924.jpg"><figcaption> A small digital scale shows the number 378.
</figcaption></figure><figure><img src="dad.jpg" alt="dad.jpg"><figcaption> A person plays the grand piano in an exhibition hall.
</figcaption></figure><figure><img src="ferry.jpg" alt="ferry.jpg"><figcaption> A boat is docked at a port near a forest.
</figcaption></figure><figure><img src="github_error.jpg" alt="github_error.jpg"><figcaption> The image shows a screenshot of a screenshot of a screenshot of a screenshot of a screenshot of a screenshot of a screenshot of a screenshot of a screenshot</figcaption></figure>
We could have cleaned it up to make a proper HTML page. But tools like HTML tidy already exist for that.
tidy -i -o album.html data
<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.8.0">
<title></title>
</head>
<body>
<figure>
<img src="test%20pattern.png" alt="test pattern.png">
<figcaption>
The colorful television screen displays the image of a fish
tank with blue, red, yellow, green, and blue elements.
</figcaption>
</figure>
<figure>
<img src="trading%20patterns.png" alt="trading patterns.png">
<figcaption>
A computer monitor displaying a variety of graphs and
diagrams.
</figcaption>
</figure>
...
As you can see, the new --template feature makes the AI web creation much easier.