Automatic Output Name from `<title>`
monolith http://www.example.com/ --some-flag to automatically save to Example\ Domain.html
default to <title> as the output file's name with a configurable format (case, components such as {{todaysDate}} - {{title}}.html) in config file $HOME/.config/monolith/config.
Hi, i wrote a small bash/python script that does something along those lines.
It uses BeautifulSoup4, a python library to get the title of a webpage, and inserts it as the filename of the output file of monolith
It takes as input a link you have in the clipboard.
get_title() {
CLIP_URL="$(xclip -o -selection clipboard)" python - <<END
# importing the modules
# get clipboard content
import os
# bs4
from urllib.request import urlopen
from bs4 import BeautifulSoup
# target url
url = os.environ['CLIP_URL']
# using the BeautifulSoup module
soup = BeautifulSoup(urlopen(url), 'html.parser')
# displaying the title
title = soup.title.get_text()
print (title)
END
}
get_title
monolith $(xclip -o -selection clipboard) -o "$(get_title)".html
Hey @fretzo, thank you for the code! I think I'll start with something like %T first, then add %U for URL, %D for date, etc.
Mhh, title is one (good) option.. :thinking:
I would actually suggest that, by default, it downloads to a file encoding the original URL and outputs to stdout only when invoked with -c (I'd remap --no-css to something else, -c is used like this in many common shell tools) ..
Here's some inspiration how this could look. Been using this shell snip /usr/local/bin/wa for years, mostly to save PDF off the internet:
#!/bin/bash
# download file with simplified origin URL as name.
URL=${@}
file=$(echo ${URL}|echo -e $(sed -r 's-.*://--;s+/+--+g;s/%/\\x/g;s/ /-/g')|sed 's|/|__|g')
[[ "$file" =~ ^www. ]] && suffix="${file##*.}" && file="${file#www.}" && file="${file%.*}.www.${suffix}"
wget --server-response --verbose --user-agent="Mozilla/5.0 (Linux; KHTML, like Gecko)" --output-document="$file" "${URL}"
and with another script /usr/local/bin/unwa to undo the transform, which usually works:
#!/bin/bash
# try to regenerate original URL from file name (reverse wa).
file="$*"
[[ "$file" =~ .www. ]] && suffix="${file##*.}" && file="${file%www.*}" && file="www.${file%.*}.${suffix}"
echo $file|sed -r 's+^(.*/)*+http://+;s+--+/+g'
For web URLs, some additional cleaning would probably be useful (removal of special chars [?&], addition of missing file type extension) and the original URL might be saved into a second file ${FILE}.URL..