monolith Automatic Output Name from `<title>`

monolith http://www.example.com/ --some-flag to automatically save to Example\ Domain.html

default to <title> as the output file's name with a configurable format (case, components such as {{todaysDate}} - {{title}}.html) in config file $HOME/.config/monolith/config.

Apr 29 '20 01:04 smhmd

Hi, i wrote a small bash/python script that does something along those lines.

It uses BeautifulSoup4, a python library to get the title of a webpage, and inserts it as the filename of the output file of monolith

It takes as input a link you have in the clipboard.


get_title() {

CLIP_URL="$(xclip -o -selection clipboard)" python - <<END

# importing the modules

# get clipboard content
import os

# bs4
from urllib.request import urlopen
from bs4 import BeautifulSoup

# target url
url = os.environ['CLIP_URL']

# using the BeautifulSoup module
soup = BeautifulSoup(urlopen(url), 'html.parser')

# displaying the title
title = soup.title.get_text()
print (title)

END
}

get_title

monolith $(xclip -o -selection clipboard) -o "$(get_title)".html

Mar 18 '24 00:03 fretzo

Hey @fretzo, thank you for the code! I think I'll start with something like %T first, then add %U for URL, %D for date, etc.

Mar 21 '24 01:03 snshn

Mhh, title is one (good) option.. :thinking: I would actually suggest that, by default, it downloads to a file encoding the original URL and outputs to stdout only when invoked with -c (I'd remap --no-css to something else, -c is used like this in many common shell tools) .. Here's some inspiration how this could look. Been using this shell snip /usr/local/bin/wa for years, mostly to save PDF off the internet:

#!/bin/bash
# download file with simplified origin URL as name.
URL=${@}

file=$(echo ${URL}|echo -e $(sed -r 's-.*://--;s+/+--+g;s/%/\\x/g;s/ /-/g')|sed 's|/|__|g')
[[ "$file" =~ ^www. ]] && suffix="${file##*.}" && file="${file#www.}" && file="${file%.*}.www.${suffix}"
wget --server-response --verbose --user-agent="Mozilla/5.0 (Linux; KHTML, like Gecko)" --output-document="$file" "${URL}"

and with another script /usr/local/bin/unwa to undo the transform, which usually works:

#!/bin/bash
# try to regenerate original URL from file name (reverse wa).
file="$*"
[[ "$file" =~ .www. ]] && suffix="${file##*.}" && file="${file%www.*}" && file="www.${file%.*}.${suffix}"
echo $file|sed -r 's+^(.*/)*+http://+;s+--+/+g'

For web URLs, some additional cleaning would probably be useful (removal of special chars [?&], addition of missing file type extension) and the original URL might be saved into a second file ${FILE}.URL..

Mar 29 '24 10:03 eMPee584