surfraw-tools Support POST method searches

This might make generating OpenSearch elvi easier.

As for implementation, the elvis.in template could be modified to create a temporary file with a filled-in HTML form which auto-submits on page-load (using JavaScript). It would then output the name of this file, perhaps with a file:// scheme.

Feb 22 '21 07:02 Hoboneer

This could also be implemented using curl, wget, Perl's GET, or some other command-line HTTP program.

Either way, tempfiles are going to be used.

One problem I've had with testing POST method searches is that, yes, they work fine on the first search but it's inconsistent whether searching on the page again or visiting the next page works.

This was for my tests with DuckDuckGo's Lite search. It works fine with w3m and firefox but not qutebrowser (error 403). I wonder why that is. Maybe something to do with the Referer or User-Agent headers? It might just be a DuckDuckGo thing.

Here, I had to resolve the relative URLs for the URL scheme and scheme+domain:

# TODO: be more portable (i.e., not use --suffix)
#INFILE="$(mktemp --suffix=.html)"
INFILE=in.html
curl -d 'q=test' https://lite.duckduckgo.com/lite/ -o "$INFILE"
hxnormalize -x "$INFILE" | hxwls |  grep '^/' | awk -v OFS='\t' '/^\/\//{print $0, "https:" $0; next} /^\//{print $0, "https://lite.duckduckgo.com" $0}' >urls.txt
sed "$(while read before after; do echo "s,$before,$after,g"; done < urls.txt | sort | uniq)" "$INFILE" >out.html

Elvi could do something similar.

Feb 26 '21 08:02 Hoboneer

Pros and cons of both ways

Note: runtime refers to when elvi execute code--not when they're generated by mkelvis, opensearch2elvis, or some other program.

HTML form + JavaScript auto-submit

Pros

No extra runtime dependencies

Cons

Text browser users (or anyone with JavaScript disabled) will have to manually submit the form to execute their search

Curl/Wget/GET/... + HTML-XML-utils

Pros

No JavaScript required at runtime

Cons

Two extra runtime dependencies:
1. An HTTP program: to get the results page
2. HTML-XML-utils: to resolve any relative URLs in the retrieved page

It's not so bad for the HTTP program dependency because the -o option for surfraw requires one, so, in effect, there's only one extra runtime dependency.

HTML-XML-utils, AFAIK, is in most repos, so it's not much of a problem to install.

Both ways use tempfiles, which isn't too bad, but how are they going to be deleted? The tempfile could be deleted after the w3_browse_url call returns but that prohibits surfraw from using exec when it opens the browser. Or the same approach could be taken as in this answer: https://unix.stackexchange.com/a/181938

tmpfile=$(mktemp /tmp/abc-script.XXXXXX)
exec 3>"$tmpfile"
rm "$tmpfile"
: ...
echo foo >&3

Portably having the browser read a pipe, or any file descriptor for that matter, is a pain. Might just do:

__mkelvis_cleanup() { rm -f -- "$__mkelvis_tmp" "$__mkelvis_out"; }
__mkelvis_tmp=$(mktemp ...)
__mkelvis_out=$(mktemp ...)
trap '__mkelvis_cleanup"' HUP TERM  # in case anything fails
trap '__mkelvis_cleanup; trap - INT; kill -s SIGINT "$$"' INT  # handled separately, see https://mywiki.wooledge.org/SignalTrap#Special_Note_On_SIGINT_and_SIGQUIT
trap '__mkelvis_cleanup; trap - QUIT; kill -s SIGQUIT "$$"' QUIT  # handled separately, see https://mywiki.wooledge.org/SignalTrap#Special_Note_On_SIGINT_and_SIGQUIT
# generate page
rm -f "$__mkelvis_tmp"
w3_browse_url "$__mkelvis_out" &
sleep 1  # allow browser to open page
rm -f "$__mkelvis_out"  # delete filesystem entry (the browser should retain access until it closes the fd)

Regarding tempfiles: apparently mktemp isn't POSIX... I guess that's another dependency. No way am I going to reimplement this.

Feb 26 '21 09:02 Hoboneer

Improved code:

__mkelvis_in="$(mktemp)"
__mkelvis_tmp="$(mktemp)"
__mkelvis_out="$(mktemp tmp.XXXXXXXXXX.html)"
__mkelvis_cleanup() {
	rm -f -- "$__mkelvis_in" "$__mkelvis_tmp" "$__mkelvis_out"
}
trap '__mkelvis_cleanup' HUP TERM  # in case anything fails
trap '__mkelvis_cleanup' EXIT  # data cleanup on no failure (POSIX)--JUST IN CASE
trap '__mkelvis_cleanup; trap - INT; kill -s INT "$$"' INT  # handled separately, see https://mywiki.wooledge.org/SignalTrap#Special_Note_On_SIGINT_and_SIGQUIT
trap '__mkelvis_cleanup; trap - QUIT; kill -s QUIT "$$"' QUIT  # handled separately, see https://mywiki.wooledge.org/SignalTrap#Special_Note_On_SIGINT_and_SIGQUIT

curl -d 'q=surfraw' https://lite.duckduckgo.com/lite/ -o "$__mkelvis_in"
hxnormalize -x "$__mkelvis_in" | hxwls | grep '^/' | awk -v OFS='\t' '/^\/\//{print $0, "https:" $0; next} /^\//{print $0, "https://lite.duckduckgo.com" $0}' >"$__mkelvis_tmp"
sed "$(while read before after; do echo "s,$before,$after,g"; done < "$__mkelvis_tmp" | sort | uniq)" "$__mkelvis_in" >"$__mkelvis_out"

rm -f -- "$__mkelvis_in" "$__mkelvis_tmp"
. surfraw
w3_config
w3_parse_args
w3_browse_url "$__mkelvis_out"
sleep 1  # allow browser to take fd before deleting filesystem entry
rm -f "$__mkelvis_out"

Feb 26 '21 10:02 Hoboneer

Another problem: graphical browsers' shell commands exit immediately (at least if already opened), but text browsers don't exit until the user manually quits.

This causes an annoying delay of 1 second after quitting a text browser.

A potential solution is to time the execution of the w3_browse_url call, and to avoid sleeping if more than three seconds have passed. That's an arbitrary number but it should be good enough.

srand() (note the lack of args) from POSIX awk seems to work for this.

Feb 27 '21 00:02 Hoboneer

Improved code that does the timing, reduces the number of mktemp calls, and removes the need for intermediate files (html-xml-utils is great!).

#!/bin/sh
__mkelvis_out="$(mktemp "${TMPDIR:-/tmp}/surfraw-results.XXXXXXXXXX.html")" || exit 1
__mkelvis_cleanup() {
	rm -f -- "$__mkelvis_out"
}
trap '__mkelvis_cleanup' HUP TERM  # in case anything fails
trap '__mkelvis_cleanup' EXIT  # data cleanup on no failure (POSIX)--JUST IN CASE
trap '__mkelvis_cleanup; trap - INT; kill -s INT "$$"' INT  # handled separately, see https://mywiki.wooledge.org/SignalTrap#Special_Note_On_SIGINT_and_SIGQUIT
trap '__mkelvis_cleanup; trap - QUIT; kill -s QUIT "$$"' QUIT  # handled separately, see https://mywiki.wooledge.org/SignalTrap#Special_Note_On_SIGINT_and_SIGQUIT

__mkelvis_do_post () {
	curl -d 'q=surfraw' https://lite.duckduckgo.com/lite/
}
__mkelvis_resolve_urls () {
	# I love html-xml-utils
	hxnormalize -x |
		hxpipe |
		# does it matter what the attribute is?  should it just be "A.+ CDATA ..."?
		awk '/^A(href|action) CDATA \/\//{print $1, $2, "https:" $3; next} /^A(href|action) CDATA \//{print $1, $2, "https://lite.duckduckgo.com" $3; next} {print}' |
		hxunpipe
}
__mkelvis_do_post | __mkelvis_resolve_urls >"$__mkelvis_out"

__mkelvis_get_time() {
	# POSIX doesn't specify the initial seed, so ensure that the previous seed was the unix epoch time number.
	awk 'BEGIN{srand(); print srand()}'
}
. surfraw
w3_config
w3_parse_args
__mkelvis_before="$(__mkelvis_get_time)"
if ok SURFRAW_dump; then
	[ "${SURFRAW_dump_file:=-}" = "-" ] && SURFRAW_dump_file=/dev/stdout
	cat "$__mkelvis_out" > "$SURFRAW_dump_file"
else
	w3_browse_url "$__mkelvis_out"
fi
__mkelvis_after="$(__mkelvis_get_time)"
# command exited almost immediately--so probably a graphical browser
# avoid sleeping after if not
if ! ok SURFRAW_dump && ! ok SURFRAW_print && [ "$(( __mkelvis_after - __mkelvis_before ))" -lt 3 ]; then
	sleep 1  # allow browser to take fd before deleting filesystem entry
fi

Because of the use of trap, this timing code is unnecessary:

#!/bin/sh
__mkelvis_out="$(mktemp "${TMPDIR:-/tmp}/surfraw-results.XXXXXXXXXX.html")" || exit 1
__mkelvis_cleanup() {
	rm -f -- "$__mkelvis_out"
}
trap '__mkelvis_cleanup' HUP TERM  # in case anything fails
trap '__mkelvis_cleanup' EXIT  # data cleanup on no failure (POSIX)--JUST IN CASE
trap '__mkelvis_cleanup; trap - INT; kill -s INT "$$"' INT  # handled separately, see https://mywiki.wooledge.org/SignalTrap#Special_Note_On_SIGINT_and_SIGQUIT
trap '__mkelvis_cleanup; trap - QUIT; kill -s QUIT "$$"' QUIT  # handled separately, see https://mywiki.wooledge.org/SignalTrap#Special_Note_On_SIGINT_and_SIGQUIT

__mkelvis_do_post () {
	curl -d 'q=surfraw' https://lite.duckduckgo.com/lite/
}
__mkelvis_resolve_urls () {
	# I love html-xml-utils
	hxnormalize -x |
		hxpipe |
		# does it matter what the attribute is?  should it just be "A.+ CDATA ..."?
		awk '/^A(href|action) CDATA \/\//{print $1, $2, "https:" $3; next} /^A(href|action) CDATA \//{print $1, $2, "https://lite.duckduckgo.com" $3; next} {print}' |
		hxunpipe
}
__mkelvis_do_post | __mkelvis_resolve_urls >"$__mkelvis_out"

. surfraw
w3_config
w3_parse_args
if ok SURFRAW_dump; then
	[ "${SURFRAW_dump_file:=-}" = "-" ] && SURFRAW_dump_file=/dev/stdout
	cat "$__mkelvis_out" > "$SURFRAW_dump_file"
else
	w3_browse_url "$__mkelvis_out"
fi

Feb 27 '21 12:02 Hoboneer

See this stackexchange answer for portable (POSIX) tempfile creation: https://unix.stackexchange.com/a/181996

tmpfile=$(
  echo 'mkstemp(template)' |
    m4 -D template="${TMPDIR:-/tmp}/baseXXXXXX"
) || exit

Feb 27 '21 12:02 Hoboneer

Turns out that the "Next page" buttons stopped working--why....

Feb 27 '21 12:02 Hoboneer

hxnormalize broke the next page form by closing it early!

This was for DuckDuckGo Lite. I haven't tested this with other sites... should hxnormalize not be called then? What about malformed HTML? Maybe it could be configured by the user if needed.

Feb 27 '21 12:02 Hoboneer

Include the no-extra-dependencies version:

#!/bin/sh
# POSIX tempfile creation in shell: https://unix.stackexchange.com/a/181996
case "${__mkelvis_type:=no-js}" in
	no-js) 	__mkelvis_template="${TMPDIR:-/tmp}/surfraw-results.XXXXXX" ;;
	js) 	__mkelvis_template="${TMPDIR:-/tmp}/surfraw-load.XXXXXX" ;;
esac
__mkelvis_out="$(echo 'mkstemp(template)'| m4 -D template="$__mkelvis_template")" || exit 1
mv "$__mkelvis_out" "$__mkelvis_out.html" || exit 1  # can't specify suffix, so rename for browser's benefit
__mkelvis_out="$__mkelvis_out.html"
__mkelvis_cleanup() {
	rm -f -- "$__mkelvis_out"
}
trap '__mkelvis_cleanup' HUP TERM  # in case anything fails
trap '__mkelvis_cleanup' EXIT  # data cleanup on no failure (POSIX)--JUST IN CASE
trap '__mkelvis_cleanup; trap - INT; kill -s INT "$$"' INT  # handled separately, see https://mywiki.wooledge.org/SignalTrap#Special_Note_On_SIGINT_and_SIGQUIT
trap '__mkelvis_cleanup; trap - QUIT; kill -s QUIT "$$"' QUIT  # handled separately, see https://mywiki.wooledge.org/SignalTrap#Special_Note_On_SIGINT_and_SIGQUIT

__mkelvis_query='test'
__mkelvis_qstring="q=$__mkelvis_query"
__mkelvis_post_url='https://lite.duckduckgo.com/lite/'
__mkelvis_do_post () {
	if command -v curl >/dev/null; then
		curl -d "$__mkelvis_qstring" "$__mkelvis_post_url"
	elif command -v wget >/dev/null; then
		wget --post-data "$__mkelvis_qstring" "$__mkelvis_post_url"
	elif command -v POST >/dev/null; then
		printf '%s\n' "$__mkelvis_qstring" | POST "$__mkelvis_post_url"
	else
		# TODO: use surfraw err
		err 'no POST-capable program found: install curl, wget, or libwww-perl; or use the js version' >&2
		exit 1
	fi
}
__mkelvis_resolve_urls () {
	# I love html-xml-utils
	hxpipe |  # `hxnormalize` broke (DuckDuckGo Lite's) next page form...
		# does it matter what the attribute is?  should it just be "A.+ CDATA ..."?
		awk '/^A(href|action) CDATA \/\//{print $1, $2, "https:" $3; next} /^A(href|action) CDATA \//{print $1, $2, "https://lite.duckduckgo.com" $3; next} {print}' |
		hxunpipe
}
__mkelvis_make_page () {
	cat <<EOF
<!DOCTYPE html>
<html>
  <head>
    <title>POST method search</title>
    <meta name="Content-Type" content="text/html" charset="utf-8" />
    <script type="text/javascript">
      function doSearch() { document.forms["search"].submit(); }	
    </script>
  </head>
  <body onload="doSearch()">
    <p>Submit the form manually if JavaScript is disabled (or using a text browser)</p>
    <form name="search" action="$__mkelvis_post_url" method="post">
      <input name="q" type="text" value="$__mkelvis_query" />
      <input type="submit" value="Search" />
    </form>  
  </body>
</html>
EOF
}
case "${__mkelvis_type:=no-js}" in
	no-js) __mkelvis_do_post | __mkelvis_resolve_urls ;;
	js) __mkelvis_make_page ;;
esac >"$__mkelvis_out"

. surfraw
w3_config
w3_parse_args

if ok SURFRAW_dump; then
	[ "${SURFRAW_dump_file:=-}" = "-" ] && SURFRAW_dump_file=/dev/stdout
	cat "$__mkelvis_out" > "$SURFRAW_dump_file"
	exit
elif ok SURFRAW_print; then
	# FIXME: should the cleanups not be done in this case?
	err "POST method HTML pages are temporary files: they don't exist after program exit"
fi

__mkelvis_get_time() {
	# POSIX doesn't specify the initial seed, so ensure that the previous seed was the unix epoch time number.
	awk 'BEGIN{srand(); print srand()}'
}
__mkelvis_before="$(__mkelvis_get_time)"
w3_browse_url "$__mkelvis_out"
__mkelvis_after="$(__mkelvis_get_time)"
# some graphical browsers are slow to access the file (e.g., Firefox when it's already open)
if [ "$(( __mkelvis_after - __mkelvis_before ))" -lt 3 ]; then
	sleep 1  # allow browser to take fd before deleting filesystem entry
fi

Feb 28 '21 02:02 Hoboneer