arctee icon indicating copy to clipboard operation
arctee copied to clipboard

Atomic tee

#+EXPORT_EXCLUDE_TAGS: noexport

#+begin_src python :exports output :results replace raw import arctee return arctee.doc #+end_src

#+RESULTS:

Helper script to run your data exports. It works kind of like [[https://en.wikipedia.org/wiki/Tee_(command)][tee command]], but:

  • a: writes output atomically
  • r: supports retrying command
  • c: supports compressing output

You can read more on how it's used [[https://beepb00p.xyz/exports.html#arctee][here]].

  • Motivation Many things are very common to all data exports, regardless of the source. In the vast majority of cases, you want to fetch some data, save it in a file (e.g. JSON) along with a timestamp and potentially compress.

This script aims to minimize the common boilerplate:

  • =path= argument allows easy ISO8601 timestamping and guarantees atomic writing, so you'd never end up with corrupted exports.
  • =--compression= allows to compress simply by passing the extension. No more =tar -zcvf=!
  • =--retries= allows easy exponential backoff in case service you're querying is flaky.

Example:

: arctee '/exports/rtm/{utcnow}.ical.zstd' --compression zstd --retries 3 -- /soft/export/rememberthemilk.py

  1. runs =/soft/export/rememberthemilk.py=, retrying it up to three times if it fails

    The script is expected to dump its result in stdout; stderr is simply passed through.

  2. once the data is fetched it's compressed as =zstd=

  3. timestamp is computed and compressed data is written to =/exports/rtm/20200102T170015Z.ical.zstd=

  • Do you really need a special script for that?
  • why not use =date= command for timestamps?

    passing =$(date -Iseconds --utc).json= as =path= works, however I need it for most of my exports; so it ends up polluting my crontabs.

Next, I want to do several things one after another here. That sounds like a perfect candidate for pipes, right? Sadly, there are serious caveats:

  • pipe errors don't propagate. If one parts of your pipe fail, it doesn't fail everything

    That's a major problem that often leads to unexpected behaviours.

    In bash you can fix this by setting =set -o pipefail=. However:

    • default cron shell is =/bin/sh=. Ok, you can change it to ~SHELL=/bin/bash~, but

    • you can't set it to =/bin/bash -o pipefail=

      You'd have to prepend all of your pipes with =set -o pipefail=, which is quite boilerplaty

  • you can't use pipes for retrying; you need some wrapper script anyway

    E.g. similar to how you need a wrapper script when you want to stop your program on timeout.

  • it's possible to use pipes for atomically writing output to a file, however I haven't found any existing tools to do that

    E.g. I want something like =curl https://some.api/get-data | tee --atomic /path/to/data.sjon=.

    If you know any existing tool please let me know!

  • it's possible to pipe compression

    However due to the above concerns (timestamping/retrying/atomic writing), it has to be part of the script as well.

It feels that cron isn't a suitable tool for my needs due to pipe handling and the need for retries, however I haven't found a better alternative. If you think any of these things can be simplified, I'd be happy to know and remove them in favor of more standard solutions!

  • Installation

This can be installed with pip by running: =pip3 install --user git+https://github.com/karlicoss/arctee=

You can also manually install this by installing =atomicwrites= (=pip3 install atomicwrites=) and downloading and running =arctee.py= directly

** Optional Dependencies

  • =pip3 install --user backoff=

    [[https://github.com/litl/backoff][backoff]] is a library to simplify backoff and retrying. Only necessary if you want to use --retries--.

  • =apt install atool=

    [[https://www.nongnu.org/atool][atool]] is a tool to create archives in any format. Only necessary if you want to use compression.

end of autogenerated stuff

  • Usage

#+begin_src sh :results output :exports output arctee --help #+end_src

TODO ugh. seems that github chokes over #+RESULT: here

#+begin_example usage: arctee [-h] [-r RETRIES] [-c COMPRESSION] path

Wrapper for automating boilerplate for reliable and regular data exports.

Example: arctee '/exports/rtm/{utcnow}.ical.zstd' --compression zstd --retries 3 -- /soft/export/rememberthemilk.py --user "[email protected]"

Arguments past '--' are the actuall command to run.

positional arguments: path Path with borg-style placeholders. Supported: {utcnow}, {hostname}, {platform}.

                    Example: '/exports/pocket/pocket_{utcnow}.json'

                    (see https://manpages.debian.org/testing/borgbackup/borg-placeholders.1.en.html)

optional arguments: -h, --help show this help message and exit -r RETRIES, --retries RETRIES Total number of tries, 1 (default) means only try once. Uses exponential backoff. -c COMPRESSION, --compression COMPRESSION Set compression format.

                    See 'man apack' for list of supported formats. In addition, 'zstd' is also supported.

#+end_example

  • TODOs :noexport: