arctee
arctee copied to clipboard
Atomic tee
#+EXPORT_EXCLUDE_TAGS: noexport
#+begin_src python :exports output :results replace raw import arctee return arctee.doc #+end_src
#+RESULTS:
Helper script to run your data exports. It works kind of like [[https://en.wikipedia.org/wiki/Tee_(command)][tee command]], but:
- a: writes output atomically
- r: supports retrying command
- c: supports compressing output
You can read more on how it's used [[https://beepb00p.xyz/exports.html#arctee][here]].
- Motivation Many things are very common to all data exports, regardless of the source. In the vast majority of cases, you want to fetch some data, save it in a file (e.g. JSON) along with a timestamp and potentially compress.
This script aims to minimize the common boilerplate:
- =path= argument allows easy ISO8601 timestamping and guarantees atomic writing, so you'd never end up with corrupted exports.
- =--compression= allows to compress simply by passing the extension. No more =tar -zcvf=!
- =--retries= allows easy exponential backoff in case service you're querying is flaky.
Example:
: arctee '/exports/rtm/{utcnow}.ical.zstd' --compression zstd --retries 3 -- /soft/export/rememberthemilk.py
-
runs =/soft/export/rememberthemilk.py=, retrying it up to three times if it fails
The script is expected to dump its result in stdout; stderr is simply passed through.
-
once the data is fetched it's compressed as =zstd=
-
timestamp is computed and compressed data is written to =/exports/rtm/20200102T170015Z.ical.zstd=
- Do you really need a special script for that?
-
why not use =date= command for timestamps?
passing =$(date -Iseconds --utc).json= as =path= works, however I need it for most of my exports; so it ends up polluting my crontabs.
Next, I want to do several things one after another here. That sounds like a perfect candidate for pipes, right? Sadly, there are serious caveats:
-
pipe errors don't propagate. If one parts of your pipe fail, it doesn't fail everything
That's a major problem that often leads to unexpected behaviours.
In bash you can fix this by setting =set -o pipefail=. However:
-
default cron shell is =/bin/sh=. Ok, you can change it to ~SHELL=/bin/bash~, but
-
you can't set it to =/bin/bash -o pipefail=
You'd have to prepend all of your pipes with =set -o pipefail=, which is quite boilerplaty
-
-
you can't use pipes for retrying; you need some wrapper script anyway
E.g. similar to how you need a wrapper script when you want to stop your program on timeout.
-
it's possible to use pipes for atomically writing output to a file, however I haven't found any existing tools to do that
E.g. I want something like =curl https://some.api/get-data | tee --atomic /path/to/data.sjon=.
If you know any existing tool please let me know!
-
it's possible to pipe compression
However due to the above concerns (timestamping/retrying/atomic writing), it has to be part of the script as well.
It feels that cron isn't a suitable tool for my needs due to pipe handling and the need for retries, however I haven't found a better alternative. If you think any of these things can be simplified, I'd be happy to know and remove them in favor of more standard solutions!
- Installation
This can be installed with pip by running: =pip3 install --user git+https://github.com/karlicoss/arctee=
You can also manually install this by installing =atomicwrites= (=pip3 install atomicwrites=) and downloading and running =arctee.py= directly
** Optional Dependencies
-
=pip3 install --user backoff=
[[https://github.com/litl/backoff][backoff]] is a library to simplify backoff and retrying. Only necessary if you want to use --retries--.
-
=apt install atool=
[[https://www.nongnu.org/atool][atool]] is a tool to create archives in any format. Only necessary if you want to use compression.
end of autogenerated stuff
- Usage
#+begin_src sh :results output :exports output arctee --help #+end_src
TODO ugh. seems that github chokes over #+RESULT: here
#+begin_example usage: arctee [-h] [-r RETRIES] [-c COMPRESSION] path
Wrapper for automating boilerplate for reliable and regular data exports.
Example: arctee '/exports/rtm/{utcnow}.ical.zstd' --compression zstd --retries 3 -- /soft/export/rememberthemilk.py --user "[email protected]"
Arguments past '--' are the actuall command to run.
positional arguments: path Path with borg-style placeholders. Supported: {utcnow}, {hostname}, {platform}.
Example: '/exports/pocket/pocket_{utcnow}.json'
(see https://manpages.debian.org/testing/borgbackup/borg-placeholders.1.en.html)
optional arguments: -h, --help show this help message and exit -r RETRIES, --retries RETRIES Total number of tries, 1 (default) means only try once. Uses exponential backoff. -c COMPRESSION, --compression COMPRESSION Set compression format.
See 'man apack' for list of supported formats. In addition, 'zstd' is also supported.
#+end_example
- TODOs :noexport: