awk-crashcourse
awk-crashcourse copied to clipboard
AWK language course

AWK crashcourse
AWK language course aims to explain AWK in 15 minutes to let you find awesome tool friend despite it's given name. The correct pronunciation is [auk] after smaller seabirds Parakeet auklets.
General language description
AWK language (is):
- (mainly) text processing language
- available on most UNIX-like systems by default, on Windows there is either native binary or cygwin one
- syntax is influenced by
candshellprogramming languages - programs from single line to multiple library files
- several implementations available, notably
gawkandmawk - solves generaly same problems as similar text-processing tools
sed,grep,wc,tr,cut,printf,tail,head,cat,tac,bc,column, ...
AWK language use-cases are:
- computing int / floating point math formulas (based on input)
- general text-processing
- cutting pieces from input text stream
- reformatting input text stream
- (shell) meta-programming generator
AWK language capabilities:
- text-processing functions
- regular expression support
- math functions
- dynamic typing, support for
- integer / long
- floats
- associative arrays (including multi-dimensional array support)
- external execution support
Processing workflow aka main()
Every AWK execution consist of folowing three phases:
- [1]
BEGIN{ ... }are actions performed at the beginning before first text character is read- multiple blocks allowed (normally single)
- [2]
[condition]{ ... }are actions performed on everyAWK record(default text line)- every
AWK recordis automatically split intoAWK fields(by default words) - multiple blocks allowed
- every
- [3]
END{ ... }are actions performed at the end of the execution after last text character is read- multiple blocks allowed (normally single)


warm-up basic example
$ echo -e "AWK is still useful\ntext-processing technology!" | \
> awk 'BEGIN{wcnt=0;print "lineno/#words/3rd-word:individual words\n"}
> {printf("% 6d/% 6d/% 8s:%s\n",NR,NF,$3,$0);wcnt+=NF}
> END{print "\nSummary:", NR, "lines/records,", wcnt, "words/fields"}'
lineno/#words/3rd-word:individual words
1/ 4/ still:AWK is still useful
2/ 2/ :text-processing technology!
Summary:2 lines/records, 6 words/fields
Command-line basics
-
Passing text data to AWK:
- from pipe:
cat input-data.txt | awk <app> - from file[s] read by awk itself:
awk <app> input-data.txt
- from pipe:
-
AWK application execution styles (
-f):- on command-line
awk '{ ... }' input-data.txt - in separate files
awk -f myapp.awk input-data.txt
- on command-line
-
specifying an AWK variable on command-line
-v var=val -
specifying
AWK fieldseparatorFSvariable or-F <FS>switch
Global variables
Global variables are documented here, most common ones are:
$0value of currentAWK record(whole line without line-break)$1,$2, ...$NFvalues of first, second, ... lastAWK field(word)
FSSpecifies the inputAWK fieldseparator, i.e. how AWK breaks input record into fields (default: a whitespace).RSSpecifies the inputAWK recordseparator, i.e. how AWK breaks input stream into records (default: an universal line break).OFSSpecifies the output separator, i.e. how AWK print parsed fields to the output stream usingprint()(default: single space).ORSSpecifies the output separator, i.e. how AWK print parsed records to the output stream usingprint()(default: line break)FILENAMEcontains the name of the input file read by awk (read only global variable)
Buildin functions
AWK functions are documented, the most important ones are:
print,printf()andsprintf()- printing functions
length()- length of an string argument
substr()- splitting string to a substring
split()- split string into an array of strings
index()- find position of an substring in a string
sub()andgsub()- (regexp) search and replace (once respectivelly globally)
~operator andmatch()- regexp search
tolower()andtoupper()- convert text to lowercase resp. uppercase
Learn by examples
- Hello world
- Word count using wc and awk
- Pattern search using grep and awk
- Uniq words in awk
- Computing the average
- Text stream FSM machine
- Manipulation with text columns
- Shell metaprogramming with awk
- Why is cut very limited to awk
- Memory hungry application
- CPU intensive application
- Debugging / profiling AWK application
- GNU AWK network programing
- 30 seconds of AWK code
Best practices
Portability
Prefer general awk before an specific AWK implementation:
- use general
awkfor portable programs - otherwise use the particular implementation e.g.
gawk
AWK programs extension and readability
General rule of thumb is to create AWK program as a *.awk file if equivalent one-liner is not well readable.
If you have troubles to understand one line awk program then feel free to use GNU AWK's profiling functionality i.e. -p option to receive pretty printed AWK code (in awkprof.out).
Code quality
- comment properly
- indent similarly as in c/c++ programmimng languages
- use functions whenever possible
- stay explicit avoiding awk default (implicit) actions which make AWK application hard to understand
- example:
length > 80should be rather written'length($0) > 80 { print }'or'length($0) > 80 { print $0 }'
- example:
Pitfalls
- don't forget to always use apostrophe
'quotation when writing awk oneline applications to avoid shell expansion (for instance$1)awk "{print $1}"should beawk '{print $1}'
- use one of the recommended implementations as old implementations are quite limited (old
awkornawk) - string / array indexing from
1(index(),split(),$i, ...) - GNU AWK implementation understand localization & utf-8/unicode and thus replacing with
[g]sub()can lead to unwanted behavior unless you force gawk to drop such support via exporting environment variableLC_ALL=C- other awk implementations may not support utf-8/unicode:
# awk implementation versions
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.5, GNU MP 6.1.1)
mawk 1.3.4 20161107
BusyBox v1.22.1 (2016-02-03 18:22:11 UTC) multi-call binary.
$ echo "Zřetelně" | gawk '{print toupper($0)}'
ZŘETELNĚ
$ echo "Zřetelně" | mawk '{print toupper($0)}'
ZřETELNě
$ echo "Zřetelně" | busybox awk '{print toupper($0)}'
ZřETELNě
- extended reqular expressions are available just for gawk (and for older version has to be explicitly enabled):
$ ps auxwww | gawk '{if($2~/^[0-9]{1,1}$/){print}}'
root 1 0.0 0.0 197064 4196 ? Ss Oct31 2:21 /usr/lib/systemd/systemd --switched-root --system --deserialize 24
root 4 0.0 0.0 0 0 ? S< Oct31 0:00 [kworker/0:0H]
$ ps auxwww | gawk --re-interval '{if($2~/^[0-9]{1,1}$/){print}}'
root 1 0.0 0.0 197064 4196 ? Ss Oct31 2:21 /usr/lib/systemd/systemd --switched-root --system --deserialize 24
root 4 0.0 0.0 0 0 ? S< Oct31 0:00 [kworker/0:0H]
$ ps auxwww | mawk '{if($2~/^[0-9]{1,1}$/){print}}'
$