ll_asm
ll_asm copied to clipboard
linux_logo in 26+ kinds of assembly language
yes, I am insane
Tired of waiting many miliseconds for linux_logo to run? Tired of wasting 35k of disk space? Upset that to run linux_logo you need huge GLIBC?
Well your worries are over!
With "ll" [even the name was shortened to save space!] you get all of the benefits of linux_logo in a smaller, faster package!
"ll" is written entirely in native Linux assembly language!
Some Statistics
*NOTE* not all architectures implement the same feature-set
(IE, not all have MHz in /proc/cpuinfo) so this is only
a rough comparison.
Processor lzss executable
--------- ---------------
ia64 2826 bytes (as 2.18.0.20080103)
alpha 1821 bytes (as 2.19.51.20090723)
RiSC 1418 bytes (RiSC-as 0.0.2)
parisc 1400 bytes (as 2.17.50)
mips 1314 bytes (as 2.28)
microblaze 1298 bytes (as 2.16)
m88k 1240 bytes (as 1.92.3)
SPARC 1221 bytes (as 2.22)
arm.oabi 1186 bytes (as 2.18.0.20080103)
PPC 1165 bytes (as 2.19.51.20090805)
riscv64-im 1161 bytes (as 2.28.0.20170505)
arm.eabi 1161 bytes (as 2.24.51.20141001)
micromips 1147 bytes (as 2.28)
6502 1130*bytes (ca64 2.12.0)
riscv32-im 1125 bytes (as 2.28.0.20170505)
mips16 1104 bytes (as 2.28)
arm64 1094 bytes (as 2.23.1)
s390 1064 bytes (as 2.19.51.20090805)
x86_64 1027 bytes (as 2.29)
riscv64-imc 1019 bytes (as 2.28.0.20170505)
x86_x32 1003 bytes (as 2.29)
sh3 994 bytes (as 2.17.50.0.5)
x86 968 bytes (as 2.29)
riscv32-imc 961 bytes (as 2.28.0.20170505)
vax 950 bytes (as 2.16.1)
arm_thumb 920 bytes (as 2.25)
1802 915 bytes (asmx)
avr32 914 bytes (as 2.16.1)
arm_thumb2 908 bytes (as 2.25)
crisv32 905 bytes (as 2.12.1)
z80 891 bytes (as 2.20.1.20100303)
pdp-11 890 bytes (as 2.19)
m68k 870 bytes (as 2.28)
8086 780 bytes (as 2.29)
* the 6502 results were adjusted to match the code present in other
architectures (i.e., not counting the graphical routines)
The various implementations have varying functunality and often
use different methods to get system info. Still, some gross
comparisons between the architectures can be made.
Individual architectural comments, in descending order of executable size:
ia64: ia64 is VLIW.
no divide at all.
you use the fp unit to do int multiply
no unaligned words.
on the plus side, basically have infnite (well, 128) registers
does have an auto-incrementing load/store.
The actual ia64 architecture is too bizzarre for words.
It probably doesn't make any sense at all unless you are a
computer architect.
The instructions come in groups of 3 (40-bits each, with
a total instruction size of 128bits). These run in
parallel. So if your instructions don't parallelize,
you end up running lots of nops.
In any case it's no suprise the executable ends up being
so huge. It very probably can be optimized a lot from here.
If I really cared I'd turn off the automatic bundling
and set all of the instruction bundles by hand.
I found a bug in the assembler where it puts two
call instructions in the same bundle (which can't work).
I wonder if thise means I am one of the few people who
writes programs entirely in assembly for ia64....
alpha:
Alpha is hurt because it implements some "optional" features,
such as MHz (which added a lot of code) and also counting
num of cpus, with proper pluralization of Processor.
The *big* hurt though is lack of byte-manipulating instructions.
The original alpha architecture did not support operations on bytes,
you have to do a lot of shifting and masking.
Unfortunately ll uses a lot of byte-sized memory operations.
On the x86 the instruction "lodsb" is 1 byte in size.
On the PPC the equivelant is "lbzu" which is 4 bytes in size.
On the alpha, the instruction is "ldb" which expands to
a "lda","ldq_u","extbl","sll","sra" sequence, plus
an add instruction that x86 and ppc do automatically.
Thus, taking 24 bytes.
It's a bit better if you do an unsigned load, which is only 16
bytes, but still.
A store byte actually does a 32-bit load, masks in the byte by
hand, and then does the actual 32-bit store :(
The "jump if bit one set" type instructions help the lzss.
The immediate field for ALU ops of only 8 bits really hurts
There is no native integer division routine on alpha.
The GOT section is a huge space-sink. It has 64-bit constants,
which most of the time you shouldn't really need.
RiSC:
This is a 16-bit architecture used in my undergrad
computer courses I took at University of Maryland.
http://www.ece.umd.edu/~blj/RiSC/
No logic instructions besides nand. This means
masking with an "and" takes at least 2 instructions
(because there is no immediate form).
16-bit memory acesses only. It is quite complicated to load a byte.
Only 7 registers, one of which is always 0, and one of which
is the stack pointer.
No shift instructions. Left shift can be done with add, but
right shift takes a fairly complicated routine.
Only able to branch +/- 64 instructions, which can be a limit.
Only a 6-bit immediate, though the lui instruction helps.
parisc: + really hurt by its short immediate field. Most addresses
require 2 instructions to load, even if try to use relative-data add.
+ no integer divide, have to code it up... not so bad in loop form
+ delay slot that can be nulled out
+ some ALU instructions can also null out following instructions
conditionally
+ compare immediate instruction only can handle 5-bit immediate
+ loads/stores must be aligned
+ no AND immediate instruction
sparc:
Condition codes make for tighter code.
register/register load address calcs also help.
13-bit immdediate hurts a bit.
Crazy register windows a bit hard to understand
SPARC is unfair a bit, because my test machine is a 24-proc
niagara, so it has extra code to handle that many chips properly.
65c816:
Another non-Linux addition.
The method of switching the A and X, Y registers into and out
of 16 bit mode is a PAIN. A non-intuitive opcode, and if you
aren't careful your routines break if you're in an unexpected mode.
Would have been much better if somehow there were separate opcodes
for 8 vs 16-bit instructions.
riscv:
This is obviously an academic setup. Things changing all the time,
some parts of the arch extremely over-analyzed while others
just handwaving.
It's more or less the same as MIPS though.
Lack of addressing modes hurts. Also lack of short increment.
Also annoying the assembler won't let you do pointer math.
riscv-r64c:
This is the compressed RISCv.
No short byte/half load/store instructions which hurts this benchmark.
Also no logical immediate ones either.
Also no 3-operation small adds.
Couldn't figure out C.JAL
The assembler does a good job of auto-using the small instructions,
so it was mostly a matter of register choice.
6502:
So obviously this isn't running on Linux. I was curious
how an 8-bit processor would compare.
The big problem is that the LZSS algorithm and the ll data
set are very much 16-bits in size. So there is a lot of
overhead having to increment 16-bit values on an 8-bit processor.
Only having 3 registers is a handicap, but the zero page (the
first 256 bytes of memory which can be accesses in one less byte
and in fewer cycles) act as almost virtual registers.
Some potential useful instructions that would have helped (and
that are actually implemented in the later 65C02 version of the
chip):
phx/plx (push/pull X directly)
ina/dea (increment/decrement A... otherwise need 2 instructions)
bra (branch always, relative jump. otherwise need 3-byte 16-bit)
stz (store zero)
The 6502 code uses high-res graphics to approximate the color
ascii art. When calculating total image size, we don't count
the overly-large graphics code and instead only count the code
that would be used to print the text, say, out over a serial port
to an ANSI capable term client.
microblaze:
Similar to MIPS.
Branch delay slot is optional, which helps.
Only register+register and register+immediate address modes.
The branch+link instruction returns to the same instruction
that left from, so for the return you have to specify
to add 4 or 8? Very weird.
The signed 16-bit immediate makes it impossible (I think) to load
0xff00 into a register w/o two instructions.
Can't get the assembler to do pointer math into an immediate.
mips: Recent binutils has made mips come in line with the
other architectures.
It is the most RISC of the RISC architectures. Thus it ends
up having a very non-dense instruction set.
On the plus side, it has hardware support for unaligned loads,
plus hardware integer divide, which help a lot.
mips16: Reduced size instructions should help, but as always the
limitations make it hard.
Mostly the constant size, too eay to kick into extended
instructions and those won't fit in delay slots.
Also was fighting the assembler the whole way.
micromips:
I thought this would be a straightforward port of mips16, but no.
Way larger, mostly because loading addresses with "la" regressed.
They dropped a lot of useful 16-bit instructions.
The constants are all weird too, and odd sized things like 5 or 10
can no longer be added in 16-bits
Fighting the assembler all the way. In theory ADDIUPC could do
smaller loads but couldn't convince it to do so.
It is helpful having easy access to the higher regs.
Jals with short delay slot helped.
m88k: This chip came after m68k and was motorola's first RISC chip.
Didn't go well, they moved on to PPC.
It's similar to a cross between MIPS and PPC.
OpenBSD because there is no Linux port to m88k.
Using gxemul.
OpenBSD forces syscall error handling by skipping a
jump-to-error-handler instruction on return. This
makes each syscall in effect take 8 bytes.
Branch delay slot is optional (for performance only?)
Starts at low address space. This is a huge win as it
makes each pointer load 4-bytes instead of 8 assuming
we stay smaller than 64k.
bcnd instruction saves a cmp usually
cmp quite elaborate, setting 16-bits of comparison info in
a reg, not in flags. a more powerful version of the equivelant
Alpha instruction.
arm64:
New ISA different from arm32/thumb/thumb2
Fixed-width 32-bit instructions
Wider constants help *a lot*
Unaligned loads help too
New instructions, such as tbz also help
The conditional instructions like csel might, but at least in the
lzss code doesn't make a difference.
The compare/branch/zero instructions also help.
Lack of conditional execution hurts
What we really need is an "increment multiple" instruction.
inc {r1,r2,r4}
And also an equivelent of x86 loop
No open syscall, have to waste an instruction setting
AT_FDCWD and use openat() instead.
arm: no integer division routine
Really painful to load constants > 8 bits that aren't powers of two or
else 8-bit values shifted by power of two.
If we had integer divide, saner constant support, and unaligned loads
we could probably beat x86 even with 32 bit instructions.
OABI smaller than EABI because don't need to load r7 with
syscall number.
ppc: The PowerPC has very CISC-like opcodes as well.
Despite being load-store with 3 operand instructions, you almost
wouldn't know it was considered RISC. I also think I could
optimize the code a bit more and challenge x86.
The big help is auto-incrementing load/store byte instructions.
s390: This is the most CISC architecture I've ever seen. If only it had
a "load byte" opcode it would definitely beat out x86. I am sure
it can be optimized even smaller than x86 by a s390 expert.
Being able to do "strcat" in 2 or 3 op-codes and strlen in not
more than 5 is a big plus.
+ fact all opcodes are often 16-bit and often 32-bit is annoying
+ not having 3-operand opcodes also hurts
+ crazy CISC operations are amazing, but often don't do what I need
+ would be nice if offsets could be negative
+ would be nice if there was a relative branch shorter than 32-bits
x86_64: When doing a straight x86 -> x86_64 conversion (which involves making
all of the push %e?? instructions into push %r??, as well as jmp *%edx
into jmp *%rdx) makes the code 28 bytes longer, due to the "inc"
instruction becoming 2 bytes, and extra addr32 prefixes being added to
various move instructions.
Switching the syscalls to native syscalls is about neutral.
You do have to make sure to save %ecx across syscalls then.
The sad part is we have 8 extra regs, but can't use any of them
because the extra byte prefix is a killer.
Also added in a few bytes extra to print the name better
(gratuitous spaces on some cpuinfos). Also we have to
handle 4GB of RAM so we lose a few bytes for a 64-bit load.
x86_x32: This code is more or less the same as x86_64
The x32 changes don't really help us much, as we weren't
using 64-bit pointers before, and we try not to use
the R8-R15 because they take extra bytes.
In fact, almost all of the size saving comes from the fact
that a 32-bit ELF header is smaller than a 64-bit one.
The move to having all syscalls with bit 30 set hurts us
by about 10 bytes or so.
1802: RCA cosmac 1802
This is an interesting architecture.
No dedicated PC (you can set any of the 16 index registers to be PC)
No stack (though there are auto-inc/dec insns to help).
This makes function calls interesting. The traditional way
is to have dedicated index registers for each function, then
jump to just before the beginning before returning
(to "reset" that particular PC).
This is only workable if you have < 8 or so functions and
no leaf functions. Otherwise you have to emulate a stack
with fairly high overhead instruction count wise.
Otherwise nothing unusual when optimizing, except for the
always troublesome problem of doing 16-bit math when you
can only do math in the 8-bit accumulator.
vax:
vax is crazy CISC.
Some of the CISC instructions:
+ can operate on variable sized bit-fields
+ an asm instruction that implements switch/case statements
+ a fp instruction that calculates polynomials
+ special instructionss for handling queues
+ various opcodes to accelerate COBOL (edit, etc)
+ xfc - extended function call, create your own opcodes
You can do strlen with essentially one instruction, though it's
a long one. vax could easily beat x86 if it had a few one-byte
instructions.
sh3:
auto-increment addressing for loads but not for stores?
-> yes, auto-incrememnt/decrement set up for stack
accesses so it decrements on store (push) and incs on load (pop)
branch delay slots make things difficult
could really use a compare-with-zero instruction for reg other than r0
pretty compact code, even with lots of wasted branch delay slots
Really wish could put the divide instructions in a loop (like parisc)
m68k: is even more CISC than x86 if such a thing is possible
(if you don't count Vector instructions). In addition
to BCD instructions it also has a wide variety of
bit-field manipulation instructions, plus full ALU complement.
bizzarrely, m68k assembly is very similar to THUMB.
weird having separate address and data registers.
can't shift by an immediate more than 8?
can't add carry with immediate?
have to clear upper parts of words when doing byte math;
no equivelant of the mips "lbu" instruction.
x86: The x86 code is currently the smallest, mainly because I had a
running contest for a while with Stephan Walter until
we got it below 1k. It does help that there are a lot of
useful 1-byte instructions in the x86 command set, which give
it an instant advantage over all of the RISC chips.
Lack of alignment makes string manipulating programs (like ll) a lot
easier, as you can store 16 and 32 bit values w/o having to worry
if the string is properly aligned.
arm_thumb:
I tried by hardest to beat x86, even though the arm port
doesn't have to do things x86 does (SMP support for example).
I came close, but not close enough. The lack of an integer
divide instruction and the lack of unaligned memory reads
killed it.
I do like the thumb instruction set, it is in many ways more
powerful than x86 while cleaner at the same time.
There is a powerful push/pop instructions that can push/pop
any combination of registers in 16 bits.
The "blx" instruction to branch to a register (even a high one!)
is great. I cheated a bit by using the Arm5 instruction subset.
The code wouldn't be anywhere near as small if I had to use
generic arm4 thumb.
arm_thumb2:
Is same size as thumb if you make sure to use the narrow
forms of all the opcodes.
The "cbz" instruction saves space, but oddly only works
for forward branches.
The "itt" conditional instruction does help, as does
"movw" and "movt" to load 32-bit constants.
z80:
could really use a 16-bit dec that updated flags
could also use more useful 16-bit arithmatic insns
could use reg+reg addressing mode
nice if we could do ALU ops on other than 8-bit accumulator
z80 is nice for pascal-type strings, not C-type
Need 16-bit shifts. And shifts by more than 1
Need better way to set regs to zero.
Some extra bytes taken to handle CR/LF instead of just LF
The string copy routines are almost useless, having almost
as many bytes to set up as doing things discretely.
Really hurting not having just one more register that
cat be accessed with one-byte opcodes (or else
an IX+REG addressing mode)
avr32: They specifically designed the arch to have compact assemley.
The "ret" return instruction is the most useful ever. It
can handle returning a value, as well as having a special
case to return 0 or -1, and also sets the status flags.
The one weakness is that almost no instructions can take
immediate values.
It also has a great "load halfword and swap bytes" which
would be great, only it has to be an aligned halfword
so we can't use it :(
Has the advantage that binaries start at a low address,
so the addresses of functions fit in a small number of bits.
The new champion for size ;) And there's probably a few
bytes lurking that can be removed still.
crisv32:
mostly 16-bit instructions
If constants are > 6bits need longer encoding
reg+offset addressing mode only available for acr (r15)
register, which increases code size. This is an improvement
on original cris where r15 was the PC.
VM usage starts at 0
No hw divide instruction
Branch delay slot
branch instructions always 3bytes. Use jump (register val) instead?
pdp-11: Despite wealth of addressing modes, some critical ones
are missing, such as being able to add two regs to
equal address, or reg and value *before* dereferencing.
really could use logical shifts
really could use a real AND instruction
severe register pressure
UGH it's a pain working in octal
extensive use of "adb"
the a.out executable format makes for small binaries
8086: This port targets a DOS COM file.
A COM file basically has zero overhead, which is why
things are very small.
A COM file only has one 16-bit segment, so pointers are
all less than 16-bits which helps size.
8086 support is simpler than the x86 support which has
to handle more complicated cpu_info files, plus some
things DOS doesn't support like hostname.
8086 was a direct port of the 386 code, with 32-bit
registers switched to 16. Some changes had to be
made, as the 16-bit instruction set isn't as orthogonal
as the 32-bit one so some register combinations
(especially in effective address calculation) aren't allowed.
Features:
--------
+ Runs in 4 miliseconds, more than twice as fast as the 10 linux_logo
takes on a K6-2+ 450!
+ Takes up only 969 bytes when super-stripped on x86!
Amaze your enemies! Impress your friends!
BUGS:
-----
No pretty-printing: This means that your computer is reported just
as /proc/cpuinfo reports, ugly model-name, off
MHz, and all.
Possibly kernel-dependent: I only tested this on 2.4 and 2.6 kernels.
The sysinfo() syscall changed between 2.2 and 2.4
Custom Logo:
------------
Point the "ANSI_TO_USE" variable in the Makefile to any text
or ansi file you want when building.
HOW TO HELP:
------------
If you have a Linux box running on an unsupported architecture,
offer the author a shell-account so he can create a version
for your type of machine!
Useful Resources:
-----------------
http://deater.net/weave/vmwprod/asm/ll/ll.html
http://www.linuxassembly.org
http://www.deater.net/weave/vmwprod/asm
http://www.deater.net/weave/vmwprod/linux_logo
Publications using ll:
----------------------
* V.M. Weaver, S.A. McKee. "Code Density Concerns for New Architectures",
27th IEEE International Conference on Computer Design (ICCD 2009),
Lake Tahoe, California, October 2009.
* V.M. Weaver, S.A. McKee. "Optimizing for Size: Exploring the Limits of
Code Density" (Poster), Architectural Support for Programming Languages
and Operating Systems (ASPLOS '09), Washington DC, March 2009.
Thanks to:
----------
Shellcoders. You seem to be the only useful resource for
linux assembly on the various platforms.
Special Thanks to:
------------------
my lovely wife
my beautiful daughter
my two sons
my three squeaking guinea pigs
AUTHOR:
-------
Vince Weaver <vince _at_ deater.net> http://www.deater.net/weave