String literals codegen
Right now it seems like we're allocating heap objects and generating very long strings of byte moves:
4fa534: be 2e 00 00 00 mov $0x2e,%esi
4fa539: 40 88 30 mov %sil,(%rax)
4fa53c: 4c 89 f0 mov %r14,%rax
4fa53f: 48 89 fe mov %rdi,%rsi
4fa542: 48 c1 ee 09 shr $0x9,%rsi
4fa546: 48 01 f0 add %rsi,%rax
4fa549: 48 83 c0 0e add $0xe,%rax
4fa54d: be 66 00 00 00 mov $0x66,%esi
4fa552: 40 88 30 mov %sil,(%rax)
4fa555: 4c 89 f0 mov %r14,%rax
4fa558: 48 89 fe mov %rdi,%rsi
4fa55b: 48 c1 ee 09 shr $0x9,%rsi
4fa55f: 48 01 f0 add %rsi,%rax
4fa562: 48 83 c0 0f add $0xf,%rax
4fa566: be 69 00 00 00 mov $0x69,%esi
4fa56b: 40 88 30 mov %sil,(%rax)
4fa56e: 4c 89 f0 mov %r14,%rax
4fa571: 48 89 fe mov %rdi,%rsi
4fa574: 48 c1 ee 09 shr $0x9,%rsi
4fa578: 48 01 f0 add %rsi,%rax
4fa57b: 48 83 c0 10 add $0x10,%rax
4fa57f: be 6c 00 00 00 mov $0x6c,%esi
4fa584: 40 88 30 mov %sil,(%rax)
4fa587: 4c 89 f0 mov %r14,%rax
4fa58a: 48 89 fe mov %rdi,%rsi
4fa58d: 48 c1 ee 09 shr $0x9,%rsi
4fa591: 48 01 f0 add %rsi,%rax
4fa594: 48 83 c0 11 add $0x11,%rax
4fa598: be 65 00 00 00 mov $0x65,%esi
4fa59d: 40 88 30 mov %sil,(%rax)
4fa5a0: 4c 89 f0 mov %r14,%rax
4fa5a3: 48 89 fe mov %rdi,%rsi
4fa5a6: 48 c1 ee 09 shr $0x9,%rsi
4fa5aa: 48 01 f0 add %rsi,%rax
4fa5ad: 48 83 c0 12 add $0x12,%rax
4fa5b1: be 20 00 00 00 mov $0x20,%esi
4fa5b6: 40 88 30 mov %sil,(%rax)
4fa5b9: 4c 89 f0 mov %r14,%rax
4fa5bc: 48 89 fe mov %rdi,%rsi
4fa5bf: 48 c1 ee 09 shr $0x9,%rsi
4fa5c3: 48 01 f0 add %rsi,%rax
4fa5c6: 48 83 c0 13 add $0x13,%rax
4fa5ca: be 20 00 00 00 mov $0x20,%esi
4fa5cf: 40 88 30 mov %sil,(%rax)
4fa5d2: 4c 89 f0 mov %r14,%rax
4fa5d5: 48 89 fe mov %rdi,%rsi
4fa5d8: 48 c1 ee 09 shr $0x9,%rsi
4fa5dc: 48 01 f0 add %rsi,%rax
4fa5df: 48 83 c0 14 add $0x14,%rax
4fa5e3: be 20 00 00 00 mov $0x20,%esi
4fa5e8: 40 88 30 mov %sil,(%rax)
4fa5eb: 4c 89 f0 mov %r14,%rax
4fa5ee: 48 89 fe mov %rdi,%rsi
4fa5f1: 48 c1 ee 09 shr $0x9,%rsi
4fa5f5: 48 01 f0 add %rsi,%rax
4fa5f8: 48 83 c0 15 add $0x15,%rax
4fa5fc: be 20 00 00 00 mov $0x20,%esi
4fa601: 40 88 30 mov %sil,(%rax)
4fa604: 4c 89 f0 mov %r14,%rax
4fa607: 48 89 fe mov %rdi,%rsi
4fa60a: 48 c1 ee 09 shr $0x9,%rsi
4fa60e: 48 01 f0 add %rsi,%rax
4fa611: 48 83 c0 16 add $0x16,%rax
4fa615: be 20 00 00 00 mov $0x20,%esi
4fa61a: 40 88 30 mov %sil,(%rax)
4fa61d: 4c 89 f0 mov %r14,%rax
4fa620: 48 89 fe mov %rdi,%rsi
4fa623: 48 c1 ee 09 shr $0x9,%rsi
4fa627: 48 01 f0 add %rsi,%rax
4fa62a: 48 83 c0 17 add $0x17,%rax
CSE and instruction selection would clean this up dramatically, but it would be better still to generate the string as read-only data and perform a block copy.
We should also make sure that literals are being hoisted to the top level; it doesn't make sense to construct new string literals in a loop.
There is not currently any mechanism for this kind of general rodata. One would have to be added, most likely by generalizing the bitmap table.
Stretch goal: If we can put string literals in consecutive global slots, we can put the lengths in rodata and use a single stub call to copy all of them to the heap.
Alternate stretch goal: If a program has a bunch of string literals that are rarely used, making them lazy would save startup time. This is probably not relevant often.
Alternate stretch goal: at the cost of significant changes to the data representation invariants, string literals could live outside the heap entirely.
Preliminary estimate: the bootstrapped compiler has 5722 patterns which look like string initializations (2 or more byte writes separated by a constant 7 instructions), setting 31624 bytes using 758976 bytes of code. The new 2-arg _StrLit sits in exactly the space of the old _RefByte; the first or third stretch goal would allow most of the calls to be eliminated entirely.