numtoa
numtoa copied to clipboard
itoa used correctly is faster than numtoa used correctly
https://github.com/mmstick/numtoa/issues/4#issuecomment-274293855
Based on the benchmark code in #4 but writing to Vec<u8>, itoa is about 10% faster.
let mut null = Vec::new();
for number in 0..100_000_000u64 {
null.clear();
/* write */
}
numtoa: 995 ms
itoa: 866 ms
std: 2803 ms
This is an important use case because it is a good model of what serde_json::to_string is doing, for example.
It looks like in the /dev/null situation numtoa spends about 70% of its time in the kernel so the lookup table is not a bottleneck, but for Vec<u8> it spends practically zero time in the kernel and the enormous lookup table becomes really expensive. That explains why itoa does a lot better for that use case.
Side note: even when writing to /dev/null, buffers are a magical thing.
extern crate itoa;
extern crate numtoa;
extern crate time;
use numtoa::NumToA;
use std::fs::File;
use std::io::Write;
fn main() {
let mut null = File::open("/dev/null").unwrap();
let mut scratch = [0u8; 20];
let numtoa_start = time::precise_time_ns() / 1_000_000;
for number in 0..100_000_000u64 {
let bytes = number.numtoa(10, &mut scratch);
let _ = null.write(&scratch[bytes..]);
}
let numtoa_end = time::precise_time_ns() / 1_000_000;
println!("numtoa: {} ms", numtoa_end - numtoa_start);
let numtoa_start = time::precise_time_ns() / 1_000_000;
let mut number = 0;
let mut buf = Vec::with_capacity(8_000_000);
while number < 100_000_000u64 {
for number in number .. (number + 1_000_000) {
let bytes = number.numtoa(10, &mut scratch);
let _ = buf.write(&scratch[bytes..]);
}
let _ = null.write_all(&buf);
buf.clear();
number += 1_000_000;
}
let numtoa_end = time::precise_time_ns() / 1_000_000;
println!("numtoa buffered: {} ms", numtoa_end - numtoa_start);
let itoa_start = time::precise_time_ns() / 1_000_000;
let mut number = 0;
let mut buf = Vec::with_capacity(8_000_000);
while number < 100_000_000u64 {
for number in number .. (number + 1_000_000) {
let _ = itoa::write(&mut buf, number);
}
let _ = null.write_all(&buf);
buf.clear();
number += 1_000_000;
}
let itoa_end = time::precise_time_ns() / 1_000_000;
println!("itoa buffered: {} ms", itoa_end - itoa_start);
}
numtoa: 6073 ms
numtoa buffered: 1038 ms
itoa buffered: 956 ms
You could use a [u8; 8192] buffer for storing the values generated for better performance than you would get with a Vec. Having a buffer larger than 8192 bytes has pretty much no impact on performance, but storing your data in a Vec halves the performance you could get with an array. I've done this in my Parallel application whereby a stack-allocated buffer only performs a write when it can no longer hold more data. No vectors.
Basically, each write syscall will cost some overhead, and buffering locally can mitigate that overhead by performing fewer write syscalls, if you don't need immediate results. But, there's a limit to the size of a write, and that limit is 16K on Windows and 64K on Linux. 8K is typically chosen as the default buffer size in applications though (BufWriter uses 8K by default), because there's no measurable difference in performance between eight 8K writes and one 64K write.
That's great in theory but I couldn't get it to work. The resulting code using numtoa is more complicated than itoa, more unsafe than itoa, and slower than itoa. Can you do better?
extern crate itoa;
extern crate numtoa;
extern crate time;
use numtoa::NumToA;
use std::fs::File;
use std::io::Write;
use std::{cmp, ptr};
fn main() {
let mut null = File::open("/dev/null").unwrap();
let numtoa_start = time::precise_time_ns() / 1_000_000;
let mut number = 0;
let mut buf = [0; 8192];
let mut pos = 0;
while number < 100_000_000u64 {
for number in number .. cmp::min(number + 1024, 100_000_000) {
let bytes = number.numtoa(10, &mut buf[pos..]);
let len = 8192 - pos - bytes;
unsafe {
ptr::copy(buf[pos + bytes..].as_ptr(),
buf[pos..].as_mut_ptr(),
len);
}
pos += len;
}
let _ = null.write_all(&buf[..pos]);
pos = 0;
number += 1024;
}
let numtoa_end = time::precise_time_ns() / 1_000_000;
println!("numtoa buffered: {} ms", numtoa_end - numtoa_start);
let itoa_start = time::precise_time_ns() / 1_000_000;
let mut number = 0;
let mut buf = Vec::with_capacity(8192);
while number < 100_000_000u64 {
for number in number .. cmp::min(number + 1024, 100_000_000) {
let _ = itoa::write(&mut buf, number);
}
let _ = null.write_all(&buf);
buf.clear();
number += 1024;
}
let itoa_end = time::precise_time_ns() / 1_000_000;
println!("itoa buffered: {} ms", itoa_end - itoa_start);
}
numtoa buffered: 982 ms
itoa buffered: 933 ms
Basically I am starting to wonder whether the readme is saying "if you use numtoa wrong, it is faster than if you use itoa wrong" when the reality seems to be "if you use numtoa right, it is slower than if you use itoa right." I am honestly happy to use either one for serde_json but I want it to be the faster one.
You basically create a data structure like so:
struct DiskBuffer<IO: Write> {
device: IO,
read: usize,
buffer: [u8; 8 * 1024]
}
impl<IO: Write> DiskBuffer<IO> {
fn new(device: IO) -> DiskBuffer<IO> {
DiskBuffer { buffer: [0u8; 8 * 1024], read: 0, device: device }
}
fn write(&mut self, data: &[u8]) -> Result<usize, Error> {
let length = data.len();
if self.read + length > 8 * 1024 {
self.device.write(&self.buffer[0..self.read])?;
self.buffer[0..length].copy_from_slice(data);
self.read = length;
} else {
self.buffer[self.read..self.read+length].copy_from_slice(data);
self.read += length;
}
Ok(length)
}
}
No need for unsafe
Okay but... it's really slow despite the stack buffer. What am I doing wrong?
numtoa buffered: 6153 ms
itoa buffered: 898 ms
extern crate itoa;
extern crate numtoa;
extern crate time;
use numtoa::NumToA;
use std::fs::File;
use std::io::{Error, Write};
use std::{cmp, mem};
struct DiskBuffer<IO: Write> {
device: IO,
read: usize,
buffer: [u8; 8 * 1024]
}
impl<IO: Write> DiskBuffer<IO> {
fn new(device: IO) -> DiskBuffer<IO> {
DiskBuffer { buffer: [0u8; 8 * 1024], read: 0, device: device }
}
fn write(&mut self, data: &[u8]) -> Result<usize, Error> {
let length = data.len();
if self.read + length > 8 * 1024 {
self.device.write(&self.buffer[0..self.read])?;
self.buffer[0..length].copy_from_slice(data);
self.read = length;
} else {
self.buffer[self.read..self.read+length].copy_from_slice(data);
self.read += length;
}
Ok(length)
}
fn flush(&mut self) -> Result<usize, Error> {
self.device.write(&self.buffer[..self.read])?;
Ok(mem::replace(&mut self.read, 0))
}
}
fn main() {
let null = File::open("/dev/null").unwrap();
let numtoa_start = time::precise_time_ns() / 1_000_000;
let mut buffered = DiskBuffer::new(null);
let mut scratch = [0; 20];
for number in 0..100_000_000u64 {
let bytes = number.numtoa(10, &mut scratch);
let _ = buffered.write(&scratch[bytes..]);
}
let _ = buffered.flush();
let numtoa_end = time::precise_time_ns() / 1_000_000;
println!("numtoa buffered: {} ms", numtoa_end - numtoa_start);
let mut null = File::open("/dev/null").unwrap();
let itoa_start = time::precise_time_ns() / 1_000_000;
let mut number = 0;
let mut buf = Vec::with_capacity(8192);
while number < 100_000_000u64 {
for number in number .. cmp::min(number + 1024, 100_000_000) {
let _ = itoa::write(&mut buf, number);
}
let _ = null.write_all(&buf);
buf.clear();
number += 1024;
}
let itoa_end = time::precise_time_ns() / 1_000_000;
println!("itoa buffered: {} ms", itoa_end - itoa_start);
}
Also if you copied that DiskBuffer code from some project of yours, you should update it to use Write::write_all because the existing code will silently lose data. Unfortunately:panda_face: that makes it even slower.
It's not from existing code. I'm not sure why that would be slower though.
I'm not sure either but thanks for your help in this issue and my many other ones :smile:. At this point I have spent as many hours as I am willing honestly trying to get numtoa to be faster than itoa for the use case described in the readme and I am ready to give up. I will be interested to hear if you come up with anything to solve this in the future.
I'll investigate for performance improvements at a later time when I can.