k8 icon indicating copy to clipboard operation
k8 copied to clipboard

Incorrect Characterization

Open jbenet opened this issue 10 years ago • 12 comments

Hello!

Cool work! Hacking with v8 is great :) And am glad you're making cool stuff. Though I couldn't help but cringe when I read:

Take the popular Node.js as an example. Node.js mixes file I/O and file system operations, two distinct concepts, in one File System module. In the module, we can either read an entire file or a fixed-length data blob, but are unable to read a line as is provided by most other programming languages. Many other JS shell implementations follow the CommonJS APIs, which have a similar problem: no usable APIs for general-purpose file I/O. After all these efforts, on file I/O, we even do not have a JS shell matching the usability of C, let alone high-level programming languages such as Perl and Python.

(bold mine)

This is an incorrect characterization of Node.js. This is simply not true. Please check out node Streams. This is how IO is done in node.

Let's take your example, line splitting. Here's how you do it:

var fs = require('fs')
var split = require('split')

var charStream = fs.createReadStream('foo')
var lineStream = charStream.pipe(split())
lineStream.on('data', function (line) {
    console.log('hey, look! a line: ' + line)
  })

// or: lineStream.pipe(process.stdout)

And it works beyond shells, you can use process to handle stdio:

var split = require('split')

process.stdin.
  pipe(split()).
  on('data', function (line) {
    console.log('hey, look! a line: ' + line)
  })

And you can use through to reverse them:

var split = require('split')
var through = require('through')

var reverse = through(function (line) {
  this.queue(line.split('').reverse().join('') + '\n')
})

process.stdin.
  pipe(split()).
  pipe(reverse).
  pipe(process.stdout)

Put that in a foo.js and run it:

cat foo.js | node foo.js

I recommend you learn more about node Streams (they're almost magical!), a great place to start is: https://github.com/substack/stream-handbook -- streams in node are much closer to UNIX pipes than C is. Yep, I'm well aware C was built for UNIX. Node's philosophy is really, really close to UNIX.

Oh I almost forgot, you'll definitely have to install split as it's not in core. How? you search npm for "stream line split" and find many modules, including split which has a ton of downloads. You install it with:

npm install split

I think maybe your real beef with node is that when you run node it doesn't come pre-loaded with lots of shiny things like split and so on. This is really a tradeoff between users, the idea is to keep a minimal core in Node and leave almost everything up to the modules.

This is the core philosophy of how node and npm do their magic. Having programmed in many environments for a while, I have to say, this is actually amazingly productive. It's The Right Thing To Do.

I agree that having to search and install a particular package is a bit hard to wrap your head around the first time you use node. It's a bit of friction when you first get started. Perhaps this will be fixed by better education: thankfully, core node folks have made nodeschool.io to help introduce the core conepts. Or by providing custom startup imports (i.e. a set of blessed modules you want with you on every shell execution). Or when we no longer have to install packages at all.

Cheers! :)

jbenet avatar Jun 19 '14 05:06 jbenet

This is async I/O, the same thing in google's Dart which I always have problem with. I know there are areas where async I/O is better especially in server-side applications, but there are also areas where it is cumbersome. Just try to implement a Unix paste or join. It is trivial with Perl/Python, but much harder with async I/O.

Your line reading also adds noticeable overhead. For a naive implementation of "wc -l" on a ~9GB file without long lines, node.js takes 56s using 54MB RSS, k8 23s/8.1MB and perl 23s/1.6MB.

At the end of day, is it that hard for node.js and dart to provide a proper readline()? I am not demanding much.

lh3 avatar Jun 19 '14 14:06 lh3

@lh3 just curious, are you benchmarks for wc -l online? the node one (from the speed) sounds like it was implemented incorrectly

max-mapper avatar Jun 19 '14 15:06 max-mapper

@maxogden If you are taking this more seriously, here is a little better evaluation. This time, I am implementing wc -L which gives the maximum line length in a file. The input text file is ~9GB in size. The maximum line length is 70. The file has been cached in RAM. (PS: the purpose is to evaluate the performance of line reading. To compute the maximum line length, the better way is to read by block and then compute the interval length between newlines, but this type of approach is not easily applicable to generic line-based text processing.)

Program user time sys time Peak RSS Comment
c-1 6.3s 1.6s 952k fgets()
c-3 9.5s 1.6s 544k kseq.h+read()
k8-1 25.3s 1.4s 8.1M built on top of c-2
c-2 27.5s 1.6s 1.1M kseq.h+gzread()
wc -L 40.3s 1.0s 564k wc -l: 1.0+1.3s
perl 41.1s 1.5s 1.6M
k8-2 50.6s 1.5s 10.2M a fair version of k8-1
node-3 58.4s 4.9s 14.2M @mafintosh's
node-4 69.4s 4.4s 55M @paulfitz's
node-1 70.8s 5.1s 57M
node-2 203.4s 5.0s 50M @maxogden's

Implementations:

# perl
my $max = 0;
$max = $max > length - 1? $max : length - 1 while (<>);
print("$max\n");
// node-1, modified from @jbenet's examples
var split = require('split'), max = 0;
process.stdin.pipe(split()).on('data', function(line) {
    max = max > line.length? max : line.length; }); 
process.stdin.on('end', function() { console.log(max); });
// k8-1; this is not quite fair as b is a byte array but we more frequently use a string
var f = new File(), b = new Bytes(), max = 0;
while (f.readline(b) >= 0) max = max > b.length? max : b.length;
print(max); b.destroy(); f.close();
// k8-2; similar to k8-1 except that the byte array is converted to a string
var f = new File(), b = new Bytes(), max = 0;
while (f.readline(b) >= 0) {
        var line = b.toString();
        max = max > line.length? max : line.length;
}
print(max); b.destroy(); f.close();
// c-1
#include <stdio.h>
#include <string.h>
int main()
{
    char buf[0x10000];
    int max = 0;
    while (fgets(buf, 0x10000, stdin) != 0) {
        int l = strlen(buf) - 1;
        max = max > l? max : l;
    }   
    printf("%d\n", max);
    return 0;
}
// c-3; c-2 is similar
#include <stdio.h>
#include "kseq.h"
KSTREAM_INIT(int, read, 0x10000)
int main()
{
    kstream_t *ks;
    int max = 0, dret;
    kstring_t str = {0,0,0};
    ks = ks_init(fileno(stdin));
    while (ks_getuntil(ks, 2, &str, &dret) >= 0) max = max > str.l? max : str.l;
    ks_destroy(ks);
    printf("%d\n", max);
    return 0;
}

lh3 avatar Jun 19 '14 16:06 lh3

this may be a more fair entry, written using a more modern node style

var split = require('binary-split')
var through = require('through2')

var max = 0
var counter = through(function(line, enc, next) {
  max = max > line.length ? max : line.length
  next()
}, function(next) {
  console.log(max)
  next()
})

process.stdin.pipe(split()).pipe(counter)

max-mapper avatar Jun 19 '14 17:06 max-mapper

@maxogden I have updated the table with your implementation. It is actually slower than the implementation modified from @jbenet examples.

lh3 avatar Jun 19 '14 17:06 lh3

LOL

max-mapper avatar Jun 19 '14 17:06 max-mapper

@lh3 try me! try me!

var fs = require('fs')

var buf = new Buffer(65536)
var max = 0
var line = 0
var prevByte = 0

var ondata = function(err, read) {
  if (err) throw err
  if (!read) return console.log(max)

  for (var i = 0; i < read; i++) {
    if (buf[i] === 10) {
      if (prevByte === 13) line--
      if (max < line) max = line
      line = 0
    } else {
      line++
    }
    prevByte = buf[i]
  }

  fs.read(0, buf, 0, buf.length, null, ondata)
}

fs.read(0, buf, 0, buf.length, null, ondata)

mafintosh avatar Jun 19 '14 18:06 mafintosh

@mafintosh added.

lh3 avatar Jun 19 '14 18:06 lh3

ah, the versions using split and binary-split create new objects for each line, which is very expensive. the version by @mafintosh just counts newlines in a big buffer. useful approach for counting lines

max-mapper avatar Jun 19 '14 19:06 max-mapper

I'm with @lh3 on this one, I wish readLine and other standard stuff were in node's API, I basically gave up on node for scripting because of this. But anyway, to add to the table:

var readline = require('readline');

var rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout,
  terminal: false
});

var top = 0;
rl.on('line', function(cmd) {
    top = Math.max(top,cmd.length);
}).on('close', function() {
    console.log(top);
});

paulfitz avatar Jun 19 '14 19:06 paulfitz

@paulfitz added. I gave up dart for scripting also due to the lack of a C/perl/python/ruby-like synchronized I/O. I know async I/O is important, but it does not conveniently replace other types of I/O routines in all scenarios.

lh3 avatar Jun 19 '14 20:06 lh3

Just came across this thing. You can make @mafintosh's solution a bit faster by scheduling multiple reads in parallel:

var fs          = require('fs')

var BUFFER_SIZE = (1<<20)

var front       = new Buffer(BUFFER_SIZE)
var back        = new Buffer(BUFFER_SIZE)
var max         = 0
var last        = 0
var prev        = 0

function process(buffer, size) {
  var p = prev
  for(var i=0; i<size; ++i) {
    var b = buffer[i]
    if(b === 10) {
      if(p === 13) {
        last += 1
      }
      max = Math.max(max, i-last)|0
      last = i
    }
    p = b
  }
  last -= size
  prev = p
}

function ondata(err, read) {
  if (err) {
    throw err
  }
  if (!read) {
    console.log(max)
    return
  }
  fs.read(0, back, 0, BUFFER_SIZE, null, ondata)
  process(front, read)
  var tmp = front
  front = back
  back = tmp
}

fs.read(0, front, 0, BUFFER_SIZE, null, ondata)

mikolalysenko avatar Nov 10 '14 22:11 mikolalysenko