ChezScheme Performance and memory issue when dealing with string and file i/o

I wrote a csv parser, and found it perform so pooly (both speed and memory). When open a 20kB csv file, it can run out all memory and cause system to halt. But as contrary, using Racket with the same code do perform well.

Platform: Chez Scheme 64-bit threaded on Windows Source: csv.ss

And I found if I replace the function `(define csv:field-cons (lambda (new-char old-string) (string-append (string new-char) old-string)))

(define csv:field-empty "")`

with

`(define csv:field-cons (lambda (new-char old-string) (cons new-char old-string)))

(define csv:field-empty '())`

The issue resolved.

I know the implemetations are very inefficient, but I wonder why this implementation can cause Chez to halt.

May 09 '19 13:05 shih-liang

Take a look at string ports https://cisco.github.io/ChezScheme/csug9.5/io.html#./io:h5; they're a much more efficient way to build strings incrementally. As to your question I can think of two possible issues. The first is that you're constantly adding a growing string to a (very) short one; each time you do that you have to traverse the growing string. Basically, you have a hidden nested loop. By contrast, your second approach just adds a new element to the front of an existing list; that doesn't involve any traversal at all. The second possible issue is garbage collection. Turn on collect-notify https://cisco.github.io/ChezScheme/csug9.5/smgmt.html#./smgmt:s11 and compare the statistics for the two results.

On Thu, 9 May 2019 at 09:05, Shi Liang [email protected] wrote:

I wrote a csv parser, and found it perform so pooly (both speed and memory). When open a 20kB csv file, it can run out all memory and cause system to halt. But as contrary, using Racket with the same code do perform well.

Platform: Chez Scheme 64-bit threaded on Windows Source: csv.ss https://gist.github.com/shiliang-hust/50364a0cb799246f8b259a70f68ad626

And I found if I replace the function ` (define csv:field-cons (lambda (new-char old-string) (string-append (string new-char) old-string)))

(define csv:field-empty "")`

with

` (define csv:field-cons (lambda (new-char old-string) (cons new-char old-string)))

(define csv:field-empty '())`

The issue resolved.

I know the implemetations are very inefficient, but I wonder why this implementation can cause Chez to halt.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cisco/ChezScheme/issues/428, or mute the thread https://github.com/notifications/unsubscribe-auth/AABLCYIWSALJNNPSP5UMQ5TPUQORXANCNFSM4HL2JYWQ .

May 09 '19 14:05 michaellenaghan

By the way, here's an old (module-based) implementation of a CSV parser (reader) and unparser (writer). The implementation uses syntax to generate efficient parsers/unparsers that support specific features. For example, here are three generated parsers:

(define-csv-parser csv-parse #, '(parse-platform-newline parse-slash-n parse-slash-newline parse-slash-quote parse-slash-slash parse-slash-t parse-quoted-field parse-quoted-newline parse-quote-quote))

(define-csv-parser csv-parse-rfc4180 #, '(parse-platform-newline parse-quoted-field parse-quoted-newline parse-quote-quote))

(define-csv-parser tsv-parse #\tab '(parse-platform-newline parse-slash-n parse-slash-slash parse-slash-t))

The first is a lenient CSV parser, the second is a strict CSV parser, the third is a TSV parser.

===

(module util-csv (define-csv-parser csv-parse csv-parse-rfc4180 tsv-parse define-csv-unparser csv-unparse csv-unparse-rfc4180 tsv-unparse)

;; Follows RFC 4180 -- with the following deviations.

;; First, white space around quoted and unquoted fields ;; is trimmed. (White space inside a quoted field isn't ;; trimmed; only the space around it is.) It seems that ;; the vast majority of implementations do that and I ;; wanted to follow their lead. (If you want meaningful ;; white space enclose the field in quotes.)

;; Second, according to RFC 4180, "if fields are not enclosed ;; with double quotes, then double quotes may not appear ;; inside the fields." That seems like a rather arbitrary ;; restriction, and I didn't enforce it.

;; Third, RFC 4180 requires CRLF line terminators. This ;; implementation accepts any combination of CR and LF. ;; (Line terminators within a quoted field are always turned ;; into a CRLF combination. Otherwise the terminator would ;; be platform-dependent.)

;; Btw, non-space control characters are always ignored ;; when parsing. (They're written when unparsing though.)

;; Also, this implementation is meant to be UTF8-safe.

(define-syntax define-csv-parser (syntax-rules () [(define-csv-parser name the-delim the-options) (define name (let ([delim the-delim] [options the-options]) ;; don't allow just any old delim since some don't ;; make sense. for example, we aren't set up for ;; a #" delim, or a #\return or #\linefeed delim. (unless (and (char? delim) (memv delim '(#, #; #\tab))) (error 'define-csv-parser "unexpected delimiter: ~s" delim)) (unless (and (list? options) (andmap symbol? options)) (error 'define-csv-parser "unexpected options: ~s" options)) (let ([parse-platform-newline? (memq 'parse-platform-newline options)] [parse-slash-n? (memq 'parse-slash-n options)] [parse-slash-newline? (memq 'parse-slash-newline options)] [parse-slash-quote? (memq 'parse-slash-quote options)] [parse-slash-slash? (memq 'parse-slash-slash options)] [parse-slash-t? (memq 'parse-slash-t options)] [parse-quoted-field? (memq 'parse-quoted-field options)] [parse-quoted-newline? (memq 'parse-quoted-newline options)] [parse-quote-quote? (memq 'parse-quote-quote options)] [parse-whitespace? (memq 'parse-whitespace options)]) (let ([parse-slash? (or parse-slash-n? parse-slash-newline? parse-slash-quote? parse-slash-slash? parse-slash-t?)]) (when (and (char=? delim #\tab) parse-quoted-field?) (error 'define-csv-parser "unexpected option for tab-separated parser: ~s" 'parse-quoted-field)) (when (and (char=? delim #\tab) parse-quoted-newline?) (error 'define-csv-parser "unexpected option for tab-separated parser: ~s" 'parse-quoted-newline)) (when (and (char=? delim #\tab) parse-quote-quote?) (error 'define-csv-parser "unexpected option for tab-separated parser: ~s" 'parse-quote-quote)) ;; whenever we use slash to unescape a char we have to ;; make sure we unescape slashes... (when (and parse-slash-n? (not parse-slash-slash?)) (error 'define-csv-parser "missing option for ~s: ~s" 'parse-slash-n 'parse-slash-slash)) (when (and parse-slash-quote? (not parse-slash-slash?)) (error 'define-csv-parser "missing option for ~s: ~s" 'parse-slash-quote 'parse-slash-slash)) (when (and parse-slash-t? (not parse-slash-slash?)) (error 'define-csv-parser "missing option for ~s: ~s" 'parse-slash-t 'parse-slash-slash)) (lambda (port) (import util-read) (let ([field (open-output-string)]) (define %%newline (lambda (port) ;; either write a platform-specific ;; newline or write a standard-specific ;; newline... (cond [parse-platform-newline? (meta-cond [(memq (machine-type) '(i3nt ti3nt)) (write-char #\return port) (write-char #\linefeed port)] [else (write-char #\linefeed port)])] [else (write-char #\return port) (write-char #\linefeed port)]))) (define %%parse-whitespace (lambda () (when parse-whitespace? (letrec ([start (lambda () (whitespace))] [whitespace (lambda () (let ([c (peek-char port)]) (cond [(eof-object? c) (finish)] [(char<=? c #\space) (read-char port) (whitespace)] [(char=? c ##) (read-char port) (comment)] [(char=? c #;) (read-char port) (comment)] [else (finish)])))] [comment (lambda () (let ([c (peek-char port)]) (cond [(eof-object? c) (finish)] [(char=? c #\return) (read-char port) (let ([c (peek-char port)]) (when (and (char? c) (char=? c #\linefeed)) (read-char port))) (whitespace)] [(char=? c #\linefeed) (read-char port) (let ([c (peek-char port)]) (when (and (char? c) (char=? c #\return)) (read-char port))) (whitespace)] [else (read-char port) (comment)])))] [finish (lambda () (void))]) (start))))) (define %%parse-trim (lambda (s) (let* ([old-i 0] [old-j (fx1- (string-length s))] [new-i (do ([i old-i (fx1+ i)]) ((or (fx> i old-j) (not (char=? (string-ref s i) #\space))) i))] [new-j (do ([j old-j (fx1- j)]) ((or (fx< j new-i) (not (char=? (string-ref s j) #\space))) j))]) (if (and (fx= new-i old-i) (fx= new-j old-j)) s (if (fx<= new-i new-j) (substring s new-i (fx1+ new-j)) ""))))) (define %%parse-quoted-field (lambda () (let ([c (read-char port)]) (let loop ([c (read-char port)]) (if (char? c) (cond [(char=? c #") ;; one quote ends the field... (let ([c (peek-char port)]) (when (char? c) (when (and (char=? c #") parse-quote-quote?) ;; ...but two quotes potentially ;; represent an escaped quote. (write-char (read-char port) field) (loop (read-char port)))))] [(char=? c #\) (cond [parse-slash? (let ([c (peek-char port)]) (cond [(eof-object? c) (write-char #\ field) (loop (read-char port))] [(and (char=? c #\n) parse-slash-n?) (read-char port) (%%newline field) (loop (read-char port))] [(and (char=? c #\return) parse-slash-newline?) ;; replace CR[LF] with CRLF (read-char port) (let ([c (peek-char port)]) (when (and (char? c) (char=? c #\linefeed)) (read-char port))) (%%newline field) (loop (read-char port))] [(and (char=? c #\linefeed) parse-slash-newline?) ;; replace LF[CR] with CRLF (read-char port) (let ([c (peek-char port)]) (when (and (char? c) (char=? c #\return)) (read-char port))) (%%newline field) (loop (read-char port))] [(and (char=? c #") parse-slash-quote?) (read-char port) (write-char #" field) (loop (read-char port))] [(and (char=? c #\) parse-slash-slash?) (read-char port) (write-char #\ field) (loop (read-char port))] [(and (char=? c #\t) parse-slash-t?) (read-char port) (write-char #\tab field) (loop (read-char port))] [else (write-char #\ field) (loop (read-char port))]))] [else (write-char c field) (loop (read-char port))])] [(char=? c #\tab) (write-char c field) (loop (read-char port))] [(char=? c #\return) (cond [parse-quoted-newline? ;; replace CR[LF] with CRLF (let ([c (peek-char port)]) (when (and (char? c) (char=? c #\linefeed)) (read-char port))) (%%newline field) (loop (read-char port))] [else (warning 'name "missing closing quote")])] [(char=? c #\linefeed) (cond [parse-quoted-newline? ;; replace LF[CR] with CRLF (let ([c (peek-char port)]) (when (and (char? c) (char=? c #\return)) (read-char port))) (%%newline field) (loop (read-char port))] [else (warning 'name "missing closing quote")])] [else (when (and (char>=? c #\space) (not (char=? c #\rubout))) (write-char c field)) (loop (read-char port))]) (warning 'name "missing closing quote")))) (get-output-string field))) (define %%parse-unquoted-field (lambda () (let loop ([c (read-char port)]) (when (char? c) (cond [(char=? c delim) (unread-char c port)] [(char=? c #\) (cond [parse-slash? (let ([c (peek-char port)]) (cond [(eof-object? c) (write-char #\ field) (loop (read-char port))] [(and (char=? c #\n) parse-slash-n?) (read-char port) (%%newline field) (loop (read-char port))] [(and (char=? c #\return) parse-slash-newline?) ;; replace CR[LF] with CRLF (read-char port) (let ([c (peek-char port)]) (when (and (char? c) (char=? c #\linefeed)) (read-char port))) (%%newline field) (loop (read-char port))] [(and (char=? c #\linefeed) parse-slash-newline?) ;; replace LF[CR] with CRLF (read-char port) (let ([c (peek-char port)]) (when (and (char? c) (char=? c #\return)) (read-char port))) (%%newline field) (loop (read-char port))] [(and (char=? c #") parse-slash-quote?) (read-char port) (write-char #" field) (loop (read-char port))] [(and (char=? c #\) parse-slash-slash?) (read-char port) (write-char #\ field) (loop (read-char port))] [(and (char=? c #\t) parse-slash-t?) (read-char port) (write-char #\tab field) (loop (read-char port))] [else (write-char #\ field) (loop (read-char port))]))] [else (write-char c field) (loop (read-char port))])] [(char=? c #\tab) (write-char c field) (loop (read-char port))] [(char=? c #\return) (unread-char c port)] [(char=? c #\linefeed) (unread-char c port)] [else (when (and (char>=? c #\space) (not (char=? c #\rubout))) (write-char c field)) (loop (read-char port))]))) (cond [(char=? delim #,) (%%parse-trim (get-output-string field))] [(char=? delim #;) (%%parse-trim (get-output-string field))] [(char=? delim #\tab) ;; since tab-separated values are never quoted ;; we don't want to trim them; otherwise there'd ;; never be a way to get leading and trailing ;; spaces into a field. (get-output-string field)] [else (error 'name "unexpected delimiter: ~s" delim)]))) (define %%parse-field (lambda () (if (and parse-quoted-field? (let ([c (peek-char-skipping-xspace port)]) (and (char? c) (char=? c #")))) (%%parse-quoted-field) (%%parse-unquoted-field)))) (define %%parse-field-delim (lambda () (let ([c (read-char-skipping-xspace port)]) (and (char? c) (let ([delim? (char=? c delim)]) (unless delim? (unread-char c port)) (and delim? c)))))) (define %%parse-fields (lambda () (%%parse-whitespace) ;; because the final line terminator is optional ;; we can't tell the difference between a single ;; empty field and no fields at all. that means ;; we always parse and return at least one field. (let loop ([fields (cons (%%parse-field) '())]) (if (%%parse-field-delim) (loop (cons (%%parse-field) fields)) (reverse! fields))))) (and (char? (peek-char port)) (let ([fields (%%parse-fields)]) (skip-char-until-sol port) fields))))))))]))

(define-csv-parser csv-parse #, '(parse-platform-newline parse-slash-n parse-slash-newline parse-slash-quote parse-slash-slash parse-slash-t parse-quoted-field parse-quoted-newline parse-quote-quote))

(define-csv-parser csv-parse-rfc4180 #, '(parse-platform-newline parse-quoted-field parse-quoted-newline parse-quote-quote))

(define-csv-parser tsv-parse #\tab '(parse-platform-newline parse-slash-n parse-slash-slash parse-slash-t))

(define-syntax define-csv-unparser (syntax-rules () [(_ name the-delim the-options) (define name (let ([delim the-delim] [options the-options]) (unless (and (char? delim) (memv delim '(#, #; #\tab))) (error 'define-csv-unparser "unexpected delimiter: ~s" delim)) (unless (and (list? options) (andmap symbol? options)) (error 'define-csv-unparser "unexpected options: ~s" options)) (let ([unparse-platform-newline? (memq 'unparse-platform-newline options)] [unparse-slash-n? (memq 'unparse-slash-n options)] [unparse-slash-quote? (memq 'unparse-slash-quote options)] [unparse-slash-slash? (memq 'unparse-slash-slash options)] [unparse-slash-t? (memq 'unparse-slash-t options)] [unparse-quote-field? (memq 'unparse-quote-field options)] [unparse-quote-field-always? (memq 'unparse-quote-field-always options)] [unparse-quote-newline? (memq 'unparse-quote-newline options)] [unparse-quote-quote? (memq 'unparse-quote-quote options)]) (when (and (char=? delim #\tab) unparse-quote-field?) (error 'define-csv-unparser "unexpected option for tab-separated unparser: ~s" 'unparse-quote-field)) (when (and (char=? delim #\tab) unparse-quote-field-always?) (error 'define-csv-unparser "unexpected option for tab-separated unparser: ~s" 'unparse-quote-field-always)) (when (and (char=? delim #\tab) unparse-quote-newline?) (error 'define-csv-unparser "unexpected option for tab-separated unparser: ~s" 'unparse-quote-newline)) (when (and (char=? delim #\tab) unparse-quote-quote?) (error 'define-csv-unparser "unexpected option for tab-separated unparser: ~s" 'unparse-quote-quote)) (when (and unparse-quote-quote? unparse-slash-quote?) (error 'define-csv-unparser "conflicting options: ~s and ~s" 'unparse-quote-quote 'unparse-slash-quote)) (when (and unparse-quote-newline? unparse-slash-n?) (error 'define-csv-unparser "conflicting options: ~s and ~s" 'unparse-quote-newline 'unparse-slash-n)) (when (and unparse-quote-field? unparse-quote-field-always?) (error 'define-csv-unparser "conflicting options: ~s and ~s" 'unparse-quote-field 'unparse-quote-field-always)) ;; whenever we use slash to escape a char we have to ;; make sure we escape slashes... (when (and unparse-slash-n? (not unparse-slash-slash?)) (error 'define-csv-unparser "missing option for ~s: ~s" 'unparse-slash-n 'unparse-slash-slash)) (when (and unparse-slash-quote? (not unparse-slash-slash?)) (error 'define-csv-unparser "missing option for ~s: ~s" 'unparse-slash-quote 'unparse-slash-slash)) (when (and unparse-slash-t? (not unparse-slash-slash?)) (error 'define-csv-unparser "missing option for ~s: ~s" 'unparse-slash-t 'unparse-slash-slash)) (lambda (fields port) (define %%string (lambda (field) (cond ;; we don't handle booleans because the caller ;; should decide how booleans are to be represented ;; as strings. [(symbol? field) (symbol->string field)] [(string? field) field] [(fixnum? field) (number->string field)] [(flonum? field) (number->string field)] [(char? field) (string field)] [else (error 'name "unexpected value: ~s" field)]))) (define %%newline (lambda (port) ;; either write a platform-specific ;; newline or write a standard-specific ;; newline... (cond [unparse-platform-newline? (meta-cond [(memq (machine-type) '(i3nt ti3nt)) (write-char #\return port) (write-char #\linefeed port)] [else (write-char #\linefeed port)])] [else (write-char #\return port) (write-char #\linefeed port)]))) (define %%unparse-quote-field-always? (lambda (field) (let ([field-len (string-length field)]) (fx> field-len 0)))) (define %%unparse-quote-field? (lambda (field) (let ([field-len (string-length field)]) (and (fx> field-len 0) (or ;; field begins with whitespace (let ([c (string-ref field 0)]) (or (char=? c #\space) (and (not unparse-slash-t?) (char=? c #\tab)))) ;; field ends with whitespace (let ([c (string-ref field (fx1- field-len))]) (or (char=? c #\space) (and (not unparse-slash-t?) (char=? c #\tab)))) ;; field contains special char (let loop ([i 0]) (and (fx< i field-len) (let ([c (string-ref field i)]) (or (char=? c delim) ;; technically an embedded quote ;; shouldn't force quoting unless ;; it's the first char--but since ;; that seems a little obscure we ;; always quote fields with quotes ;; (if we're quoting them at all). (and unparse-quote-quote? (char=? c #")) (and unparse-quote-newline? (or (char=? c #\return) (char=? c #\linefeed))) (loop (fx1+ i))))))))))) (define %%unparse-quoted (lambda (field) (write-char #" port) (let ([field-len (string-length field)]) (let loop ([i 0]) (define %%peek-char (lambda () (let ([i (fx1+ i)]) (if (fx< i field-len) (string-ref field i) (eof-object))))) (unless (fx= i field-len) (let ([c (string-ref field i)]) (cond [(char=? c #\) (cond ;; \ => \ [unparse-slash-slash? (write-char #\ port) (write-char #\ port) (loop (fx1+ i))] ;; \ =>
[else (write-char c port) (loop (fx1+ i))])] [(char=? c #") (cond ;; " => "" [unparse-quote-quote? (write-char #" port) (write-char #" port) (loop (fx1+ i))] ;; " => " [unparse-slash-quote? (write-char #\ port) (write-char #" port) (loop (fx1+ i))] ;; " => error [else (error 'name "value contains quote: ~s" field)])] [(char=? c #\tab) (cond ;; #\tab => \t [unparse-slash-t? (write-char #\ port) (write-char #\t port) (loop (fx1+ i))] [else (cond ;; #\tab => error [(char=? c delim) (error 'name "value contains delimeter: ~s" field)] ;; #\tab => #\tab [else (write-char c port) (loop (fx1+ i))])])] [(char=? c #\return) (cond ;; #\return => \n [unparse-slash-n? (write-char #\ port) (write-char #\n port) (let ([c (%%peek-char)]) (if (and (char? c) (char=? c #\linefeed)) (loop (fx+ i 2)) (loop (fx+ i 1))))] ;; #\return => #\return #\linefeed [unparse-quote-newline? (%%newline port) (let ([c (%%peek-char)]) (if (and (char? c) (char=? c #\linefeed)) (loop (fx+ i 2)) (loop (fx+ i 1))))] ;; #\return => error [else (error 'name "value contains newlines: ~s" field)])] [(char=? c #\linefeed) (cond ;; #\linefeed => \n [unparse-slash-n? (write-char #\ port) (write-char #\n port) (let ([c (%%peek-char)]) (if (and (char? c) (char=? c #\return)) (loop (fx+ i 2)) (loop (fx+ i 1))))] ;; #\linefeed => #\return #\linefeed [unparse-quote-newline? (%%newline port) (let ([c (%%peek-char)]) (if (and (char? c) (char=? c #\return)) (loop (fx+ i 2)) (loop (fx+ i 1))))] ;; #\linefeed => error [else (error 'name "value contains newlines: ~s" field)])] ;; char => char [else (write-char c port) (loop (fx1+ i))]))))) (write-char #" port))) (define %%unparse-unquoted (lambda (field) (let ([field-len (string-length field)]) (when unparse-quote-field? (unless unparse-slash-quote? (when (and (fx> field-len 0) (char=? (string-ref field 0) #")) (error 'name "unquoted value begins with quote: ~a" field)))) (let loop ([i 0]) (define %%peek-char (lambda () (let ([i (fx1+ i)]) (if (fx< i field-len) (string-ref field i) (eof-object))))) (unless (fx= i field-len) (let ([c (string-ref field i)]) (cond [(char=? c #\) (cond ;; \ => \ [unparse-slash-slash? (write-char #\ port) (write-char #\ port) (loop (fx1+ i))] ;; \ =>
[else (write-char c port) (loop (fx1+ i))])] [(char=? c #") (cond ;; " => " [unparse-slash-quote? (write-char #\ port) (write-char #" port) (loop (fx1+ i))] ;; " => " [else (write-char c port) (loop (fx1+ i))])] [(char=? c #\tab) (cond ;; #\tab => \t [unparse-slash-t? (write-char #\ port) (write-char #\t port) (loop (fx1+ i))] [else (cond ;; #\tab => error [(char=? c delim) (error 'name "value contains delimeter: ~s" field)] ;; #\tab => #\tab [else (write-char c port) (loop (fx1+ i))])])] [(char=? c #\return) (cond ;; #\return => \n [unparse-slash-n? (write-char #\ port) (write-char #\n port) (let ([c (%%peek-char)]) (if (and (char? c) (char=? c #\linefeed)) (loop (fx+ i 2)) (loop (fx+ i 1))))] ;; #\return => error [else (error 'name "value contains newlines: ~s" field)])] [(char=? c #\linefeed) (cond ;; #\linefeed => \n [unparse-slash-n? (write-char #\ port) (write-char #\n port) (let ([c (%%peek-char)]) (if (and (char? c) (char=? c #\return)) (loop (fx+ i 2)) (loop (fx+ i 1))))] ;; #\linefeed => error [else (error 'name "value contains newlines: ~s" field)])] ;; #\delim => error [(char=? c delim) (error 'name "value contains delimeter: ~s" field)] ;; char => char [else (write-char c port) (loop (fx1+ i))]))))))) (define %%unparse-field (lambda (field) (if (or (and unparse-quote-field-always? (%%unparse-quote-field-always? field)) (and unparse-quote-field? (%%unparse-quote-field? field))) (%%unparse-quoted field) (%%unparse-unquoted field)))) (define %%unparse-fields (lambda () (when (pair? fields) (%%unparse-field (%%string (car fields))) (let loop ([fields (cdr fields)]) (when (pair? fields) (write-char delim port) (%%unparse-field (%%string (car fields))) (loop (cdr fields))))) (%%newline port))) (if (port? port) (%%unparse-fields) (error 'name "expected a port: ~s" port))))))]))

(define-csv-unparser csv-unparse #, '(unparse-platform-newline unparse-slash-n unparse-slash-slash unparse-slash-t unparse-quote-field unparse-quote-quote))

(define-csv-unparser csv-unparse-rfc4180 #, '(unparse-platform-newline unparse-quote-field unparse-quote-newline unparse-quote-quote))

(define-csv-unparser tsv-unparse #\tab '(unparse-platform-newline unparse-slash-n unparse-slash-slash unparse-slash-t)))

On Thu, 9 May 2019 at 10:24, Michael Lenaghan [email protected] wrote:

Take a look at string ports https://cisco.github.io/ChezScheme/csug9.5/io.html#./io:h5; they're a much more efficient way to build strings incrementally. As to your question I can think of two possible issues. The first is that you're constantly adding a growing string to a (very) short one; each time you do that you have to traverse the growing string. Basically, you have a hidden nested loop. By contrast, your second approach just adds a new element to the front of an existing list; that doesn't involve any traversal at all. The second possible issue is garbage collection. Turn on collect-notify https://cisco.github.io/ChezScheme/csug9.5/smgmt.html#./smgmt:s11 and compare the statistics for the two results.

On Thu, 9 May 2019 at 09:05, Shi Liang [email protected] wrote:

I wrote a csv parser, and found it perform so pooly (both speed and memory). When open a 20kB csv file, it can run out all memory and cause system to halt. But as contrary, using Racket with the same code do perform well.

Platform: Chez Scheme 64-bit threaded on Windows Source: csv.ss https://gist.github.com/shiliang-hust/50364a0cb799246f8b259a70f68ad626

And I found if I replace the function ` (define csv:field-cons (lambda (new-char old-string) (string-append (string new-char) old-string)))

(define csv:field-empty "")`

with

` (define csv:field-cons (lambda (new-char old-string) (cons new-char old-string)))

(define csv:field-empty '())`

The issue resolved.

I know the implemetations are very inefficient, but I wonder why this implementation can cause Chez to halt.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cisco/ChezScheme/issues/428, or mute the thread https://github.com/notifications/unsubscribe-auth/AABLCYIWSALJNNPSP5UMQ5TPUQORXANCNFSM4HL2JYWQ .

May 09 '19 15:05 michaellenaghan

Thank you for your kind reply. BTW, is there anything like string port for bytevectors? I need to translate a bytevector's edianness but don't know is there anything suitable for it.

May 14 '19 13:05 shih-liang

Yes, there is open-bytevector-output-port.

May 14 '19 13:05 burgerrg

Thanks you very much.

May 14 '19 16:05 shih-liang

I've tried string ports, but don't get any performance gain in this occasion. In fact, using string ports makes memory usage extramly high. Is there something suitable for building many small strings?

May 15 '19 05:05 shih-liang

string output ports are probably your best option here but you may need to restructure your code a bit to see the benefits. open-string-output-port has increasingly better performance compared to string-append the more you use the port. I see in your csv code there is a function csv:field-cons which is basically just string-append. If you tried to implement csv:field-cons in terms of open-string-output-port I would expect it to perform worse than string-append because there is extra work done when setting up the the output port. I also see that you use csv:field-cons from two locations: csv:get-escaped and csv:get-non-escaped. In both places you are looping and appending only two strings together. If you move the string ports into those functions so that the port lives longer you should see better performance and lower memory usage.

May 15 '19 12:05 gwatt

Picking up on Graham's point, if you look at the code I sent you'll see it creates a single string port for the entire import. If you look at the docs for get-output-string https://cisco.github.io/ChezScheme/csug9.5/io.html#./io:s43 you'll see this:

As a side effect, get-output-string resets string-output-port so that

subsequent output to string-output-port is placed into a fresh string.

As you use the string port it will grow to the required size. At some point it will be "big enough" for the rest of the import and not need to grow anymore. That's what Graham was talking about.

Maybe you can post your updated code?

If you're looking for the best possible performance there are at least two more things to know.

First, the way you run your code matters. For example, take a look at the section on Optimization https://cisco.github.io/ChezScheme/csug9.5/use.html#./use:h6:

To get the most out of the Chez Scheme compiler, it is necessary to give it

a little bit of help. The most important assistance is to avoid the use of top-level (interaction-environment) bindings. Top-level bindings are convenient and appropriate during program development, since they simplify testing, redefinition, and tracing (Section 3.1) of individual procedures and syntactic forms. This convenience comes at a sizable price, however.

The compiler can propagate copies (of one variable to another or of a

constant to a variable) and inline procedures bound to local, unassigned variables within a single top-level expression. For the procedures it does not inline, it can avoid constructing and passing unneeded closures, bypass argument-count checks, branch to the proper entry point in a case-lambda, and build rest arguments (more efficiently) on the caller side, where the length of the rest list is known at compile time. It can also discard the definitions of unreferenced variables, so there's no penalty for including a large library of routines, only a few of which are actually used.

It cannot do any of this with top-level variable bindings, since the

top-level bindings can change at any time and new references to those bindings can be introduced at any time.

Fortunately, it is easy to restructure a program to avoid top-level

bindings...

Second, make sure you're taking advantage of the special support Chez has for optimizing I/O; see this section https://cisco.github.io/ChezScheme/csug9.5/io.html#./io:h1 which says:

Although the fields shown and discussed above are logically present in a

port, actual implementation details may differ. The current Chez Scheme implementation uses a different representation that allows read-char, write-char, and similar operations to be open-coded with minimal overhead.

Here's what that used to mean. When using buffered ports Chez would read output-size bytes into its output-buffer. At optimization level 3 (maybe below that level, but certainly at that level) a call to peek-char and read-char would turn into a few assembly instructions that copied bytes directly from the buffer without a function call, only making an actual function call when the buffer was empty.

Rather than compiling the entire program at optimization level 3, the original code I sent above actually made selective use of optimization level 3, like this:

(#3%peek-char port)

(#3%read-char port)

I removed the #3% prefixes before pasting the code because they're something you should use/do only after you're confident that your code is correct. You can read more about those prefixes here https://cisco.github.io/ChezScheme/csug9.5/system.html#./system:s101:

If a 2 or 3 appears in the form or between the # and % in the abbreviated

form, the compiler treats an application of the primitive as if it were compiled at the corresponding optimize level (see the optimize-level parameter). If no number appears in the form, an application of the primitive is treated as an optimize-level 3 application if the current optimize level is 3; otherwise, it is treated as an optimize-level 2 application.

I'm pretty sure all of that is still at least roughly true, but I don't know if it's still completely true. If someone knows otherwise, maybe they can chime in?

On Wed, 15 May 2019 at 08:00, Graham Watt [email protected] wrote:

string output ports are probably your best option here but you may need to restructure your code a bit to see the benefits. open-string-output-port has increasingly better performance compared to string-append the more you use the port. I see in your csv code there is a function csv:field-cons which is basically just string-append. If you tried to implement csv:field-cons in terms of open-string-output-port I would expect it to perform worse than string-append because there is extra work done when setting up the the output port. I also see that you use csv:field-cons from two locations: csv:get-escaped and csv:get-non-escaped. In both places you are looping and appending only two strings together. If you move the string ports into those functions so that the port lives longer you should see better performance and lower memory usage.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cisco/ChezScheme/issues/428?email_source=notifications&email_token=AABLCYIETDJPZCZBIO2M5DLPVP3M5A5CNFSM4HL2JYW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVONXFI#issuecomment-492624789, or mute the thread https://github.com/notifications/unsubscribe-auth/AABLCYPD2Q3UTFQOZOLWVLDPVP3M5ANCNFSM4HL2JYWQ .

May 15 '19 14:05 michaellenaghan

"When using buffered ports Chez would read output-size bytes into its output-buffer."

Oops: make that "input-size" and "input-buffer".

On Wed, 15 May 2019 at 10:38, Michael Lenaghan [email protected] wrote:

Picking up on Graham's point, if you look at the code I sent you'll see it creates a single string port for the entire import. If you look at the docs for get-output-string https://cisco.github.io/ChezScheme/csug9.5/io.html#./io:s43 you'll see this:

As a side effect, get-output-string resets string-output-port so that

subsequent output to string-output-port is placed into a fresh string.

As you use the string port it will grow to the required size. At some point it will be "big enough" for the rest of the import and not need to grow anymore. That's what Graham was talking about.

Maybe you can post your updated code?

If you're looking for the best possible performance there are at least two more things to know.

First, the way you run your code matters. For example, take a look at the section on Optimization https://cisco.github.io/ChezScheme/csug9.5/use.html#./use:h6:

To get the most out of the Chez Scheme compiler, it is necessary to give

it a little bit of help. The most important assistance is to avoid the use of top-level (interaction-environment) bindings. Top-level bindings are convenient and appropriate during program development, since they simplify testing, redefinition, and tracing (Section 3.1) of individual procedures and syntactic forms. This convenience comes at a sizable price, however.

The compiler can propagate copies (of one variable to another or of a

constant to a variable) and inline procedures bound to local, unassigned variables within a single top-level expression. For the procedures it does not inline, it can avoid constructing and passing unneeded closures, bypass argument-count checks, branch to the proper entry point in a case-lambda, and build rest arguments (more efficiently) on the caller side, where the length of the rest list is known at compile time. It can also discard the definitions of unreferenced variables, so there's no penalty for including a large library of routines, only a few of which are actually used.

It cannot do any of this with top-level variable bindings, since the

top-level bindings can change at any time and new references to those bindings can be introduced at any time.

Fortunately, it is easy to restructure a program to avoid top-level

bindings...

Second, make sure you're taking advantage of the special support Chez has for optimizing I/O; see this section https://cisco.github.io/ChezScheme/csug9.5/io.html#./io:h1 which says:

Although the fields shown and discussed above are logically present in a

port, actual implementation details may differ. The current Chez Scheme implementation uses a different representation that allows read-char, write-char, and similar operations to be open-coded with minimal overhead.

Here's what that used to mean. When using buffered ports Chez would read output-size bytes into its output-buffer. At optimization level 3 (maybe below that level, but certainly at that level) a call to peek-char and read-char would turn into a few assembly instructions that copied bytes directly from the buffer without a function call, only making an actual function call when the buffer was empty.

Rather than compiling the entire program at optimization level 3, the original code I sent above actually made selective use of optimization level 3, like this:
(#3%peek-char port)

(#3%read-char port)
I removed the #3% prefixes before pasting the code because they're something you should use/do only after you're confident that your code is correct. You can read more about those prefixes here https://cisco.github.io/ChezScheme/csug9.5/system.html#./system:s101:

If a 2 or 3 appears in the form or between the # and % in the abbreviated

form, the compiler treats an application of the primitive as if it were compiled at the corresponding optimize level (see the optimize-level parameter). If no number appears in the form, an application of the primitive is treated as an optimize-level 3 application if the current optimize level is 3; otherwise, it is treated as an optimize-level 2 application.

I'm pretty sure all of that is still at least roughly true, but I don't know if it's still completely true. If someone knows otherwise, maybe they can chime in?

On Wed, 15 May 2019 at 08:00, Graham Watt [email protected] wrote:

string output ports are probably your best option here but you may need to restructure your code a bit to see the benefits. open-string-output-port has increasingly better performance compared to string-append the more you use the port. I see in your csv code there is a function csv:field-cons which is basically just string-append. If you tried to implement csv:field-cons in terms of open-string-output-port I would expect it to perform worse than string-append because there is extra work done when setting up the the output port. I also see that you use csv:field-cons from two locations: csv:get-escaped and csv:get-non-escaped. In both places you are looping and appending only two strings together. If you move the string ports into those functions so that the port lives longer you should see better performance and lower memory usage.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cisco/ChezScheme/issues/428?email_source=notifications&email_token=AABLCYIETDJPZCZBIO2M5DLPVP3M5A5CNFSM4HL2JYW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVONXFI#issuecomment-492624789, or mute the thread https://github.com/notifications/unsubscribe-auth/AABLCYPD2Q3UTFQOZOLWVLDPVP3M5ANCNFSM4HL2JYWQ .

May 15 '19 14:05 michaellenaghan

string-append performance bad. Try to replaced.

May 17 '19 23:05 evilbinary

ChezScheme ChezScheme copied to clipboard

Performance and memory issue when dealing with string and file i/o

ChezScheme
ChezScheme copied to clipboard