STklos icon indicating copy to clipboard operation
STklos copied to clipboard

allocation accounting (`time` reports # of bytes, SCM cells and quantity of allocations)

Open jpellegrini opened this issue 1 year ago • 9 comments

Hi @egallesio !

This is one thing I have been thinking about, and I've finally implemented it. Not sure if it's worth it. What do you think -- is this interesting?

The first commit:

  • Adds two fields to the VM structure, allocations and bytes_allocated. These are initialized when the thread starts.

  • Changes the standard STk_must_malloc and STk_must_malloc_atomic to increment those two counters for the current thread

  • Adds a %thread-allocations primitive that will return three values, the number of allocated bytes, the approximate number of SCM cells, and the number of allocation calls since the thread started

  • Adds a %thread-allocations-reset! primitive that zeroes the two counters (just in case someone wants it)

A helper function thread_inc_allocs(SCM thr, int size) was added to vm.c

Example:

stklos> (%thread-allocations (current-thread))
2348284
293535
87928

In this case, 2348284 bytes (approximately 293535 SCM objects) were allocated in 87928 calls since the start of the thread.

  • Caveat I: the number of allocations is kept in an unsigned long int, and there is no provision for recovering from overflow (it will wrap around) -- although I don't think this is an issue (The computed differences in allocations will be mod $2^{64}$).

  • Caveat II: reallocations are not computed, since it would be ~a bit~ a lot more complicated to get a meaninful number from them (or not?).

  • Caveat III: we only report the approximate number of SCM objects as a convenience. It is not precise -- it is more like "the equivalent number, in SCM cells" -- since not all allocations are of type SCM (a string, for example, has data allocated in bytes). We do bytes / sizeof(SCM) to obtain that number. Is this useful? Or should only the other two values be returned?

The second commit changes the time macro so it reports all that data, using the new %thread-allocations primitive.

Example:

stklos> (time (begin (make-list 200) 'finish))
Elapsed time: 0.0640000000000072 ms
Allocated: 6704 bytes (~ 838 Scheme objects) in 209 allocation calls
finish

The accounting is local to each thread:

(let ((T (make-thread (lambda ()
                        (let-values (( (b s k)
                                       (%thread-allocations (current-thread))))
                          (make-list 1_000_000)
                          (let-values (( (B S K)
                                         (%thread-allocations (current-thread))))
                            (format #t "In child thread: ~a bytes, ~a objects in ~a calls~%"
                                    (- B b) (- S s) (- K k))))))))
  (let-values (( (b s k)
                 (%thread-allocations (current-thread))))
    ;; The million allocations in thread T won't affect
    ;; the counting in parent thread:
    (thread-start! T)
    (thread-sleep! 2) ;; wait for the child thread to allocate a million cells
    (let-values (( (B S K)
                   (%thread-allocations (current-thread))))
      (format #t "In parent thread: ~a bytes, ~a objects in ~a calls~%"
              (- B b) (- S s) (- K k))))
  (thread-join! T))

In parent thread: 800672 bytes, 100084 objects in 17 calls
In child thread: 32000272 bytes, 4000034 objects in 1000008 calls

jpellegrini avatar Mar 06 '23 03:03 jpellegrini

Hi @jpellegrini,

In fact, it is not really useful to do this ourselves, since Boehm GC do already a lot of statistics on its side. This is very complete, and I think we can just pick the relevant data in its internal structure. The difficulty here is to find a pleasant Scheme interface:

In gc.h, we have:


/* Return the number of bytes in the heap.  Excludes collector private  */
/* data structures.  Excludes the unmapped memory (returned to the OS). */
/* Includes empty blocks and fragmentation loss.  Includes some pages   */
/* that were allocated but never written.                               */
/* This is an unsynchronized getter, so it should be called typically   */
/* with the GC lock held to avoid data races on multiprocessors (the    */
/* alternative is to use GC_get_heap_usage_safe or GC_get_prof_stats    */
/* API calls instead).                                                  */
/* This getter remains lock-free (unsynchronized) for compatibility     */
/* reason since some existing clients call it from a GC callback        */
/* holding the allocator lock.  (This API function and the following    */
/* four ones below were made thread-safe in GC v7.2alpha1 and           */
/* reverted back in v7.2alpha7 for the reason described.)               */
GC_API size_t GC_CALL GC_get_heap_size(void);

/* Return a lower bound on the number of free bytes in the heap         */
/* (excluding the unmapped memory space).  This is an unsynchronized    */
/* getter (see GC_get_heap_size comment regarding thread-safety).       */
GC_API size_t GC_CALL GC_get_free_bytes(void);

/* Return the size (in bytes) of the unmapped memory (which is returned */
/* to the OS but could be remapped back by the collector later unless   */
/* the OS runs out of system/virtual memory). This is an unsynchronized */
/* getter (see GC_get_heap_size comment regarding thread-safety).       */
GC_API size_t GC_CALL GC_get_unmapped_bytes(void);

/* Return the number of bytes allocated since the last collection.      */
/* This is an unsynchronized getter (see GC_get_heap_size comment       */
/* regarding thread-safety).                                            */
GC_API size_t GC_CALL GC_get_bytes_since_gc(void);

/* Return the number of explicitly deallocated bytes of memory since    */
/* the recent collection.  This is an unsynchronized getter.            */
GC_API size_t GC_CALL GC_get_expl_freed_bytes_since_gc(void);

/* Return the total number of bytes allocated in this process.          */
/* Never decreases, except due to wrapping.  This is an unsynchronized  */
/* getter (see GC_get_heap_size comment regarding thread-safety).       */
GC_API size_t GC_CALL GC_get_total_bytes(void);

/* Return the total number of bytes obtained from OS.  Includes the     */
/* unmapped memory.  Never decreases.  It is an unsynchronized getter.  */
GC_API size_t GC_CALL GC_get_obtained_from_os_bytes(void);

/* Return the heap usage information.  This is a thread-safe (atomic)   */
/* alternative for the five above getters.   (This function acquires    */
/* the allocator lock thus preventing data racing and returning the     */
/* consistent result.)  Passing NULL pointer is allowed for any         */
/* argument.  Returned (filled in) values are of word type.             */
GC_API void GC_CALL GC_get_heap_usage_safe(GC_word * /* pheap_size */,
                                           GC_word * /* pfree_bytes */,
                                           GC_word * /* punmapped_bytes */,
                                           GC_word * /* pbytes_since_gc */,
                                           GC_word * /* ptotal_bytes */);

/* Structure used to query GC statistics (profiling information).       */
/* More fields could be added in the future.  To preserve compatibility */
/* new fields should be added only to the end, and no deprecated fields */
/* should be removed from.                                              */
struct GC_prof_stats_s {
  GC_word heapsize_full;
            /* Heap size in bytes (including the area unmapped to OS).  */
            /* Same as GC_get_heap_size() + GC_get_unmapped_bytes().    */
  GC_word free_bytes_full;
            /* Total bytes contained in free and unmapped blocks.       */
            /* Same as GC_get_free_bytes() + GC_get_unmapped_bytes().   */
  GC_word unmapped_bytes;
            /* Amount of memory unmapped to OS.  Same as the value      */
            /* returned by GC_get_unmapped_bytes().                     */
  GC_word bytes_allocd_since_gc;
            /* Number of bytes allocated since the recent collection.   */
            /* Same as returned by GC_get_bytes_since_gc().             */
  GC_word allocd_bytes_before_gc;
            /* Number of bytes allocated before the recent garbage      */
            /* collection.  The value may wrap.  Same as the result of  */
            /* GC_get_total_bytes() - GC_get_bytes_since_gc().          */
  GC_word non_gc_bytes;
            /* Number of bytes not considered candidates for garbage    */
            /* collection.  Same as returned by GC_get_non_gc_bytes().  */
  GC_word gc_no;
            /* Garbage collection cycle number.  The value may wrap     */
            /* (and could be -1).  Same as returned by GC_get_gc_no().  */
  GC_word markers_m1;
            /* Number of marker threads (excluding the initiating one). */
            /* Same as returned by GC_get_parallel (or 0 if the         */
            /* collector is single-threaded).                           */
  GC_word bytes_reclaimed_since_gc;
            /* Approximate number of reclaimed bytes after recent GC.   */
  GC_word reclaimed_bytes_before_gc;
            /* Approximate number of bytes reclaimed before the recent  */
            /* garbage collection.  The value may wrap.                 */
  GC_word expl_freed_bytes_since_gc;
            /* Number of bytes freed explicitly since the recent GC.    */
            /* Same as returned by GC_get_expl_freed_bytes_since_gc().  */
  GC_word obtained_from_os_bytes;
            /* Total amount of memory obtained from OS, in bytes.       */
};

/* Atomically get GC statistics (various global counters).  Clients     */
/* should pass the size of the buffer (of GC_prof_stats_s type) to fill */
/* in the values - this is for interoperability between different GC    */
/* versions, an old client could have fewer fields, and vice versa,     */
/* client could use newer gc.h (with more entries declared in the       */
/* structure) than that of the linked libgc binary; in the latter case, */
/* unsupported (unknown) fields are filled in with -1.  Return the size */
/* (in bytes) of the filled in part of the structure (excluding all     */
/* unknown fields, if any).                                             */
GC_API size_t GC_CALL GC_get_prof_stats(struct GC_prof_stats_s *,
                                        size_t /* stats_sz */);
#ifdef GC_THREADS
  /* Same as above but unsynchronized (i.e., not holding the allocation */
  /* lock).  Clients should call it using GC_call_with_alloc_lock to    */
  /* avoid data races on multiprocessors.                               */
  GC_API size_t GC_CALL GC_get_prof_stats_unsafe(struct GC_prof_stats_s *,
                                                 size_t /* stats_sz */);
#endif

/* Get the element value (converted to bytes) at a given index of       */
/* size_map table which provides requested-to-actual allocation size    */
/* mapping.  Assumes the collector is initialized.  Returns -1 if the   */
/* index is out of size_map table bounds. Does not use synchronization, */
/* thus clients should call it using GC_call_with_alloc_lock typically  */
/* to avoid data races on multiprocessors.                              */
GC_API size_t GC_CALL GC_get_size_map_at(int i);

/* Count total memory use in bytes by all allocated blocks.  Acquires   */
/* the lock.                                                            */
GC_API size_t GC_CALL GC_get_memory_use(void);

/* Disable garbage collection.  Even GC_gcollect calls will be          */
/* ineffective.                                                         */
GC_API void GC_CALL GC_disable(void);

/* Return non-zero (TRUE) if and only if garbage collection is disabled */
/* (i.e., GC_dont_gc value is non-zero).  Does not acquire the lock.    */
GC_API int GC_CALL GC_is_disabled(void);

/* Try to re-enable garbage collection.  GC_disable() and GC_enable()   */
/* calls nest.  Garbage collection is enabled if the number of calls to */
/* both functions is equal.                                             */
GC_API void GC_CALL GC_enable(void);

egallesio avatar Mar 06 '23 10:03 egallesio

But libgc wouldn't help report the allocations separated by thread, right?

jpellegrini avatar Mar 06 '23 10:03 jpellegrini

But libgc wouldn't help report the allocations separated by thread, right?

Indeed. However, do you think it is a problem? (I'm not sure that threads are heavily used).

Furthermore, this PR implies that all allocations will be more expensive (adding counter management, but more importantly accessing the current thread which, in turn, uses pthred_getspecific which is as far I remember not really efficient). That means that you'll have to pay even you are not interested in counting allocations. On the other way, liggc already do some accounting (which cannot disabled, as far as I know). On the contrary, using the libgc statistics, would be free (since it is already done).

Coupling allocations reports with the time macro is definitively a good idea. It is simple to use and when you are interested in the time used by an expression, having an idea on the allocations it implies it really important.

BTW, do you know a Scheme that differentiate the allocations by threads?

egallesio avatar Mar 07 '23 10:03 egallesio

Looking at gc.h, it seems that GC_get_heap_usage_safe is a good candidate. We could return a plist such as

(:heap-size xxx :free-bytes yyyy ....)

Everything could be easily written in Scheme, once we have it. Do you want, I try that?

egallesio avatar Mar 07 '23 11:03 egallesio

BTW, do you know a Scheme that differentiate the allocations by threads?

I think Gauche does (I need to confirm). I remember I checked this before coming up with the patch. I'll do a survey and get back to this later.

jpellegrini avatar Mar 07 '23 11:03 jpellegrini

I think Gauche does

Sorry, nope. Gauche reports time only, not memory. Be back later.

jpellegrini avatar Mar 07 '23 11:03 jpellegrini

Ok, as promised... I did a survey. But there aren't many Lisps that have a "time" macro that report number of allocations (not only "GC time") and support threads. I found the ones below.

The idea of the test is to start two threads, One will allocate one million cells, and the other does not cons at all. And we call time inside the non-consing thread, but with a specific timing so the allocation happens while the non-consing thread is measuring time. Then we also measure time on an isolated call to make-list, and compare the consing numbers.

System sibling thread isolated make-list obs conclusion
Chez 0 160012224 per thread
Chicken 53/8 18011/3001 (mutations, not bytes) per thread
Gambit 114 480000272 per thread
SBCL 160,089,344 160,024,064 unified
Clisp 160005768 160000000 unified
ABCL 10000181 10000000 unified

edit: included Clisp

I'm actually quite surprised that the Common Lisps do unified accounting.

The code used is below.

;;;
;;; Chez
;;;

(define a #f)
(define b #f)

(define (P)
  (sleep 0.5) ;; wait for Q to start timing
  (set! a (make-list 10000000 -1)))

(define (Q)
  (time
   (lambda ()
     (sleep 2)  ;; wait for P to cons a million cells
     (set! b 'ok))))

;; the consing cost of make-list, in a sibling thread:
(begin (fork-thread Q)
       (fork-thread P))

;; the consing cost of make-list, outside any threads:
(time (set! a (make-list 10000000 -1)))
;;;
;;; Chicken & Gambit
;;;
(import r7rs) ;; Chicken

(import (srfi 18))

(define a #f)
(define b #f)

(define P (make-thread
           (lambda ()
             (thread-sleep! 0.5)
             (set! a (make-list 10000000 -1)))))

(define Q (make-thread
           (lambda ()
             (time
              (lambda ()
                (thread-sleep! 2)
                (set! b 'ok))))))

;; the consing cost of make-list, in a sibling thread:
(begin
  (thread-start! Q)
  (thread-start! P)
  (thread-join! P)
  (thread-join! Q))

;; the consing cost of make-list, outside any threads:
(time (set! a (make-list 10000000 -1)))
;;;
;;; SBCL
;;;

(import :sb-thread)

(setq a nil)
(setq b nil)

(defun P ()
  (sleep 0.5)
  (setq a (make-list 10000000)))

(defun Q ()
  (time
   (progn
    (sleep 2)
    (setq b 'OK))))

(let ((qq (sb-thread:make-thread #'Q))
      (pp (sb-thread:make-thread #'P)))
  (sb-thread:join-thread pp)
  (sb-thread:join-thread qq))

;; Isolated make-list
(time (progn (setq a (make-list 10000000)) 'ok))

jpellegrini avatar Mar 07 '23 18:03 jpellegrini

All that said... I suppose it may be OK to unify the accounting and disregard threads. After all, time is usually something one calls on the REPL. But someone could perhaps someday want a procedure that can be called inside the code to measure timing and consing of specific parts. But it's up to you :grin: (For example, Guile has a REPL command ,time that cannot be used inside the code...)

accessing the current thread which, in turn, uses pthred_getspecific which is as far I remember not really efficient

Indeed, I added these to the definition of the thread structure:

  unsigned long allocations;
  unsigned long bytes_allocated;

I'll do some benchmarks and get back (again :smile: )

jpellegrini avatar Mar 07 '23 18:03 jpellegrini

There seems to be something like a 5% increase in running time with the thread-specific code, as compared to the current HEAD version of STklos -- for one specific benchmark.

(define (F)
  (time
   (repeat 10
           (set! L '())
           (let loop ((i 10000000))
             (if (> i 0)
                 (begin
                   (set! L (cons i L))

                   (loop (- i 1)))))))
  (format #t "length = ~a~%" (length L)))

(define ff #f)
(define L '())

(repeat 10
        (set! L '())
        (gc)
        (set! ff (make-thread F))
        (thread-start! ff)
        (thread-join! ff)
        (thread-terminate! ff))

I included the gc call because STklos was consuming a bit too much memory (but this call is outside the time measurement).

For STklos-HEAD, called the above twice (hence 20 timings):

10291.281,
10420.08,
11502.345,
11053.034,
12055.191,
10922.766,
10976.697,
11970.819,
10234.252,
10517.499,
10287.137,
10553.226,
11439.516,
11052.777,
11934.334,
10953.393,
10930.94,
11776.797,
10219.99,
10525.149

avg   = 10980.86115

For the code with the per-thread memory accounting code:

10961.646,
11017.736,
12121.469,
11811.318,
12134.749,
12057.604,
11900.644,
12566.177,
11021.459,
11193.761,
10866.668,
11085.077,
12079.91,
11495.933,
12078.783,
11918.455,
11681.912,
12446.421,
10824.307,
11192.771

avg   = 11622.84

So there was an increase of $641.97885$ ms in the average time -- 5% of the average for the timings for plain STklos.

Not that this is a very good benchmark... But there are the numbers! :)

jpellegrini avatar Mar 07 '23 20:03 jpellegrini

I have finally merged this PR. As I already said, I was reluctant to add code to allocation functions if tit incurred some performance penalty. By a lack of chance, I spent a lot of time comparing a version with memory allocation accounting and a v2.0 compiler. The problem was that in the new version, I forgot to insert comparison functions (< >, ...) in the inline table :disappointed: . So, the new version was really slower on my benchmarks, but not because of the gc accounting.

The new time macro also reports the number of GC runs during the evaluation. Allocations counting is done by thread.

stklos> (import (srfi 1))
stklos> (define lst (iota 10_000))
;; lst
stklos> (time (reverse lst) 'end)
 ;   Elapsed time: 0.884 ms
 ;   Allocations: 240000 bytes in 10000 allocation calls (GC: 0)
end
stklos> (time (reverse! lst) 'end)
;    Elapsed time: 0.174 ms
;    Allocations: 0 byte in 0 allocation call (GC: 0)
end

I have also suppressed your count of Scheme objects, since I found it was misleading (allocating a big string is a single Scheme object and a conversion in a multiple of the size of a pointer are not really related).

egallesio avatar May 01 '24 14:05 egallesio

That's great @egallesio ! I just don't get why the PR wasn't closed, since you actually did merge it (17553c66e05899f8e4f3e903a03c2ad4aebf3ab7 is exactly "Merge branch 'jpellegrini-gc-accounting'".

jpellegrini avatar May 01 '24 15:05 jpellegrini

Weird. This is the second time GitHub doesn't close a merged PR. Thanks for signalling it. I close it.

egallesio avatar May 01 '24 15:05 egallesio

I close it.

Hm, looks like it didn't work again :smile: I'll try to close it.

jpellegrini avatar May 01 '24 16:05 jpellegrini

Don't know if this is you or me, but it is closed now :tada:

egallesio avatar May 01 '24 18:05 egallesio