Wishlist-for-R icon indicating copy to clipboard operation
Wishlist-for-R copied to clipboard

Support for IEC (KiB, MiB, ...) and SI (kB, MB, ...) binary units

Open HenrikBengtsson opened this issue 8 years ago • 12 comments

Background

There are a few standards [1] for binary prefixes for byte-size units:

  • IEC: KiB (1024 bytes), MiB (1024^2 bytes), GiB (1024^3 bytes), TiB (1024^4 bytes), ...
  • JEDEC & customary standard: KB (1024 bytes), MB (1024^2 bytes), GB (1024^3 bytes)

Note that for decimal prefixes, we have:

  • SI: kB (1000 bytes), MB (1000^2 bytes), GB (1000^3 bytes),, TB (1000^4 bytes), ...

For byte versus bit, we have:

  • IEC & customary standard: 'B' for 'byte' and 'bit' for 'bit' [3,4].
  • IEEE: 'b' for 'bit' [3].

Problem

  • R uses Kb, Mb and Gb. None of these are part of the above byte standards. Note the lower case 'b' is typically used for bit and not byte.

For example,

> size <- object.size(1:1e7)
> size
40000040 bytes
> format(size, units="auto")
[1] "38.1 Mb"

This is specific example illustrates a problem with utils:::format.object_size(). Another example is:

> base::gc()
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 279622 15.0     592000 31.7   350000 18.7
Vcells 478234  3.7    1023718  7.9   786432  6.0
> str(base::gc())
 num [1:2, 1:6] 279638 478263 15 3.7 592000 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "Ncells" "Vcells"
  ..$ : chr [1:6] "used" "(Mb)" "gc trigger" "(Mb)" ...

The issue with non-standard byte units in R has been reported to R-devel [5].

Wish / Suggestion

  • Use units KiB, MiB, GiB, TiB, ... everywhere in R because they are unambiguous. UPDATE: ... or SI units?
  • Migrate smoothly by:
    • [x] Add support for IEC, ~~JEDEC~~ and SI prefixes where applicable;
      • [x] IEC units for utils:::format.object_size(), cf. PR #16649. Completed as of 2016-01-06 in r69879.
      • [x] ~~JEDEC units for utils:::format.object_size()~~, cf. PR #16657. UPDATE: See discussion in comments below.
      • [x] SI units for utils:::format.object_size(). UPDATE: Added to R-devel on 2017-01-11 (r71960)
    • [ ] Add options for default unit standard used in R, e.g. getOptions("byte.unit.standard", "legacy").
    • [ ] Make ~~IEC~~ SI units the new default, e.g. gc(), format.object_size(..., units="auto") and allocation error messages.
    • [ ] Deprecate invalid units (lower case b) with .Deprecate().
    • [ ] Eventually drop them using .Defunct().

Known functions / code affected:

Note, the out-of-memory errors in the native code can not easily be tweaked to support a global option; if tried, then there is a risk that that triggers another out-of-memory error.

Usages of IEC / SI elsewhere

  • The Ubuntu Linux distribution uses the IEC prefixes for base-2 units and SI prefixes for base-10 units [6].
  • Windows and Android uses JEDEC prefixes.
  • Mac OS X uses decimal SI units kB since 2009.

References

  1. Binary prefix, Wikipedia, https://en.wikipedia.org/wiki/Binary_prefix
  2. Byte, Wikipedia, https://en.wikipedia.org/wiki/Byte#Unit_symbol
  3. Bit, Wikipedia, https://en.wikipedia.org/wiki/Bit#Unit_and_symbol
  4. Man page units(7), http://man7.org/linux/man-pages/man7/units.7.html
  5. R devel thread 'format(object.size(...), units): KB, MB, and GB instead of Kb, Mb, and Gb?' started on 2014-09-07
  6. UnitsPolicy, Ubuntu Wiki, Jan 2016, https://wiki.ubuntu.com/UnitsPolicy
  • UPDATE 2016-05-03: Added src/gnuwin32/malloc.c to the list of places that needs to be updated.
  • UPDATE 2017-01-01: Aim for SI to be the new standard.
  • UPDATE 2017-01-11: Propose option byte.unit.standard for smooth transition.
  • UPDATE 2017-05-17: Identified more (all?) locations in R and native code that require updating.

HenrikBengtsson avatar Dec 30 '15 21:12 HenrikBengtsson

As a first step, I just filed a backward-compatible patch to add support for IEC units in utils:::format.object_size(), cf. PR #16649.

UPDATE: This has been implemented as of 2016-01-06 in r69879.

HenrikBengtsson avatar Dec 30 '15 21:12 HenrikBengtsson

IEC units are now supported by R. As the next step, I filed a backward-compatible patch to add support for JEDEC units in utils:::format.object_size(), cf. PR #16657.

HenrikBengtsson avatar Jan 06 '16 15:01 HenrikBengtsson

  • Can you give an example and reference for "The Ubuntu Linux distribution uses the IEC prefixes since 2010" ? Personally, I find the 'KiB' notation quite ugly. I see df -h, du -h, ls -h all use suffixes K, M, G .. but no "iB" (or "B" or "b").
  • The real problem is that the SI standard really want "KB" or "MB" to mean something different than "KiB" or "MiB" and JEDEC does not.... But really the SI system is the world standard one, and JEDEC is mainly "industry" and not science bases (which the SI is). So, in principle --- if we are willing to change back compatibility--- we should really move towards the real world standard, i.e., the SI standard system.... and consequently, I'd be against endorsing JEDEC any more than we do now (by accepting it on "input").

mmaechler avatar Jan 07 '16 16:01 mmaechler

Thanks for the comments.

  • I got the "Ubuntu" statement from [1], but must have been sloppy. I've now clarified it to say: "The Ubuntu Linux distribution uses the IEC prefixes for base-2 units and SI prefixes for base-10 units" which reflects Ubuntu's official UnitsPolicy.
  • Searching the web, there are references starting ~2010 (around Ubuntu 10.10) saying Ubuntu will move to using decimal/base-10 units with SI prefixes throughout. I don't know where they are regarding that goal.
  • SI vs JEDEC confusion: If I understand your comment correctly, you're saying we'll introduce more confusion if we explicitly add support for JEDEC. If so, I agree with you. My idea was to introduce it properly, to make it explicit that the old R units are home brewed. I'm happy to skip JEDEC.
  • Long-term for R: If this is what you are saying, I agree, supporting both decimal/base-10 and binary/base-2 units, using SI and IEC prefixes respectively, would be ideal. I'm all for that as well. Since R has only single API entry (=utils::format.object_size()) we could even introduce argument base=getOption(object.size.base=2) controlling whether base 2 or base 10 should be displayed (when units="auto"). It would also allow us to migrate from current base 2 to base 10 smoothly (and allow users to undo via the option), if that is where we heading. BTW, gc() should utilize utils::format.object_size().
  • To implementing the transition from R's current base-2 units (Kb, Mb, Gb) to SI/base-10 units (kB, MB, GB), it might be less of a shock if one does this in few release cycles:
    1. Switch to using IEC/base-2 units (KiB, MiB, GiB, ...) for units="auto".
    2. Deprecate explicit usage of units="Kb", units="Mb", ...
    3. Switch to using SI/base-10 units (kB, MB, GB, ...) for units="auto".

What do you think?

HenrikBengtsson avatar Jan 07 '16 16:01 HenrikBengtsson

Another approach that could work is to add support for units="IEC", units="SI" and units="legacy". That can be done without breaking backward compatibilty. The units="auto" can equal units="legacy" and any future transitions can be in what units="auto" corresponds to.

UPDATE: The issues with this is that it's not possible to control whether units="MB" is meant to be current R "legacy" (base-2) units or SI (base-10) units.

HenrikBengtsson avatar Jan 14 '16 20:01 HenrikBengtsson

Here's my new proposal for supporting "legacy", IEC and SI units in a backward compatible way and such that it will be easy to switch from today's default "legacy" to SI units at some point in R's future.

The file to be updated in R is src/library/utils/R/object.size.R:

object.size <- function(x)
    structure(.Call(C_objectSize, x), class = "object_size")

format.object_size <- function(x, units = "b", standard = "auto", digits = 1L, ...)
{
    known_bases <- c(legacy = 1024, IEC = 1024, SI = 1000)
    known_units <- list(
        SI      =  c("B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"),
        IEC     =  c("B", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB", "ZiB", "YiB"),
        legacy  =  c("b", "Kb", "Mb", "Gb", "Tb", "Pb"),
        LEGACY  =  c("B", "KB", "MB", "GB", "TB", "PB")
    )

    units <- match.arg(units, c("auto", unique(unlist(known_units), use.names = FALSE)))
    standard <- match.arg(standard, c("auto", names(known_bases)))

    ## Infer 'standard' from 'units'?
    if (standard == "auto") {
        standard <- "legacy"           ## default; to become "SI"
        if (units != "auto") {
            if (grepl("iB$", units)) {
                standard <- "IEC"
            } else if (grepl("b$", units)) {
                standard <- "legacy"   ## keep when "SI" is the default
            } else if (units == "kB") {
	        ## SPECIAL: Drop when "SI" becomes the default
                stop("For SI units, please specify standard = \"SI\"")
	    }
        }
    }

    base <- known_bases[[standard]]
    units_map <- known_units[[standard]]

    if (units == "auto") {
        power <- if (x <= 0) 0 else min(as.integer(log(x, base = base)), length(units_map) - 1L)
    } else {
        power <- match(toupper(units), toupper(units_map)) - 1L
        if (is.na(power)) {
            stop(gettextf("Unit %s is not part of standard %s", sQuote(units), sQuote(standard)))
        }
    }

    unit <- units_map[power + 1L]

    ## SPECIAL: Use suffix 'bytes' instead of 'b' for 'legacy'
    if (power == 0 && standard == "legacy") unit <- "bytes"
    
    paste(round(x / base^power, digits = digits), unit)
}

print.object_size <-
    function(x, quote = FALSE, units = "b", standard = "auto", digits = 1L, ...)
{
    y <- format.object_size(x, units = units, standard = standard, digits = digits)
    if(quote) print.default(y, ...) else cat(y, "\n", sep = "")
    invisible(x)
}

Examples and tests

assert_size <- function(x, ..., expected) {
    size <- structure(x, class = "object_size")
    res <- try(format(size, ...), silent = TRUE)
    if (expected == "error") {
        if (!inherits(res, "try-error"))
            stop(sprintf("Expected %s but got %s", sQuote(expected), sQuote(res)))
    } else if (res != expected) {
        stop(sprintf("Expected %s but got %s", sQuote(expected), sQuote(res)))
    }
}

## The default is the 'legacy' standard (backward compatibility)
assert_size(0,    expected = "0 bytes")
assert_size(1,    expected = "1 bytes")
assert_size(1023, expected = "1023 bytes")
assert_size(1024, expected = "1024 bytes")

## Standard inferred from 'legacy' units
assert_size(0,            units = "b",  expected = "0 bytes")
assert_size(1,            units = "B",  expected = "1 bytes")
assert_size(999,          units = "B",  expected = "999 bytes")
assert_size(1000,         units = "Kb", expected = "1 Kb")
assert_size(1024,         units = "KB", expected = "1 Kb")
assert_size(2.0 * 1000^2, units = "MB", expected = "1.9 Mb")
assert_size(3.1 * 1000^3, units = "GB", expected = "2.9 Gb")
assert_size(4.2 * 1000^8, units = "TB", expected = "3819877747446.3 Tb")
assert_size(4.2 * 1000^9, units = "Pb", expected = "3730349362740.5 Pb")

## Standard inferred from 'IEC' units
assert_size(1000,         units = "KiB", expected = "1 KiB")
assert_size(1024,         units = "KiB", expected = "1 KiB")
assert_size(2.0 * 1000^2, units = "MiB", expected = "1.9 MiB")
assert_size(3.1 * 1000^3, units = "GiB", expected = "2.9 GiB")
assert_size(4.2 * 1000^8, units = "TiB", expected = "3819877747446.3 TiB")
assert_size(4.2 * 1000^9, units = "PiB", expected = "3730349362740.5 PiB")

## Inferring standard from 'SI' units is not possible because they
## conflict with 'legacy' units (and it would be confusing to support
## high-range SI units not covered by the legacy units)
assert_size(3.1 * 1024^1, units = "kB", expected = "error")
assert_size(3.1 * 1024^6, units = "EB", expected = "error")
assert_size(3.1 * 1024^7, units = "ZB", expected = "error")
assert_size(3.1 * 1024^8, units = "YB", expected = "error")


## Automatic 'legacy' units (default)
assert_size(0,            units = "auto", expected = "0 bytes")
assert_size(1,            units = "auto", expected = "1 bytes")
assert_size(1023,         units = "auto", expected = "1023 bytes")
assert_size(1024,         units = "auto", expected = "1 Kb")
assert_size(2.0 * 1000^2, units = "auto", expected = "1.9 Mb")

## Automatic 'legacy' units
assert_size(0,            units = "auto", standard = "legacy", expected = "0 bytes")
assert_size(1,            units = "auto", standard = "legacy", expected = "1 bytes")
assert_size(1023,         units = "auto", standard = "legacy", expected = "1023 bytes")
assert_size(1024,         units = "auto", standard = "legacy", expected = "1 Kb")
assert_size(2.0 * 1000^2, units = "auto", standard = "legacy", expected = "1.9 Mb")
assert_size(3.1 * 1024^3, units = "auto", standard = "legacy", expected = "3.1 Gb")
assert_size(3.1 * 1024^4, units = "auto", standard = "legacy", expected = "3.1 Tb")
assert_size(3.1 * 1024^5, units = "auto", standard = "legacy", expected = "3.1 Pb")
assert_size(3.1 * 1024^6, units = "auto", standard = "legacy", expected = "3174.4 Pb")

## Automatic 'IEC' units
assert_size(0,            units = "auto", standard = "IEC", expected = "0 B")
assert_size(1,            units = "auto", standard = "IEC", expected = "1 B")
assert_size(1023,         units = "auto", standard = "IEC", expected = "1023 B")
assert_size(1024,         units = "auto", standard = "IEC", expected = "1 KiB")
assert_size(2.0 * 1000^2, units = "auto", standard = "IEC", expected = "1.9 MiB")
assert_size(3.1 * 1024^3, units = "auto", standard = "IEC", expected = "3.1 GiB")
assert_size(3.1 * 1024^4, units = "auto", standard = "IEC", expected = "3.1 TiB")
assert_size(3.1 * 1024^5, units = "auto", standard = "IEC", expected = "3.1 PiB")
assert_size(3.1 * 1024^6, units = "auto", standard = "IEC", expected = "3.1 EiB")
assert_size(3.1 * 1024^7, units = "auto", standard = "IEC", expected = "3.1 ZiB")
assert_size(4.2 * 1024^8, units = "auto", standard = "IEC", expected = "4.2 YiB")
assert_size(4.2 * 1024^9, units = "auto", standard = "IEC", expected = "4300.8 YiB")

## Automatic 'SI' units
assert_size(0,            units = "auto", standard = "SI", expected = "0 B")
assert_size(1,            units = "auto", standard = "SI", expected = "1 B")
assert_size(999,          units = "auto", standard = "SI", expected = "999 B")
assert_size(1000,         units = "auto", standard = "SI", expected = "1 kB")
assert_size(1024,         units = "auto", standard = "SI", expected = "1 kB")
assert_size(2.0 * 1000^2, units = "auto", standard = "SI", expected = "2 MB")
assert_size(3.1 * 1000^3, units = "auto", standard = "SI", expected = "3.1 GB")
assert_size(3.1 * 1000^4, units = "auto", standard = "SI", expected = "3.1 TB")
assert_size(3.1 * 1000^5, units = "auto", standard = "SI", expected = "3.1 PB")
assert_size(3.1 * 1000^6, units = "auto", standard = "SI", expected = "3.1 EB")
assert_size(3.1 * 1000^7, units = "auto", standard = "SI", expected = "3.1 ZB")
assert_size(4.2 * 1000^8, units = "auto", standard = "SI", expected = "4.2 YB")
assert_size(4.2 * 1000^9, units = "auto", standard = "SI", expected = "4200 YB")

UPDATE: 2017-01-01: Forgot that SI uses 'kB'; minor tweaks above.

HenrikBengtsson avatar Jan 02 '17 01:01 HenrikBengtsson

UPDATE: SI units are now supported in R-devel, see r71960.

HenrikBengtsson avatar Jan 11 '17 18:01 HenrikBengtsson

I'll just add a link to a thread on twitter for your future references on this topic: https://twitter.com/henrikbengtsson/status/1231986947360354305

llrs avatar Feb 24 '20 17:02 llrs

Posted PR18297 titled 'Use standard file-size units everywhere in base R (e.g., Mb -> MiB)' on 2022-02-01.

HenrikBengtsson avatar Feb 01 '22 18:02 HenrikBengtsson

Filed PR18435 adding new SI prefixes RB (ronnabytes) and QB (quettabytes) to format() for object_size.

HenrikBengtsson avatar Nov 19 '22 02:11 HenrikBengtsson

SI prefixes RB (ronnabytes) and QB (quettabytes) was has been added to R-devel (to become R 4.3.0), cf. https://github.com/wch/r-source/commit/cd2d0ba6ca3dd419179de94caaafcedd47b3a855

HenrikBengtsson avatar Nov 21 '22 17:11 HenrikBengtsson

One more location to fix, was just added to src/main/memory.c in R-devel, cf. https://github.com/wch/r-source/commit/459492bc14ad5a3ff735d90a70ad71f6d5fe9faa.

HenrikBengtsson avatar Nov 29 '23 21:11 HenrikBengtsson