julia
julia copied to clipboard
zero out memory of uninitialized fields
Raised here: https://groups.google.com/forum/#!topic/julia-users/O5S8pPav5Ks.
Instead of initializing them to zero, we could initialize them to a large, nonsensical value to help catch access of uninitialized fields.
I guess, but this only applies to pure data types like Int or Float64. Frankly, I'm still not entirely convinced that this is a real usability problem, but unlike arrays, there's no good performance reason not to guarantee zeroed memory.
The zeroing could take a lot of time if you are inserting lots of things into a large tree or a list, or even an array of composite types.
Like Stefan, I'm not too worried about performance here. One reason is that uninitialized object fields are relatively rare. new
calls that pass all arguments would be unaffected. LLVM might also be able to remove the extra store in x = new(); x.fld = 1
. And if the object is heap allocated, the overhead of an extra store would be comparatively small.
One corner case that could cause problems is uninitialized bits fields in immutable types. It's a gotcha if they are zeroed when allocating individual objects, but not when allocating an array of them. Right now we consistently say "bits aren't automatically zeroed". If you like automatic zeroing, you want it everywhere, and doing it sometimes is arguably worse than doing it never.
One way out of that corner case is to disallow explicitly-uninitialized bits fields in immutables. Uninitialized references in immutables have uses (e.g. Nullable{BigMutableThing}
), but uninitialized bits fields are less reasonable.
Frankly, I'd rather leave it as-is, or zero everything. For small arrays we can just pay the price, and allocate big arrays directly with mmap. Might not be so bad.
I'd be in favour of initializing everything. If this turns out to be a bottleneck, as measured in a validated benchmark, then we can see whether introducing ccall(:jl_allocate_uninitialized_array)
for a few special cases wouldn't do the trick.
Regarding zero-ing everything, it wouldn't be too hard to change our malloc_a16
function (in gc.c
, which is used to allocate arrays) to a calloc_a16
function, which called calloc
, shifts the pointer, and stores the original pointer before the pointed-to data. This is how the _aligned_malloc
function works on Windows, and how we defined a 16-byte (or 32-byte) aligned malloc
for FFTW (which is so trivial I don't think relicensing would be an issue).
I would prefer zeroing everything rather only in some places - or use a specific byte pattern. I guess we can start by using calloc and validate through a benchmark as suggested.
Also this is something we can presumably backport to 0.3 for Ron's class, if it all works out.
Which would be a pretty drastic semantic change within one version number...
Regarding the actual issue: I think it is a good idea to initialize with zero if the performance degradation are negligible. Would still be good to have a flag to get the malloced array if desired.
Which would be a pretty drastic semantic change within one version number...
It's a safe change though since this is not a behavior anyone could reasonably rely on.
(@tknopp, you can always call pointer_to_array(convert(Ptr{T}, c_malloc(sizeof(T) * n)), n, true)
or similar to get a malloced
array, so I don't think we necessarily need a flag. Assuming the overhead of calloc
is normally negligible, anyone needing an uninitialized array will be working at such a low level that calling c_malloc
won't be unreasonable.)
I tend to agree that people shouldn't rely on this behavior and it probably shouldn't even be documented; they should use zeros(...)
if they want guaranteed zero initialization. (Of course, the implementation of zeros
in Base can take advantage of it.)
@StefanKarpinski: Indeed. Still, I am not sure if backporting features or semantic language changes is a good idea. Its hard to keep track in which version the feature gets in. Or one might even have to distinguish minor version numbers (e.g. 0.33 and 0.34) when a new feature gets in in 0.34. This then has impact for all packages...
@stevengj: While I use ones
and zeros
myself when initializing an array I think the Array constructor should be a valid way to initialize an array. Currently I am not using it because I want zero initialization. If the constructor would initialize with zero, it would be IMHO the more logical way to create an array. For every other datastructure I also use the constructor.
@tknopp, I'm not saying you shouldn't use the constructor. I'm saying that if a calloc
version is fast enough then we need not provide a high-level uninitialized-array constructor (nor "a flag" for this).
I made an experimental branch that uses calloc
instead of malloc
, and so far I haven't been able to detect any performance differences (all the differences are swamped by the measurement noise) on MacOS.
Interesting and tangentially related: http://research.swtch.com/sparse
Do you want users rely on zero initialization? If yes, best implement and document it so everyone's on the same page. If no, use some nonzero filler like 0b10101010 or just leave it uninitialized like it is today. Facts of life: if you implement zero initialization, users will rely on it, documented or not, whether you want them to or not. Either way, there should be some easy way to get uninitialized memory, like e.g. NumPy has empty()
in addition to zeros()
and ones()
which you can use when you want performance.
@RauliRuohonen in the absence of explicit documentation to the contrary (and even then, not guaranteed), users will default to assuming zero initialization. This is the case in Graphs.jl, where dijkstra_shortest_paths
can return uninitialized memory (see https://github.com/JuliaLang/Graphs.jl/issues/140 for an example).
This newbie's vote is for zero-by-default, and the sooner it's implemented, the better.
I personally would prefer a byte pattern if we were to do this.
Also it is quite safe to do this by default, and in the few performance sensitive cases, have a way to get uninitialized memory.
I personally would prefer a byte pattern if we were to do this.
I'm genuinely curious - why would a byte pattern be preferable to zeros, especially when new pages are supposedly zeroed by the OS by default?
A byte pattern makes it easier to find uses of uninitialized values. The implication is that people must make sure to manually initialize everything, or else they will get some big useless value which at least makes it easier to find the bug.
However, this strikes me as going out of our way to slap people on the wrist. If we are going to put in the effort to guarantee initialization, I'd rather do people a favor and initialize with a likely-useful value (zero). You'd never need to write Foo(x,y) = new(0,0)
. And given calloc
, there might be a performance advantage.
they will get some big useless value which at least makes it easier to find the bug
Or, in a worst case, they will get a big useless value that is close enough to an expected value that it slips through, and causes some catastrophic failure down the line?
Unless Julia's going to explicitly test and warn on uninitialized values using this byte pattern (thereby voiding any legitimate uses of that particular pattern), I don't see the advantage - and I see two disadvantages: 1), as you said, calloc() provides an optimized zero, and writing a specific byte pattern might result in poorer peformance; and 2) the principle of "do[ing] what is expected" seems to favor zeros.
I think that either doing what we do now or initializing with zeros and having that be a specified, reliable behavior are the two best options.
I think initializing to zeros is really the way to go unless there's a serious performance cost. It simplifies the mental model of how memory allocation works and provides a lot more security.
Proposal: zero-fill by default; provide a named parameter for an option to use "raw" malloc for when performance is über-critical.
The security issue is nothing to sneeze at, especially, for example, when building out web services with authenticated sessions. Also, it would make auditing things like Crypto.jl that much more complex.
fwiw, we appear to have some bugs in pcre.jl related to the unintentional use of undefined values from an Array(Ptr{T}, x)
, but zeros(Ptr{T}, x)
doesn't work anymore (it's deprecated)
Hmm.. That seems like a reasonable usage of zeros(Ptr{T}, x)
. Maybe we should have changed the documentation, instead of depreciating the method?
and there's another one in socket.jl (I changed some local behavior of ccall that is causing these to become more visible, as segfaults)
Thinking more about this, I have come to believe that not zeroing the memory in Array(Ptr{T}, x)
is a mistake, and should rather be fixed in the array constructor than in in a separate zeros
method.
I think zero(::Ptr)
and thus zeros(::Array{Ptr})
were not considered as correct because C_NULL
is not the additive identity for pointers.
What about using fill(Ptr{T}(0), n) here?
2015-01-24 10:15 GMT-05:00 Milan Bouchet-Valat [email protected]:
I think zero(::Ptr) and thus zeros(::Array{Ptr}) were not considered as correct because C_NULL is not the additive identity for pointers.
— Reply to this email directly or view it on GitHub https://github.com/JuliaLang/julia/issues/9147#issuecomment-71321615.
The consensus that seems to form is that initializing newly allocated memory should be the default, for security reasons. (There should also be a sufficiently-obscure escape hatch for well-tested low-level library functions.) There's nothing wrong with zeros
or fill
, but rather with the Array
constructor: It should choose safety by default.
+1
I even agree with @stevengj's original point: those who need malloc
dirty memory can just use ccall
.
@eschnett My point was not about the behavior of Array
, but about zeros
. I don't think zeros
should be used when constructing arrays of null pointers. Other people are better to decide on the default behavior of Array
.
@andreasnoack I agree.
fill(C_NULL, n)
would be my favorite way to get an array of null pointers. But yes, Array
should zero-fill as well.
Just checking: has this been implemented recently? I notice that newly-allocated arrays (from yesterday's master) are all getting zeros:
julia> a = Array(Float64,(6,6))
6x6 Array{Float64,2}:
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
or very close to it:
julia> a = Array(Float64,(6,6))
6x6 Array{Float64,2}:
0.0 9.88131e-324 1.4822e-323 1.97626e-323 2.96439e-323 3.45846e-323
4.94066e-324 1.4822e-323 1.4822e-323 1.97626e-323 2.96439e-323 3.45846e-323
9.88131e-324 9.88131e-324 1.4822e-323 2.47033e-323 2.96439e-323 3.95253e-323
1.4822e-323 9.88131e-324 1.4822e-323 2.47033e-323 2.96439e-323 3.45846e-323
4.94066e-324 9.88131e-324 1.97626e-323 2.47033e-323 3.45846e-323 3.95253e-323
4.94066e-324 1.4822e-323 1.97626e-323 2.47033e-323 3.45846e-323 3.95253e-323
No, not implemented yet. Close doesn't count! Very often you'll get zeros purely by accident since new pages from the OS are zero'd already.
Ah, ok. Thanks for the update :) It was weird that I was getting values < eps(Float64).
Hi all,
Given the proposed feature freeze (https://groups.google.com/d/msg/julia-dev/s2-Zj3acL_g/Nw7MV8MT3QwJ), could I suggest that we get this in prior to 0.4? Thanks.
i've put a milestone target on this so it doesn't get lost. if there's a PR sooner, then I don't see why it couldn't be added to v0.4 (or perhaps even v0.4.x)
@vtjnash @stevengj
https://github.com/JuliaLang/julia/issues/9147#issuecomment-64924076
I made an experimental branch that uses calloc instead of malloc, and so far I haven't been able to detect any performance differences (all the differences are swamped by the measurement noise) on MacOS.
Is this branch still available? If so, could it form the basis of the PR?
i had looked that over briefly. i think it was based on the old GC and was more a proof of concept than a full analysis and implementation.
I'm willing to put this in 0.4 if a PR arises.
I probably don't understand all the cases well enough, but is using calloc sufficient? I thought that memory that gets GC'ed will also need reinitialization. Stating the obvious, but it seems like if we do this, we should do it across the board.
The change shouldn't be too bad:
- use calloc instead of malloc where necessary
- add extra zero stores to emit_new_struct
- add zero stores and memsets in a couple places in alloc.c and array.c
:+1: to doing this for 0.4, but maybe just for scalar values... I'd be concerned about large arrays, esp. when they get totally filled up immediately after getting allocated (as in the string conversion functions).
Ah, spoke too soon :sad: @sbromberger 's idea is exactly what I'd want... some way of telling Julia that the Array{Uint8,1)(100000000) I just allocated is "raw".
@ScottPJones, for large arrays, I couldn't measure any performance penalty to calloc
(any difference from malloc
was in the noise). As I understand it, modern calloc
implementations don't actually memset
the memory to zero, they just generate copy-on-write references to a special pre-allocated page of zeros. Do you have any data to indicate there is ever a significant penalty on modern systems?
@stevengj My data was probably seriously out of date :wink: I'd benchmarked exactly this issue over 20 years ago... had serious slowdowns when doing large allocations... Keyword in what you said is "modern"... If that's going to be the case even for smaller array allocations (say for a 512 byte-64K string), I'm happy. [too much of my experience goes back to the dawn of time... it's always useful to retest your performance assumptions every few years...]
Related thought: What about calling resize!
on a Vector
? Currently this appends uninitialized values if needed, but if the current issue is fixed as planned I think it might make more sense for it to append zeros (on types where that is possible).
@garrison I think we need to generalize this concept (zeroed memory) across the board, so that there is no chance of uninitialized memory by default for any Julia allocation.
There are two distinct argument here I believe. The security one is fine, but it doesn't mean that it should be the default, just that we should have a --zero-mem flag as an additional security measure (you shouldn't use uninitialized memory anyway).
The programming error one I don't buy. Nothing makes zero a more suitable value than any other. It's actually more dangerous because it is ok in most cases until zero has a special meaning for a bitfield or something. You should not use uninitialized memory period, and better catch the usage sooner than later. A specific bit pattern is fine, or a random one, but keep it undefined behavior so that we can go back on this. If you zero memory then you are stuck with it forever because programs will implicitly rely on it.
A concrete example of what I'm talking about. Say you have a field for which zero is a special value, you change its meaning at some point and try to refactor. Looking for every assignment to this field you will miss every places where it's implicitly zero'd.
I think the issue is primarily one of 1) desired default behavior and 2) performance. For the former, the security issue is paramount: memory should be initialized by default (with perhaps some option for "raw/unsafe" memory if you want to guarantee that you're going to initialize it yourself). For 2), my understanding is that a) calloc is (practically) as fast as malloc, and b) calloc can't initialize with non-zero values. This makes a non-zero initial value incur a performance penalty with respect to the other options.
Finally, I'd note that zeroed memory is the default for most other modern scripting languages, so there's precedence for both 1) and 2).
I don't agree that "secure by default" trumps everything here. Especially if the alternative is to add a single cli flag when you run in production where sec is important (not too hard, you probably have other things to worry about). On the other hand, I think specifying the default value of memory is actively harmful since it will be relied on by everyone (rightly, since it's in the spec).
The performance argument is secondary, since the way I see it, you would actively garbage the memory only in "debug" mode, and only zero it in "secure" mode.
Removing undefined behavior is itself useful.
Other platforms, like .NET, have done this. What is their experience? Do they wish they hadn't done it? I don't know for sure but I doubt it.
Security by default is always important. Otherwise you get 30 years of source code vulnerabilities and your own Wikipedia article.
Also, you will make it difficult for any security researcher to take your language seriously, as this problem has been solved in pretty much every modern language due primarily to the realization that not providing initialized memory leads to compromised applications and systems.
I feel like I'm fighting a lost battle here but I'll try anyway.
My point is that undefined behavior is less dangerous than almost-always-ok implicit behavior, as long as you have good tools to catch it. If we keep it undefined we can trash the memory actively on debug builds to see if you rely on it, we can zero it on production outside-facing application as an additional layer of protection.
About the security by default : we already don't do it ! For example it would be IMO foolish to run any julia application communicating with the outside world without the --check-bounds=yes. As you surely know, an oob access is arguably even more dangerous than uninitialized memory since it can be used to take control of the process and not only leak secrets. To be honest, if you can't be bothered with adding a command-line option your security problem is probably not limited to uninitialized memory.
Re: check-bounds: you fight the fights you can fight. This issue happens to be one that was identified by one of the world's foremost security experts and that has caused innumerable problems for other languages. IMO, the fact that there may be other security issues with Julia does not diminish the importance of this one, which appears to be a straightforward and correct fix; namely to do something that is expected and consistent. Right now the real-world experience is that Julia currently fails both of these tests with respect to memory allocation.*
- expected: the failures of multiple packages due to their implicit assumption of initialized memory serves as the evidence here. consistent: the fact that large memory allocations provide zeroed memory but smaller allocations may not demonstrates inconsistency.
Check bounds : we will never change the behavior here, which is security by default. But for perf reason code has an opt-in bounds check remove that people do get wrong (see #11429). What we do here is the arguably the best possible thing : if you're worried you only have to add a single CLI flag and you get guaranteed safe behavior, at the cost of performance. How is that a problem ?
I'm actually arguing for the same thing : people should not rely on uninitialized memory. My argument is that zero is just as bad as uninitialized from a programming point of view (but not from a sec pov obviously) because it's implicit ! What I would like is to have the same thing as for check bounds that is :
- an easy way to check if you relied on undefined behavior (--uninit-mem=garbage)
- an easy way to be sure that even if you did your secrets are still safe (--uninit-mem=zero)
Again, I agree with your points, what I don't agree with is that the command line arguments are too much of a bother and people won't use them in security critical places. You cannot protect people that don't think about security at all. If they do, then adding a CLI arg is a simple easy step, negligible compared to actually reviewing your code for higher level security mistakes.
@sbromberger So would you call numpy also insecure.
In [1]: from pylab import *
In [2]: empty(1)
Out[2]: array([ 9.])
IIRC, this is one major reason empty
is much slower in PyPy
than on CPython
(if it haven't been fixed recently)
(Edit: Actually, they've fixed it)
My major concern if memory allocation is always initialized would be. How should I ask for an uninitialized array if I know what I'm doing? IMHO, a language that cares about performance (like Julia, I hope) should provide a way to do that. Should we add a @nonzero
macro then? In that case, why shouldn't we just let Array()
not initialize the memory and make zero()
do that (like now).
Okay, but preliminary benchmarks seem to point towards this change being free (or nearly so): https://github.com/JuliaLang/julia/issues/9147#issuecomment-99887609
It it truly is near-zero cost, then there's no need to ask for uninitialized memory. This is silly to debate without more evidence that there's a nonzero cost involved.
@mbauman OK. If the performance is nearly the same, I think I'll be fine with either then.
(And my impression of the difference comes from the PyPy issue I saw before. See the link in my updated comment above)
I think performance is not the problem here. Even if we initialize all memory my point is that it should always be explicit. i.e., remove the Array constructor for bits types and only allow zeros(), ones(), fill() ...
Again, the zero bit pattern has no meaning for non-numeric types. The fact that it often does is only more of a pitfall, not less. I'm not arguing for performance, I'm arguing for actively undefined behavior, coupled with good tools to catch it.
From a security perspective, how much of the risk is "oops, forgot to initialize this!" vs. "I just intercepted your credit card number because it was still left over in RAM." As @carnaval says, the first isn't automatically fixed by zeroing out, but the second category of problems is.
Isn't julia always vulnerable to the second category though? After all we have (as Keno pointed out) unsafe_load to read arbitrary memory location....
@timholy The former is also a security issue as it can trigger (and has triggered) intermittent / unreproducible code crashes, which turn into a denial of service / data integrity issue when they slip past QA.
@yuyichao - yes, you CAN do it, but it should be a deliberate choice. I think that the ability to allocate uninitialized memory is a requirement. The only thing I'm arguing for is that initialized memory should be the default behavior, as it will only be through deliberate action that a developer will make an unsafe choice.
(And re numpy: yes, I consider that insecure, but fortunately, I'm not aware of any socket code or other code that requires numpy. The issue in Julia is that this is language-wide behavior.)
Honestly while the security argument is important it's not the most important thing to me here. If we wanted to tune every design decision for security, there's no end to what we'd need to do. Semantics and predictability matter more to me (note in some cases security argues for less predictability, as in hashes or timing variation).
I don't like the argument that unpredictable behavior is good because if you forget to initialize something, you deserve it. See Stefan's excellent description of this mindset: https://github.com/JuliaLang/julia/pull/10116#issuecomment-107717077
Also, please pay more attention to the prior art here. I'm trying to find an article that argues that .NET's zero initialization is a bad thing. Help me look.
@JeffBezanson this one, perhaps? It's a performance issue. (The article is also 10 years old.)
I saw that one, but (1) it only matters if you double-initialize things (and I think LLVM will be able to remove some of these), (2) I doubt it matters in practice, and (3) the article doesn't conclude that the zero default is overall a bad thing.
It's not about deserving it, it's exactly the opposite. You can only argue for predictability if the behavior makes sense. Think of it this way : using zero will actually be exactly as unpredictable to me than using anything else, because I'm not expecting it, because zero does not make any sense for my type. It's not a theoritical case : in the code I'm writing right now, I have a bitstype where having it be all zero should never happen and is bound to provoke subtle logic errors. What do you do then ? You will argue that I should just initialize it, back to square one and the old testament "victim blaming" argument.
If I wasn't explicit about what value a field should have, then I shouldn't read it (btw exactly the behavior we have for pointer fields where we fail early). Since we can't have a bitpattern being used for "uninitialized" since all bit patterns can be used for a bits type, making it undefined is the only uniform choice IMO. Again, if it is not defined it is easy to have tools to check you are not relying on it, whereas if it is defined to be zero you never know whether the programmer really wanted this zero but was too lazy to be explicit, or if it is a logic error.
@carnaval
using zero will actually be exactly as unpredictable to me than using anything else, because I'm not expecting it
Then this change will have precisely zero impact on the way you do things now, right? You are still free to assume that all memory is "bad for your use" and initialize it with whatever variables you wish. There is literally no downside* to this change, and many positive implications for folks who don't have a problem with zeroed memory.
*assuming that calloc is more-or-less equivalent to malloc.
I'm concerned about the performance issues (remember, you'd need to deal with realloc
as well, not just calloc
), but if that were shown to be minor, I think it is much nicer (and safer) to have them initialized... but that's just IMO...
@sbromberger No because if the specification says that memory is zero, then we can never have a mode which returns garbage, it would break code.
For now, it often is garbage (for small objects as you noticed) which has been good enough for me to have it fail early. I agree it's not enough in general so we should have this debug mode. Again, I don't care what value the memory is, I just don't want it to be defined, so that people don't rely on implicit behavior making us unable to check for use-before-define errors.
I mean, we could even have a safe mode with a separate bitfield for every array to check if you are using undefined slots, for example. All of this becomes impossible as soon as you have a defined value for uninitialized memory.
@carnaval I don't understand. There's nothing preventing us from having a function that allocates memory and takes an explicit unsafe_alloc=true
keyword. This can be defined in the specification and if you want to use it, you can go ahead. It will not guarantee garbage (see below).
Array memory is the ONLY allocated variable that requires explicit zeroing as far as I can tell. This is an additional inconsistency.
You cannot test use-before-define based on garbage, either, since you are not guaranteed to get garbage when you allocate memory - if the allocation is large enough, it's zeroed. This is an additional inconsistency.
What an explicit zero-by-default specification does is provides coders moving from pretty much every other modern scripting language and who haven't grown up with explicit zeroing (a la C) an assurance that they won't inadvertently make a mistake with something as simple as defining an array. Even folks who are Julia experts have been tripped up by this: this thread is littered with examples from folks who are the leaders in this language.
In any case, I'm merely repeating previous points. I'll leave it to the experts for a decision. (I'm really hoping that my expectations for the language are not so divergent from the core team's.)
We're bound to cycle if you don't read what I'm saying. Or maybe I'm wrong because I feel I'm alone on this but I'll try one last time :
I don't care whether memory is initialized or not by default (!)
When I speak of garbage mode I'm talking about a slower mode where the runtime will explicitly fill memory with random garbage. This is a way to check for use before define. This is not implementable if your language guarantees zero memory.
In fact, you cannot implement a single use-before-def detection technique, statically or dynamically because there are no undefined values anymore ! You can never know again if the zero was expected by the programmer or not. Ever.
This makes sense only if the default value has an obvious "not initialized" meaning for the type : like a null pointer. To be honest, I feel that zero is not even that good for integers, but for other types it's just plain wrong.
One last time :
- I'm OK with zeroing memory. I don't care.
- I think we should not do it by default because people will rely on it : it would be just as bad as having it be the defined behavior
- I'm reluctantly ok with doing it by default as long as we provide at least a garbage mode and run the test suites with it
I would like to ask a question about performance. We have understood that calloc is as fast as malloc on most systems. But what about garbage that we reuse in our GC? don't we have to explicitly memset this? And would't that cause a performance hit?
I think that we should do this:
- zero uninitialized fields when constructing objects
- make
Array{T}(n)
give you zeroed out memory - make
Array{T}(n, initialize=false)
give you uninitialized memory as a performance escape hatch
Change 1 only affects the behavior of bits fields since we already zero non-bits fields, which then appear as #undef
(i.e. null pointers). The current behavior of uninitialized bits fields in composite objects is just pointlessly junky:
julia> type Bar
a::Int
b::Int
Bar() = new()
end
julia> Bar()
Bar(1,3)
julia> Bar()
Bar(13057579512,4494323120)
julia> immutable Foo
a::Int
b::Int
Foo() = new()
end
julia> Foo()
Foo(13270996416,1)
There's no good performance reason for this since allocating just one object isn't performance critical. Undefined behaviors are annoying and unpredictable undefined behaviors are worse. If you have a type where having these fields be zero is ok, then you can just rely on that; if you don't, then you'll know that because your type will not work.
Changes 2 & 3 obviously doesn't really do that much: you can still get uninitialized memory if you ask for it and Array
for many types does the same thing as zeros
. Why do it then? One reason is that it has become clear from experience that people are surprised and annoyed when Array
gives them junk.
Another reason is that it means that Foo
and Array{Foo}(n)
do similar things – give you arrays of Foo
objects with zeroed out fields. If/when default field values happen, Array
should probably fill in default values and zero everything else, keeping things in sync. Yes, this will be slower than calloc
, but it addresses @carnaval's issue with zero sometimes being the wrong value; and you still always have the Array{T}(n, initialize=false)
option.
Yet another reason for change 2 is that there are types for which Array{T}(n)
is more general than zeros
. For example, there is currently no way to ask for a zeroed array of Foo
or Bar
objects – at least not without being forced to define a zero
method for them. With this change, Array{Foo}(10)
would do that, while Array{Foo}(10, initialize=false)
would give you the current behavior.
@StefanKarpinski this doesn't address my issue in general.
I'm not talking about implementation or user interface here, but a much more fundamental point : we should never assume that a bit pattern is valid for a user type. Be it zero or something else.
My problem is the following : do you have, or not, in the specs of the language, not in the implementation, a concept of "undefined (bits) value" or not.
If yes, then you have to provide tools to catch use-after-def. It's fine. I would argue for this. I did not say it should be easy to construct an array of uninitialized value, this is an UI problem, I don't care.
If not, then you have to have the user be specific about what the default value is. Getting a no method error for zero() is perfectly fine ! In fact it's the only sane way to go, because you cannot assume that the 0* bit pattern is the same as the conceptual zero value for your type (or however your decide to name your "default" initialized value, it could be a no-argument ctor call, again, UI, not my point).
Whether or not Array() defaults to what zeros() does, or under another name, is irrelevant. My point is, if your language allows the concept of undefined values, that is values you did not explicitely provided the code to initialize, then the value should actually be undefined (in the spec, in practice it could be whatever you want, garbage, zero, ...) and not some hand picked bit pattern.
I really don't see why saying "we zero the memory" is such a problem. This is being specific about values – your data will be set to whatever value is represented by zeroed out memory. It is not like we're allergic to working directly with the in-memory representation of things around here. Whether this value is useful to you – or even a valid value – depends on your type. If not, then initialize with something else.
Well then what is the point of zeroing ? It's just as bad as random garbage except for integers and floats. We wouldn't have this discussion if, e.g, (float)0x0 was NaN, because you would find it ugly that Array(Float64,10) spit out a bunch of NaNs.
Again, having it be the defined behavior means that you can never have any kind of tool to catch errors where you use the zero object when you did not wanted to.
Why not make it explicit ? If you want zero'd memory then just ask for it : zeros, that what it does. Again, Array() could default to zeros, I just want to have to be explicit about what I consider be the default of my type if you insist on always initializing it to something.
Well then what is the point of zeroing ? It's just as bad as random garbage except for integers and floats. We wouldn't have this discussion if, e.g, (float)0x0 was NaN, because you would find it ugly that Array(Float64,10) spit out a bunch of NaNs.
It would be fine if (float)0x0
was NaN
– then uninitialized memory would poison computations.
Again, having it be the defined behavior means that you can never have any kind of tool to catch errors where you use the zero object when you did not wanted to.
I'm ok with that. Apparently so is the entire .NET platform.
You're making my point ! If it was undefined behavior we could actually have a mode where every uninitialized Float array would be NaN ! What if (float)0x0 was 3.2e12 ? Would you still argue for defining uninitialized as zero ?
I don't care about the .NET platform. I'd prefer for a clear argument as to why zero makes a good default value for bits.
Sure, but geez, that's a weird number format. The point about .NET is that there's a whole huge programming ecosystem out there where no one seems to miss that kind of tool. You can't just argue these things purely from first principles – experience in the real world is necessary and .NET gives us a proxy for that in absence of our own experience. Programming language design is a subfield of psychology, not mathematics.
I agree that zero is fine in most cases, because as a convention we use it as a good binary representation for a decent default in many formats. That why I don't like it, it's too subtle when it's not true anymore.
They don't miss the tool because they can never have it. Instead they are careful about initializing their values if zero is not adequate for their value type.
I feel that most of the resistance here is that because of how easy it is to get uninitialized values, not because of the fact that they are undefined. I'd be perfectly fine to have Array() default to zeros().
What I find weird is that if I defined a new type it breaks the abstraction that I can only ever encouter in the wild something which has been through my constructor. So before, what if I had something of type A, then it was either : undefined behavior, that I can catch with tools OR a valid instance of A (that has all the invariants I specified in the various construction places). Now it's gonna be either a valid A, or a A filled with zeros, and I have no way of making sure I never end up in the latter case.
@carnaval Correct me if I'm wrong but is it (one of) your concern that Array{T}(n)
can give you defined value that is not allowed by the constructor of T
?
Edit: and it would be hard to catch later.
@yuyichao Essentially yes. If you have a T, it should either be a valid T, or a bug to use it. Not a valid T or a zero T that has no meaning depending on T. (from a specification pov, the implementation is free to zero the memory, or give uninitialized garbage, or give generated garbage, or a string of digits of pi)
I'm convinced (which doesn't count much ;P) that zero initialization is as bad as random initialization (and maybe worse in the sense that problem cause by it can be harder to find....).
Somehow this remind me of the C++ situation, every variable is initialed by the constructor and then a few concepts (and type_traits) are introduced so that copying of arrays of simple types can be done with memcpy....
a string of digits of pi
And I like this. But maybe binary representation. =)
wait - just so I understand - is the issue that if I define a type, say, NegativeInteger
, that (obviously) cannot take a zero, if we go this route, upon instantiation of an array of these things, will be immediately invalid?
This I sort of understand....
@sbromberger At least that is how I understand it. And I guess @carnaval have a real type that has this behavior.
@yuyichao: why isn't this ok, then?
julia> type Bar
x::Vector{Int}
end
julia> x = Array{Bar}(3,3)
3x3 Array{Bar,2}:
#undef #undef #undef
#undef #undef #undef
#undef #undef #undef
(That is: I'd likely be ok with all uninitialized memory returning #undef
, especially since
julia> x[3,3]
ERROR: UndefRefError: access to undefined reference
in getindex at array.jl:299
)
(BTW you can just use Array{Any}(1)
)
So IMHO, the issue can be phrase as, what should Array{T}(1)[1]
be if T
is a user defined bittype.
The few options proposed (AFAIK) are:
- Access error (
#undef
) - zero filled structure
-
zero(T)
,T()
or some other constructor - undefined behavior
The ways to judge them are,
- It should be possible/easy to implement
- It should be possible to make sure the users are not using that value by mistake. This can be done by either making it never a mistake or making sure it is always a mistake and catch it by verifying tools.
- For the first solution (
#undef
), there's probably no easy way to implement (unless we keep track of whether each element are assigned) - For the second solution (zero filled), it makes using that value not a mistake most of the time but it is probably impossible to catch the few cases left in which it is a mistake (e.g. your
NegativeInteger
). - For the third solution, it should be possible to implement and it is never an error to use the value. I think this is what @carnaval meant by
I'd be perfectly fine to have Array() default to zeros().
- The fouth solution is the current one, it makes using that value always a mistake and one can use other tools/options to catch those mistakes. (e.g. by getting the behavior of (1))
A possibly interesting solution would be 3. (fill with zero(T)
), but with a fast path for the majority of bits types for which passing zeroed memory has the same effect. This could be implemented using Tim Holy's Traits Trick. Types for which zero
does not make sense would raise a no method (or a friendlier) message and force you to explicitly ask for uninitialized memory, or pass an object to fill the array with.
The advantages of this solution are:
-
zero()
can be different from zeroed memory if needed - if
zero()
does not make sense you get an error - places where you get uninitialized arrays which does not make sense until filled are clearly visible, and can be stress-tested in debugging mode by setting
--uninit-mem=garbage
.
One way in which zeroing out bits could be subtly problematic, IIUC, is this: say you have some code which does Array{SomeType}(3)
, where SomeType
is immutable and comes from a library of some sort. Say that, up to some point, SomeType
has a Float64
field; then, a library upgrade changes that to Rational{Int}
. Now suddenly your array needs initialization in order to even make sense, while previously it didn't, and code breaks.
More in general, whenever a field gets added which makes it mandatory to pass through the constructor, one has this kind of issue. But it could also be a problem if the internal fields change meaning, or are rearranged, or change default value etc.
Naturally, if the internals of SomeType
were documented, then the field change would require an API bump, and code depending on it should be adapted to the new API. If the internals were undocumented, then the depending code was wrong to rely on zeroed out initialization in the first place. In any case, whether "uninitialized" is ok or not becomes part of the API.
So in principle everything is fine. In practice, I'm not so sure.
If implementing this, @nalimilan's last proposal (if feasible) is the way to go IMO.
I'm trying to find an article that argues that .NET's zero initialization is a bad thing. Help me look.
He doesn't have a blog or anything that I'm aware of, but my dad's been using .NET for many many years so I just sent him an email and asked his opinion.
FWIW, I agree with a lot of these points, that having zeroing memory guaranteed (even with an escape clause) doesn't always make sense and can lead to subtle bugs, and can have some (if small) effect on performance, but just from a user friendliness standpoint, when somebody allocates a new array as a buffer, and then sees that it is already full of leftover junk, is not nice, and makes debugging more difficult (although zero
is not the best fill for debugging either... 0xDEADBEEF
and the like are much better).
@yuyichao's #undef
proposal, although conceptually very nice, doesn't work for the bitstypes, or for collections of them, unless you allocate a defined bit (or a whole bunch of them, and then when things are reshaped... it all falls apart).
I like @StefanKarpinski 's idea to have some syntax for an initialized vs. uninitialized (to zero()
or whatever), but I don't think that's the right place to do it. I think it should only be available in an inner constructor of the type (and I don't know what the best syntax should be... currently there is the special function new()
, maybe some flag on new
, that would allow 1) it to be handled at the low level, so as not to hurt performance, 2) different values to be used (not necessarily constants).
So, if you don't care about the value, but don't want random left-over garbage, new()
might, depending on a build or runtime flag, set the memory to all zeros, a random value, or debugging markers such as 0xDEADBEEF
), but if you do know you want memory initialized (to zero or some other correct value for that type... maybe NaN for floats), that can be done as efficiently as possible, and can become part of the contract of the type.
I think some of @nalimilan good ideas can work in conjunction with this as well.
I've given this quite a bit of thought since last night's revelation. The core issue for me is this: uninitialized memory can cause unintended behavior that can in turn cause catastrophic* code failure.
Because of the discussion and great explanation by @yuyichao (thanks, btw), I'm now modifying / softening my stance. As long as we can detect the use - both during compilation and in runtime - of uninitialized memory, and error out, then that's what I'm really seeking.
*by "catastrophic code failure", I mean something more than just an abend - it is entirely possible that the random values assigned to uninitialized memory are close enough to expected values that errors will propagate silently, leading to sinister corruption of downstream data/processes. This is MUCH worse than an abend / exception, since 1) it's not guaranteed to be reproducible, and 2) it's really difficult to track down.
Thanks for the discussion. I look forward to reading some more interesting approaches.
@carlobaldassi Great point for Rational
julia> Array{Rational{Int}}(10)[1]
0//0
Because of the discussion and great explanation by @yuyichao (thanks, btw), I'm now modifying / softening my stance. As long as we can detect the use - both during compilation and in runtime - of uninitialized memory, and error out, then that's what I'm really seeking.
@sbromberger Is it OK then if this check is not ON by default? I guess it should be possible to make this an option in julia or julia debugger. (although I'm clearly not the right person to write it......)
@yuyichao - no, I'm still of the strong opinion that this needs to be a default behavior so that new coders and others don't get bitten by this.
It seems to me that this discussion is mixing multiple things:
-
a security request that memory allocated not contain data from another process
-
should Julia objects be initialized by default
-
if so to what value
-
cost
-
the security issue can't be solved by Julia so long as C can violate it, since Julia can call arbitrary C code that can examine (malloc/unitialized stack/mmap) for as many credit card details as it likes. If C gets fixed (by the OS) I would expect Julia to also get fixed automatically as a result.
So I don't think the security issue is currently fixable whatever Julia does, and so shouldn't impact the rest of the discussion.
- To be safe, objects do not need to be initialized by default, only before first use, and as @sbromberger says, random garbage (or zero) can be "close enough" to an expected value that its hard to detect that it is uninitialized, but still sends your rocket to Saturn not Mars.
So initialization is needed, but by default?
- For initialization by default, there is no universal initialization value that is legal for all types, and even if there was, if use of the default initialized object occurs, there is no way of preventing it sending the rocket to Saturn if the default value is again "close enough" to an expected initial value.
So default initialization cannot guarantee a useful object (or even legal unless it is a type dependent default) so it is of limited value. So if default initialization has limited value it should not cost much or it isn't worth implementing.
- Much of the discussion here concentrates on the use of zero bits as the default because its known to be cheap (or free if the OS already does it). Aside from some possible optimisations, all other default values cost as much as normal explicit initialization, but as noted above do not help the program to do the right thing.
So the conclusion I get, is that initializing objects to a default value may not be a sensible cost/benefit ratio for Julia for other than a zero bits value.
Instead all memory should be initialized explicitly before use, and attempts to use uninitialized memory be detected by the compiler.
It is my understanding that the compiler cannot make that guarantee at the moment.
Default initialization is not a replacement for that, it does not provide any guarantees of correctness or of immediate failure, and @carnaval makes the good (if lonely) point that specifying a default value prevents the use of memory fuzzing to help detect "use before init" errors, which is a useful tool since the compiler does not guarantee to catch all misuse.
It would instead be useful to put the effort into improving the compilers support for simplifying explicit initialization and of detecting "use before init" situations.