`fromList` is memory inefficient
The underlying arrays created by fromList can be much larger than the length of the list.
This is because fromList works by doubling the size of the underlying array when it runs out of space and then at the end it does an unsafeFreeze. So you get up to half the length of the list in excess unused space.
I would like to avoid this unnecessary memory usage. Here's a couple of ideas of how it could be done. I'd be happy to try to implement something.
-
Make
fromList xs = fromListN (length xs) xsThis is how it's implemented byData.Primitive.Arrayfrom primitive. The trade-off is that the current implementation consumes the list in a streaming manner, which means that we don't need to hold on to List constructors, whereas this would force the list before the vector can start to be filled. -
Shrink the underlying array at the end As far as I can tell there is no shrink primop defined for
MutableArrayso this would have to be implemented in terms of copying the array.
- Make
fromList xs = fromListN (length xs) xsThis is how it's implemented byData.Primitive.Arrayfrom primitive. The trade-off is that the current implementation consumes the list in a streaming manner, which means that we don't need to hold on to List constructors, whereas this would force the list before the vector can start to be filled.
Computing length xs means the list has to be traversed twice (which despite obviously being slower also probably breaks list fusion). You can always manually do fromListN (length xs) xs (it's not much more to write, often you even know the length in advance, so you can use fromListN n xs).
- Shrink the underlying array at the end As far as I can tell there is no shrink primop defined for
MutableArrayso this would have to be implemented in terms of copying the array.
Copying a large array is also quite expensive.
Both of these solutions have problems that make them not suited to be the default imo. I think for most people, the extra memory usage isn't a problem anyway (especially when they're not dealing with large arrays). There are also several options to create a Vector of a fixed length without extra memory usage, e.g. replicate, generate, iterateN, ...
Growing arrays in other languages also use the same principle of increasing the array size by some factor when the capacity is used up, although that factor doesn't have to be 2, many also use 1.5 afaik. So one possible mitigation would be to only increase the array by half the length of the current one.
There is no perfect implementation for fromList, all of them have at least one downside, so if the default implementation doesn't work for your particular situation you can always do a workaround:
fromListN (length xs) xswill solve the problem with allocating more memory at the expense of performance. @konsumlamm explained quite well this point (fusion is broken, double list iteration etc.)force (fromList xs), will make a copy of the vector without the extra redundant space at the end, thus allowing GC to cleanup the larger vector.
If anyone would like to submit a PR with improved documentation for fromList that would be awesome.
Hi, I'd like to improve the fromList documentation on the basis of your comments if no one else is working on it :)
@AlexMaryW that would be very appreciated! Please note there's 6 fromLists: generic one and 5 specializations. I wish we had better way of dealing with copy-paste
Fixed by #500