atomspace
atomspace copied to clipboard
Rename FloatValue to FloatSeq?
ProtoAtoms such as FloatValue
, LinkValue
, etc actually are sequences. For instance FloatValue
is
std::vector<double>
Why not rename them FloatSeq
, etc?
It would be more consistent with their meanings, and leave more room for other names the day someone wants to introduce single values as opposed to sequences.
Also, maybe LinkValue
should be renamed AtomSeq
.
Good idea, I was expecting FloatValue to hold one value when saw it in the code.
As far as I understand it is named so because it is a parent class of the TruthValue. And TruthValue in turn is 2 floats actually. I agree that name is confusing, but I think we should better change implementation instead of renaming.
Thus FloatValue should contain single float. TruthValue should contain two float fields instead of vector. And they should not inherit each other. And so on. I am not sure what is impact on compatibility though.
@vsbogd SimpleTruthValue has 2 values but other truth value types such as CountTruthValue have different numbers of values, this is thus why TruthValue inherits from FloatValue
(note saying it is the best design, I simply don't know, but just to explain why it inherits from FloatValue
as opposed to say FloatPair
).
@ngeiswei, yes, sure, there are other values in this hierarchy which inherit FloatValue and for some of them using vector as internal representation makes sense.
What I am trying to say in other words:
- my opinion TruthValue is not FloatSequence, so simple renaming will introduce another sort of confusion
- we still need FloatValue to represent single float instead of vector
@vsbogd agreed, for instance generalized distributional truth values are not FloatSeq (more like float counter, or something).
Also agreed about having FloatValue eventually representing single floating number rather than vector, just need to define it strictly after this first round of renaming to not create lots of confusion.
Is anybody working on this? If not I might do it as I could use a FloatValue that's just a single float.
Let's wait for @linas' feedback.
And let me stress again that turning FloatValue into a single value will have to be in a separate PR, doing otherwise would be way to0O DANGEROUS!
Also beware that it might affect the opencog repo...
Hi, Sorry for late reply.
-
Regarding the renaming: Now that you know its actually a sequence, what's the problem? Are you afraid that you'll wake up tomorrow morning, having forgotten that its a sequence?
-
Regarding "single" values: what's wrong with a sequence of length one? Does something get easier or harder, more efficient or less efficient, by introducing "single" values?
-
Explanation of why its a sequence: the primary, number-one driver of the value design is compute-efficiency: minimizing cpu cycles, minimizing ram usage, and for storage to disk (i.e. to sql), minimizing size. Second is implementation complexity (although, it is usually the case that small, fast things are also not complex). A distant third is human-usability. Again: the goal here is to make it easy for algorithms (not humans) to access and manipulate data. Since algorithms are stupid (algorithms don't have general intelligence), this is best accomplished by keeping data structures and API's as simple as possible. I'm trying to avoid a system with lots of frills and features.
3a) Efficiency: given the above, if you need to store more than one float (for truth values, attention values) then how can you do this efficiently? Try to write down every possible way of doing this, and then try to count CPU cycles and RAM usage. When I did this, the answer that came back was "use c++ std::vector". If there is a better way, then it's outside of what I can currently imagine.
- Naming of
LinkValue
: Yeah. That was hard. The most complicated part of 3) above was here: should values be JSON like? Should there be a difference between sets and arrays? Should there be linked lists? Should there be IntValues and CharValues? The JS in JSON stands for JavaScript, and the atomspace really is a whole lot like Javascript. A huge amount like javascript. If you don't know JS, you should take a month and code up some non-trivial app in it. You'll see what I mean.
4a) Clearly, there was a need for arrays-of-values. Or arrays of atoms. Or something. What's the difference between an array of atoms and an array of values? Ugh. So what possible candidate names are there? I listed all of those, and they were all ugly. So LinkValue
was the least odious of them.
Anyway: to continue with point 2) -- re single-value float-values: again what's the point? what's the problem? why should you even care? If you want to store a single float, you can. Nothing is stopping you. Perhaps you think that it's more efficient to not use std::vector<double>
? Barely, hardly. Try it, measure it. A microscopic amount of the performance goes there; most of it is lost in the python/guile bindings, most of the rest in the atomsapce, and then a decent chunk to look up value-by-key. If you get rid of std::vector<double>
then you need to write a bunch of special-case code in the SQL backend, to handle each and every different thing you replace it with. Yuck. Likewise for whatever RESTful interface, or whatever zeromq interface, or whatever you come up with. The more data formats there are, the more code you have to write, the more bugs you introduce, the more unit-tests to catch them... yuck.
Under the covers, the vector can store a single float, and to you, the user, why should it matter what happens under the covers? why should you care? How much time do you spend examining the implementation details of GCC? Worrying about how they implemented things? As long as GCC is "good enough", and "does the right thing", you mostly don't care about the assembly code falling out the back of it. Same here: you can represent a single float just fine, why should you care about how it happened, under the covers?
The biggest problem with the word Value
is that it makes talking about "ordinary" values hard. So that is a really icky side-effect of this nameing choice. I kind of hated it, but could not think of anything better. I really wanted the user to think of the following concepts:
- https://en.wikipedia.org/wiki/Valuation_(logic)
- https://en.wikipedia.org/wiki/Valuation_(algebra)
- https://en.wikipedia.org/wiki/Valuation_(measure_theory)
All three of those are kind-of the same thing. So, FloatValue
is actually a "real vector-valued non-positive measure" but that's a mouthful. I'm not sure which wikipedia article describes that, but https://en.wikipedia.org/wiki/POVM comes close: measures can be operators, not just vectors. Well, and FloatValue
can store other things, not just measures: it doesn't obey all of the axioms of a measure, just some of them.
We could rename FloatValue
to VectorMeasure
... but that's also icky, because while the mathematicians (who are few in number) would "get it", the programmers (the vast majority) would say "what the heck is that"? Naming things is ... hard. I want to strike a balance between what ordinary programmers encounter, and what is needed on the theoretical side.
@linas, I'm personally neutral on single value vs singleton vector, just consider that a single value takes at most 8 bytes, while an empty vector already takes 16 bytes. I would agree though that we don't need to worry about that before it becomes an actual problem.
However
- We need to allow room for single value to happen if we ever need them (maybe once the pattern matcher supports boolean predicates we'll want that, not sure, just an example).
- The names
FloatValue
andLinkValue
are misleading, I think even for math literates, they may resolve the apparent inconsistency quicker but still. It's not just me saying that, I'm regularly asked about them...
I'm obviously open to other names than FloatSeq
, etc. like FloatValues
or such.
Regarding renaming LinkValue
(I still don't understand why Link
was used), it's actually a vector of proto atoms, not regular Atom, so AtomSeq
or such would be misleading too, maybe it could be ProtoAtomSeq
or ProtoAtomValues
.
Obviously one may ask, if TruthValue
then why not FloatValue
?
My answer is: the notion of truth is rather abstract, it can be a boolean to an infinite order probability, nobody really knows (even Sam Harris and Jordan Peterson disagree on its definition). The notion of float however is rather concrete...
But yeah these objects are generally called values, as opposed to atoms (though they can be as well), so maybe we can rename FloatValue
to FloatSeqValue
or FloatsValue
.
Linas: "I'm trying to avoid a system with lots of frills and features."
I think this is reason enough to not having a special case for single value implementation. Yes, there's a waste of RAM when you need to keep a single float in the vector but trying to optimize something without knowing it's an actual performance problem is an anti-pattern.
Linas: "Now that you know its actually a sequence, what's the problem? Are you afraid that you'll wake up tomorrow morning, having forgotten that its a sequence?"
IMO If the name confused an experimented developer it'll probably confuse newcomers as well. I see no reason to keep bad naming here.
@ngeiswei Another name suggestion - FloatArray or FloatVec
It can be useful to have both FloatValue and FloatSequenceValue because if there are many FloatValue instances containing single float it is ineffective to keep them in vectors. Another advantage is that we can have FloatValue.value() method returning single float value instead of collection and we don't need taking first element of collection in addition. And I also think it will make API less surprising.
Yes it will require additional code to serialize each representation to use it in SQL, REST API or zeromq. But each new value type will require it. One need to write serialization/deserialization code for the type once and reuse it in all APIs which need transmitting values.
OK, FloatArray
or FloatArrayValue
is an OK name, I guess. The second name reminds you that it's not an atom. The distinction between atoms and values is very important -- if users have any confusion about that, things go down-hill very quickly.
The problem with the name "vector" is that comp-sci/programming vectors are not actually 'vectors' in the mathematical sense; they fail to obey the axioms for a vector.
@vsbogd there are no users currently, or foreseen, for single floats. Truth values are all 2,3,or 4 floats. Attention values are 4 floats. All of the various ad-hoc values in the language learning code are 3 or 4 or 5 floats, for example, count, normalized-count (probability), log-probability and p log p. Its useful to store all of these at once, it avoids repeated calculations (e.g. log p might be needed thousands of times).
The space-savings of not using an std::vector
is miniscule: as @ngeiswei pointed out, its 16 bytes, which is about 1% of the total atom size, and maybe half of that if one also counts the values. The performance overhead is also trite: one can do 100 million std::vector
accesses a second; one can do 20K python/scheme accesses a second - the performance impact is 1/100th of one percent.
Worse, if you decide to store two distinct floats, instead of an array of length two, the overhead is huge: storing one value takes something like 150 or 200 bytes, because of the key-value system uses an std::map<smart-ptr, smart-ptr>
and each smart-pointer is maybe 40 or 50 bytes, and the std-map entry itself (a rb-tree) is another 40 or 50 bytes, and the value itself is 20 or 40 or more bytes. So storing two distinct floats instead of an array uses about 10x or 20x more storage. There's just no way to win storing singe floats.
This analysis is what I meant when I said "think of all the other ways you could implement this". All the other ways I could think of were losers; this way seemed like the best way. If there's a better way, that would be great, but I don't know what that better way is.
... or FloatSeqValue
is OK, I guess. I'm just not very concerned about the newbie not realizing that this is an array, instead of a single value. By contrast, I am very very very concerned that the newbie has a very clear idea of the difference between atoms and values, and the fact that they have very different performance profiles, and very different mutability and associativity properties. By the time they figure this out, the fact that a FloatValue
is actually an array will be kind-of a trivial fact. This is not the hard part ...
I would vote for FloatSeqValue
as well, because Seq
tends to be used in the code to denote a list or a vector. And similarly I would go with ProtoAtomSeqValue
.
@rTreutlein, I'm happy to have you do the change (FloatValue
->FlaotSeqValue
, etc), however I do agree that, as of today, introducing a single value doesn't seem necessary. Could you describe the reason you want to introduce single values?
@ngeiswei I am using single values for the distributional values but I can just use a [0] when accessing the value. It just seems less clean then if I knew that there could only be 1 number in the FloatValue.
I can still work on the renaming but if it is just that then i don't mind if somebody else wants to do it.
Maybe for now work with a singleton vector, the representation for a DV might change anyway and we can revisit this sort of things when it appears that introducing a single value provides a practical benefit.
Adding DistributionalValue type can make finding all usages of DV value simpler in future.
Adding DistributionalValue type can make finding all usages of DV value simpler in future.
I think we'll eventually want to do that, yeah. AtomSpace-MOSES also needs some form of distributional value.
There will be a DistributionalValue Type and the way it is right now also a ConditionalDistributionalValue Type. That's they way it currently works in my dev branch.
If we are renaming things, I would like to get rid of the term "proto-atom". It has outlived it's purpose and conceptual usefulness.
Ideally, I want the new name to begin with the letter "a". can't think of one ... iota, dot, jot, speck ... infinitesimal ...
@rTreutlein questions:
-
for dereferencing the [0], can't you just write a wrapper routine to just do that for you? For you, it's the API that matters, and not how it is implemented under-the-covers. Please distinguish these concerns (api vs. implementation) as being distinct concerns.
-
A "distributional value" in the standard sense of probability theory would be a bin-count (aka histogram, frequency count, etc.) which require N numbers - very naturally a sequence. https://en.wikipedia.org/wiki/Histogram
One side-effect of making FloatValue a sequence was I figured it would make @ngeiswei very happy by giving him a very natural distributional truth value. i.e. a histogram. So I'm confused by how a distribution can be just a single value??
Just to be clear:
FloatSeqValue 4 5 6
is much much more efficient than
LinkValue
FloatValue 4
FloatValue 5
FloatValue 6
It's almost criminal to use LinkValue like this: it would bloat the atomspace, it would run 4x slower, kill database performance, etc.
@linas regarding renaming proto-atom, what about Value
?
In this case LinkValue
would become ValueSeqValue
, kinda weird but does the job IMO.
Regarding distributional value, a histogram is good but you do need to map the random variable co-domain partition to the bins. I would like the distributional value representation to be expressive enough that it can represent distributions not just over probabilities, and not just over equidistant bins, so I was thinking more like a map from anything to float.
Instead of map it could also be 2 vectors, one FloatSeqValue
and one ValueSeqValue
linked by their indices. And one could establish a convention such that if the partition vector is missing, then it represents equidistant bins over [0, 1]. This remains to be decided I guess.