hledger
hledger copied to clipboard
digit group separator parsed as decimal point
dj_ryan reports that amount decimal point is being misparsed, contrary to http://hledger.org/manual.html#amounts . An example in shelltest format:
1/1
(a) $1,000.00 ; first $ amount, clear that . is decimal point
1/2
(a) $1,420 ; hledger should know that , is digit group separator
$ hledger -f - print
2018/01/01
(a) $1,000.000
2018/01/02
(a) $1.420 ; wrong
Adding a commodity directive does fix it, as the manual suggests it might:
commodity $1,000.00
1/1
(a) $1,000.00
1/2
(a) $1,420
$ hledger -f - print
2018/01/01
(a) $1,000.00
2018/01/02
(a) $1,420.00
but it shouldn't have been necessary in this case; with no directive, the first posting amount should set the style.
I don't think this behaviour was introduced with last release. Note that in manual it is explicitly said:
However, there is some ambiguous way of representing numbers like $1.000 and $1,000 both may mean either one thousand or one dollar. By default hledger will assume that this is sole delimiter is used only for decimals.
So this behavior matches manual.
Consider counter-example:
1/2
(a) $1,420 ; hledger doesn't know yet that we plan to use , is digit group separator
1/1
(a) $1,000.00 ; second $ amount, clear that . is decimal point
We parse $1,420
first and treat it as before $1,000.00
.
As I said in comments to #487 we can make two-pass parser. But you cannot resolve case like this:
2016/1/1
(a) $-10.00
2017/1/1
(a) $-10,00
There is no guarantee which one should be treated as a source of information about formatting. Thus explicit formatting specification which have higher priority and affects rest of the journal (up to reporting) will make more sense here, I think.
Though we can add more strictness and reject journals where ambiguity is present. But that will significantly break compatibility with LedgerCLI. I don't mind this, but I'm not sure if want introduce that taking into account user base we have.
I agree, not a regression. I thought older versions did better on this example, but no.
I'll read your comment in detail when I have more time to think on this. This pesky somewhat related duo of controlling the number of decimal digits, and controlling the decimal point/digit grouping characters, is our current biggest usability issue I think. Would love to be really done with these by 1.6.
The problem is the current behavior is already broken WRT to LedgerCLI -> it's treating things like $1,420 as one dollar and 42 cents, not one-thousand four hundred and twenty like LedgerCLI.
If hledger was failing and suggested in it's error message to use 'commodity' directives to resolve it, that might be more helpful, then silently mis-interpreting my ledger entries.
@ryanobjc , if you'll write $1,42
I'm pretty sure that LedgerCLI will treat it as one dollar and 42 cents.
From what I can see LedgerCLI have heuristic which probably will be hard to describe in manual. It also is more biased toward some national-specific expectations.
zsh% ledger bal --no-pager -f -
1/1
(a) $1,000
1/2
(a) $3.1415
; end
$1,003.1415 a
zsh% ledger bal --no-pager -f -
1/1
(a) $1.000
1/2
(a) $3,1415
; end
$4,1415 a
zsh% ledger bal --no-pager -f -
1/1
(a) $1,0000
1/2
(a) $3.145
; end
$3.146,0000 a
As you can see as long as there is 3 digits it will automatically assume that it is digit groups. Some more examples:
zsh% ledger bal --no-pager -f -
1/1
(a) $1,0000
1/2
(a) $1.0000,42
; end
While parsing file "", line 5:
While parsing posting:
(a) $1.0000,42
^^^^^^^^^^
Error: Incorrect use of thousand-mark period
zsh% ledger bal --no-pager -f -
1/1
(a) $1,000
1/2
(a) $1.000,42
; end
$2.000,42 a
It is lucky that I also used to have 3-digit groups and never saw that when I was using LedgerCLI.
We can add similar heuristic. But if there is a better way - I'd prefer it. Though I don't mind writing commodity
directive.
P.S. If you'll use space as separator (1 000.42
) you may avoid hitting this problem.
We (@simonmichael and I) have found that there is currently no logic in hledger to use the encountered styles of a certain commodity to change the way future amounts of that commodity are parsed.
Furthermore, while the ledger and hledger manuals mention that the first encountered commoditized amount is examined in order to infer the display style of that comomdity (leger manual sections 3.4 and 14.2.2.2; hledger manual #amounts and #declaring-commodities), I am unable to find any mention that [h]ledger should infer the way in which the commoditized amount is parsed.
If indeed there are no such mentions, then it would seem that commodity directives are the recommended (and only) way to direct the parsing of commoditized amounts.
Unfortunately, this does not make any less surprising the difference between ledger and hledger in the way they parse the value $1,420
.
You're right, from our recent discussions I see that it's a bit more complex than I thought.
-
We do one parsing pass, then a finalisation step involving various transformations of the journal.
-
An amount display style is chosen and applied to all the amounts in each commodity at the end of parsing. This determines the symbol side and spacing, number of decimal places, decimal point character, digit grouping character, digit group sizes, but it does not alter the number that was parsed.
-
Numbers are parsed as we encounter them. Each number is parsed afresh, without knowledge of other numbers in the journal.
-
Numbers which have no decimal point or digit grouping character, and numbers which have both, can be parsed unambiguously.
-
For numbers which have just one or the other (eg:
1.000
,1,000
,1 000
), we need to determine if it's a decimal point. -
If we have already parsed a commodity directive that seems applicable, we get the information from there. (A decimal point in commodity directives is or soon will be mandatory.)
-
Otherwise, we just guess, and assume that it's a decimal point. (We even assume space is a decimal point, which is now or soon to be fixed.)
I hope I have that right. Solutions ?
-
stick to the present course: fix the two parenthesised issues above, and improve the docs, showing the circumstances under which amounts can be misparsed and easy ways to prevent that.
-
presumably we could delay even the interpretation of numbers till the end, like the display styles, and then interpret them based on more complete data. You could still be left with ambiguous amounts, but it would be rarer, and at least all amounts will be [mis]parsed in the same way, which is arguably better. The cost of this approach (changes to the balance assertion checking code, more complexity in the implementation) is unclear.
Your summary of the way we parse amounts is consistent with my understanding.
I think there is a variation on your second suggestion. Instead of delaying the interpretation of numbers until the very end, when we have the complete set of data, we could choose some subset of this data to keep track of in the StateT Journal
layer of the parser and use this subset for immediate interpretation. For instance, we could push AmountStyle
s to the jcommodities
or jinferredcommodities
field (I forget which is which) as we encounter them.
I am a bigger fan of the first suggestion, however. All else equal, I would prefer hledger to perform checking of specifications rather than inference. It seems like bad behaviour for the insertion of a transaction into an existing journal to change the meaning of the transactions that follow it, which is something that might occur if my above suggestion were implemented.
Personally, I would prefer to enforce even more strictness where possible without breaking journals. For example, to me, it would be a good thing to stipulate that the amounts of a certain commodity are interpreted in a consistent manner, using the same format, across an entire journal. This is already what directives sort of do, but it could be taken it further. For example, we could stipulate that there be a format specification for every commodity in use, whether user-specified or inherited from some default.
I've started a related mail list thread, https://groups.google.com/d/msg/hledger/GqDwXF1LAJY/atAj60JeBgAJ . I ran out of steam but hope to follow up with some ideas for this issue, unless someone beats me to it.
My epic solo mail list thread continues; agreement/disagreement/perspective on any of it are welcome. I think it's pointing in the same direction as what awjchen is saying.
#793 aims to help with this and similar issues.
This issue has been around for a while. The overall conclusion seems to be that, given the need to support both , and . as decimal separators, the existing behaviour (default to decimal point) is least suprising and most maintainable. Beyond putting a more prominent warning in the docs, is there anything to be done here?
This issue's original example still stands; we parse each amount separately, always treating a single ambiguous comma or period as a decimal mark, and this is documented. #793 is for more advanced behaviour, so we can close this one.