Too much code sees a dedicated partial strings representation
Please consider the most recently corrected mistake in a long list of issues related to partial strings: https://github.com/mthom/scryer-prolog/commit/93cc4d48980d0a6c39575a04557bee086a865e31
Such issues stem from special cases related to a dedicated partial strings representation. Scryer is on the forefront of pioneering systems of this representation, and is the first to run into many such issues.
How can we reliably prevent such problems?
A completely safe solution for such issues is to move as much of the code as possible to Prolog. A candidate that has already been identified is write_term/3. Also a Prolog implementation may have mistakes. But it will never be a mistake due to confusing two different internal representations of partial strings, because Prolog code cannot tell the representations apart.
In cases where that would be too slow, a simple step that would prevent many such issues is to retain at least the following guiding principle: The internal representation of partial strings must be exposed to as few places in the code as possible.
There is no need for compare/3 to even know that there is a dedicated internal representation for partial strings. It should only ever see a partial list, no matter how it is internally represented.
@Skgland: Since you have a particular talent for far-reaching adaptations that improve correctness, I hope this may interest you for example? I.e., to implement the Rust vocabulary that provides access to partial strings as "normal" lists?
I don't have any immediate ideas in mind and I don't think I had much contact or involvement with partial strings till now. Should something come to mind I will definitely share it, but I am not expecting much.
Another candidate where a better building block that papers over the internal representation difference could be useful and help to avoid mistakes: https://github.com/mthom/scryer-prolog/commit/b2cfc8b6f2fbcd26cbe99f98b4a03679dab8671b
One additional motivation to limit the places that see the internal representation difference:
In the future, binary blobs (of fixed, known length) could be an interesting addition to the engine, one key advantage being that each byte value is processed in constant time (whereas with UTF-8 encoding, values greater than 127 take two bytes). This will reduce or even entirely eliminate the risk of side-channel attacks in cryptographic applications. See also https://github.com/mthom/scryer-prolog/issues/24#issuecomment-1881642240
With useful building blocks that let more Rust code ignore the internal representation difference, such binary blobs will be a lot easier to add. Interestingly, and I think hitherto unheard of, binary blobs could then be accessed as Prolog lists and simultaneously also serve as arrays of binary data where each byte can be accessed in O(1)!
Especially for APL programmers such as you @pmikkelsen, this could yield interesting combination opportunities leading to array-based logic programming or logic-based array programming. Fast vector-based building blocks could be very valuable to have in Prolog applications.
I agree 100% that having different internal representations for data, which can be used transparently from the prolog code is a great idea. At my day job, the APL interpreter we develop has 5 or so representations of numeric arrays (1-bit, 1,2 and 4 byte signed integers, and 64bit floatingpoints) and the most compact one is always used internally (the user shouldn’t care about the representation - it is just numbers, and there is no need to be able to tell them apart in 99,9% of usecases). Modern CPUs can process a lot of data at once using SIMD, and the smaller the array elements, the more elements per instruction :)
Similarity we have 3 different internal representations for character data (1, 2 and 4 byte integers).
The transparen conversion that the interpreter is doing when converting between representations is what makes this a pleasure to work with. I wouldn’t want it if as an APL programmer, I needed to think take care of each format.
Also, I am not familiar with how scryer handles/stores partial strings, so some of what I mention might be what you already do.
Partial strings are internally stored in UTF-8 encoding, a very compact representation explained in #24 and #95.
Another key benefit: UTF-8 encoded text files that don't have 0-bytes can be mmapped to memory (#251) and thus be processed with DCGs and list predicates without having to be loaded entirely into memory.
As coincidentally another APLer recently mentioned to me: With an internal "binary blob" representation that also makes the bytes available as a list of characters to Prolog code, we could extend this to binary files, and thus parse and analyze for example WAV files very efficiently with Prolog and its grammar mechanism.
"binary blob" ... makes the bytes available as a list of characters to Prolog code, we could extend this to binary files
Usually, binary blobs are the very data mapped directly into memory (like mmap(2)-ing a jpeg). Handling such a blob then as a list of characters would incur some data-overhead (creating some auxiliary terms on the heap), and thus some mapping like in library(pio) is the best what one can hope for. And I do not see much use of such blobs in Prolog.
However, partial strings have a representation such that parsing them with a DCG produces no auxiliary data. So it is (now) never more costly (data-wise) than the naive list-of-characters representation but shares the same minimal overhead when being passed around.
It can be called a pessimistic view of a naive user, but I can only hope that before Scryer starts processing binary data, it will be able to process textual data.
It can be called a pessimistic view of a naive user, but I can only hope that before Scryer starts processing binary data, it will be able to process textual data.
Scryer does handle textual data in UTF-8.
?- phrase_to_file("a\xa0\b",file).
true.
?- phrase_from_file(seq(Xs),file).
Xs = "a\xa0\b"
; false.
?- open(file,read,S),get_char(S,A),get_char(S,NBSP),get_char(S,B).
S = '$stream'(0x5d6cf5b81f80), A = a, NBSP = '\xa0\', B = b.
$ od -c file
0000000 a 302 240 b
0000004
What you want is to extend Prolog's syntax.
Here, by processing textual data I mean (that Scryer and Trealla would work in the same way):
Scryer:
?- use_module(library(pio)).
true.
?- phrase_to_file("nbsp('\xa0\').","nbsp-fact.pl").
true.
?- ['nbsp-fact.pl'].
error(syntax_error(invalid_single_quoted_character),read_term/3). % Should I expect this?
Trealla (as I expected):
?- ['nbsp-fact.pl'].
true.
?- nbsp(NBSP).
NBSP = '\xa0\'.
SWI-Prolog:
?- ['nbsp-fact.pl'].
true.
?- nbsp(NBSP).
NBSP = '\u00A0'.
GNU Prolog:
| ?- ['nbsp-fact.pl'].
compiling /../nbsp-fact.pl for byte code...
/.../nbsp-fact.pl compiled, 0 lines read - 296 bytes written, 14 ms
yes
| ?- nbsp(NBSP).
NBSP = '\xc2\\xa0\'
Ciao:
?- ['nbsp-fact.pl'].
yes
?- nbsp(NBSP).
NBSP = ' ' ?
yes
This appears to be working on the playground (Who knowns what version the playground is using?)
How did you install scryer? What version are you using?
Version: https://github.com/mthom/scryer-prolog/tree/cd501beb0b3fe1fd569080dea56fc6b2dada5dac
By: $ cargo build --release
@Skgland, yes, it works if there is the explicit literal: '\xa0\' in a file; but in that file that literal is not saved with phrase_to_file("nbsp('\xa0\').","nbsp-fact.pl").
Ah. A literal non breaking space appears to reproduce the problem.
And the version of the playground appears rather old
@Skgland Yes! That is what I see as a problem, or I misunderstand the thing...
In my (naive) imagination, at least saving textual data with non-breaking spaces to a file and then consulting that file as Prolog code should work (as it works in Trealla Prolog, for example).
@haijinSk: Writing strings verbatim to a file does not guarantee that they form valid Prolog syntax that can be consulted. For example, we equally have, with Trealla Prolog:
?- phrase_to_file("hello('\n').", "hello.pl").
true.
?- [hello].
Error: syntax error, unterminated quoted atom, hello:1
This is expected.
To ensure that the emitted text is valid Prolog text and can be parsed, we can use for example the ~q format specifier of format_//2, or also portray_clause//1:
?- phrase_to_file(portray_clause_(nbsp('\xa0\')), "nbsp-fact.pl").
true.
?- ['nbsp-fact'].
true.
Does this work for you?
Markus @triska, thank you very much!!! That solves the thing for me: how to create a file of Prolog facts with non-breaking spaces in texts of the facts... Thank you!!!
And I'm closing (it seems now that I should close it) this: https://github.com/mthom/scryer-prolog/issues/2982
Also, for looking at that thing, thank you @UWN, thank you @Skgland
Another case that could benefit from this approach: #3086.
Another case that can maybe benefit from such an approach: #3089.
This issue is also related to #2594: In the future, yet another term representation may be introduced to support terms with unbounded arities, and the interface described above would make that far easier and less error-prone.
A conceptually related issue is Lis vs. '.'(_, _), unexpectedly yielding https://github.com/mthom/scryer-prolog/issues/3171, and handling in https://github.com/mthom/scryer-prolog/pull/3173/commits/268ba289e4e6740be074a43771e9b13398284e9b.
Ideally, the internal distinction is visible to as few places as possible.
Note: 268ba28 does not fully resolves #3171, The Term::from_heapcell is only used in QueryState::next, QueryState being the iterator returned from Machine::run_query