scryer-prolog icon indicating copy to clipboard operation
scryer-prolog copied to clipboard

Too much code sees a dedicated partial strings representation

Open triska opened this issue 5 months ago • 23 comments

Please consider the most recently corrected mistake in a long list of issues related to partial strings: https://github.com/mthom/scryer-prolog/commit/93cc4d48980d0a6c39575a04557bee086a865e31

Such issues stem from special cases related to a dedicated partial strings representation. Scryer is on the forefront of pioneering systems of this representation, and is the first to run into many such issues.

How can we reliably prevent such problems?

A completely safe solution for such issues is to move as much of the code as possible to Prolog. A candidate that has already been identified is write_term/3. Also a Prolog implementation may have mistakes. But it will never be a mistake due to confusing two different internal representations of partial strings, because Prolog code cannot tell the representations apart.

In cases where that would be too slow, a simple step that would prevent many such issues is to retain at least the following guiding principle: The internal representation of partial strings must be exposed to as few places in the code as possible.

There is no need for compare/3 to even know that there is a dedicated internal representation for partial strings. It should only ever see a partial list, no matter how it is internally represented.

triska avatar Aug 17 '25 19:08 triska

@Skgland: Since you have a particular talent for far-reaching adaptations that improve correctness, I hope this may interest you for example? I.e., to implement the Rust vocabulary that provides access to partial strings as "normal" lists?

triska avatar Aug 20 '25 18:08 triska

I don't have any immediate ideas in mind and I don't think I had much contact or involvement with partial strings till now. Should something come to mind I will definitely share it, but I am not expecting much.

Skgland avatar Aug 20 '25 21:08 Skgland

Another candidate where a better building block that papers over the internal representation difference could be useful and help to avoid mistakes: https://github.com/mthom/scryer-prolog/commit/b2cfc8b6f2fbcd26cbe99f98b4a03679dab8671b

triska avatar Aug 23 '25 16:08 triska

One additional motivation to limit the places that see the internal representation difference:

In the future, binary blobs (of fixed, known length) could be an interesting addition to the engine, one key advantage being that each byte value is processed in constant time (whereas with UTF-8 encoding, values greater than 127 take two bytes). This will reduce or even entirely eliminate the risk of side-channel attacks in cryptographic applications. See also https://github.com/mthom/scryer-prolog/issues/24#issuecomment-1881642240

With useful building blocks that let more Rust code ignore the internal representation difference, such binary blobs will be a lot easier to add. Interestingly, and I think hitherto unheard of, binary blobs could then be accessed as Prolog lists and simultaneously also serve as arrays of binary data where each byte can be accessed in O(1)!

Especially for APL programmers such as you @pmikkelsen, this could yield interesting combination opportunities leading to array-based logic programming or logic-based array programming. Fast vector-based building blocks could be very valuable to have in Prolog applications.

triska avatar Aug 24 '25 09:08 triska

I agree 100% that having different internal representations for data, which can be used transparently from the prolog code is a great idea. At my day job, the APL interpreter we develop has 5 or so representations of numeric arrays (1-bit, 1,2 and 4 byte signed integers, and 64bit floatingpoints) and the most compact one is always used internally (the user shouldn’t care about the representation - it is just numbers, and there is no need to be able to tell them apart in 99,9% of usecases). Modern CPUs can process a lot of data at once using SIMD, and the smaller the array elements, the more elements per instruction :)

Similarity we have 3 different internal representations for character data (1, 2 and 4 byte integers).

The transparen conversion that the interpreter is doing when converting between representations is what makes this a pleasure to work with. I wouldn’t want it if as an APL programmer, I needed to think take care of each format.

Also, I am not familiar with how scryer handles/stores partial strings, so some of what I mention might be what you already do.

pmikkelsen avatar Aug 24 '25 10:08 pmikkelsen

Partial strings are internally stored in UTF-8 encoding, a very compact representation explained in #24 and #95.

Another key benefit: UTF-8 encoded text files that don't have 0-bytes can be mmapped to memory (#251) and thus be processed with DCGs and list predicates without having to be loaded entirely into memory.

As coincidentally another APLer recently mentioned to me: With an internal "binary blob" representation that also makes the bytes available as a list of characters to Prolog code, we could extend this to binary files, and thus parse and analyze for example WAV files very efficiently with Prolog and its grammar mechanism.

triska avatar Aug 24 '25 14:08 triska

"binary blob" ... makes the bytes available as a list of characters to Prolog code, we could extend this to binary files

Usually, binary blobs are the very data mapped directly into memory (like mmap(2)-ing a jpeg). Handling such a blob then as a list of characters would incur some data-overhead (creating some auxiliary terms on the heap), and thus some mapping like in library(pio) is the best what one can hope for. And I do not see much use of such blobs in Prolog.

However, partial strings have a representation such that parsing them with a DCG produces no auxiliary data. So it is (now) never more costly (data-wise) than the naive list-of-characters representation but shares the same minimal overhead when being passed around.

UWN avatar Aug 24 '25 17:08 UWN

It can be called a pessimistic view of a naive user, but I can only hope that before Scryer starts processing binary data, it will be able to process textual data.

haijinSk avatar Aug 25 '25 12:08 haijinSk

It can be called a pessimistic view of a naive user, but I can only hope that before Scryer starts processing binary data, it will be able to process textual data.

Scryer does handle textual data in UTF-8.

?- phrase_to_file("a\xa0\b",file).
   true.
?- phrase_from_file(seq(Xs),file).
   Xs = "a\xa0\b"
;  false.
?- open(file,read,S),get_char(S,A),get_char(S,NBSP),get_char(S,B).
   S = '$stream'(0x5d6cf5b81f80), A = a, NBSP = '\xa0\', B = b.

$ od -c file
0000000   a 302 240   b
0000004

What you want is to extend Prolog's syntax.

UWN avatar Aug 25 '25 12:08 UWN

Here, by processing textual data I mean (that Scryer and Trealla would work in the same way):

Scryer:

?- use_module(library(pio)).
   true.
?- phrase_to_file("nbsp('\xa0\').","nbsp-fact.pl").
   true.
?- ['nbsp-fact.pl'].
   error(syntax_error(invalid_single_quoted_character),read_term/3). % Should I expect this?

Trealla (as I expected):

?- ['nbsp-fact.pl'].
   true. 
?- nbsp(NBSP).
   NBSP = '\xa0\'.

SWI-Prolog:

?- ['nbsp-fact.pl'].
true.

?- nbsp(NBSP).
NBSP = '\u00A0'.

GNU Prolog:

| ?- ['nbsp-fact.pl'].
compiling /../nbsp-fact.pl for byte code...
/.../nbsp-fact.pl compiled, 0 lines read - 296 bytes written, 14 ms

yes
| ?- nbsp(NBSP).

NBSP = '\xc2\\xa0\'

Ciao:

?- ['nbsp-fact.pl'].

yes
?- nbsp(NBSP).

NBSP = ' ' ? 

yes

haijinSk avatar Aug 25 '25 13:08 haijinSk

This appears to be working on the playground (Who knowns what version the playground is using?)

Image

How did you install scryer? What version are you using?

Skgland avatar Aug 25 '25 13:08 Skgland

Version: https://github.com/mthom/scryer-prolog/tree/cd501beb0b3fe1fd569080dea56fc6b2dada5dac

By: $ cargo build --release

haijinSk avatar Aug 25 '25 13:08 haijinSk

@Skgland, yes, it works if there is the explicit literal: '\xa0\' in a file; but in that file that literal is not saved with phrase_to_file("nbsp('\xa0\').","nbsp-fact.pl").

haijinSk avatar Aug 25 '25 13:08 haijinSk

Ah. A literal non breaking space appears to reproduce the problem.

Image

And the version of the playground appears rather old

Image

Skgland avatar Aug 25 '25 14:08 Skgland

@Skgland Yes! That is what I see as a problem, or I misunderstand the thing...

haijinSk avatar Aug 25 '25 14:08 haijinSk

In my (naive) imagination, at least saving textual data with non-breaking spaces to a file and then consulting that file as Prolog code should work (as it works in Trealla Prolog, for example).

haijinSk avatar Aug 25 '25 14:08 haijinSk

@haijinSk: Writing strings verbatim to a file does not guarantee that they form valid Prolog syntax that can be consulted. For example, we equally have, with Trealla Prolog:

?- phrase_to_file("hello('\n').", "hello.pl").
   true.
?- [hello].
Error: syntax error, unterminated quoted atom, hello:1

This is expected.

To ensure that the emitted text is valid Prolog text and can be parsed, we can use for example the ~q format specifier of format_//2, or also portray_clause//1:

?- phrase_to_file(portray_clause_(nbsp('\xa0\')), "nbsp-fact.pl").
   true.
?- ['nbsp-fact'].
   true.

Does this work for you?

triska avatar Aug 25 '25 17:08 triska

Markus @triska, thank you very much!!! That solves the thing for me: how to create a file of Prolog facts with non-breaking spaces in texts of the facts... Thank you!!!

And I'm closing (it seems now that I should close it) this: https://github.com/mthom/scryer-prolog/issues/2982

Also, for looking at that thing, thank you @UWN, thank you @Skgland

haijinSk avatar Aug 25 '25 19:08 haijinSk

Another case that could benefit from this approach: #3086.

triska avatar Sep 13 '25 18:09 triska

Another case that can maybe benefit from such an approach: #3089.

triska avatar Sep 16 '25 16:09 triska

This issue is also related to #2594: In the future, yet another term representation may be introduced to support terms with unbounded arities, and the interface described above would make that far easier and less error-prone.

triska avatar Oct 16 '25 19:10 triska

A conceptually related issue is Lis vs. '.'(_, _), unexpectedly yielding https://github.com/mthom/scryer-prolog/issues/3171, and handling in https://github.com/mthom/scryer-prolog/pull/3173/commits/268ba289e4e6740be074a43771e9b13398284e9b.

Ideally, the internal distinction is visible to as few places as possible.

triska avatar Nov 21 '25 23:11 triska

Note: 268ba28 does not fully resolves #3171, The Term::from_heapcell is only used in QueryState::next, QueryState being the iterator returned from Machine::run_query

Skgland avatar Nov 21 '25 23:11 Skgland