Implement URI Handling Functions
ECMASCript feature We currently lack the implementation of some built-in global functions used for URI encoding/decoding. You can find these functions in the spec here.
The 4 functions to implement would be these:
In those links you can find example code that should work after this implementation.
Implementation tips We should probably create a new module here for these functions. An example of a built-in global function implementation can be seen here.
Hey @Razican , I would like to work on it. Thanks :)
Hi @sidntrivedi012 how is this going?
I will be happy to take a look if @sidntrivedi012 don't mind
Go ahead @captain-yossarian
@Razican @tofpie Should/Can I use this percent_encoding crrate ?
@Razican @tofpie Should/Can I use this percent_encoding crrate ?
Hi, yes, if it makes sense to use it, feel free :)
Unassigned the issue by @captain-yossarian's request. For anyone interested, feel free to ping us if you want to work on this😁
I'll take a look at this, please assign to me.
@jtara1 I've asigned it to you. Let us know if you need some pointers to get started :)
I got something working with that crate ^ but I'm trying to just follow their algo directly in the ecma spec now
What's the return values for this function? https://github.com/jtara1/boa/blob/8ba500a26afdad8e200c9990b375664b5c04a97a/boa/src/builtins/string/mod.rs#L36
e: looks like @joshwd36 implemented that, what do the return values of (u32, u8, bool) represent for code_point_at func?
code_point_at just represents the operation CodePointAt from the spec. But maybe we could rewrite it to be more idiomatic with descriptive enum returns instead of bools and u8s
the spec answered my question with
Return the Record { [[CodePoint]], [[CodeUnitCount]], [[IsUnpairedSurrogate]] }.
thanks
So far the hardest part of this is figuring out what the transformation algo (for singles & pairs) for code point -> utf-8 encoded byte(s) is. https://en.wikipedia.org/wiki/UTF-8 which actually seems like the most thorough explanation for this
I'm also checking the node.js implementation of this https://github.com/nodejs/node/blob/e46c680bf2b211bbd52cf959ca17ee98c7f657f5/deps/v8/src/strings/uri.cc#L279
and
void AddEncodedOctetToBuffer(uint8_t octet, std::vector<uint8_t>* buffer) {
buffer->push_back('%');
buffer->push_back(HexCharOfValue(octet >> 4));
buffer->push_back(HexCharOfValue(octet & 0x0F));
}
void EncodeSingle(uc16 c, std::vector<uint8_t>* buffer) {
char s[4] = {};
int number_of_bytes;
number_of_bytes =
unibrow::Utf8::Encode(s, c, unibrow::Utf16::kNoPreviousCharacter, false);
for (int k = 0; k < number_of_bytes; k++) {
AddEncodedOctetToBuffer(s[k], buffer);
}
}
void EncodePair(uc16 cc1, uc16 cc2, std::vector<uint8_t>* buffer) {
char s[4] = {};
int number_of_bytes =
unibrow::Utf8::Encode(s, unibrow::Utf16::CombineSurrogatePair(cc1, cc2),
unibrow::Utf16::kNoPreviousCharacter, false);
for (int k = 0; k < number_of_bytes; k++) {
AddEncodedOctetToBuffer(s[k], buffer);
}
}
e: my answer may lie here https://github.com/nodejs/node/blob/e46c680bf2b211bbd52cf959ca17ee98c7f657f5/deps/v8/src/strings/unicode-inl.h#L144 still sorting through things
Would it be worth looking into #736 again? It seems that the implementation will be pretty closely tied to the encoding
Would it be worth looking into #736 again? It seems that the implementation will be pretty closely tied to the encoding
Yeah, maybe. We could reevaluate if there's a way to adapt our regex to consume UTF-16, then change the definition of JsString.