boa Implement URI Handling Functions

ECMASCript feature We currently lack the implementation of some built-in global functions used for URI encoding/decoding. You can find these functions in the spec here.

The 4 functions to implement would be these:

In those links you can find example code that should work after this implementation.

Implementation tips We should probably create a new module here for these functions. An example of a built-in global function implementation can be seen here.

Oct 19 '20 12:10 Razican

Hey @Razican , I would like to work on it. Thanks :)

Oct 24 '20 19:10 sidntrivedi012

Hi @sidntrivedi012 how is this going?

Jan 11 '21 14:01 Razican

I will be happy to take a look if @sidntrivedi012 don't mind

Jan 22 '21 23:01 captain-yossarian

Go ahead @captain-yossarian

Jan 24 '21 08:01 tofpie

@Razican @tofpie Should/Can I use this percent_encoding crrate ?

Jan 25 '21 22:01 captain-yossarian

@Razican @tofpie Should/Can I use this percent_encoding crrate ?

Hi, yes, if it makes sense to use it, feel free :)

Jan 26 '21 10:01 Razican

Unassigned the issue by @captain-yossarian's request. For anyone interested, feel free to ping us if you want to work on this😁

Sep 20 '21 22:09 jedel1043

I'll take a look at this, please assign to me.

Oct 01 '21 23:10 jtara1

@jtara1 I've asigned it to you. Let us know if you need some pointers to get started :)

Oct 01 '21 23:10 jedel1043

I got something working with that crate ^ but I'm trying to just follow their algo directly in the ecma spec now

What's the return values for this function? https://github.com/jtara1/boa/blob/8ba500a26afdad8e200c9990b375664b5c04a97a/boa/src/builtins/string/mod.rs#L36

e: looks like @joshwd36 implemented that, what do the return values of (u32, u8, bool) represent for code_point_at func?

Oct 02 '21 00:10 jtara1

code_point_at just represents the operation CodePointAt from the spec. But maybe we could rewrite it to be more idiomatic with descriptive enum returns instead of bools and u8s

Oct 02 '21 01:10 jedel1043

the spec answered my question with

Return the Record { [[CodePoint]], [[CodeUnitCount]], [[IsUnpairedSurrogate]] }.

thanks

Oct 02 '21 01:10 jtara1

So far the hardest part of this is figuring out what the transformation algo (for singles & pairs) for code point -> utf-8 encoded byte(s) is. https://en.wikipedia.org/wiki/UTF-8 which actually seems like the most thorough explanation for this

I'm also checking the node.js implementation of this https://github.com/nodejs/node/blob/e46c680bf2b211bbd52cf959ca17ee98c7f657f5/deps/v8/src/strings/uri.cc#L279

and

void AddEncodedOctetToBuffer(uint8_t octet, std::vector<uint8_t>* buffer) {
  buffer->push_back('%');
  buffer->push_back(HexCharOfValue(octet >> 4));
  buffer->push_back(HexCharOfValue(octet & 0x0F));
}

void EncodeSingle(uc16 c, std::vector<uint8_t>* buffer) {
  char s[4] = {};
  int number_of_bytes;
  number_of_bytes =
      unibrow::Utf8::Encode(s, c, unibrow::Utf16::kNoPreviousCharacter, false);
  for (int k = 0; k < number_of_bytes; k++) {
    AddEncodedOctetToBuffer(s[k], buffer);
  }
}

void EncodePair(uc16 cc1, uc16 cc2, std::vector<uint8_t>* buffer) {
  char s[4] = {};
  int number_of_bytes =
      unibrow::Utf8::Encode(s, unibrow::Utf16::CombineSurrogatePair(cc1, cc2),
                            unibrow::Utf16::kNoPreviousCharacter, false);
  for (int k = 0; k < number_of_bytes; k++) {
    AddEncodedOctetToBuffer(s[k], buffer);
  }
}

e: my answer may lie here https://github.com/nodejs/node/blob/e46c680bf2b211bbd52cf959ca17ee98c7f657f5/deps/v8/src/strings/unicode-inl.h#L144 still sorting through things

Oct 03 '21 17:10 jtara1

Would it be worth looking into #736 again? It seems that the implementation will be pretty closely tied to the encoding

Oct 03 '21 19:10 joshwd36

Would it be worth looking into #736 again? It seems that the implementation will be pretty closely tied to the encoding

Yeah, maybe. We could reevaluate if there's a way to adapt our regex to consume UTF-16, then change the definition of JsString.

Oct 05 '21 00:10 jedel1043