boa icon indicating copy to clipboard operation
boa copied to clipboard

Implement URI Handling Functions

Open Razican opened this issue 5 years ago • 15 comments

ECMASCript feature We currently lack the implementation of some built-in global functions used for URI encoding/decoding. You can find these functions in the spec here.

The 4 functions to implement would be these:

In those links you can find example code that should work after this implementation.

Implementation tips We should probably create a new module here for these functions. An example of a built-in global function implementation can be seen here.

Razican avatar Oct 19 '20 12:10 Razican

Hey @Razican , I would like to work on it. Thanks :)

sidntrivedi012 avatar Oct 24 '20 19:10 sidntrivedi012

Hi @sidntrivedi012 how is this going?

Razican avatar Jan 11 '21 14:01 Razican

I will be happy to take a look if @sidntrivedi012 don't mind

captain-yossarian avatar Jan 22 '21 23:01 captain-yossarian

Go ahead @captain-yossarian

tofpie avatar Jan 24 '21 08:01 tofpie

@Razican @tofpie Should/Can I use this percent_encoding crrate ?

captain-yossarian avatar Jan 25 '21 22:01 captain-yossarian

@Razican @tofpie Should/Can I use this percent_encoding crrate ?

Hi, yes, if it makes sense to use it, feel free :)

Razican avatar Jan 26 '21 10:01 Razican

Unassigned the issue by @captain-yossarian's request. For anyone interested, feel free to ping us if you want to work on this😁

jedel1043 avatar Sep 20 '21 22:09 jedel1043

I'll take a look at this, please assign to me.

jtara1 avatar Oct 01 '21 23:10 jtara1

@jtara1 I've asigned it to you. Let us know if you need some pointers to get started :)

jedel1043 avatar Oct 01 '21 23:10 jedel1043

I got something working with that crate ^ but I'm trying to just follow their algo directly in the ecma spec now

What's the return values for this function? https://github.com/jtara1/boa/blob/8ba500a26afdad8e200c9990b375664b5c04a97a/boa/src/builtins/string/mod.rs#L36

e: looks like @joshwd36 implemented that, what do the return values of (u32, u8, bool) represent for code_point_at func?

jtara1 avatar Oct 02 '21 00:10 jtara1

code_point_at just represents the operation CodePointAt from the spec. But maybe we could rewrite it to be more idiomatic with descriptive enum returns instead of bools and u8s

jedel1043 avatar Oct 02 '21 01:10 jedel1043

the spec answered my question with

Return the Record { [[CodePoint]], [[CodeUnitCount]], [[IsUnpairedSurrogate]] }.

thanks

jtara1 avatar Oct 02 '21 01:10 jtara1

So far the hardest part of this is figuring out what the transformation algo (for singles & pairs) for code point -> utf-8 encoded byte(s) is. https://en.wikipedia.org/wiki/UTF-8 which actually seems like the most thorough explanation for this

I'm also checking the node.js implementation of this https://github.com/nodejs/node/blob/e46c680bf2b211bbd52cf959ca17ee98c7f657f5/deps/v8/src/strings/uri.cc#L279

and

void AddEncodedOctetToBuffer(uint8_t octet, std::vector<uint8_t>* buffer) {
  buffer->push_back('%');
  buffer->push_back(HexCharOfValue(octet >> 4));
  buffer->push_back(HexCharOfValue(octet & 0x0F));
}

void EncodeSingle(uc16 c, std::vector<uint8_t>* buffer) {
  char s[4] = {};
  int number_of_bytes;
  number_of_bytes =
      unibrow::Utf8::Encode(s, c, unibrow::Utf16::kNoPreviousCharacter, false);
  for (int k = 0; k < number_of_bytes; k++) {
    AddEncodedOctetToBuffer(s[k], buffer);
  }
}

void EncodePair(uc16 cc1, uc16 cc2, std::vector<uint8_t>* buffer) {
  char s[4] = {};
  int number_of_bytes =
      unibrow::Utf8::Encode(s, unibrow::Utf16::CombineSurrogatePair(cc1, cc2),
                            unibrow::Utf16::kNoPreviousCharacter, false);
  for (int k = 0; k < number_of_bytes; k++) {
    AddEncodedOctetToBuffer(s[k], buffer);
  }
}

e: my answer may lie here https://github.com/nodejs/node/blob/e46c680bf2b211bbd52cf959ca17ee98c7f657f5/deps/v8/src/strings/unicode-inl.h#L144 still sorting through things

jtara1 avatar Oct 03 '21 17:10 jtara1

Would it be worth looking into #736 again? It seems that the implementation will be pretty closely tied to the encoding

joshwd36 avatar Oct 03 '21 19:10 joshwd36

Would it be worth looking into #736 again? It seems that the implementation will be pretty closely tied to the encoding

Yeah, maybe. We could reevaluate if there's a way to adapt our regex to consume UTF-16, then change the definition of JsString.

jedel1043 avatar Oct 05 '21 00:10 jedel1043