clp icon indicating copy to clipboard operation
clp copied to clipboard

How to pass custom delimiters, dictionary and non-dictionary schemas

Open kavirajk opened this issue 4 years ago • 6 comments

According to the paper, we can pass following configs for CLP.

  1. delimiters
  2. dictionary_variables
  3. non_dictionary_variables

But, AFAIU, there is no way to pass these for clg and clp now.

Can you help me if I miss anything? Thanks

kavirajk avatar Sep 03 '21 19:09 kavirajk

Hey @kavirajk,

Right, there’s currently no easy way to specify them. Full schema support is in our open-sourcing pipeline. We are finishing up optimization & testing. Stay tuned!

kirkrodrigues avatar Sep 04 '21 03:09 kirkrodrigues

Thanks @kirkrodrigues for the information!.

So basically, now clp-core uses kinda static delimiter and variables check as done here and here respectively. Correct?

I also tested with your hadoop dataset. The compression ratio is very impressive!. Great work!. Looking forward for the full schema support!

kavirajk avatar Sep 04 '21 03:09 kavirajk

Yup, that's correct.

Nice, thanks for trying it out! We'll post here when full schema support is open-sourced.

kirkrodrigues avatar Sep 05 '21 03:09 kirkrodrigues

Hi @kirkrodrigues, As mentioned in the paper, variable schema is pre-defined by the user. Are the regex as shown below currently manually written by the developer? I can see no part of the code checking for these schema. I understand it is not open-source yet, but can you please point out how is it done in currently?

dictionary_variables :
" task_ \ d + "                                              # Task ID
" \ d {1 ,3}\.\ d {1 ,3}\.\ d {1 ,3}\.\ d {1 ,3} " # IP
" container_ \ d + "                                     # Container ID

BasantaChaulagain avatar Dec 16 '21 15:12 BasantaChaulagain

Hi @BasantaChaulagain,

Currently, we have a few schemas implemented implicitly (i.e., the logic is not implemented as a regular expression but as a set of conditions) in the code. Roughly, the logic works as follows:

  • EncodedVariableInterpreter::encode_and_add_to_dictionary repeatedly calls LogTypeDictionaryEntry::parse_next_var to parse every variable in a message.
  • LogTypeDictionaryEntry::parse_next_var calls get_bounds_of_next_var to parse the next variable in the message.
  • get_bounds_of_next_var iterates the message and gets the bounds of the next token in the message (a token starts with a non-delimiter (!is_delim) and ends before the next delimiter (is_delim).
  • get_bounds_of_next_var then runs a few checks to determine if the token should be treated as a variable according to the implicit schemas.
  • Once the control-flow returns to EncodedVariableInterpreter::encode_and_add_to_dictionary, it will try to convert the given variable into a non-dictionary variable (e.g., convert_string_to_representable_integer_var). If it cannot be converted, it will be treated as a dictionary variable.

Encoding a query is a little bit more complicated since we need to handle wildcards (admittedly, the logic is a bit messy). To see how it works, I would start from Grep::process_raw_query.

We are working hard to try and get an easy-to-use version of schema support open-sourced so that the logic won't be so complicated. I hoped we would be done by now, but development always takes longer than we expect.

kirkrodrigues avatar Dec 17 '21 09:12 kirkrodrigues

Thank you for your response @kirkrodrigues. Good to hear that development is in progress.

BasantaChaulagain avatar Dec 17 '21 13:12 BasantaChaulagain