modelfox icon indicating copy to clipboard operation
modelfox copied to clipboard

Bag of words - what is the delimiter?

Open Overload119 opened this issue 3 years ago • 3 comments

Consider a table:

target words
1 This, That, And The Other
0 This
1 And The Other, That

Am I using the commas to infer the bag of words correctly?

Overload119 avatar Dec 20 '22 00:12 Overload119

The tokenizer will tokenize the string in the following way:

words tokens
This, That, And the Other this , that , and the other

It's not splitting text into tokens using a comma delimiter.

If you want the behavior to instead be three tokens This, That, And The Other, I suggest preprocessing those columns and pass text that has already been feature engineered.

isabella avatar Dec 20 '22 19:12 isabella

Do you have an example of how that would work? How can I pass text in any other way in the column?

Overload119 avatar Dec 21 '22 01:12 Overload119

You would need to pre-process your csv using another tool. Alternatively, you can use an enum column by using a custom config file as described here: https://www.modelfox.dev/docs/guides/train_with_custom_configuration.

In the example linked above, the "chest_pain" column is specified as type "enum" with four variants.

{
  "dataset": {
    "columns": [
    {
      "name": "chest_pain",
      "type": "enum",
      "variants": [
        "asymptomatic",
        "atypical angina",
        "non-angina pain",
        "typical angina"
      ]
    },
...
  }
}

For your dataset, you would specify that the words column is an enum with 3 variants: "This", "That", "And The Other".

Then, use the config file by passing --config path/to/config.json on the CLI.

isabella avatar Dec 21 '22 16:12 isabella