website icon indicating copy to clipboard operation
website copied to clipboard

Some questions/comments on Categorical Predictors

Open bmreiniger opened this issue 9 months ago • 0 comments

  1. The example with each agent working with a single customer type introduced in 5.2:
    1. I think the row-wise sum comment could use some clarification; it's the sum among agents with a given customer type, and the single customer type column?
    2. Later, in 5.4.3, the example is reused, but I think the language is stronger: "agent was aliased with the customer type" to me means there's a one-to-one correspondence rather than the many-to-one relationship I think the original insinuated. And in a one-to-one relationship, the effect encodings will end up being identical, so the argument fails. Separately: can we add a ref-link?
  2. Figure 5.1 typo "distirbution"
  3. In 5.4, I would expect to see some mention of coarsening the categories according to domain knowledge (e.g. states into regions). Maybe also model-based coarsening that uses other predictors?
  4. The Cerda & Varoquaux citation seems to deal more with encodings that take the string nature of the predictor into account, with a hint of natural language processing to it.
  5. In 5.4.2, I'm not sure whether adding a -1 to the hashing values leads to "fewer collisions"; it depends on what exactly you mean by a collision, and I'm not familiar with the cryptography literature to say. But in a parametric model, it's still enforcing some arbitrary constraint.
  6. The intro to 5.3.2 says "different" supervised tool, but it's the only supervised tool in the chapter.
  7. In 5.5, I'd like a small note about integer-encoding the values being reasonable for certain models. (Again, "will be discussed more later", but a preview would be nice.)

bmreiniger avatar Apr 30 '24 14:04 bmreiniger