website
website copied to clipboard
Some questions/comments on Categorical Predictors
- The example with each agent working with a single customer type introduced in 5.2:
- I think the row-wise sum comment could use some clarification; it's the sum among agents with a given customer type, and the single customer type column?
- Later, in 5.4.3, the example is reused, but I think the language is stronger: "agent was aliased with the customer type" to me means there's a one-to-one correspondence rather than the many-to-one relationship I think the original insinuated. And in a one-to-one relationship, the effect encodings will end up being identical, so the argument fails. Separately: can we add a ref-link?
- Figure 5.1 typo "distirbution"
- In 5.4, I would expect to see some mention of coarsening the categories according to domain knowledge (e.g. states into regions). Maybe also model-based coarsening that uses other predictors?
- The Cerda & Varoquaux citation seems to deal more with encodings that take the string nature of the predictor into account, with a hint of natural language processing to it.
- In 5.4.2, I'm not sure whether adding a
-1
to the hashing values leads to "fewer collisions"; it depends on what exactly you mean by a collision, and I'm not familiar with the cryptography literature to say. But in a parametric model, it's still enforcing some arbitrary constraint. - The intro to 5.3.2 says "different" supervised tool, but it's the only supervised tool in the chapter.
- In 5.5, I'd like a small note about integer-encoding the values being reasonable for certain models. (Again, "will be discussed more later", but a preview would be nice.)