Lecture 8: Categorical features

Munir Hiabu

Some practical considerations Categorical data

  • One point often ignored is that in many applications a significant part of features are categorical.

  • All algorithms we have introduced so far implicitly assume, however, that features are numerical.

Some practical considerations Categorical data: One-hot/dummy encoding

  • The most famous ways of dealing with categorical features are one-hot encoding or dummy encoding.

Definition (One-hot encoding)

Given a feature with \(X\) that can take \(k\) differen categorical values, one-hot encoding entails transforming the feature into \(k\) binary features where \[ X^{(l)}=\mathbb 1(X=l), \quad l=1,\dots,k. \]

Definition (Dummy encoding)

Given a feature with \(X\) that can take \(k\) differen categorical values, dummy encoding entails transforming the feature into \(k-1\) binary features where, given a reference \(l^\ast\), \[ X^{(l)}=\mathbb 1(X=l), \quad l=1,\dots,l^\ast-1,l^\ast+1,\dots,k. \]

  • Dummy encoding solves the problem of collinearity when one-hot encoding.

Some practical considerations Categorical data: One-hot/dummy encoding

  • Both methods can easily make the dimension of the problem rapidly increase if a feature has many categories
    • e.g. car brand, country, or postcode
  • Grouping variables with similar effect on the response might reduce variance.
    • This could e.g. be done via expert knowledge.

Some practical considerations Categorical data: Partitioning

  • There are \(B_k = \sum_{j=0}^k \binom k j\) ways to partition a categorical feature with \(k\) values (Bell number).
    • \(B_2=2\), \(B_3=5\), \(B_4=15\), \(B_5=52\), \(B_6=203\), \(B_7=877\), \(B_8=4140\).
  • In a CART algorithm, we would only partition a current node into two child-nodes.
  • In this case there are \(2^{k - 1}-1\) possible partitions.
  • This is still not feasible for large \(k\).
  • It is well known (Fisher 1958), (Wright and König 2019) that if optimal partition is with respect to gini index or squared loss, then the optimal partition can be found by considering only \(k-1\) partitions:
    1. Order feature values by mean response.
    2. The optimal partition is a contiguous partition.
      • Search for optimal partition treating the categorical features as numerical, i.e., split into a left and right partition.
  • Wright and König (2019) propose to order categorical variables only once when fitting a random forest. This is the default in \(\texttt{ranger}\).
  • \(\texttt{lightgbm}\) and very recently (April 2022) \(\texttt{xgboost}\) offer the option to order variables before every split.

References

Fisher, Walter D. 1958. “On Grouping for Maximum Homogeneity.” Journal of the American Statistical Association 53 (284): 789–98.
Wright, Marvin N, and Inke R König. 2019. “Splitting on Categorical Predictors in Random Forests.” PeerJ 7: e6339.