|Abstract:|| Logistic Regression (LR) is a widely used statistical method in empirical studies in many research fields. However, these real-life scenarios oftentimes share complexities that would hinder the application of the as-is model. First and foremost, the need to include high-order interactions to capture the variability of their data. Moreover, these studies are seldom developed in imbalanced settings, with datasets growing wider, sample size
from very large to extremely small and a strong need for model and results interpretability.
In this paper we present a novel algorithm, High-Order Interaction Learning via targeted Pattern search (HOILP), to select interaction terms of varying order to include in a LR for
an imbalanced binary classification task when input data is categorical. HOILP's rationale is built on the duality between item sets and categorical interactions, and is composed of
(i) an interaction learning step based on a well-known frequent item set mining algorithm and (ii) a novel dissimilarity-based interaction selection step, that allows the user to control
for the number of interactions to include in the LR model. Besides HOILP we present here two variants (Scores HOILP and Clusters HOILP), that can suit even more specific needs.
Through a set of experiments we validate our algorithm and prove its wide applicability to real-life research scenarios, surpassing the performance of a benchmark state-of-the-art