Bachelor thesis project supervision 2024 Q4

Undergraduate course, , 2024

Supervision of the thesis project related to optimal decision trees of five bachelor students.

Optimal Decision Trees for non-linear metrics: A geometric convex hull approach

By: Bogdan-Andrei Bancuta - Thesis

abstract In the pursuit of employing interpretable and performant Machine Learning models, Decision Trees has become a staple in many industries while being able to produce near-optimal results. With computational power becoming more accessible, there has been increasing progress in constructing Optimal Decision Trees. It guarantees optimal solutions with respect to different metrics within a given size limit on training data while requiring a smaller number of nodes and becoming more viable to compute on real-world data. However, non-linear metrics, which are very effective when evaluating trees on imbalanced datasets, still represent a challenge regarding runtime performance and scalability. Previous approaches generate the Pareto Front of the set of possible solutions, an expensive operation in computing the optimal tree. To address this gap, we introduce a novel merging algorithm of two Pareto Fronts using convex hulls, offering better pruning and leading to an increase in scalability. The experiments show a significant improvement in runtime of almost 10\% on bigger datasets and higher-depth trees using the F1-score metric, with the potential to be applied to other convex metrics.

Optimal Survival Trees With the Iterative Breslow Estimator and the Integrated Brier Score Objective

By: Izzy van der Giessen - Thesis

abstract Survival analysis predicts survival functions that give the probability of survival until a given time. Many applications of survival analysis involve health care, which requires interpretability of the models used to predict the survival function. Provably optimal decision trees have shown to be an interpretable alternative to so-called black box models. However, these algorithms often choose estimators that are fast, yet not necessarily most accurate. Moreover, the objective functions of optimal decision tree algorithms tend to make (possibly incorrect) assumptions about the survival function. In this paper, we tackle both problems. We implement the iterative Breslow estimator in an already existing optimal survival tree algorithm in order to iteratively improve the Nelson-Aalen estimator. This approach has great potential, as we show by using it on artificial datasets, but we do not see an improvement in accuracy on real world data. To eliminate the assumptions made by the objective function, we implement the Integrated Brier Score objective, which causes a significant improvement on training accuracy. However, we see no improvement on out-of-sample accuracy

Optimal Cox Survival Trees

By: Matei Mihai Mirica - Thesis

abstract Survival analysis is a branch of statistics concerned with studying and estimating the expected time duration until some event, such as biological death, occurs. Survival distributions are fitted based on historical data, where some instances are censored, meaning that the actual time of the event is not known precisely. Survival trees extend on the classical statistical methods developed and can capture complex non-linear relations between the variables by recursively splitting the instances by generated rules and fitting a different survival distribution in each leaf. Moreover, decision trees are desirable models due to their interpretable nature. We extend existing optimal survival tree methods by considering Cox Proportional Hazard models in each leaf node, which allows us to find more complex yet interpretable relationships than existing methods. The experiments show that our model outperforms state-of-the-art methods for creating survival trees, SurTree, OST, and CTree, especially in determining the relative risks between out-of-sample observations while generating significantly smaller trees.

Optimal Decision Trees for The Algorithm Selection Problem: Balancing Performance and Interpretability

By: Daniël Poolman - Thesis

abstract The Algorithm Selection Problem (ASP) presents a significant challenge in numerous industries, requiring optimal solutions for complex computational problems. Traditional approaches to solving ASP often rely on complex, black-box models like random forests, which are effective but lack transparency, and they often fail to balance performance with interpretability. This paper investigates the performance-interpretability trade-off for the ASP, specifically focused on Optimal Decision Trees (ODTs) as recent innovations have made the use of ODTs more viable. We compare ODTs against 4 other tree-based models, using 11 different datasets. We show there is no apparent tradeoff between performance and interpretability for ODTs which have been trained using an instance cost-sensitive approach, as they achieve comparable performance to a Random Forest Regressor while maintaining interpretability through multiple orders of magnitude fewer leaf nodes.

P-STreeD: A Multithreaded Approach for DP Optimal Decision Trees

By: Albert Sandu - Thesis

abstract Decision trees are valued for their ability to logically and transparently classify data. While heuristic methods to compute such trees are efficient, they often compromise on accuracy, prompting interest in Optimal Decision Trees (ODTs), which have the best misclassification score for a given tree size limit and training dataset. The dynamic programming (DP) approach for ODTs has shown improvements over alternatives such as mixed-integer or constraint programming. That being said, it requires further improvements to handle exponential runtime scaling with the depth of the tree and the number of features of the dataset. Leveraging modern hardware, such as multiple CPU cores, offers a promising solution to improve efficiency. This paper proposes a multithreading method for DP ODTs which we apply to STreeD specifically. We introduce a shared memory model and determine which components of the original program can be made local to the threads. We investigate whether it is more efficient to start multithreading at the root of the search tree than near its leaf nodes. We find the former to be superior, resulting in faster runtimes and less shared resource access. Empirical evaluations against the state of the art demonstrate better runtimes, particularly beneficial for large datasets. Finally, thread scaling analyses reveal substantial speedups, exceeding 2.5 times with four threads, highlighting our approach’s effectiveness for computationally-intensive tasks.