# Nonparametric estimation of customer segments with panel data

• Nicht-parametrische Schätzung von Kundensegmenten mit Paneldaten

Jörg, Johannes Ferdinand; Wentzel, Daniel (Thesis advisor); Cleophas, Eva Catherine (Thesis advisor)

Aachen : RWTH Aachen University (2021)
Dissertation / PhD Thesis

Dissertation, Rheinisch-Westfälische Technische Hochschule Aachen, 2021

Abstract

The increasing availability of large data sets collected by firms raises the need to automate their evaluation. This thesis shows an approach to handling large data sets of panel data, i.e., repeated observations of the same individuals over several time periods as they are frequently collected in retail or marketing. We develop and evaluate a nonparametric estimation procedure based on artificial data as well as real-world data sets. The aim of the estimation is an automated market or customer segmentation algorithm to support optimization procedures and decision making. To achieve this aim, we split the estimation procedure into two main, independent parts: The estimation of the number of segments and the estimation of the structure of the segments. While they can be applied independently, they are designed to work together: The output of the first part can be used as the input for the second part of the procedure. Both are based on finite mixture models which represent a general way to model probabilities for populations with sub-populations. In an additional step, we use the results of the estimation procedure to retrospectively assign customers to their respective segment. To display its practical application, the procedure is used in a simulation calibration to initialize a genetic algorithm. The thesis comprises three scientific essays: Estimating the number of segments, estimating the structure of segments, and a practical application in the area of revenue management. Each essay contains a computational study where we evaluate the performance of the proposed algorithms. To achieve comparable and meaningful results, we use artificially generated data sets as well as real-world data. The following paragraphs briefly describe the content of the respective essay. The estimation of the number of segments can be done on relatively sparse data sets, i.e., it requires panel data over two time periods. We employ a rank estimation for the corresponding observation matrix which yields a lower bound for the number of segments. We show that in artificially generated data sets, this lower bound is suited to be used as an estimate for the actual number. The results are compared to validation indices of $k$-means clustering algorithms. Two different data availability settings are compared: uncensored and censored data. While uncensored means that the recorded data reflects the underlying preferences, censored data may be influenced by several outlying factors and thus does not perfectly reflect customer choices in an unrestricted setting. For censored data, we evaluate several heuristics to impute missing data points or unconstrain data from availability restrictions. Following the estimation of number of segments, we propose an algorithm that estimates the structure within these segments. To do so, the estimation needs a fixed number of segments and panel data over three time periods. We improve upon an already existing estimation procedure which has drawbacks regarding discrete attribute settings. Several improvements to the estimation yield a significant increase in estimation performance. Again, we discuss the implications of uncensored and censored observations and further improve the algorithm for censored demand by prorating probabilities. We benchmark the results of the proposed estimation procedure to another nonparametric approach both in quality and computational speed. In addition to a computational study, we also evaluate the hindsight assignment of estimation results to the individual customer observations. To highlight the general applicability of the proposed estimation framework apart from using it for retail or marketing purposes, we conduct simulation studies where the combined estimation procedures are applied to calibrate scenarios. To this end, we use an airline revenue management simulation tool that uses customer segments to simulate interactions between individuals and an airline revenue management system. As calibrating such scenarios is time-consuming and usually involves manual interaction, a system that uses a target function to automate this process is highly useful. For this study, a genetic algorithm is used to improve the calibration results step by step. Usually, random initializations are used for genetic algorithms to start the selection process. Prefacing the genetic algorithm with the proposed estimation procedure and using its output to initialize the parameters of each customer segment shows significant time savings compared to the random initialization.