|Abstract:|| At the time of writing, an ever-increasing amount of data is collected every day, with the volume of such generated records estimated to be doubling every two years. Datasets are becoming massive in terms of size and substantially more complex in nature. Nevertheless, this abundance of "raw information" does come at a price: wrong measurements, data-entry errors, breakdowns of automatic collection systems and several other causes may ultimately undermine the overall data quality. The talk will present novel methodologies for performing reliable inference, within the model-based classification and clustering framework, in presence of contaminated data. First a discriminant analysis method for anomaly and novelty detection will be introduced, with the final aim of discovering label noise, outliers and unobserved classes in a semi-supervised context. Secondly, two robust variable selection methods, effectively performing high-dimensional discrimination within an adulterated scenario, will be discussed.
Joint work with Francesca Greselin and Thomas Brendan Murphy