Data quality in the context of classification tasks - Université Clermont Auvergne Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

Data quality in the context of classification tasks

Résumé

Data cleaning is an important step of a machine learning process to get the best results possible. The literature is rich, and there are many tools available, which makes choosing which tool to use complex. The objective of our work is to answer the question: Is it always best to repair data? We focus on numeric data for classification tasks. We decompose the question into five criteria, we propose a metric to measure how difficult using a repairing tool is. Then, we studied the impact of the degree of degradation of data, the type of errors present, the effectiveness of repairing tools, and the impact of different classification models. We found that error types such as missing values and outliers have more impact on accuracy and f1 score than other types of errors. Moreover, even though complex repairing tools were generally more effective, there is a point where data is so degraded that tools do not perform well. For low levels of errors, the tools also tend to have similar performances, the decision of which one to use can then be made according to their difficulty to use.
Fichier principal
Vignette du fichier
BDA_2022__Data_quality_in_the_context_of_classification_tasks (1).pdf (431.31 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03903460 , version 1 (16-12-2022)

Identifiants

  • HAL Id : hal-03903460 , version 1

Citer

Roxane Jouseau, Chafik Samir, Sébastien Salva. Data quality in the context of classification tasks. 38èmes journées de la conférence BDA « Gestion de Données – Principes, Technologies et Applications, Oct 2022, Clermont-Ferrand, France, France. ⟨hal-03903460⟩
56 Consultations
25 Téléchargements

Partager

Gmail Facebook X LinkedIn More