Data quality in the context of classification tasks

Roxane Jouseau; Chafik Samir; Sébastien Salva

Communication Dans Un Congrès Année : 2022

Data quality in the context of classification tasks

(1) , (1) , (1)

Roxane Jouseau

Fonction : Auteur
PersonId : 1166286
IdHAL : roxane-jouseau
ORCID : 0000-0003-0893-9293

Laboratoire d'Informatique, de Modélisation et d'Optimisation des Systèmes

Chafik Samir

Fonction : Auteur
PersonId : 15306
IdHAL : chafik-samir
ORCID : 0000-0003-0619-5040
IdRef : 187174601

Laboratoire d'Informatique, de Modélisation et d'Optimisation des Systèmes

Sébastien Salva

Fonction : Auteur
PersonId : 1026227

Laboratoire d'Informatique, de Modélisation et d'Optimisation des Systèmes

Résumé

Data cleaning is an important step of a machine learning process to get the best results possible. The literature is rich, and there are many tools available, which makes choosing which tool to use complex. The objective of our work is to answer the question: Is it always best to repair data? We focus on numeric data for classification tasks. We decompose the question into five criteria, we propose a metric to measure how difficult using a repairing tool is. Then, we studied the impact of the degree of degradation of data, the type of errors present, the effectiveness of repairing tools, and the impact of different classification models. We found that error types such as missing values and outliers have more impact on accuracy and f1 score than other types of errors. Moreover, even though complex repairing tools were generally more effective, there is a point where data is so degraded that tools do not perform well. For low levels of errors, the tools also tend to have similar performances, the decision of which one to use can then be made according to their difficulty to use.

Mots clés

data quality data repairing data errors classification

Domaines

Informatique [cs] Apprentissage [cs.LG]

Fichier principal

BDA_2022__Data_quality_in_the_context_of_classification_tasks (1).pdf (431.31 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

sébastien salva : Connectez-vous pour contacter le contributeur

https://uca.hal.science/hal-03903460

Soumis le : vendredi 16 décembre 2022-13:48:34

Dernière modification le : mardi 18 avril 2023-10:00:06

Dates et versions

hal-03903460 , version 1 (16-12-2022)

Identifiants

HAL Id : hal-03903460 , version 1

Citer

Roxane Jouseau, Chafik Samir, Sébastien Salva. Data quality in the context of classification tasks. 38èmes journées de la conférence BDA « Gestion de Données – Principes, Technologies et Applications, Oct 2022, Clermont-Ferrand, France, France. ⟨hal-03903460⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

PRES_CLERMONT CNRS LIMOS CLERMONT-AUVERGNE-INP

56 Consultations

25 Téléchargements

Data quality in the context of classification tasks

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager