Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data

Dhouha Grissa; Mélanie Pétéra; Marion Brandolini; Amedeo Napoli; Blandine Comte; Estelle Pujos-Guillot

doi:10.3389/fmolb.2016.00030

Article Dans Une Revue Frontiers in Molecular Biosciences Année : 2016

Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data

(1, 2) , (1) , (1) , (2) , (1) , (1)

1
2

Dhouha Grissa

Fonction : Auteur
PersonId : 771971
IdRef : 179415352

Unité de Nutrition Humaine

Knowledge representation, reasonning

Mélanie Pétéra

Fonction : Auteur
PersonId : 742819
IdHAL : melanie-petera

Unité de Nutrition Humaine

Marion Brandolini

Fonction : Auteur
PersonId : 753999
IdHAL : marion-brandolini
ORCID : 0000-0003-0131-8216

Unité de Nutrition Humaine

Amedeo Napoli

Fonction : Auteur
PersonId : 743383
IdHAL : amedeo-napoli
IdRef : 034282297

Knowledge representation, reasonning

Blandine Comte

Fonction : Auteur
PersonId : 736569
IdHAL : blandine-comte
ORCID : 0000-0002-4662-6581
IdRef : 153377720

Unité de Nutrition Humaine

Estelle Pujos-Guillot

Fonction : Auteur
PersonId : 741275
IdHAL : estelle-pujos-guillot
ORCID : 0000-0002-4693-5712
IdRef : 08151557X

Unité de Nutrition Humaine

Résumé

Untargeted metabolomics is a powerful phenotyping tool for better understanding biological mechanisms involved in human pathology development and identifying early predictive biomarkers. This approach, based on multiple analytical platforms, such as mass spectrometry (MS), chemometrics and bioinformatics, generates massive and complex data that need appropriate analyses to extract the biologically meaningful information. Despite various tools available, it is still a challenge to handle such large and noisy datasets with limited number of individuals without risking overfitting. Moreover, when the objective is focused on the identification of early predictive markers of clinical outcome, few years before occurrence, it becomes essential to use the appropriate algorithms and workflow to be able to discover subtle effects among this large amount of data. In this context, this work consists in studying a workflow describing the general feature selection process, using knowledge discovery and data mining methodologies to propose advanced solutions for predictive biomarker discovery. The strategy was focused on evaluating a combination of numeric-symbolic approaches for feature selection with the objective of obtaining the best combination of metabolites producing an effective and accurate predictive model. Relying first on numerical approaches, and especially on machine learning methods (SVM-RFE, RF, RF-RFE) and on univariate statistical analyses (ANOVA), a comparative study was performed on an original metabolomic dataset and reduced subsets. As resampling method, LOOCV was applied to minimize the risk of overfitting. The best k-features obtained with different scores of importance from the combination of these different approaches were compared and allowed determining the variable stabilities using Formal Concept Analysis. The results revealed the interest of RF-Gini combined with ANOVA for feature selection as these two complementary methods allowed selecting the 48 best candidates for prediction. Using linear logistic regression on this reduced dataset enabled us to obtain the best performances in terms of prediction accuracy and number of false positive with a model including 5 top variables. Therefore, these results highlighted the interest of feature selection methods and the importance of working on reduced datasets for the identification of predictive biomarkers issued from untargeted metabolomics data.

Mots clés

univariate statistics feature selection biomarker discovery metabolomics visualization machine learning prediction formal concept analysis

Domaines

Sciences du Vivant [q-bio] Informatique [cs]

Fichier principal

Frontiers-VersionFinale-2016.pdf (4.27 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Dhouha Grissa : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01421002

Soumis le : mercredi 21 décembre 2016-13:30:45

Dernière modification le : vendredi 2 février 2024-03:55:02

Archivage à long terme le : lundi 20 mars 2017-17:59:44

Dates et versions

hal-01421002 , version 1 (21-12-2016)

Identifiants

HAL Id : hal-01421002 , version 1
DOI : 10.3389/fmolb.2016.00030
PRODINRA : 369918

Citer

Dhouha Grissa, Mélanie Pétéra, Marion Brandolini, Amedeo Napoli, Blandine Comte, et al.. Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data. Frontiers in Molecular Biosciences, 2016, 3, pp.15. ⟨10.3389/fmolb.2016.00030⟩. ⟨hal-01421002⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

PRES_CLERMONT CNRS INRIA INRA UNH UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD INRAE ANR PFEM METABOHUB ALIMH

573 Consultations

474 Téléchargements

Feature Selection Methods for Early Predictive Biomarker Discovery Using Untargeted Metabolomic Data

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager