Statistical Feature Selection: With Applications in Life Science

The sequencing of the human genome has changed life science research in many ways. Novel measurement technologies such as microarray expression analysis, genome-wide SNP typing and mass spectrometry are now producing experimental data of extremely high dimensions. While these techniques provide unprecedented opportunities for exploratory data analysis, the increase in dimensionality also introduces many difficulties. A key problem is to discover the most relevant variables, or features, among the tens of thousands of parallel measurements in a particular experiment. This is referred to as feature selection. For feature selection to be principled, one needs to decide exactly what it means for a feature to be ”relevant”. This thesis considers relevance from a statistical viewpoint, as a measure of statistical dependence on a given target variable. The target variable might be continuous, such as a patient’s blood glucose level, or categorical, such as ”smoker” vs. ”non-smoker”…


1 Introduction
1.1 A brief background
1.2 A guide to the thesis
2 Statistical Data Models
2.1 Parametric models
2.1.1 The exponential family
2.1.2 Maximum likelihood estimation
2.2 Graphical models
2.2.1 Markov networks
2.2.2 Bayesian networks
2.2.3 Probability axioms
2.3 Conditional probability models
2.4 Predictors and inducers
2.5 Loss and risk
2.6 Nonparametric methods
2.6.1 Empirical risk minimization
2.6.2 Nearest-neighbor methods
2.6.3 Kernel methods
2.7 Priors, regularization and over-fitting
2.7.1 Over-fitting
2.7.2 Regularization
2.7.3 Priors and Bayesian statistics
2.8 Summary
3 Feature Selection Problems
3.1 Predictive features
3.1.1 The Markov boundary
3.1.2 The Bayes-relevant features
3.2 Small sample-optimal features
3.2.1 The min-features bias
3.2.2 k-optimal feature sets
3.3 All relevant features
3.3.1 The univariate case
3.3.2 The multivariate case
3.4 Feature extraction and gene set testing
3.5 Summary
4 Feature Selection Methods
4.1 Filter methods
4.1.1 Statistical hypothesis tests
4.1.2 The multiple testing problem
4.1.3 Variable ranking
4.1.4 Multivariate filters
4.1.5 Multivariate search methods
4.2 Wrapper methods
4.3 Embedded methods
4.3.1 Sparse linear predictors
4.3.2 Non-linear methods
4.4 Feature extraction and gene set testing methods
4.5 Summary
5 A benchmark study
5.1 Evaluation system
5.2 Feature selection methods tested
5.3 Results
5.3.1 Robustness against irrelevant features
5.3.2 Regularization in high dimensions
5.3.3 Rankings methods are comparable in high dimen-sions
5.3.4 Number of selected features increases with dimension
5.3.5 No method improves SVM accuracy
5.3.6 Feature set accuracy
5.4 Summary
6 Consistent Feature Selection in Polynomial Time
6.1 Relations between feature sets
6.1.1 The Markov boundary and strong relevance
6.1.2 The Bayes-relevant features
6.1.3 The optimal feature set
6.2 Consistent polynomial-time search algorithms
6.2.1 Experimental data
6.3 Discussion
6.4 Summary
7 Bootstrapping Feature Selection
7.1 Stability and error rates
7.2 Feature selection is ill-posed
7.3 The bootstrap approach
7.3.1 Accuracy of the bootstrap
7.3.2 Simulation studies
7.3.3 Application to cancer data
7.4 Discussion
7.5 Summary
8 Finding All Relevant Features
8.1 Computational complexity
8.2 The Recursive Independence Test algorithm
8.2.1 Outline
8.2.2 Asymptotic correctness
8.2.3 Biological relevance of the PCWT class
8.2.4 Multiplicity and small-sample error control
8.2.5 Simulated data
8.2.6 Microarray data
8.2.7 Discussion
8.3 The Recursive Markov Boundary algorithm
8.3.1 Outline
8.3.2 Asymptotic correctness
8.4 Related work
8.5 Summary
9 Conclusions
9.1 Model-based feature selection
9.2 Recommendations for practitioners
9.3 Future research

Author: Nilsson, Roland

Source: Linköping University

Download URL 2: Visit Now

Leave a Comment