Posts

Showing posts with the label Machine Learning

Random Forests SVM

Image
Bagged classier using decision trees: 1) Each split only considers a random group of features 2) Tree is grown to maximum size without pruning 3) Final predictions obtained by aggregating over the B trees Out of Bags(OOB) samples: From each observation, construct its random forest predictor by averaging only those trees corresponding to bootstrap samples in which observation does not appear OOB error estimates can be fit in one sequence Once OOB stabilizes, training can be stopped OOB can also be used for variable importance Ensembles and Multi-Learners Goal: use multiple learners to solve parts of the same problem  Ensembles: competing learners with multiple looks at the same problem SVM: Support Vector Machine(SVM) Find large margin separator to improve generalization Use optimization to find solution with few errors Use kernel trick to make large feature spaces computationally efficient Chose the linear separator ...

Logistic Regression 、 Linear Discriminant Analysis(LDA) 、 Quadratic Discriminant Analysis(QDA)

(1 ) 当分类边界是线性时, LR 和 LDA 会更好; 当分类边界是非线性时, QAD 会更好; 当分类边界更复杂时, KNN 会更好。 (2) LR 和 LDA 都将产生线性分类边界,不同的是LR的系数估计是通过极大似然法, 而LDA系数是运用正态分布的均值和方差的估计值计算的, LR 适用于二分类问题,对于多分类问题,LDA 更为常见。 LDA 和 QDA 都是建立在自变量服从正态分布的假设上,所以当自变量的分布确实是几乎服从正态分布时,这两种方法表现的较好; LDA 和QDA 的区别在于LDA 假设所有类别的自变量都服从方差相同的正态分布,而QDA假设对于因变量属于不同类别的自变量服从方差不同的正态分布,选择LDA和QDA的关键在于bias-variance的权衡。

Machine Learning

Image
Linear regression: Logistic regression:  SVM:  Convex Optimization: achieves global minimum, no local traps Convex Sets: x1, x2 in C, 0<= t <=1 ==> t*x1 + (1-t)*x2 in C Gradient Descent: Use for unconstrained optimization min(x) f(x) Main idea: take a step proportional to the negative of the gradient Extremely popular; simple and easy to implement Handful of approaches to selecting step size: 1) fixed step size 2) exact search 3) backtracking line search Limitations of Gradient Descent: Step size search may be expensive; Convergence is slow for ill-conditioned problems; Convergence speed depends on initial starting position; Does not work for non differentiable or constrained problems. Linear Regression Formulation: Given an inout vector X' = (X1, X2, ..., Xp), we want to predict the quantitative response y. Minimize sum of square errors: (Objective function) Least square solutions:) ...

ROC & AUC

ROC: Receiver Operating Characteristic A ROC curve is a graphic plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC curve is created by plotting the true positive rate(TPR) against the false positive rate(FPR) at various threshold settings. The true positive rate is also know as sensitivity, recall or probability of detection in machine learning. Acc = (True Positive + True Negative)/Total population F1 Score = 2*(Precision*Recall)/(Precision + Recall) False positive: type I error False negative: type II error AUC: Area Under the Curve When using normalized units, the area under the curve is equally to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.

Clustering vs Dimensionality Reduction

Clustering and Dimensionality Reduction are applied to the problem of unsupervised learning. Clustering identifies unknown structure in the data;  Dimensionality Reduction uses s tructural characteristics to simplify data. However, there is a problem called Curse of Dimensionality . In practice, too many features leads to worse performance . Therefore, we need Dimensionality Reduction . It's possible to represent data with fewer dimensions, which requires us to discover intrinsic dimensionality of data. One way to do dimensionality reduction is to perform lower dimensional projections. In the way, we could transform dataset to have less features, in the new feature space, some original features are combined via linear or nonlinear functions . PCA: Principal Component Analysis Find sequence of linear combinations of the features that have maximal variance and are uncorrelated PCA:1st PC 1st PC of X is unit vector that maximizes the s...