What is machine learning?
Definition
- Literlly, “machine” denotes “programming computer” and “learning” denotes “learn from data”.
- In a general sense, machine learning means that the computer can learn some ability without explicitly programming.
- From the perspective of engineering, given some task $T$, corresponding experience (training data) $E$ and performance measurement $P$, machine learning hopes to learn from $E$ so that the performance $P$ on task $T$ can be improved.
Machine learning is a interdisciplinary field, which relates to computer science, statistics, mathematics and so on.
Basic element
- Data. Every insatcne is called sample. The set of training and testing data is called training set and testing set, respectively. Since for some algorithms, parameters are required to be tuned, we need to split a subset from the training set, which is called evaluation set and used for determining how good or bad the parameters are.
- Model. It can be viewed as a function $f$. Given an input $x$, one can get an output $y$. The model may rely on some changeable parameters $\theta$. The process of learning is to update $\theta$.
- Performance measurement. It is used to evalute the performance of the model. We can use utility function, fitness function to evaluate how good a model is. And we can also use the cost function to evaluate how bad a model is.
Procedure
- To study data;
- To select a model;
- To train the model on the training set;
- To make a(n) prediction/inference on new data.
Why use machine learnig?
- To do work that requires a lot of hand-tuning or long lists of rules;
- To adpat to change of environment/data;
- To solve problems that is difficult for human;
- To learning unkonwn rules (data mining)
Types of machine learning
There are many categories for machine learning algorithms. Generally, we can classify them from the following perspectives.
Training data
Supervised learning.
In supervised learning, each training sample $x\in \mathscr{X}$ has a label $y\in\mathscr{Y}$.
- Classification. The label set $\mathscr{Y}$ consists of finite elements, such as $\{0,1\}$, $\{\text{Yes}, \text{No}\}$ and so on. The classification task is to determine which class is for a given sample.
- Regression. The label set $\mathscr{Y}$ consists of an interval or even more complex elements, such as $[0,1]$. The regression task is to find a suitable map from $\mathscr{X}$ to $\mathscr{Y}$.
- Ranking. The samples are splitted into different group, and the label set can either be discrete or continuous. This is a special task and commonly used in recommended systems. It aims to give ranks of samples in a group.
Some common supervised learning algorithms are given below:
- k-Nearest Neighbors
- Linear Regression
- Logistic Regression
- Support Vector Machines (SVMs)
- Decision Trees and Random Forests
- Neural networks
Unsupervised learning.
- Clustering
- K-Means
- DBSCAN
- Hierarchical Cluster Analysis (HCA)
- Anomaly detection and novelty detection
- One-class SVM
- Isolation Forest
- Visualization and dimensionality reduction
- Principal Component Analysis (PCA)
- Kernel PCA
- Locally-Linear Embedding (LLE)
- t-distributed Stochastic Neighbor Embedding (t-SNE)
- Association rule learning
- Apriori
- Eclat
Semisupervised learning.
Reinforcement learning.
Introduction Of Reinforcement LearningLearning
- Offline learning/batch learning.
- Online learning.
Generalizing
- Instance-based learning.
- Model-based learning.
Main challenges of machine learning
Data
- Lack of training data;
- Lack of representitive of training data;
- Poor quaility of training data;
- Irrelevant features.
Model
- Overfitting of models on training data;
- Underfitting of models on training data.