机器学习的敲门砖：kNN算法（中）

智能天下
2019-09-26 12:15:23 3

+关注

前言

在《机器学习入门：kNN算法（上）》中，我们介绍了适合初学者的k近邻算法（kNN）。我们详细学习了kNN算法的流程，并且在Jupyter Notebook上手动实现了代码。此外，我们也探讨了如何使用sklearn库中的kNN算法。

在实际应用中，我们常常会有这样的疑问：思想简单的kNN算法，它的效果如何？预测准确率有多高？在机器学习中如何评价算法的好坏？除了这些，我们还需要关注哪些问题？

本文将通过训练数据集和测试数据集来评估模型性能，并引入准确率（accuracy）这一评价指标。最后，我们将探讨超参数的选择对模型的影响。

判断模型好坏

1.1 训练数据集与测试数据集

我们已经训练好了模型，但是否可以直接投入生产环境中使用呢？实际上，我们需要先评估模型的效果。为了验证模型的准确性，我们需要将数据集划分为训练数据集和测试数据集。

通常情况下，我们会按照80%训练数据和20%测试数据的比例进行划分。然而，如果数据集是按时间顺序排列的，那么简单地按比例划分可能会导致偏差。这时，我们需要对数据集进行随机打乱（shuffle），以确保训练集和测试集具有代表性。

1.2 鸢尾花数据集示例

鸢尾花数据集是常用的UCI数据集之一。我们可以通过Python的sklearn库直接加载数据集，并进行初步探索：

```python import numpy as np from sklearn import datasets import matplotlib.pyplot as plt

iris = datasets.load_iris() X = iris.data y = iris.target print(X.shape) # 输出：(150, 4) print(y.shape) # 输出：(150,) ```

接下来，我们将数据集划分为训练集和测试集：

```python from sklearn.modelselection import traintest_split

数据集随机打乱

shuffleindex = np.random.permutation(len(X)) testratio = 0.2 testsize = int(len(X) * testratio) testindex = shuffleindex[:testsize] trainindex = shuffleindex[testsize:]

分割数据集

Xtrain = X[trainindex] Xtest = X[testindex] ytrain = y[trainindex] ytest = y[testindex]

print(Xtrain.shape) # 输出：(120, 4) print(Xtest.shape) # 输出：(30, 4) print(ytrain.shape) # 输出：(120,) print(ytest.shape) # 输出：(30,) ```

1.3 自定义 `train_test_split` 函数

我们可以自定义一个 train_test_split 函数，方便以后重复使用：

```python import numpy as np

def traintestsplit(X, y, testratio=0.2, seed=None): assert X.shape[0] == y.shape[0], "数据集大小不一致" assert 0.0 <= testratio <= 1.0, "测试集比例必须在0到1之间"

if seed:
    np.random.seed(seed)

shuffled_index = np.random.permutation(len(X))
test_size = int(len(X) * test_ratio)
test_index = shuffled_index[:test_size]
train_index = shuffled_index[test_size:]

X_train = X[train_index]
X_test = X[test_index]
y_train = y[train_index]
y_test = y[test_index]

return X_train, X_test, y_train, y_test

```

评估模型性能

为了评估模型的性能，我们通常使用准确率（accuracy）这一指标。准确率是指分类正确的样本数占总样本数的比例。

2.1 使用 kNN 算法进行预测

我们可以通过以下步骤评估kNN算法的性能：

```python from sklearn.neighbors import KNeighborsClassifier

创建 kNN 分类器实例

knn_clf = KNeighborsClassifier()

训练模型

knnclf.fit(Xtrain, y_train)

预测

ypredict = knnclf.predict(X_test)

计算准确率

accuracy = np.sum(ypredict == ytest) / len(y_test) print(f"准确率：{accuracy:.2f}") ```

超参数优化

超参数的选择对模型的性能有很大影响。我们可以通过网格搜索（Grid Search）来寻找最佳的超参数组合。

3.1 网格搜索原理

网格搜索是一种穷举搜索的方法，它通过遍历所有可能的超参数组合来找到最优解。我们可以使用sklearn库中的GridSearchCV类来实现网格搜索。

```python from sklearn.model_selection import GridSearchCV

定义参数搜索空间

paramsearch = [ {"weights": ["uniform"], "nneighbors": [i for i in range(1, 11)]}, {"weights": ["distance"], "n_neighbors": [i for i in range(1, 11)], "p": [i for i in range(1, 6)]} ]

创建分类器实例

knn_clf = KNeighborsClassifier()

创建网格搜索对象

gridsearch = GridSearchCV(knnclf, param_search, cv=5)

执行网格搜索

gridsearch.fit(Xtrain, y_train)

输出最佳参数和准确率

print("最佳参数：", gridsearch.bestparams) print("最佳准确率：", gridsearch.bestscore) ```

总结

在这篇文章中，我们学习了如何通过训练数据集和测试数据集来评估模型的性能，并引入了准确率这一评价指标。我们还探讨了超参数的选择对模型性能的影响，通过网格搜索方法找到了最佳的超参数组合。

通过以上步骤，我们可以更好地理解和应用kNN算法。在未来的学习中，我们将继续探讨数据归一化等重要概念，并进一步优化kNN算法。

图灵汇

责任编辑：：智能天下

声明：本文系图灵汇原创稿件，版权属图灵汇所有，未经授权不得转载，已经协议授权的媒体下载使用时须注明"稿件来源：图灵汇"，违者将依法追究责任。

敲门砖算法机器学习 kNN

杨昌坤

2019-09-26

前言