机器学习第一步，这是一篇手把手的随机森林入门实战

华红兵
2020-02-25 21:41:48 3

+关注

构建随机森林模型及其优化方法

作者：Alexander Cheng

机器之心编译

参与：高璇、思

随着机器学习技术的发展，2020年出现了许多有趣的机器学习教程。本文将从最流行的随机森林算法入手，详细介绍如何构建一个模型，包括模型的整个流程。

数据科学家可以采用多种方法创建分类模型，而随机森林是最受欢迎的一种方法之一。我们可以通过调整随机森林的超参数来优化模型性能。

然而，在使用模型进行拟合之前，通常还会进行主成分分析（PCA），尽管随机森林已经可以帮助我们更好地理解特征的重要性。那么，为什么还需要进行PCA呢？这是因为PCA可以降低模型需要处理的特征数量，从而提高模型训练的速度。

PCA在分析随机森林模型的“特征重要性”时，会使每个特征的解释变得复杂。但PCA可以降维，减少模型处理的特征数量，进而加快训练速度。特别是当预测特征数量达到数百甚至上千时，PCA的作用尤为显著。因此，如果希望模型具有最佳性能并且可以牺牲特征解释性，PCA可能非常有用。

接下来，我们通过一个具体的例子来说明。我们将使用Scikit-learn提供的“乳腺癌”数据集，并创建三个模型进行比较：

随机森林
具有PCA降维的随机森林
具有PCA降维和超参数调整的随机森林

导入数据

首先，我们加载数据并创建DataFrame。这是一个由Scikit-learn提供的“玩具”数据集，因此可以直接进行建模。不过，作为最佳实践，我们需要执行以下操作： - 使用df.head()查看新的DataFrame，确保其符合预期。 - 使用df.info()了解每一列的数据类型和数据量，并根据需要转换数据类型。 - 使用df.isna()确保没有NaN值，并根据需要处理缺失值或删除行。 - 使用df.describe()了解每列的最小值、最大值、均值、中位数、标准差和四分位数范围。

“cancer”列是我们要预测的目标变量，“0”表示“无癌症”，“1”表示“癌症”。

```python import pandas as pd from sklearn.datasets import loadbreastcancer

columns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']

dataset = loadbreastcancer() data = pd.DataFrame(dataset['data'], columns=columns) data['cancer'] = dataset['target'] ```

训练集/测试集分割

现在，我们使用Scikit-learn的train_test_split函数拆分数据。为了确保模型有足够的数据进行训练，同时保留部分数据进行测试，我们将数据分为50%的训练集和50%的测试集。我们还设置了stratify=y，以确保训练集和测试集的比例与原始数据集一致。

```python from sklearn.modelselection import traintest_split

X = data.drop('cancer', axis=1) y = data['cancer'] Xtrain, Xtest, ytrain, ytest = traintestsplit(X, y, testsize=0.50, randomstate=2020, stratify=y) ```

规范化数据

在建模之前，我们需要将数据“居中”和“标准化”。此外，我们还将y_train从Pandas的Series对象转换为NumPy数组，以便模型稍后接收训练数据。

```python import numpy as np from sklearn.preprocessing import StandardScaler

ss = StandardScaler() Xtrainscaled = ss.fittransform(Xtrain) Xtestscaled = ss.transform(Xtest) ytrain = np.array(y_train) ```

拟合基线随机森林模型

现在，我们创建一个“基线”随机森林模型，使用所有预测特征和默认设置。首先，我们实例化模型并使用标准化的数据拟合模型。我们可以测量模型在训练数据上的准确性。

```python from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import recall_score

rfc = RandomForestClassifier() rfc.fit(Xtrainscaled, ytrain) print(rfc.score(Xtrainscaled, ytrain)) # 输出：1.0 ```

特征重要性

如果我们想知道哪些特征对随机森林模型预测乳腺癌最重要，可以通过调用feature_importances_方法来可视化和量化这些重要特征。

```python feats = {} for feature, importance in zip(data.columns, rfc.featureimportances): feats[feature] = importance

importances = pd.DataFrame.fromdict(feats, orient='index').rename(columns={0: 'Gini-Importance'}) importances = importances.sortvalues(by='Gini-Importance', ascending=False) importances = importances.reset_index().rename(columns={'index': 'Features'})

可视化代码略

```

主成分分析（PCA）

如何改进基线模型呢？通过降维，我们可以用更少的变量来拟合原始数据集，同时降低模型的计算成本。使用PCA，我们可以研究特征的累积方差比，以了解哪些特征代表数据中的最大方差。

我们实例化PCA函数并设置要保留的成分数量。这里设置为30，以查看所有生成成分的方差，并决定在何处切割。然后，我们将缩放后的X_train数据拟合到PCA函数中。

```python import matplotlib.pyplot as plt import seaborn as sns from sklearn.decomposition import PCA

pcatest = PCA(ncomponents=30) pcatest.fit(Xtrain_scaled)

sns.set(style='whitegrid') plt.plot(np.cumsum(pcatest.explainedvarianceratio)) plt.xlabel('Number of Components') plt.ylabel('Cumulative Explained Variance') plt.axvline(linewidth=4, color='r', linestyle='--', x=10, ymin=0, ymax=1) plt.show()

选择前10个成分

pca = PCA(ncomponents=10) pca.fit(Xtrainscaled) Xtrainscaledpca = pca.transform(Xtrainscaled) Xtestscaledpca = pca.transform(Xtest_scaled) ```

拟合具有PCA降维的随机森林模型

接下来，我们将X_train_scaled_pca和y_train数据拟合到另一个随机森林模型中，以测试该模型的预测性能是否有所提升。

python rfc = RandomForestClassifier() rfc.fit(X_train_scaled_pca, y_train) print(rfc.score(X_train_scaled_pca, y_train)) # 输出：1.0

第一轮超参数调优：RandomSearchCV

完成PCA之后，我们还可以通过超参数调优进一步优化模型。超参数可以视为模型的“设置”，不同数据集的最佳设置也不同。我们使用RandomizedSearchCV进行超参数调优，尝试不同的超参数组合。

```python from sklearn.model_selection import RandomizedSearchCV

nestimators = [int(x) for x in np.linspace(start=100, stop=1000, num=10)] maxfeatures = ['log2', 'sqrt'] maxdepth = [int(x) for x in np.linspace(start=1, stop=15, num=15)] minsamplessplit = [int(x) for x in np.linspace(start=2, stop=50, num=10)] minsamples_leaf = [int(x) for x in np.linspace(start=2, stop=50, num=10)] bootstrap = [True, False]

paramdist = { 'nestimators': nestimators, 'maxfeatures': maxfeatures, 'maxdepth': maxdepth, 'minsamplessplit': minsamplessplit, 'minsamplesleaf': minsamples_leaf, 'bootstrap': bootstrap }

rs = RandomizedSearchCV(rfc, paramdist, niter=100, cv=3, verbose=1, njobs=-1, randomstate=0) rs.fit(Xtrainscaledpca, ytrain) print(rs.bestparams) ```

第二轮超参数调优：GridSearchCV

在第一轮超参数调优之后，我们使用GridSearchCV对最佳超参数进行更精细的搜索。通过调整超参数，我们进一步提高了模型性能。

```python from sklearn.model_selection import GridSearchCV

nestimators = [300, 500, 700] maxfeatures = ['sqrt'] maxdepth = [2, 3, 7, 11, 15] minsamplessplit = [2, 3, 4, 22, 23, 24] minsamples_leaf = [2, 3, 4, 5, 6, 7] bootstrap = [False]

paramgrid = { 'nestimators': nestimators, 'maxfeatures': maxfeatures, 'maxdepth': maxdepth, 'minsamplessplit': minsamplessplit, 'minsamplesleaf': minsamples_leaf, 'bootstrap': bootstrap }

gs = GridSearchCV(rfc, paramgrid, cv=3, verbose=1, njobs=-1) gs.fit(Xtrainscaledpca, ytrain) print(gs.bestparams) ```

评估模型性能

最后，我们使用测试数据评估模型的性能。我们将测试三个模型：

基线随机森林
具有PCA降维的随机森林
具有PCA降维和超参数调优的随机森林

```python ypred = rfc.predict(Xtestscaled) ypredpca = rfc.predict(Xtestscaledpca) ypredgs = gs.bestestimator.predict(Xtestscaled_pca)

from sklearn.metrics import confusionmatrix, recallscore

confmatrixbaseline = pd.DataFrame(confusionmatrix(ytest, ypred), index=['actual 0', 'actual 1'], columns=['predicted 0', 'predicted 1']) confmatrixbaselinepca = pd.DataFrame(confusionmatrix(ytest, ypredpca), index=['actual 0', 'actual 1'], columns=['predicted 0', 'predicted 1']) confmatrixtunedpca = pd.DataFrame(confusionmatrix(ytest, ypred_gs), index=['actual 0', 'actual 1'], columns=['predicted 0', 'predicted 1'])

print(confmatrixbaseline) print('Baseline Random Forest recall score:', recallscore(ytest, ypred)) print(confmatrixbaselinepca) print('Baseline Random Forest With PCA recall score:', recallscore(ytest, ypredpca)) print(confmatrixtunedpca) print('Hyperparameter Tuned Random Forest With PCA Reduced Dimensionality recall score:', recallscore(ytest, ypred_gs)) ```

以上就是完整的随机森林模型构建及优化过程。在这个过程中，我们不仅介绍了随机森林的基本概念，还展示了如何使用PCA和超参数调优来优化模型性能。

图灵汇

责任编辑：：华红兵

声明：本文系图灵汇原创稿件，版权属图灵汇所有，未经授权不得转载，已经协议授权的媒体下载使用时须注明"稿件来源：图灵汇"，违者将依法追究责任。

手把手实战入门随机机器森林这是学习

华红兵

2020-02-26