python机器学习API引见8：回归决策树API

刘凯悦
2019-12-05 14:45:04 5

+关注

回归决策树API介绍

在scikit-learn中，有两种类型的决策树，它们都可以采用优化的CART决策树算法。这里我们将重点放在如何使用Python机器学习API来实现常见的回归任务。

回归决策树API：DecisionTreeRegressor

DecisionTreeRegressor 是用于处理回归问题的一个API。其构造函数定义如下：

python class sklearn.tree.DecisionTreeRegressor( criterion='mse', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0, max_features=None, random_state=None, max_leaf_nodes=None, presort=False )

参数详解

criterion：一个字符串，用于指定切分的质量标准，默认值为 'mse'（均方误差）。
splitter：一个字符串，用于指定切分策略，可选值为 'best'（最佳切分）或 'random'（随机切分）。
max_features：一个整数、浮点数、字符串或None，用于指定寻找最佳切分时考虑的特征数量。具体含义包括：
- 整数：每次切分只考虑固定数量的特征。
- 浮点数：每次切分只考虑百分比的特征。
- 字符串 'auto' 或 'sqrt'：特征数量等于总特征数的平方根。
- 字符串 'log2'：特征数量等于总特征数的对数。
- None：特征数量等于总特征数。
max_depth：一个整数或None，用于指定树的最大深度。若为None，则树的深度不受限制；若指定了 max_leaf_nodes，则忽略此项。
minsamplessplit：一个整数，用于指定每个非叶节点至少包含的样本数。
minsamplesleaf：一个整数，用于指定每个叶节点至少包含的样本数。
minweightfraction_leaf：一个浮点数，用于指定叶节点中样本的最小权重系数。
maxleafnodes：一个整数或None，用于指定叶节点的最大数量。若为None，则不限制叶节点数量；若指定了非None值，则忽略 max_depth。
random_state：一个整数或 RandomState 实例，用于指定随机数生成器的种子。
presort：一个布尔值，用于指定是否提前对数据进行排序以加快最优切分的搜索速度。

属性详解

featureimportances：表示特征的重要性，值越大表示该特征越重要。
maxfeatures：max_features 的推断值。
nfeatures：训练后的特征数量。
noutputs：训练后的输出数量。
tree_：表示底层的决策树对象。

方法详解

fit(X, y[, sample_weight])：用于训练模型。
predict(X)：用于进行预测。
predictlogproba(X)：返回X预测为各类别的概率的对数值。
predict_proba(X)：返回X预测为各类别的概率值。
score(X, y[, sample_weight])：返回测试性能得分。

Python实例

下面是一个简单的示例，展示如何使用 DecisionTreeRegressor 进行回归任务：

```python import numpy as np from sklearn.tree import DecisionTreeRegressor from sklearn.modelselection import traintest_split import matplotlib.pyplot as plt

创建一个生成随机数的函数

def createdata(n): np.random.seed(0) X = 5 * np.random.rand(n, 1) y = np.sin(X).ravel() noisenum = int(n / 5) y[::5] += 3 * (0.5 - np.random.rand(noisenum)) return traintestsplit(X, y, testsize=0.25, random_state=1)

训练并测试回归树

def testDecisionTreeRegressor(*data): Xtrain, Xtest, ytrain, ytest = data regressor = DecisionTreeRegressor() regressor.fit(Xtrain, y_train)

print(f"Training score: {regressor.score(X_train, y_train)}")
print(f"Testing score: {regressor.score(X_test, y_test)}")

绘图

def plot_results(): fig, ax = plt.subplots(figsize=(8, 6)) x = np.arange(0.0, 5.0, 0.05)[:, np.newaxis] y = regressor.predict(x)

ax.scatter(X_train, y_train, label="Train Sample", color='g')
ax.scatter(X_test, y_test, label="Test Sample", color='r')
ax.plot(x, y, label="Predicted Values", linewidth=2, alpha=0.5)

ax.set_xlabel("Data")
ax.set_ylabel("Target")
ax.set_title("Decision Tree Regression")
ax.legend(framealpha=0.5)
plt.show()

Xtrain, Xtest, ytrain, ytest = createdata(100) testDecisionTreeRegressor(Xtrain, Xtest, ytrain, ytest) plot_results() ```

运行上述代码后，可以看到训练精度很高，但测试精度较低，这表明模型在当前数据集上的泛化能力较弱。

接下来，我们进一步验证随机划分与最优划分对预测功能的影响，以及不同深度对预测性能的影响。

随机划分与最优划分的影响

```python def testDecisionTreeRegressorsplitter(*data): Xtrain, Xtest, ytrain, ytest = data splitters = ['best', 'random']

for splitter in splitters:
    regressor = DecisionTreeRegressor(splitter=splitter)
    regressor.fit(X_train, y_train)

    print(f"Splitter: {splitter}")
    print(f"Training score: {regressor.score(X_train, y_train)}")
    print(f"Testing score: {regressor.score(X_test, y_test)}")

Xtrain, Xtest, ytrain, ytest = createdata(100) testDecisionTreeRegressorsplitter(Xtrain, Xtest, ytrain, y_test) ```

决策树深度的影响

```python def testDecisionTreeRegressordepth(*data, maxdepth): Xtrain, Xtest, ytrain, ytest = data depths = np.arange(1, maxdepth + 1)

training_scores = []
testing_scores = []

for depth in depths:
    regressor = DecisionTreeRegressor(max_depth=depth)
    regressor.fit(X_train, y_train)

    training_scores.append(regressor.score(X_train, y_train))
    testing_scores.append(regressor.score(X_test, y_test))

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(depths, training_scores, label="Training Scores")
ax.plot(depths, testing_scores, label="Testing Scores")

ax.set_xlabel("Max Depth Value")
ax.set_ylabel("Score Value")
ax.set_title("Decision Tree Regression")
ax.legend(framealpha=0.5)
plt.show()

Xtrain, Xtest, ytrain, ytest = createdata(100) testDecisionTreeRegressordepth(Xtrain, Xtest, ytrain, ytest, maxdepth=10) ```

通过以上代码，可以发现随机划分和最优划分对预测性能的影响较小，而决策树的深度对预测性能影响显著。在只有100个样本的情况下，最大深度为7时已经无法再进行划分。

图灵汇

责任编辑：：刘凯悦

声明：本文系图灵汇原创稿件，版权属图灵汇所有，未经授权不得转载，已经协议授权的媒体下载使用时须注明"稿件来源：图灵汇"，违者将依法追究责任。