在scikit-learn中,有两种类型的决策树,它们都可以采用优化的CART决策树算法。这里我们将重点放在如何使用Python机器学习API来实现常见的回归任务。
DecisionTreeRegressor 是用于处理回归问题的一个API。其构造函数定义如下:
python
class sklearn.tree.DecisionTreeRegressor(
criterion='mse',
splitter='best',
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0,
max_features=None,
random_state=None,
max_leaf_nodes=None,
presort=False
)
max_features:一个整数、浮点数、字符串或None,用于指定寻找最佳切分时考虑的特征数量。具体含义包括:
max_depth:一个整数或None,用于指定树的最大深度。若为None,则树的深度不受限制;若指定了 max_leaf_nodes,则忽略此项。
max_depth。RandomState 实例,用于指定随机数生成器的种子。max_features 的推断值。下面是一个简单的示例,展示如何使用 DecisionTreeRegressor 进行回归任务:
```python import numpy as np from sklearn.tree import DecisionTreeRegressor from sklearn.modelselection import traintest_split import matplotlib.pyplot as plt
def createdata(n): np.random.seed(0) X = 5 * np.random.rand(n, 1) y = np.sin(X).ravel() noisenum = int(n / 5) y[::5] += 3 * (0.5 - np.random.rand(noisenum)) return traintestsplit(X, y, testsize=0.25, random_state=1)
def testDecisionTreeRegressor(*data): Xtrain, Xtest, ytrain, ytest = data regressor = DecisionTreeRegressor() regressor.fit(Xtrain, y_train)
print(f"Training score: {regressor.score(X_train, y_train)}")
print(f"Testing score: {regressor.score(X_test, y_test)}")
def plot_results(): fig, ax = plt.subplots(figsize=(8, 6)) x = np.arange(0.0, 5.0, 0.05)[:, np.newaxis] y = regressor.predict(x)
ax.scatter(X_train, y_train, label="Train Sample", color='g')
ax.scatter(X_test, y_test, label="Test Sample", color='r')
ax.plot(x, y, label="Predicted Values", linewidth=2, alpha=0.5)
ax.set_xlabel("Data")
ax.set_ylabel("Target")
ax.set_title("Decision Tree Regression")
ax.legend(framealpha=0.5)
plt.show()
Xtrain, Xtest, ytrain, ytest = createdata(100) testDecisionTreeRegressor(Xtrain, Xtest, ytrain, ytest) plot_results() ```
运行上述代码后,可以看到训练精度很高,但测试精度较低,这表明模型在当前数据集上的泛化能力较弱。
接下来,我们进一步验证随机划分与最优划分对预测功能的影响,以及不同深度对预测性能的影响。
```python def testDecisionTreeRegressorsplitter(*data): Xtrain, Xtest, ytrain, ytest = data splitters = ['best', 'random']
for splitter in splitters:
regressor = DecisionTreeRegressor(splitter=splitter)
regressor.fit(X_train, y_train)
print(f"Splitter: {splitter}")
print(f"Training score: {regressor.score(X_train, y_train)}")
print(f"Testing score: {regressor.score(X_test, y_test)}")
Xtrain, Xtest, ytrain, ytest = createdata(100) testDecisionTreeRegressorsplitter(Xtrain, Xtest, ytrain, y_test) ```
```python def testDecisionTreeRegressordepth(*data, maxdepth): Xtrain, Xtest, ytrain, ytest = data depths = np.arange(1, maxdepth + 1)
training_scores = []
testing_scores = []
for depth in depths:
regressor = DecisionTreeRegressor(max_depth=depth)
regressor.fit(X_train, y_train)
training_scores.append(regressor.score(X_train, y_train))
testing_scores.append(regressor.score(X_test, y_test))
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(depths, training_scores, label="Training Scores")
ax.plot(depths, testing_scores, label="Testing Scores")
ax.set_xlabel("Max Depth Value")
ax.set_ylabel("Score Value")
ax.set_title("Decision Tree Regression")
ax.legend(framealpha=0.5)
plt.show()
Xtrain, Xtest, ytrain, ytest = createdata(100) testDecisionTreeRegressordepth(Xtrain, Xtest, ytrain, ytest, maxdepth=10) ```
通过以上代码,可以发现随机划分和最优划分对预测性能的影响较小,而决策树的深度对预测性能影响显著。在只有100个样本的情况下,最大深度为7时已经无法再进行划分。