归档

24.1 链式法则

假设z=f(t)
t=g(y)
则，根据复合函数的求导规则（即链式法则），可得：

24.1.1 计算图的方向传播

反向传播的计算顺序是将信号E乘以节点的局部导数（∂y/∂x），然后将结果传递给下一个节点。

24.1.2 链式法则

如果某个函数为复合函数，则该复合函数的导数可以用构成复合函数的各个函数的导数的乘积表示。

24.1.3 链式法则和计算图

反向传播是上游传过来的导数乘以本节点导数。

24.2 反向传播

反向传播是上游传过来的导数乘以本节点对应输入的导数。

24.2.1 加法节点的反向传播

代码实现如下

class AddLayer:
    def __init__(self):
        pass
    def forward(self,x,y):
        return x+y
    def backward(self,dout):
        dx=dout*1
        dy=dout*1
        return dx,dy

class AddLayer:

def __init__(self):

pass

def forward(self,x,y):

return x+y

def backward(self,dout):

dx=dout*1

dy=dout*1

return dx,dy

24.2.2 乘法法节点的反向传播

class MultiLayer:
    def __init__(self):
        self.x=None
        self.y=None
    def forward(self,x,y):
        self.x=x
        self.y=y
        return x*y
    def backward(self,dout):
        dx=dout*self.y
        dy=dout*self.x
        return dx,dy

class MultiLayer:

def __init__(self):

self.x=None

self.y=None

def forward(self,x,y):

self.x=x

self.y=y

return x*y

def backward(self,dout):

dx=dout*self.y

dy=dout*self.x

return dx,dy

24.2.3 激活函数层的反向传播

24.3 激活函数层的反向传播

24.3.1 ReLU激活函数

class ReLu:
    def __init__(self):
        self.mask=None
      
    def forward(self,x):
        self.mask=(x<=0)  #把x中不大于0的置为True，否则为False。
        out=x.copy()
        #使不大于0的设置为0，其它不变。
        out[self.mask]=0        
        return out
    def backward(self,dout):
        dout[self.mask]=0
        dx=dout
        return dx

class ReLu:

def __init__(self):

self.mask=None

def forward(self,x):

self.mask=(x<=0) #把x中不大于0的置为True，否则为False。

out=x.copy()

#使不大于0的设置为0，其它不变。

out[self.mask]=0

return out

def backward(self,dout):

dout[self.mask]=0

dx=dout

return dx

24.3.2 Sigmoid激活函数

根据导数的链式规则，上游的值乘以本节点输出对输入的导数.
代码实现

class sigmoid:
    def __init__(self):
        self.out=None
    def forward(self,x):
        out=1/(1+np.exp(-x))
        self.out=out
        return out
    def backward(self,dout):
        dx=dout*(1.0-self.out)*self.out
        return dx

class sigmoid:

def __init__(self):

self.out=None

def forward(self,x):

out=1/(1+np.exp(-x))

self.out=out

return out

def backward(self,dout):

dx=dout*(1.0-self.out)*self.out

return dx

24.4 Affine/softmax层的反向传播

DY=np.array([[1,2,3],[4,5,6]])
dB=np.sum(DY,axis=0)
Affine层的代码实现

class Affine:
    def __init__(self,W,B):
        self.W=W
        self.B=B
        self.X=None
        self.dW=None
        self.dB=None
    def forward(self,X):
        self.X=X
        out=np.dot(X,self.W)+self.B
        return out
    def backward(self,dout):
        dX=np.dot(dout,self.W.T)
        self.dW=np.dot(self.X.T,dout)
        self.dB=np.sum(dout,axis=0)
        return dx

class Affine:

def __init__(self,W,B):

self.W=W

self.B=B

self.X=None

self.dW=None

self.dB=None

def forward(self,X):

self.X=X

out=np.dot(X,self.W)+self.B

return out

def backward(self,dout):

dX=np.dot(dout,self.W.T)

self.dW=np.dot(self.X.T,dout)

self.dB=np.sum(dout,axis=0)

return dx

24.4.1 Softmax-with loss 层

输入一张手写5的图片，经过多层（这里假设为2层）神经网络转换后，对输出10个节点，在各个输出节点的得分或概率是不同的，其中对应标签为5的节点（转换为one
-hot后为[0,0,0,0,1,0,0,0,0]，得分或概率最大。

我们看一下带softmax及loss的反向传播，如何计算梯度。以下为示意图。

代码实现

class softmaxwithloss:
    def __init__(self):
        self.loss=None
        self.y=None
        self.t=None
    def forward(self,x,t):
        self.y=softmax(x)
        self.t=t
        self.loss=cross_entropy_error(self.y,self.t)
        return self.loss
    def backward(self,dout=1):
        #如果是批处理，需要除以批量数据
        batch_size=self.t.shape[0]
        dx=(self.y-self.t)/batch_size
        return dx

class softmaxwithloss:

def __init__(self):

self.loss=None

self.y=None

self.t=None

def forward(self,x,t):

self.y=softmax(x)

self.t=t

self.loss=cross_entropy_error(self.y,self.t)

return self.loss

def backward(self,dout=1):

#如果是批处理，需要除以批量数据

batch_size=self.t.shape[0]

dx=(self.y-self.t)/batch_size

return dx

24.5 损失反向传播法的实现

24.5.1 神经网络学习的基本步骤

利用随机梯度下降法，求梯度并更新权重和偏置参数，整个过程是个循环过程。
步骤1
从训练数据中随机选择一部分数据
步骤2
构建网络，利用前向传播，求出输出值。然后利用输出值与目标值得到损失函数，利用损失函数，利用反向传播方法，求各参数的梯度。
步骤3
将权重参数沿梯度方向进行微小更新
步骤4
重复以上1、2、3步骤

24.5.2 神经网络学习的反向传播法的实现

神经网络结构图如下

下面用代码实现
1）概述
为定义和保存以上神经网络架构，需要先定义几个实例变量：
保存权重参数的字典型变量params。
保存各层的信息的顺序字典layers，这里的顺序是插入数据的先后顺序。
神经网络的最后一层lastlayer
除了以上三个实例变量，还需要定义一些方法
构造函数，以初始化变量和权重等
预测方法，根据神经网络各层的前向传播得到最后的输出值
损失函数，根据输出值与目标值，得到交叉熵作为衡量两个分布的距离。
评估指标，这里使用精度来衡量模型性能
最后就是计算梯度，这里使用反向传播方法的得到，具体是利用导数的链式法则，从后往前，获取各层的梯度作为前层梯度往前传递（往输入端）
当然，这里需要先定义好各层类，各类中各层的权重参数、包括前向传播结果，反向传播的梯度等。
2）定义各层类
①softmax 函数及Sigmoid类

def softmax(x):
    if x.ndim == 2:
        x = x.T
        x = x - np.max(x, axis=0)
        y = np.exp(x) / np.sum(np.exp(x), axis=0)
        return y.T 

    x = x - np.max(x) #防止出现溢出情况
    return np.exp(x) / np.sum(np.exp(x))

class Sigmoid:
    def __init__(self):
        self.out=None
        
    def forward(self,x):
        self.out=1 / (1 + np.exp(-x))
        return self.out
    
    def backward(self,dout):
        dx=dout*(1.0 -self.out) * self.out
        return dx

class Relu:
    def __init__(self):
        self.mask = None

    def forward(self, x):
        self.mask = (x <= 0)
        out = x.copy()
        out[self.mask] = 0

        return out

    def backward(self, dout):
        dout[self.mask] = 0
        dx = dout

        return dx

def softmax(x):

if x.ndim == 2:

x = x.T

x = x - np.max(x, axis=0)

y = np.exp(x) / np.sum(np.exp(x), axis=0)

return y.T

x = x - np.max(x) #防止出现溢出情况

return np.exp(x) / np.sum(np.exp(x))

class Sigmoid:

def __init__(self):

self.out=None

def forward(self,x):

self.out=1 / (1 + np.exp(-x))

return self.out

def backward(self,dout):

dx=dout*(1.0 -self.out) * self.out

return dx

class Relu:

def __init__(self):

self.mask = None

def forward(self, x):

self.mask = (x <= 0)

out = x.copy()

out[self.mask] = 0

return out

def backward(self, dout):

dout[self.mask] = 0

dx = dout

return dx

②Affine类或称为sumweigt

class Affine:
    def __init__(self, W, b):
        self.W =W
        self.b = b
        
        self.x = None
        self.original_x_shape = None
        # 权重和偏置参数的导数
        self.dW = None
        self.db = None

    def forward(self, x):
        # 对应张量
        self.original_x_shape = x.shape
        x = x.reshape(x.shape[0], -1)
        self.x = x

        out = np.dot(self.x, self.W) + self.b

        return out

    def backward(self, dout):
        dx = np.dot(dout, self.W.T)
        self.dW = np.dot(self.x.T, dout)
        self.db = np.sum(dout, axis=0)
        
        dx = dx.reshape(*self.original_x_shape)  # 还原输入数据的形状（对应张量）
        return dx

class Affine:

def __init__(self, W, b):

self.W =W

self.b = b

self.x = None

self.original_x_shape = None

# 权重和偏置参数的导数

self.dW = None

self.db = None

def forward(self, x):

# 对应张量

self.original_x_shape = x.shape

x = x.reshape(x.shape[0], -1)

self.x = x

out = np.dot(self.x, self.W) + self.b

return out

def backward(self, dout):

dx = np.dot(dout, self.W.T)

self.dW = np.dot(self.x.T, dout)

self.db = np.sum(dout, axis=0)

dx = dx.reshape(*self.original_x_shape) # 还原输入数据的形状（对应张量）

return dx

③最后一层

class SoftmaxWithLoss:
    def __init__(self):
        self.loss = None
        self.y = None # softmax的输出
        self.t = None # 监督数据

    def forward(self, x, t):
        self.t = t
        self.y = softmax(x)
        self.loss = cross_entropy_error(self.y, self.t)
        
        return self.loss

    def backward(self, dout=1):
        batch_size = self.t.shape[0]
        if self.t.size == self.y.size: # 监督数据是one-hot-vector的情况
            dx = (self.y - self.t) / batch_size
        else:
            dx = self.y.copy()
            dx[np.arange(batch_size), self.t] -= 1
            dx = dx / batch_size
        
        return dx

class SoftmaxWithLoss:

def __init__(self):

self.loss = None

self.y = None # softmax的输出

self.t = None # 监督数据

def forward(self, x, t):

self.t = t

self.y = softmax(x)

self.loss = cross_entropy_error(self.y, self.t)

return self.loss

def backward(self, dout=1):

batch_size = self.t.shape[0]

if self.t.size == self.y.size: # 监督数据是one-hot-vector的情况

dx = (self.y - self.t) / batch_size

else:

dx = self.y.copy()

dx[np.arange(batch_size), self.t] -= 1

dx = dx / batch_size

return dx

3）定义损失函数

def cross_entropy_error(y, t):
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)
        
    # 如果t为one-hot格式，把它转换为数字格式
    if t.size == y.size:
        t = t.argmax(axis=1)
             
    batch_size = y.shape[0]
    return -np.sum(np.log(y[np.arange(batch_size), t] + 1e-8)) / batch_size

def cross_entropy_error(y, t):

if y.ndim == 1:

t = t.reshape(1, t.size)

y = y.reshape(1, y.size)

# 如果t为one-hot格式，把它转换为数字格式

if t.size == y.size:

t = t.argmax(axis=1)

batch_size = y.shape[0]

return -np.sum(np.log(y[np.arange(batch_size), t] + 1e-8)) / batch_size

4）定义神经网络类

import numpy as np
from collections import OrderedDict


class TwoLayerNet:

    def __init__(self, input_size, hidden_size,output_size, weight_init_std = 0.01):
        # 初始化权重
        self.params = {}
        self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size) 
        self.params['b2'] = np.zeros(output_size)

        # 生成层
        self.layers = OrderedDict()
        self.layers['Affine1'] = Affine(self.params['W1'], self.params['b1'])
        #self.layers['Sigmoid1'] = Sigmoid()
        self.layers['Relu1'] = Relu()
        self.layers['Affine2'] = Affine(self.params['W2'], self.params['b2'])

        self.lastLayer = SoftmaxWithLoss()
        
    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)
        
        return x


    
        
    # x:输入数据, t:监督数据
    def loss(self, x, t):
        y = self.predict(x)
        return self.lastLayer.forward(y, t)
    
    
    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        #print("预测值",y[0],y.shape)
        
        if t.ndim != 1: 
            t = np.argmax(t, axis=1)
        accuracy = np.sum(y == t) / float(x.shape[0])
        return accuracy
        
    # x:输入数据, t:监督数据
    def numerical_gradient(self, x, t):
        loss_W = lambda W: self.loss(x, t)
        
        grads = {}
        grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
        grads['b1'] = numerical_gradient(loss_W, self.params['b1'])
        grads['W2'] = numerical_gradient(loss_W, self.params['W2'])
        grads['b2'] = numerical_gradient(loss_W, self.params['b2'])
        
        return grads
        
    def gradient(self, x, t):
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.lastLayer.backward(dout)
        
        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        # 用一个字典记录各参数（权重和偏置）的梯度
        grads = {}
        grads['W1'], grads['b1'] = self.layers['Affine1'].dW, self.layers['Affine1'].db
        grads['W2'], grads['b2'] = self.layers['Affine2'].dW, self.layers['Affine2'].db

        return grads

import numpy as np

from collections import OrderedDict

class TwoLayerNet:

def __init__(self, input_size, hidden_size,output_size, weight_init_std = 0.01):

# 初始化权重

self.params = {}

self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)

self.params['b1'] = np.zeros(hidden_size)

self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size)

self.params['b2'] = np.zeros(output_size)

# 生成层

self.layers = OrderedDict()

self.layers['Affine1'] = Affine(self.params['W1'], self.params['b1'])

#self.layers['Sigmoid1'] = Sigmoid()

self.layers['Relu1'] = Relu()

self.layers['Affine2'] = Affine(self.params['W2'], self.params['b2'])

self.lastLayer = SoftmaxWithLoss()

def predict(self, x):

for layer in self.layers.values():

x = layer.forward(x)

return x

# x:输入数据, t:监督数据

def loss(self, x, t):

y = self.predict(x)

return self.lastLayer.forward(y, t)

def accuracy(self, x, t):

y = self.predict(x)

y = np.argmax(y, axis=1)

#print("预测值",y[0],y.shape)

if t.ndim != 1:

t = np.argmax(t, axis=1)

accuracy = np.sum(y == t) / float(x.shape[0])

return accuracy

# x:输入数据, t:监督数据

def numerical_gradient(self, x, t):

loss_W = lambda W: self.loss(x, t)

grads = {}

grads['W1'] = numerical_gradient(loss_W, self.params['W1'])

grads['b1'] = numerical_gradient(loss_W, self.params['b1'])

grads['W2'] = numerical_gradient(loss_W, self.params['W2'])

grads['b2'] = numerical_gradient(loss_W, self.params['b2'])

return grads

def gradient(self, x, t):

# forward

self.loss(x, t)

# backward

dout = 1

dout = self.lastLayer.backward(dout)

layers = list(self.layers.values())

layers.reverse()

for layer in layers:

dout = layer.backward(dout)

# 用一个字典记录各参数（权重和偏置）的梯度

grads = {}

grads['W1'], grads['b1'] = self.layers['Affine1'].dW, self.layers['Affine1'].db

grads['W2'], grads['b2'] = self.layers['Affine2'].dW, self.layers['Affine2'].db

return grads

5）使用误差反向传播法训练模型

import numpy as np

import matplotlib.pyplot as plt

# 读入数据

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)

network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

iters_num = 20000  # 适当设定循环的次数

train_size = x_train.shape[0]

batch_size = 100

learning_rate = 0.1

train_loss_list = []

train_acc_list = []

test_acc_list = []

iter_per_epoch = max(train_size / batch_size, 1)

for i in range(iters_num):

   batch_mask = np.random.choice(train_size, batch_size)

   x_batch = x_train[batch_mask]

   t_batch = t_train[batch_mask]

   # 计算梯度

   #grad = network.numerical_gradient(x_batch, t_batch)

   grad = network.gradient(x_batch, t_batch)
   
   

   for key in ('W1', 'b1', 'W2', 'b2'):

       network.params[key] -= learning_rate * grad[key]

   loss = network.loss(x_batch, t_batch)

   train_loss_list.append(loss)

   if i % iter_per_epoch == 0:
       # 更新
       if i%5000==0:
           learning_rate*=0.9

       # 更新参数
       print(learning_rate)

       train_acc = network.accuracy(x_train, t_train)

       test_acc = network.accuracy(x_test, t_test)

       train_acc_list.append(train_acc)

       test_acc_list.append(test_acc)

       print("train acc, test acc | " + str(train_acc) + ", " + str(test_acc))

# 绘制图形

markers = {'train': 'o', 'test': 's'}

x = np.arange(len(train_acc_list))

plt.plot(x, train_acc_list, label='train acc')

plt.plot(x, test_acc_list, label='test acc', linestyle='--')

plt.xlabel("epochs")

plt.ylabel("accuracy")

plt.ylim(0, 1.0)

plt.legend(loc='lower right')

plt.show()

import numpy as np

import matplotlib.pyplot as plt

# 读入数据

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)

network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

iters_num = 20000 # 适当设定循环的次数

train_size = x_train.shape[0]

batch_size = 100

learning_rate = 0.1

train_loss_list = []

train_acc_list = []

test_acc_list = []

iter_per_epoch = max(train_size / batch_size, 1)

for i in range(iters_num):

batch_mask = np.random.choice(train_size, batch_size)

x_batch = x_train[batch_mask]

t_batch = t_train[batch_mask]

# 计算梯度

#grad = network.numerical_gradient(x_batch, t_batch)

grad = network.gradient(x_batch, t_batch)

for key in ('W1', 'b1', 'W2', 'b2'):

network.params[key] -= learning_rate * grad[key]

loss = network.loss(x_batch, t_batch)

train_loss_list.append(loss)

if i % iter_per_epoch == 0:

# 更新

if i%5000==0:

learning_rate*=0.9

# 更新参数

print(learning_rate)

train_acc = network.accuracy(x_train, t_train)

test_acc = network.accuracy(x_test, t_test)

train_acc_list.append(train_acc)

test_acc_list.append(test_acc)

print("train acc, test acc | " + str(train_acc) + ", " + str(test_acc))

# 绘制图形

markers = {'train': 'o', 'test': 's'}

x = np.arange(len(train_acc_list))

plt.plot(x, train_acc_list, label='train acc')

plt.plot(x, test_acc_list, label='test acc', linestyle='--')

plt.xlabel("epochs")

plt.ylabel("accuracy")

plt.ylim(0, 1.0)

plt.legend(loc='lower right')

plt.show()

.09000000000000001
train acc, test acc | 0.9837833333333333, 0.9724
0.09000000000000001
train acc, test acc | 0.9842666666666666, 0.9722
0.08100000000000002
train acc, test acc | 0.98475, 0.9716
0.08100000000000002
train acc, test acc | 0.9853166666666666, 0.9733
0.08100000000000002
train acc, test acc | 0.9859666666666667, 0.9726
0.08100000000000002
train acc, test acc | 0.9861166666666666, 0.9707
0.08100000000000002
train acc, test acc | 0.9873, 0.9737
0.08100000000000002
train acc, test acc | 0.9873833333333333, 0.9744
0.08100000000000002
train acc, test acc | 0.9881, 0.973
0.08100000000000002
train acc, test acc | 0.9886666666666667, 0.9747
0.08100000000000002
train acc, test acc | 0.9888833333333333, 0.9743

6）利用各种算法对MNIST数据集的影响

def smooth_curve(x):
    """用于使损失函数的图形变圆滑
    参考：http://glowingpython.blogspot.jp/2012/02/convolution-with-numpy.html
    """
    window_len = 11
    s = np.r_[x[window_len-1:0:-1], x, x[-1:-window_len:-1]]
    w = np.kaiser(window_len, 2)
    y = np.convolve(w/w.sum(), s, mode='valid')
    return y[5:len(y)-5]



import matplotlib.pyplot as plt
# 0:读入MNIST数据==========
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

train_size = x_train.shape[0]
batch_size = 128
max_iterations = 2000


# 1:进行实验的设置==========
optimizers = {}
optimizers['SGD'] = SGD()
optimizers['Momentum'] = Momentum()
optimizers['AdaGrad'] = AdaGrad()
optimizers['Adam'] = Adam()
#optimizers['RMSprop'] = RMSprop()

networks = {}
train_loss = {}
for key in optimizers.keys():
    networks[key] = TwoLayerNet(
        input_size=784, hidden_size=100,
        output_size=10)
    train_loss[key] = []    


# 2:开始训练==========
for i in range(max_iterations):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]
    
    for key in optimizers.keys():
        grads = networks[key].gradient(x_batch, t_batch)
        optimizers[key].update(networks[key].params, grads)
    
        loss = networks[key].loss(x_batch, t_batch)
        train_loss[key].append(loss)
    
    if i % 100 == 0:
        print( "===========" + "iteration:" + str(i) + "===========")
        for key in optimizers.keys():
            loss = networks[key].loss(x_batch, t_batch)
            print(key + ":" + str(loss))


# 3.绘制图形==========
markers = {"SGD": "o", "Momentum": "x", "AdaGrad": "s", "Adam": "D"}
x = np.arange(max_iterations)
for key in optimizers.keys():
    plt.plot(x, smooth_curve(train_loss[key]), marker=markers[key], markevery=100, label=key)
plt.xlabel("iterations")
plt.ylabel("loss")
plt.ylim(0, 1)
plt.legend(loc='upper right')
plt.show()

def smooth_curve(x):

"""用于使损失函数的图形变圆滑

参考：http://glowingpython.blogspot.jp/2012/02/convolution-with-numpy.html

"""

window_len = 11

s = np.r_[x[window_len-1:0:-1], x, x[-1:-window_len:-1]]

w = np.kaiser(window_len, 2)

y = np.convolve(w/w.sum(), s, mode='valid')

return y[5:len(y)-5]

import matplotlib.pyplot as plt

# 0:读入MNIST数据==========

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

train_size = x_train.shape[0]

batch_size = 128

max_iterations = 2000

# 1:进行实验的设置==========

optimizers = {}

optimizers['SGD'] = SGD()

optimizers['Momentum'] = Momentum()

optimizers['AdaGrad'] = AdaGrad()

optimizers['Adam'] = Adam()

#optimizers['RMSprop'] = RMSprop()

networks = {}

train_loss = {}

for key in optimizers.keys():

networks[key] = TwoLayerNet(

input_size=784, hidden_size=100,

output_size=10)

train_loss[key] = []

# 2:开始训练==========

for i in range(max_iterations):

batch_mask = np.random.choice(train_size, batch_size)

x_batch = x_train[batch_mask]

t_batch = t_train[batch_mask]

for key in optimizers.keys():

grads = networks[key].gradient(x_batch, t_batch)

optimizers[key].update(networks[key].params, grads)

loss = networks[key].loss(x_batch, t_batch)

train_loss[key].append(loss)

if i % 100 == 0:

print( "===========" + "iteration:" + str(i) + "===========")

for key in optimizers.keys():

loss = networks[key].loss(x_batch, t_batch)

print(key + ":" + str(loss))

# 3.绘制图形==========

markers = {"SGD": "o", "Momentum": "x", "AdaGrad": "s", "Adam": "D"}

x = np.arange(max_iterations)

for key in optimizers.keys():

plt.plot(x, smooth_curve(train_loss[key]), marker=markers[key], markevery=100, label=key)

plt.xlabel("iterations")

plt.ylabel("loss")

plt.ylim(0, 1)

plt.legend(loc='upper right')

plt.show()

本章数据集下载地址（提取码是：7kct）

神经风格迁移是指将参考图像的风格应用于目标图像，同时保留目标图形的内容，如下图所示：

实现风格迁移核心思想就是定义损失函数，然后最小化损失。这里的损失包括风格损失和内容损失。
用公式来表示就是：

loss = distance(style(reference_image) - style(generated_image)) +
       distance(content(original_image) - content(generated_image))

1 2	loss = distance(style(reference_image) - style(generated_image)) + distance(content(original_image) - content(generated_image))

具体内容如下图

如图，假设初始化图像x（Input image）是一张随机图片，我们经过fw（image Transform Net）网络进行生成，生成图片y。
此时y需要和风格图片ys进行特征的计算得到一个loss_style，与内容图片yc进行特征的计算得到一个loss_content，假设loss=loss_style+loss_content，便可以对fw的网络参数进行训练。

23.1 内容损失

内容损失一般选择靠近的某层激活的差平方或L2范数。
写成代码就是

content_loss = F.mse_loss(features[2], content_features[2]) * content_weight

1	content_loss = F.mse_loss(features[2], content_features[2]) * content_weight

23.2 风格损失

格拉姆矩阵（Gram Matrix），即某一层特征图的内积。这个内积可以理解为表示该层特征之间相互关系的映射。损失函数的定义主要考虑以下因素：
①在目标内容图像和生成图像之间保持相似的较高层激活，从而能保留内容。卷积神经网络应该能够看到目标图像和生成图像包含相同的内容。
②在较低层和较高层的激活中保持类似的相互关系，从而能保留风格。特征相互关系捕zu到的是纹理，生成图像和风格参考图像在不同的空间尺度上应该具有相同纹理。

Gram Matrices的计算过程
假设输入图像经过卷积后，得到的feature map为[ch, h, w]。我们经过flatten和矩阵转置操作，可以变形为[ ch, h*w]和[h*w, ch]的矩阵。再对两矩阵做内积得到[ch, ch]大小的矩阵，这就是我们所说的Gram Matrices，如下图所示：

比如我们假设输入图像经过卷积后得到的[b, ch, h*w]的feature map，其中我们用fm表示第m个通道的特征层，fn为第n通道特征层。则Gram Matrices中元素fm∗fn代表的就是m通道和n通道特征flatten后按位相乘（内积）
具体实现代码

def gram_matrix(y):
    (b, ch, h, w) = y.size()
    features = y.view(b, ch, w * h)
    features_t = features.transpose(1, 2)
    gram = features.bmm(features_t) / (ch * h * w)
    return gram


style_grams = [gram_matrix(x) for x in style_features]

style_loss = 0
grams = [gram_matrix(x) for x in features]
for a, b in zip(grams, style_grams):
    style_loss += F.mse_loss(a, b) * style_weight

def gram_matrix(y):

(b, ch, h, w) = y.size()

features = y.view(b, ch, w * h)

features_t = features.transpose(1, 2)

gram = features.bmm(features_t) / (ch * h * w)

return gram

style_grams = [gram_matrix(x) for x in style_features]

style_loss = 0

grams = [gram_matrix(x) for x in features]

for a, b in zip(grams, style_grams):

style_loss += F.mse_loss(a, b) * style_weight

关于Gram矩阵还有以下三点值得注意：
1 Gram矩阵的计算采用了累加的形式，抛弃了空间信息。一张图片的像素随机打乱之后计算得到的Gram Matrix和原图的Gram Matrix一样。所以认为Gram Matrix所以认为Gram Matrix抛弃了元素之间的空间信息。
2 Gram Matrix的结果与feature maps F 的尺寸无关，只与通道个数有关，无论H,W的大小如何，最后Gram Matrix的形状都是CxC
3 对于一个C x H x W的feature maps，可以通过调整形状和矩阵乘法运算快速计算它的Gram Matrix。即先将F调整到 C x (H x W)的二维矩阵，然后再计算F 和F的转置。结果就为Gram Matrix

Gram Matrix的特点：
通过相乘运算，它将特征之间的区别进行扩大或者缩小，由此可一定程度反应向量本身及向量之间的一些特征或关系，它注重风格纹理，忽略空间信息

23.3 用keras实现神经风格迁移

https://ypw.io/style-transfer/（神经风格迁移 pytorch 0.4）
1）导入目标、风格图像

from keras.preprocessing.image import load_img, img_to_array

# This is the path to the image you want to transform.
#target_image_path = '/home/wumg/data/data/portrait.png'
target_image_path = '/home/wumg/data/data/shanghai_buildings.jpg'
# This is the path to the style image.
#style_reference_image_path = '/home/wumg/data/data/popova.png'
style_reference_image_path = '/home/wumg/data/data/starry-sky.jpg'

# Dimensions of the generated picture.
width, height = load_img(target_image_path).size
img_height = 400
img_width = int(width * img_height / height)

from keras.preprocessing.image import load_img, img_to_array

# This is the path to the image you want to transform.

#target_image_path = '/home/wumg/data/data/portrait.png'

target_image_path = '/home/wumg/data/data/shanghai_buildings.jpg'

# This is the path to the style image.

#style_reference_image_path = '/home/wumg/data/data/popova.png'

style_reference_image_path = '/home/wumg/data/data/starry-sky.jpg'

# Dimensions of the generated picture.

width, height = load_img(target_image_path).size

img_height = 400

img_width = int(width * img_height / height)

2）定义图像处理辅助函数
对进出VGG19神经网络的图像进行加载、预处理和后处理等处理。

import numpy as np
from keras.applications import vgg19

def preprocess_image(image_path):
    img = load_img(image_path, target_size=(img_height, img_width))
    img = img_to_array(img)
    img = np.expand_dims(img, axis=0)
    img = vgg19.preprocess_input(img)
    return img

def deprocess_image(x):
    # Remove zero-center by mean pixel
    x[:, :, 0] += 103.939
    x[:, :, 1] += 116.779
    x[:, :, 2] += 123.68
    # 'BGR'->'RGB'
    x = x[:, :, ::-1]
    x = np.clip(x, 0, 255).astype('uint8')
    return x

import numpy as np

from keras.applications import vgg19

def preprocess_image(image_path):

img = load_img(image_path, target_size=(img_height, img_width))

img = img_to_array(img)

img = np.expand_dims(img, axis=0)

img = vgg19.preprocess_input(img)

return img

def deprocess_image(x):

# Remove zero-center by mean pixel

x[:, :, 0] += 103.939

x[:, :, 1] += 116.779

x[:, :, 2] += 123.68

# 'BGR'->'RGB'

x = x[:, :, ::-1]

x = np.clip(x, 0, 255).astype('uint8')

return x

【说明】
keras中preprocess_input()函数的作用是对样本执行逐样本均值消减的归一化，即在每个维度上减去样本的均值，对于维度顺序是channels_last的数据，keras中每个维度上的操作如下：

x[..., 0] -= 103.939
x[..., 1] -= 116.779
x[..., 2] -= 123.68

x[..., 0] -= 103.939

x[..., 1] -= 116.779

x[..., 2] -= 123.68

3）加载VGG19网络，并将其应用于三张图像
三张图像是目标图像、风格图像、生成图像，把这三张图像作为一个批量。其中生成图像将改变，以占位符的形式存储。而目标图像、风格图像在整个过程中是不变的，故以constant方式存储。

from keras import backend as K

target_image = K.constant(preprocess_image(target_image_path))
style_reference_image = K.constant(preprocess_image(style_reference_image_path))

# This placeholder will contain our generated image
combination_image = K.placeholder((1, img_height, img_width, 3))

# We combine the 3 images into a single batch
input_tensor = K.concatenate([target_image,
                              style_reference_image,
                              combination_image], axis=0)

# We build the VGG19 network with our batch of 3 images as input.
# The model will be loaded with pre-trained ImageNet weights.
model = vgg19.VGG19(input_tensor=input_tensor,
                    weights='imagenet',
                    include_top=False)
print('Model loaded.')

from keras import backend as K

target_image = K.constant(preprocess_image(target_image_path))

style_reference_image = K.constant(preprocess_image(style_reference_image_path))

# This placeholder will contain our generated image

combination_image = K.placeholder((1, img_height, img_width, 3))

# We combine the 3 images into a single batch

input_tensor = K.concatenate([target_image,

style_reference_image,

combination_image], axis=0)

# We build the VGG19 network with our batch of 3 images as input.

# The model will be loaded with pre-trained ImageNet weights.

model = vgg19.VGG19(input_tensor=input_tensor,

weights='imagenet',

include_top=False)

print('Model loaded.')

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg19_weights_tf_dim_ordering_tf_kernels_notop.h5
80142336/80134624 [==============================] - 155s 2us/step
Model loaded.

4)查看VGG19的网络结构图

print(model.summary())

1	print(model.summary())

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, None, None, 3) 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, None, None, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, None, None, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, None, None, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, None, None, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, None, None, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, None, None, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, None, None, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, None, None, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, None, None, 256) 590080
_________________________________________________________________
block3_conv4 (Conv2D) (None, None, None, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, None, None, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, None, None, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block4_conv4 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, None, None, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block5_conv4 (Conv2D) (None, None, None, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, None, None, 512) 0
=================================================================
Total params: 20,024,384
Trainable params: 20,024,384
Non-trainable params: 0
_________________________________________________________________
None

VGG19网络的结构图

5）定义内容损失

内容损失最小化，以保证目标图像和生成图像在VGG19卷积神经网络的顶层（即block5-conv2）具有相似结果。

def content_loss(base, combination):
    return K.sum(K.square(combination - base))

1 2	def content_loss(base, combination): return K.sum(K.square(combination - base))

6）定义风格损失函数
使用一个辅助函数来计算输入矩阵的格拉姆矩阵，即原始特征矩阵中相互关系的映射。

def gram_matrix(x):
    features = K.batch_flatten(K.permute_dimensions(x, (2, 0, 1)))
    gram = K.dot(features, K.transpose(features))
    return gram


def style_loss(style, combination):
    S = gram_matrix(style)
    C = gram_matrix(combination)
    channels = 3
    size = img_height * img_width
    return K.sum(K.square(S - C)) / (4. * (channels ** 2) * (size ** 2))

def gram_matrix(x):

features = K.batch_flatten(K.permute_dimensions(x, (2, 0, 1)))

gram = K.dot(features, K.transpose(features))

return gram

def style_loss(style, combination):

S = gram_matrix(style)

C = gram_matrix(combination)

channels = 3

size = img_height * img_width

return K.sum(K.square(S - C)) / (4. * (channels ** 2) * (size ** 2))

假设输入图像经过卷积后，得到的feature map为[ch, h, w]。我们经过flatten和矩阵转置操作，可以变形为[ ch, h*w]和[h*w, ch]的矩阵。再对两矩阵做内积得到[ch, ch]大小的矩阵，这就是我们所说的Gram Matrices，如下图所示：

7）定义总变差损失函数
除了以上两个损失函数，还需要一个总变差损失，它对生成的图像的像素进行正则化等操作，它促使生成图像具有空间的连续性，以避免结果过度像素化。

def total_variation_loss(x):
    a = K.square(
        x[:, :img_height - 1, :img_width - 1, :] - x[:, 1:, :img_width - 1, :])
    b = K.square(
        x[:, :img_height - 1, :img_width - 1, :] - x[:, :img_height - 1, 1:, :])
    return K.sum(K.pow(a + b, 1.25))

def total_variation_loss(x):

a = K.square(

x[:, :img_height - 1, :img_width - 1, :] - x[:, 1:, :img_width - 1, :])

b = K.square(

x[:, :img_height - 1, :img_width - 1, :] - x[:, :img_height - 1, 1:, :])

return K.sum(K.pow(a + b, 1.25))

8）定义总损失函数
总损失是内容损失、风格损失、总变差损失的加权损失。网络顶层包含更加全局、更加抽象的信息，所以内容损失只使用一个顶层，即block5_conv2层；每层对都有不同风格，所以对风格损失需要使用一系列的层（block1_conv1、block2_conv1、block3_conv1、block4_conv1、block5_conv1）

# Dict mapping layer names to activation tensors
outputs_dict = dict([(layer.name, layer.output) for layer in model.layers])
# Name of layer used for content loss
content_layer = 'block5_conv2'
# Name of layers used for style loss
style_layers = ['block1_conv1',
                'block2_conv1',
                'block3_conv1',
                'block4_conv1',
                'block5_conv1']
# Weights in the weighted average of the loss components
total_variation_weight = 1e-4
style_weight = 1.
content_weight = 0.025

# Define the loss by adding all components to a `loss` variable
loss = K.variable(0.)
layer_features = outputs_dict[content_layer]
target_image_features = layer_features[0, :, :, :]
combination_features = layer_features[2, :, :, :]
loss += content_weight * content_loss(target_image_features,
                                      combination_features)
for layer_name in style_layers:
    layer_features = outputs_dict[layer_name]
    style_reference_features = layer_features[1, :, :, :]
    combination_features = layer_features[2, :, :, :]
    sl = style_loss(style_reference_features, combination_features)
    #loss += (style_weight / len(style_layers)) * sl
    loss =loss + (style_weight / len(style_layers)) * sl
#loss+=total_variation_weight * total_variation_loss(combination_image)
loss = loss + total_variation_weight * total_variation_loss(combination_image)

# Dict mapping layer names to activation tensors

outputs_dict = dict([(layer.name, layer.output) for layer in model.layers])

# Name of layer used for content loss

content_layer = 'block5_conv2'

# Name of layers used for style loss

style_layers = ['block1_conv1',

'block2_conv1',

'block3_conv1',

'block4_conv1',

'block5_conv1']

# Weights in the weighted average of the loss components

total_variation_weight = 1e-4

style_weight = 1.

content_weight = 0.025

# Define the loss by adding all components to a `loss` variable

loss = K.variable(0.)

layer_features = outputs_dict[content_layer]

target_image_features = layer_features[0, :, :, :]

combination_features = layer_features[2, :, :, :]

loss += content_weight * content_loss(target_image_features,

combination_features)

for layer_name in style_layers:

layer_features = outputs_dict[layer_name]

style_reference_features = layer_features[1, :, :, :]

combination_features = layer_features[2, :, :, :]

sl = style_loss(style_reference_features, combination_features)

#loss += (style_weight / len(style_layers)) * sl

loss =loss + (style_weight / len(style_layers)) * sl

#loss+=total_variation_weight * total_variation_loss(combination_image)

loss = loss + total_variation_weight * total_variation_loss(combination_image)

9）L_BFGS算法简介
这里使用scipy中L_BFGS算法进行最优化。为便于大家理解该优化器，这里我们先简单介绍一下L_BFGS算法对应的函数格式及示例。
fmin_l_bfgs_b函数格式：

scipy.optimize.fmin_l_bfgs_b(func, x0, fprime=None, args=(), approx_grad=0, bounds=None, m=10, factr=10000000.0, pgtol=1e-05, epsilon=1e-08, iprint=-1, maxfun=15000, maxiter=15000, disp=None, callback=None, maxls=20)[source]¶

1	scipy.optimize.fmin_l_bfgs_b(func, x0, fprime=None, args=(), approx_grad=0, bounds=None, m=10, factr=10000000.0, pgtol=1e-05, epsilon=1e-08, iprint=-1, maxfun=15000, maxiter=15000, disp=None, callback=None, maxls=20)[source]¶

使用示例：

x_true = np.arange(0,10,0.1)
m_true = 2.5
b_true = 1.0
y_true = m_true*x_true + b_true

def func(params, *args):
    x = args[0]
    y = args[1]
    m, b = params
    y_model = m*x+b
    error = y-y_model
    return sum(error**2)

initial_values = np.array([1.0, 0.0])
mybounds = [(None,2), (None,None)]

fmin_l_bfgs_b(func, x0=initial_values, args=(x_true,y_true), approx_grad=True)
fmin_l_bfgs_b(func, x0=initial_values, args=(x_true, y_true), bounds=mybounds, approx_grad=True)

x_true = np.arange(0,10,0.1)

m_true = 2.5

b_true = 1.0

y_true = m_true*x_true + b_true

def func(params, *args):

x = args[0]

y = args[1]

m, b = params

y_model = m*x+b

error = y-y_model

return sum(error**2)

initial_values = np.array([1.0, 0.0])

mybounds = [(None,2), (None,None)]

fmin_l_bfgs_b(func, x0=initial_values, args=(x_true,y_true), approx_grad=True)

fmin_l_bfgs_b(func, x0=initial_values, args=(x_true, y_true), bounds=mybounds, approx_grad=True)

10）定义生成图像的优化器
通过优化器获取梯度、损失等

# Get the gradients of the generated image wrt the loss
grads = K.gradients(loss, combination_image)[0]

# Function to fetch the values of the current loss and the current gradients
fetch_loss_and_grads = K.function([combination_image], [loss, grads])

class Evaluator(object):

    def __init__(self):
        self.loss_value = None
        self.grads_values = None

    def loss(self, x):
        assert self.loss_value is None
        x = x.reshape((1, img_height, img_width, 3))
        outs = fetch_loss_and_grads([x])
        loss_value = outs[0]
        grad_values = outs[1].flatten().astype('float64')
        self.loss_value = loss_value
        self.grad_values = grad_values
        return self.loss_value

    def grads(self, x):
        assert self.loss_value is not None
        grad_values = np.copy(self.grad_values)
        self.loss_value = None
        self.grad_values = None
        return grad_values

evaluator = Evaluator()

# Get the gradients of the generated image wrt the loss

grads = K.gradients(loss, combination_image)[0]

# Function to fetch the values of the current loss and the current gradients

fetch_loss_and_grads = K.function([combination_image], [loss, grads])

class Evaluator(object):

def __init__(self):

self.loss_value = None

self.grads_values = None

def loss(self, x):

assert self.loss_value is None

x = x.reshape((1, img_height, img_width, 3))

outs = fetch_loss_and_grads([x])

loss_value = outs[0]

grad_values = outs[1].flatten().astype('float64')

self.loss_value = loss_value

self.grad_values = grad_values

return self.loss_value

def grads(self, x):

assert self.loss_value is not None

grad_values = np.copy(self.grad_values)

self.loss_value = None

self.grad_values = None

return grad_values

evaluator = Evaluator()

11）训练模型

from scipy.optimize import fmin_l_bfgs_b
from scipy.misc import imsave
import imageio
import time

result_prefix = 'style_transfer_result'
iterations = 20

# Run scipy-based optimization (L-BFGS) over the pixels of the generated image
# so as to minimize the neural style loss.
# This is our initial state: the target image.
# Note that `scipy.optimize.fmin_l_bfgs_b` can only process flat vectors.
x = preprocess_image(target_image_path)
x = x.flatten()
for i in range(iterations):
    print('Start of iteration', i)
    start_time = time.time()
    x, min_val, info = fmin_l_bfgs_b(evaluator.loss, x,
                                     fprime=evaluator.grads, maxfun=20)
    print('Current loss value:', min_val)
    # Save current generated image
    img = x.copy().reshape((img_height, img_width, 3))
    img = deprocess_image(img)
    fname = result_prefix + '_at_iteration_%d.png' % i
    #imsave(fname, img)
    imageio.imwrite(fname, img)
    end_time = time.time()
    print('Image saved as', fname)
    print('Iteration %d completed in %ds' % (i, end_time - start_time))

from scipy.optimize import fmin_l_bfgs_b

from scipy.misc import imsave

import imageio

import time

result_prefix = 'style_transfer_result'

iterations = 20

# Run scipy-based optimization (L-BFGS) over the pixels of the generated image

# so as to minimize the neural style loss.

# This is our initial state: the target image.

# Note that `scipy.optimize.fmin_l_bfgs_b` can only process flat vectors.

x = preprocess_image(target_image_path)

x = x.flatten()

for i in range(iterations):

print('Start of iteration', i)

start_time = time.time()

x, min_val, info = fmin_l_bfgs_b(evaluator.loss, x,

fprime=evaluator.grads, maxfun=20)

print('Current loss value:', min_val)

# Save current generated image

img = x.copy().reshape((img_height, img_width, 3))

img = deprocess_image(img)

fname = result_prefix + '_at_iteration_%d.png' % i

#imsave(fname, img)

imageio.imwrite(fname, img)

end_time = time.time()

print('Image saved as', fname)

print('Iteration %d completed in %ds' % (i, end_time - start_time))

12)可视化目标图、参考风格图、生成图等。

from matplotlib import pyplot as plt

# Content image
plt.imshow(load_img(target_image_path, target_size=(img_height, img_width)))
plt.figure()

# Style image
plt.imshow(load_img(style_reference_image_path, target_size=(img_height, img_width)))
plt.figure()

# Generate image
plt.imshow(img)
plt.show()

from matplotlib import pyplot as plt

# Content image

plt.imshow(load_img(target_image_path, target_size=(img_height, img_width)))

plt.figure()

# Style image

plt.imshow(load_img(style_reference_image_path, target_size=(img_height, img_width)))

plt.figure()

# Generate image

plt.imshow(img)

plt.show()

人们常说，神经网络模型就像一个“黑盒”，这对一些神经网络模型确实如此，不过卷积神经网络，在可视化方面取得长足进步，我们可以看到卷积的中间结果、可视化不同的卷积核、可视化图像中类激活的热力图（决定类分类的关键区域）。
这三种方法的具体内容为：
1、卷积核输出的可视化(Visualizing intermediate convnet outputs (intermediate activations)，即可视化卷积核经过激活之后的结果。能够看到图像经过卷积之后结果，帮助理解卷积核的作用
2、卷积核的可视化(Visualizing convnets filters)，帮助我们理解卷积核是如何感受图像的。
3、热度图可视化(Visualizing heatmaps of class activation in an image)，通过热度图，了解图像分类问题中图像哪些部分起到了关键作用，同时可以定位图像中物体的位置。

22.1 可视化中间结果

可视化中间结果，是指对于给定输入，展示网络中各个卷积层和池化层输出的特征图（层的输出通常是激活函数的输出，故又称为该层的激活）。
每个通道上特征图是相对独立的，我们可以将这些特征图可视化的正确方法是将每个通道的内容分别绘制成二维图像。
1）可视化下例模型的中间输出
本章模型cats_and_dogs_small_2.h5、图像cat.1700.jpg、creative_commons_elephant.jpg
下载地址

from keras.models import load_model

model = load_model('cats_and_dogs_small_2.h5')
model.summary()  # As a reminder.

from keras.models import load_model

model = load_model('cats_and_dogs_small_2.h5')

model.summary() # As a reminder.

运行结果
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_13 (Conv2D) (None, 148, 148, 32) 896
_________________________________________________________________
max_pooling2d_13 (MaxPooling (None, 74, 74, 32) 0
_________________________________________________________________
conv2d_14 (Conv2D) (None, 72, 72, 64) 18496
_________________________________________________________________
max_pooling2d_14 (MaxPooling (None, 36, 36, 64) 0
_________________________________________________________________
conv2d_15 (Conv2D) (None, 34, 34, 128) 73856
_________________________________________________________________
max_pooling2d_15 (MaxPooling (None, 17, 17, 128) 0
_________________________________________________________________
conv2d_16 (Conv2D) (None, 15, 15, 128) 147584
_________________________________________________________________
max_pooling2d_16 (MaxPooling (None, 7, 7, 128) 0
_________________________________________________________________
flatten_5 (Flatten) (None, 6272) 0
_________________________________________________________________
dropout_5 (Dropout) (None, 6272) 0
_________________________________________________________________
dense_13 (Dense) (None, 512) 3211776
_________________________________________________________________
dense_14 (Dense) (None, 1) 513
=================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0

2）获取测试数据中的一张图像

img_path = './cats_and_dogs_small/test/cats/cat.1700.jpg'

# We preprocess the image into a 4D tensor
from keras.preprocessing import image
import numpy as np

img = image.load_img(img_path, target_size=(150, 150))
img_tensor = image.img_to_array(img)
img_tensor = np.expand_dims(img_tensor, axis=0)
# Remember that the model was trained on inputs
# that were preprocessed in the following way:
img_tensor /= 255.

# Its shape is (1, 150, 150, 3)
print(img_tensor.shape)

img_path = './cats_and_dogs_small/test/cats/cat.1700.jpg'

# We preprocess the image into a 4D tensor

from keras.preprocessing import image

import numpy as np

img = image.load_img(img_path, target_size=(150, 150))

img_tensor = image.img_to_array(img)

img_tensor = np.expand_dims(img_tensor, axis=0)

# Remember that the model was trained on inputs

# that were preprocessed in the following way:

img_tensor /= 255.

# Its shape is (1, 150, 150, 3)

print(img_tensor.shape)

运行结果
(1, 150, 150, 3)
转换为3D

img_tensor[0].shape

1	img_tensor[0].shape

(150, 150, 3)

3）可视化这个图像

import matplotlib.pyplot as plt

plt.imshow(img_tensor[0])
plt.show()

import matplotlib.pyplot as plt

plt.imshow(img_tensor[0])

plt.show()

4)抽取前8层的特征图或激活输出

from keras import models

# Extracts the outputs of the top 8 layers:
layer_outputs = [layer.output for layer in model.layers[:8]]
# Creates a model that will return these outputs, given the model input:
activation_model = models.Model(inputs=model.input, outputs=layer_outputs)

from keras import models

# Extracts the outputs of the top 8 layers:

layer_outputs = [layer.output for layer in model.layers[:8]]

# Creates a model that will return these outputs, given the model input:

activation_model = models.Model(inputs=model.input, outputs=layer_outputs)

5) 返回8个Numpy数组组成的列表，每个激活输出对应一个Numpy数组

# This will return a list of 8 Numpy arrays:
# one array per layer activation
activations = activation_model.predict(img_tensor)

first_layer_activation = activations[0]
print(first_layer_activation.shape)

# This will return a list of 8 Numpy arrays:

# one array per layer activation

activations = activation_model.predict(img_tensor)

first_layer_activation = activations[0]

print(first_layer_activation.shape)

(1, 148, 148, 32)
6) 查看第一层，第4个通道的激活输出图像

mport matplotlib.pyplot as plt

plt.matshow(first_layer_activation[0, :, :, 4], cmap='viridis')
plt.show()

mport matplotlib.pyplot as plt

plt.matshow(first_layer_activation[0, :, :, 4], cmap='viridis')

plt.show()

第4通道似乎是对角边缘检测器。
查看第7个通道的输出图像

plt.matshow(first_layer_activation[0, :, :, 7], cmap='viridis')
plt.show()

1 2	plt.matshow(first_layer_activation[0, :, :, 7], cmap='viridis') plt.show()

第7通道似乎是圆点检测器，这对寻找猫眼睛非常有帮助。

7）把各通道组合成一个完整图形

import keras

# These are the names of the layers, so can have them as part of our plot
layer_names = []
for layer in model.layers[:8]:
    layer_names.append(layer.name)

images_per_row = 16

# Now let's display our feature maps
for layer_name, layer_activation in zip(layer_names, activations):
    # This is the number of features in the feature map
    n_features = layer_activation.shape[-1]

    # The feature map has shape (1, size, size, n_features)
    size = layer_activation.shape[1]

    # We will tile the activation channels in this matrix
    n_cols = n_features // images_per_row
    display_grid = np.zeros((size * n_cols, images_per_row * size))

    # We'll tile each filter into this big horizontal grid
    for col in range(n_cols):
        for row in range(images_per_row):
            channel_image = layer_activation[0,
                                             :, :,
                                             col * images_per_row + row]
            # Post-process the feature to make it visually palatable
            channel_image -= channel_image.mean()
            channel_image /= channel_image.std()
            channel_image *= 64
            channel_image += 128
            channel_image = np.clip(channel_image, 0, 255).astype('uint8')
            display_grid[col * size : (col + 1) * size,
                         row * size : (row + 1) * size] = channel_image

    # Display the grid
    scale = 1. / size
    plt.figure(figsize=(scale * display_grid.shape[1],
                        scale * display_grid.shape[0]))
    plt.title(layer_name)
    plt.grid(False)
    plt.imshow(display_grid, aspect='auto', cmap='viridis')
    
plt.show()

import keras

# These are the names of the layers, so can have them as part of our plot

layer_names = []

for layer in model.layers[:8]:

layer_names.append(layer.name)

images_per_row = 16

# Now let's display our feature maps

for layer_name, layer_activation in zip(layer_names, activations):

# This is the number of features in the feature map

n_features = layer_activation.shape[-1]

# The feature map has shape (1, size, size, n_features)

size = layer_activation.shape[1]

# We will tile the activation channels in this matrix

n_cols = n_features // images_per_row

display_grid = np.zeros((size * n_cols, images_per_row * size))

# We'll tile each filter into this big horizontal grid

for col in range(n_cols):

for row in range(images_per_row):

channel_image = layer_activation[0,

:, :,

col * images_per_row + row]

# Post-process the feature to make it visually palatable

channel_image -= channel_image.mean()

channel_image /= channel_image.std()

channel_image *= 64

channel_image += 128

channel_image = np.clip(channel_image, 0, 255).astype('uint8')

display_grid[col * size : (col + 1) * size,

row * size : (row + 1) * size] = channel_image

# Display the grid

scale = 1. / size

plt.figure(figsize=(scale * display_grid.shape[1],

scale * display_grid.shape[0]))

plt.title(layer_name)

plt.grid(False)

plt.imshow(display_grid, aspect='auto', cmap='viridis')

plt.show()

上图从第1层到第8层，各通道的拼接图，从这些拼接图可以看出：
①第一层是各种边缘探测器的集合。在这一阶段，激活几乎保留了原始图像中的所有信息。
②随着层数的加深，激活变得越来越抽象，并且越来越难以直观理解。层数越深，关于图像视觉内容的信息越少，而关于类别的信息就越多。
③激活的稀疏性随着层数的加深而增大。

22.2 可视化卷积网络的过滤器

参考：https://blog.csdn.net/weiwei9363/article/details/79112872
https://www.jianshu.com/p/fb3add126da1
卷积核到底是如何识别物体的呢？想要解决这个问题，有一个方法就是去了解卷积核最感兴趣的图像是怎样的。我们知道，卷积的过程就是特征提取的过程，每一个卷积核代表着一种特征。如果图像中某块区域与某个卷积核的结果越大，那么该区域就越“像”该卷积核。
基于以上的推论，如果我们找到一张图像，能够使得这张图像对某个卷积核的输出最大，那么我们就说找到了该卷积核最感兴趣的图像。
具体思路：输入一张随机内容的图像I, 求某个卷积核F对图像的梯度 G=∂F/∂I，用梯度上升的方法迭代更新图像 I=I+η∗G，η为学习率。
我们可以从空白输入图像开始，将梯度下降应用于卷积神经网络输入图像的值，让某个过滤器的响应最大化。这样得到的输入图像就是选定过滤器具有最大响应的图形。
具体过程，
1）先构建一个损失函数，让某个卷积层的某个过滤器作用输入图像的激活值最大化；
2）使用随机梯度下降来调节输入图像的值，以便让这个激活值最大化。
以下我们以VGG16网络的block3_conv1层的第0个过滤器为例，
(1)首先构建有关过滤器激活值的损失函数。

from keras.applications import VGG16
from keras import backend as K

model = VGG16(weights='imagenet',
              include_top=False)

layer_name = 'block3_conv1'
filter_index = 0

layer_output = model.get_layer(layer_name).output
loss = K.mean(layer_output[:, :, :, filter_index])

from keras.applications import VGG16

from keras import backend as K

model = VGG16(weights='imagenet',

include_top=False)

layer_name = 'block3_conv1'

filter_index = 0

layer_output = model.get_layer(layer_name).output

loss = K.mean(layer_output[:, :, :, filter_index])

(2)为了求相对于模型输入loss的梯度，可以使用keras的backend模块内置的gradients函数。

grads = K.gradients(loss, model.input)[0]

1	grads = K.gradients(loss, model.input)[0]

为便于计算梯度，使梯度用L2进行标准化处理

# We add 1e-5 before dividing so as to avoid accidentally dividing by 0.
grads /= (K.sqrt(K.mean(K.square(grads))) + 1e-5)

1 2	# We add 1e-5 before dividing so as to avoid accidentally dividing by 0. grads /= (K.sqrt(K.mean(K.square(grads))) + 1e-5)

（3）利用keras后端函数计算loss及梯度
利用keras后端函数，可以根据一个输入图像，计算损失张量和梯度张量的值。

output = K.function([model.input], [loss, grads])

# Let's test it:
import numpy as np
loss_value, grads_value = output([np.zeros((1, 150, 150, 3))])

output = K.function([model.input], [loss, grads])

# Let's test it:

import numpy as np

loss_value, grads_value = output([np.zeros((1, 150, 150, 3))])

利用一个循环进行随机梯度下降，从而更新梯度

# We start from a gray image with some noise
input_img_data = np.random.random((1, 150, 150, 3)) * 20 + 128.

# Run gradient ascent for 40 steps
step = 1.  # this is the magnitude of each gradient update
for i in range(40):
    # Compute the loss value and gradient value
    loss_value, grads_value = output([input_img_data])
    # Here we adjust the input image in the direction that maximizes the loss
    input_img_data += grads_value * step

# We start from a gray image with some noise

input_img_data = np.random.random((1, 150, 150, 3)) * 20 + 128.

# Run gradient ascent for 40 steps

step = 1. # this is the magnitude of each gradient update

for i in range(40):

# Compute the loss value and gradient value

loss_value, grads_value = output([input_img_data])

# Here we adjust the input image in the direction that maximizes the loss

input_img_data += grads_value * step

为便于可视化，对输入图像进行预处理。

def deprocess_image(x):
    # normalize tensor: center on 0., ensure std is 0.1
    x -= x.mean()
    x /= (x.std() + 1e-5)
    x *= 0.1

    # clip to [0, 1]
    x += 0.5
    x = np.clip(x, 0, 1)

    # convert to RGB array
    x *= 255
    x = np.clip(x, 0, 255).astype('uint8')
    return x

def deprocess_image(x):

# normalize tensor: center on 0., ensure std is 0.1

x -= x.mean()

x /= (x.std() + 1e-5)

x *= 0.1

# clip to [0, 1]

x += 0.5

x = np.clip(x, 0, 1)

# convert to RGB array

x *= 255

x = np.clip(x, 0, 255).astype('uint8')

return x

（4）将以上代码整合到一个函数中

def generate_pattern(layer_name, filter_index, size=150):
    # Build a loss function that maximizes the activation
    # of the nth filter of the layer considered.
    layer_output = model.get_layer(layer_name).output
    loss = K.mean(layer_output[:, :, :, filter_index])

    # Compute the gradient of the input picture wrt this loss
    grads = K.gradients(loss, model.input)[0]

    # Normalization trick: we normalize the gradient
    grads /= (K.sqrt(K.mean(K.square(grads))) + 1e-5)

    # This function returns the loss and grads given the input picture
    iterate = K.function([model.input], [loss, grads])
    
    # We start from a gray image with some noise
    input_img_data = np.random.random((1, size, size, 3)) * 20 + 128.

    # Run gradient ascent for 40 steps
    step = 1.
    for i in range(40):
        loss_value, grads_value = iterate([input_img_data])
        input_img_data += grads_value * step
        
    img = input_img_data[0]
    return deprocess_image(img)

def generate_pattern(layer_name, filter_index, size=150):

# Build a loss function that maximizes the activation

# of the nth filter of the layer considered.

layer_output = model.get_layer(layer_name).output

loss = K.mean(layer_output[:, :, :, filter_index])

# Compute the gradient of the input picture wrt this loss

grads = K.gradients(loss, model.input)[0]

# Normalization trick: we normalize the gradient

grads /= (K.sqrt(K.mean(K.square(grads))) + 1e-5)

# This function returns the loss and grads given the input picture

iterate = K.function([model.input], [loss, grads])

# We start from a gray image with some noise

input_img_data = np.random.random((1, size, size, 3)) * 20 + 128.

# Run gradient ascent for 40 steps

step = 1.

for i in range(40):

loss_value, grads_value = iterate([input_img_data])

input_img_data += grads_value * step

img = input_img_data[0]

return deprocess_image(img)

（5）可视化每一层前64个卷积核

for layer_name in ['block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1']:
    size = 64
    margin = 5

    # This a empty (black) image where we will store our results.
    results = np.zeros((8 * size + 7 * margin, 8 * size + 7 * margin, 3))

    for i in range(8):  # iterate over the rows of our results grid
        for j in range(8):  # iterate over the columns of our results grid
            # Generate the pattern for filter `i + (j * 8)` in `layer_name`
            filter_img = generate_pattern(layer_name, i + (j * 8), size=size)

            # Put the result in the square `(i, j)` of the results grid
            horizontal_start = i * size + i * margin
            horizontal_end = horizontal_start + size
            vertical_start = j * size + j * margin
            vertical_end = vertical_start + size
            results[horizontal_start: horizontal_end, vertical_start: vertical_end, :] = filter_img

    # Display the results grid
    plt.figure(figsize=(20, 20))
    plt.imshow(results)
    plt.show()

for layer_name in ['block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1']:

size = 64

margin = 5

# This a empty (black) image where we will store our results.

results = np.zeros((8 * size + 7 * margin, 8 * size + 7 * margin, 3))

for i in range(8): # iterate over the rows of our results grid

for j in range(8): # iterate over the columns of our results grid

# Generate the pattern for filter `i + (j * 8)` in `layer_name`

filter_img = generate_pattern(layer_name, i + (j * 8), size=size)

# Put the result in the square `(i, j)` of the results grid

horizontal_start = i * size + i * margin

horizontal_end = horizontal_start + size

vertical_start = j * size + j * margin

vertical_end = vertical_start + size

results[horizontal_start: horizontal_end, vertical_start: vertical_end, :] = filter_img

# Display the results grid

plt.figure(figsize=(20, 20))

plt.imshow(results)

plt.show()

block1_conv1 层前64个过滤器模式

block2_conv1 前64个过滤器模式

block3-conv1 前64过滤器模式

block4-conv1 前64个过滤器模式
结论：
低层的卷积核似乎对颜色，边缘信息感兴趣。
越高层的卷积核，感兴趣的内容越抽象（非常魔幻啊），也越复杂。
高层的卷积核感兴趣的图像越来越难通过梯度上升获得（block5_conv1有很多还是随机噪声的图像）

22.3 可视化类激活的热力图

1）CAM
在介绍Grad-CAM,Grad- Class Activation Mapping)之前，我们先介绍一下CAM。
我们日常生活中讲的热力图，是根据动物散发热量而形成的图形，图1中动物或人因为散发出热量，所以能够清楚的被看到。

图1
这次我们讲的深度学习中的类激活的热力图与此类似。
对一个深层的卷积神经网络而言，通过多次卷积和池化以后，它的最后一层卷积层包含了最丰富的空间和语义信息，再往下就是全连接层和softmax层了，如图2，其中所包含的信息都是难以理解的，难以可视化的方式展示出来。如果要让卷积神经网络的对其分类结果给出一个合理解释，充分利用好最后一个卷积层是关键。

图2
CAM借鉴了很著名的论文Network in Network中的思路，利用GAP(Global Average Pooling)替换掉了全连接层。可以把GAP视为一个特殊的average pool层，只不过其pool size和整个特征图一样大，其实就是求每张特征图所有像素的均值。具体可参考图3

图3

图4
GAP（参考图4）的优点在NIN的论文中说的很明确了：由于没有了全连接层，输入就不用固定大小了，因此可支持任意大小的输入；此外，引入GAP更充分的利用了空间信息，且没有了全连接层的各种参数，鲁棒性强，也不容易产生过拟合；还有很重要的一点是，在最后的 mlpconv层(也就是最后一层卷积层)强制生成了和目标类别数量一致的特征图，经过GAP以后再通过softmax层得到结果，这样做就给每个特征图赋予了很明确的意义。
我们重点看下经过GAP之后与输出层的连接关系(暂不考虑softmax层)，实质上也是就是个全连接层，只不过没有了偏置项，如图4所示：

图5
从图5中可以看到，经过GAP之后，我们得到了最后一个卷积层每个特征图的均值，通过加权和得到输出(实际中是softmax层的输入)。需要注意的是，对每一个类别C，每个特征图k的均值都有一个对应的w，记为ω_k^c。CAM的基本结构就是这样了，下面就是和普通的CNN模型一样训练就可以了。训练完成后才是重头戏：我们如何得到一个用于解释分类结果的热力图呢？其实非常简单，比如说我们要解释为什么分类的结果是羊驼，我们把羊驼这个类别对应的所有ω_k^c取出来，求出它们与自己对应的特征图的加权和即可。由于这个结果的大小和特征图是一致的，我们需要对它进行上采样，叠加到原图上去，如图6所示。

图6
这样，CAM以热力图的形式告诉了我们，模型通过哪些像素确定这个图片是羊驼了
2）Grad-CAM
前面我们简单介绍了CAM，CAM的解释效果已经很不错了，但是它有一个不足，就是它要求修改原模型的结构，导致需要重新训练该模型，这大大限制了它的使用场景。如果模型已经上线了，或着训练的成本非常高，我们几乎是不可能为了它重新训练的。为了解决这个问题，人们就提出了Grad-CAM。
Grad-CAM的基本思路和CAM是一致的，也是通过得到每对特征图对应的权重，最后求一个加权和。但是它与CAM的主要区别在于求权重ω_k^c的过程。CAM通过替换全连接层为GAP层，重新训练得到权重，而Grad-CAM另辟蹊径，用梯度的全局平均来计算权重。事实上，经过严格的数学推导，Grad-CAM与CAM计算出来的权重是等价的。为了和CAM的权重做区分，定义Grad-CAM中第k个特征图对类别c的权重为ω_k^c，可通过下面的公式计算：

其中，Z为特征图的像素个数，y^c是对应类别c的分数（在代码中一般用logits表示，是输入softmax层之前的值），A_ij^k表示第k个特征图中，(i,j)位置处的像素值。求得类别对所有特征图的权重后，求其加权和就可以得到热力图。

Grad-CAM的整体结构如下图所示：

图7
注意这里和CAM的另一个区别是，Grad-CAM对最终的加权和加了一个ReLU，加这么一层ReLU的原因在于我们只关心对类别c有正影响的那些像素点，如果不加ReLU层，最终可能会带入一些属于其它类别的像素，从而影响解释的效果。使用Grad-CAM对分类结果进行解释的效果如下图所示：

图8
3）用Keras如何实现Grad-CAM?
①加载带有预训练权重的VGG16网络

from keras.applications.vgg16 import VGG16

K.clear_session()

# Note that we are including the densely-connected classifier on top;
# all previous times, we were discarding it.
model = VGG16(weights='imagenet')

from keras.applications.vgg16 import VGG16

K.clear_session()

# Note that we are including the densely-connected classifier on top;

# all previous times, we were discarding it.

model = VGG16(weights='imagenet')

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels.h5
553467904/553467096 [==============================] - 2601s 5us/step
②预处理一张图片

from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input, decode_predictions
import numpy as np

# The local path to our target image
img_path = '/home/wumg/data/elephants/creative_commons_elephant.jpg'

# `img` is a PIL image of size 224x224
img = image.load_img(img_path, target_size=(224, 224))

# `x` is a float32 Numpy array of shape (224, 224, 3)
x = image.img_to_array(img)

# We add a dimension to transform our array into a "batch"
# of size (1, 224, 224, 3)
x = np.expand_dims(x, axis=0)

# Finally we preprocess the batch
# (this does channel-wise color normalization)
x = preprocess_input(x)

from keras.preprocessing import image

from keras.applications.vgg16 import preprocess_input, decode_predictions

import numpy as np

# The local path to our target image

img_path = '/home/wumg/data/elephants/creative_commons_elephant.jpg'

# `img` is a PIL image of size 224x224

img = image.load_img(img_path, target_size=(224, 224))

# `x` is a float32 Numpy array of shape (224, 224, 3)

x = image.img_to_array(img)

# We add a dimension to transform our array into a "batch"

# of size (1, 224, 224, 3)

x = np.expand_dims(x, axis=0)

# Finally we preprocess the batch

# (this does channel-wise color normalization)

x = preprocess_input(x)

查看预训练的VGG16网络，并将其预测向量解码为人们可读的格式。

preds = model.predict(x)
print('Predicted:', decode_predictions(preds, top=3)[0])

1 2	preds = model.predict(x) print('Predicted:', decode_predictions(preds, top=3)[0])

redicted: [('n02504458', 'African_elephant', 0.90942115), ('n01871265', 'tusker', 0.08618273), ('n02504013', 'Indian_elephant', 0.004354583)]

从上面运行结果可以看出：
非洲象（African_elephant），占90%
长牙动物（tusker），占8%
印度象（Indian_elephant），占0.4%
网络识别出图像中包含数据量不确定的非洲象。预测向量中被最大激活的元素是对应“非洲象”类别元素(即类别概率最大项)，索引编号为386

np.argmax(preds[0])

1	np.argmax(preds[0])

386
④ 实现Grad-CAM算法

# This is the "african elephant" entry in the prediction vector
african_elephant_output = model.output[:, 386]

# The is the output feature map of the `block5_conv3` layer,
# the last convolutional layer in VGG16
last_conv_layer = model.get_layer('block5_conv3')

# This is the gradient of the "african elephant" class with regard to
# the output feature map of `block5_conv3`
grads = K.gradients(african_elephant_output, last_conv_layer.output)[0]

# This is a vector of shape (512,), where each entry
# is the mean intensity of the gradient over a specific feature map channel
pooled_grads = K.mean(grads, axis=(0, 1, 2))

# This function allows us to access the values of the quantities we just defined:
# `pooled_grads` and the output feature map of `block5_conv3`,
# given a sample image
iterate = K.function([model.input], [pooled_grads, last_conv_layer.output[0]])

# These are the values of these two quantities, as Numpy arrays,
# given our sample image of two elephants
pooled_grads_value, conv_layer_output_value = iterate([x])

# We multiply each channel in the feature map array
# by "how important this channel is" with regard to the elephant class
for i in range(512):
    conv_layer_output_value[:, :, i] *= pooled_grads_value[i]

# The channel-wise mean of the resulting feature map
# is our heatmap of class activation
heatmap = np.mean(conv_layer_output_value, axis=-1)

# This is the "african elephant" entry in the prediction vector

african_elephant_output = model.output[:, 386]

# The is the output feature map of the `block5_conv3` layer,

# the last convolutional layer in VGG16

last_conv_layer = model.get_layer('block5_conv3')

# This is the gradient of the "african elephant" class with regard to

# the output feature map of `block5_conv3`

grads = K.gradients(african_elephant_output, last_conv_layer.output)[0]

# This is a vector of shape (512,), where each entry

# is the mean intensity of the gradient over a specific feature map channel

pooled_grads = K.mean(grads, axis=(0, 1, 2))

# This function allows us to access the values of the quantities we just defined:

# `pooled_grads` and the output feature map of `block5_conv3`,

# given a sample image

iterate = K.function([model.input], [pooled_grads, last_conv_layer.output[0]])

# These are the values of these two quantities, as Numpy arrays,

# given our sample image of two elephants

pooled_grads_value, conv_layer_output_value = iterate([x])

# We multiply each channel in the feature map array

# by "how important this channel is" with regard to the elephant class

for i in range(512):

conv_layer_output_value[:, :, i] *= pooled_grads_value[i]

# The channel-wise mean of the resulting feature map

# is our heatmap of class activation

heatmap = np.mean(conv_layer_output_value, axis=-1)

⑥可视化类激活图

heatmap = np.maximum(heatmap, 0)
heatmap /= np.max(heatmap)
plt.matshow(heatmap)
plt.show()

heatmap = np.maximum(heatmap, 0)

heatmap /= np.max(heatmap)

plt.matshow(heatmap)

plt.show()

图9
⑦把原始图叠加在刚生成的热力图上

import cv2

# We use cv2 to load the original image
img = cv2.imread(img_path)

# We resize the heatmap to have the same size as the original image
heatmap = cv2.resize(heatmap, (img.shape[1], img.shape[0]))

# We convert the heatmap to RGB
heatmap = np.uint8(255 * heatmap)

# We apply the heatmap to the original image
heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)

# 0.4 here is a heatmap intensity factor
superimposed_img = heatmap * 0.4 + img

# Save the image to disk
cv2.imwrite('/home/wumg/data/elephants/elephant_cam.jpg', superimposed_img)

import cv2

# We use cv2 to load the original image

img = cv2.imread(img_path)

# We resize the heatmap to have the same size as the original image

heatmap = cv2.resize(heatmap, (img.shape[1], img.shape[0]))

# We convert the heatmap to RGB

heatmap = np.uint8(255 * heatmap)

# We apply the heatmap to the original image

heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)

# 0.4 here is a heatmap intensity factor

superimposed_img = heatmap * 0.4 + img

# Save the image to disk

cv2.imwrite('/home/wumg/data/elephants/elephant_cam.jpg', superimposed_img)

图10

参考资料：
《Python深度学习》弗朗索瓦•肖莱著
https://bindog.github.io/blog/2018/02/10/model-explanation/
http://spytensor.com/index.php/archives/20/（包括keras、pytorch实现Grad-CAM算法，class activation map）
http://spytensor.com/index.php/archives/19/(介绍GAP)

21.1 TensorFlow.js主要功能

TensorFlow.js是一个JavaScript库，它可以将机器学习功能添加到任何Web应用程序中。使用TensorFlow.js，可以从头开发机器学习脚本。你可以使用API在浏览器或Node.js服务器应用程序中构建和训练模型。并且，你可以使用TensorFlow.js在JavaScript环境中运行现有模型。
甚至，你可以使用TensorFlow.js用自己的数据再训练预先存在的机器学习模型，这些其中包括浏览器中客户端可用的数据。例如，你可以使用网络摄像头中的图像数据。如果你是一名机器学习、深度学习爱好者，那么TensorFlow.js是学习的好方法！
TensorFlow.js利用 WebGL 来进行加速的机器学习类库，它基于浏览器，提供了高层次的 JavaScript API 接口。它将高性能机器学习构建块带到您的指尖，使您能够在浏览器中训练神经网络或在推理模式下运行预先训练的模型。
TensorFlow.js的主要功能包括：利用用js开发机器学习、运行已有模型、重新训练已有模型等，具体请看下图：

有关安装/配置 TensorFlow.js 的指南，请参阅：https://js.tensorflow.org/index.html#getting-started。

21.2 安装Node.js和NPM

Node.js 是一个基于 Chrome V8 引擎的 JavaScript 运行环境。Node.js 使用了一个事件驱动、非阻塞式 I/O 的模型，使其轻量又高效。 Node.js 的使用包管理器 npm来管理所有模块的安装、配置、删除等操作，使用起来非常方便。但是想要配置好npm的使用环境还是稍微有点复杂，下面跟着我一起来学习在windows系统上配置NodeJS和NPM吧。
具体安装大家可参考：
https://jingyan.baidu.com/article/48b37f8dd141b41a646488bc.html

21.3 Javascript(js)简介

1.JavaScript历史
要了解JavaScript，我们首先要回顾一下JavaScript的诞生。
在上个世纪的1995年，当时的网景公司正凭借其Navigator浏览器成为Web时代开启时最著名的第一代互联网公司。
由于网景公司希望能在静态HTML页面上添加一些动态效果，于是叫Brendan Eich这哥们在两周之内设计出了JavaScript语言。你没看错，这哥们只用了10天时间。
为什么起名叫JavaScript？原因是当时Java语言非常红火，所以网景公司希望借Java的名气来推广，但事实上JavaScript除了语法上有点像Java，其他部分基本上没啥关系。
2.ECMAScript
因为网景开发了JavaScript，一年后微软又模仿JavaScript开发了JScript，为了让JavaScript成为全球标准，几个公司联合ECMA（European Computer Manufacturers Association）组织定制了JavaScript语言的标准，被称为ECMAScript标准。
所以简单说来就是，ECMAScript是一种语言标准，而JavaScript是网景公司对ECMAScript标准的一种实现。
那为什么不直接把JavaScript定为标准呢？因为JavaScript是网景的注册商标。
不过大多数时候，我们还是用JavaScript这个词。如果你遇到ECMAScript这个词，简单把它替换为JavaScript就行了。
3.JavaScript版本
JavaScript的标准——ECMAScript在不断发展，最新版ECMAScript 6标准（简称ES6）已经在2015年6月正式发布了，所以，讲到JavaScript的版本，实际上就是说它实现了ECMAScript标准的哪个版本。想更详细了解JavaScript，大家可参考：
http://www.runoob.com/js/js-howto.html

21.4 TensorFlow.JS基础

21.4.1 主要概念

TensorFlow.js的主要概念，包括张量（Tensor）、变量（Variables）、操作（operations）、模型和层（models and layers）等，如下图所示。

21.4.2 张量

张量（Tensor）是TensorFlow中的主要数据单位。张量包含一组数值，可以是任何形状：一维或多维。当你创建新的张量时，你还需要定义形状（shape）。你可以通过使用tensor函数并传入第二个参数来定义形状，如下所示：

const t1 = tf.tensor([1,2,3,4,2,4,6,8]), [2,4]);

1	const t1 = tf.tensor([1,2,3,4,2,4,6,8]), [2,4]);

这是定义具有两行四列形状的张量。产生的张量如下所示

[[1,2,3,4],
[2,4,6,8]]

1 2	[[1,2,3,4], [2,4,6,8]]

也可以让TensorFlow推断出张量的形状，如下列代码。

const t2 = tf.tensor([[1,2,3,4],[2,4,6,8]]);

1	const t2 = tf.tensor([[1,2,3,4],[2,4,6,8]]);

我们可以使用input.shape来检索张量的大小。

const tensor_s = tf.tensor([2,2]).shape;

1	const tensor_s = tf.tensor([2,2]).shape;

这里的形状为[2]。我们还可以创建具有特定大小的张量。例如，下面我们创建一个形状为[2,2]的零值张量。

const t_zeros = tf.zeros([2,3]);

1	const t_zeros = tf.zeros([2,3]);

这行代码创建了以下张量
[[0,0,0],
[0,0,0]]
此外，你可以使用以下函数来增强代码可读性：

tf.scalar：只有一个值的张量
tf.tensor1d：具有一个维度的张量
tf.tensor2d：具有两个维度的张量
tf.tensor3d：具有三维的张量
tf.tensor4d：具有四个维度的张量,
下图为常见的几种张量示意图：

如表示标量：

const tensor = tf.scalar(2);

1	const tensor = tf.scalar(2);

如表示二维张量：

const c = tf.tensor2d([[1.0, 2.0, 3.0], [10.0, 20.0, 30.0]]);
c.print();
// Output: [[1 , 2 , 3 ],
//          [10, 20, 30]]

const c = tf.tensor2d([[1.0, 2.0, 3.0], [10.0, 20.0, 30.0]]);

c.print();

// Output: [[1 , 2 , 3 ],

// [10, 20, 30]]

在TensorFlow.js中，所有张量都是不可变的。这意味着张量一旦创建，之后就无法改变。如果你执行一个更改量值的操作，总是会创建一个新的张量并返回结果值。

21.4.3 变量

张量(Tensors) 是不可变的，一旦创建，不能改变其值；而变量(variables) 则可以动态改变其值，主要用于在模型训练期间存储和更新值。您可以使用assign方法为现有变量指定新值：

const initialValues = tf.zeros([5]);
const biases = tf.variable(initialValues); // initialize biases
biases.print(); // output: [0, 0, 0, 0, 0]

const updatedValues = tf.tensor1d([0, 1, 0, 1, 0]);
biases.assign(updatedValues); // update values of biases
biases.print(); // output: [0, 1, 0, 1, 0]

const initialValues = tf.zeros([5]);

const biases = tf.variable(initialValues); // initialize biases

biases.print(); // output: [0, 0, 0, 0, 0]

const updatedValues = tf.tensor1d([0, 1, 0, 1, 0]);

biases.assign(updatedValues); // update values of biases

biases.print(); // output: [0, 1, 0, 1, 0]

变量主要用于在模型训练期间存储然后更新参数值。

21.4.4 操作

通过使用TensorFlow操作，你可以操纵张量的数据。由于张量运算的不变性，结果值总是返回一个新的张量。
TensorFlow.js提供了许多有用的操作，如square，add，sub和mul。你可以直接应用操作，如下所示：。

const t3 = tf.tensor2d([1,2], [3, 4]);
const t3_squared = t3.square();

1 2	const t3 = tf.tensor2d([1,2], [3, 4]); const t3_squared = t3.square();

执行此代码后，新张量包含以下值：
[[1, 4 ],
[9, 16]]

21.4.5 内存管理

由于TensorFlow.js使用GPU来加速数学运算，因此在使用张量和变量时需要管理GPU内存。
TensorFlow.js提供了两个函数来帮助解决这个问题：dispose和tf.tidy。
 dispose
我们可以在张量或变量上调用dispose来清除它并释放其GPU内存：

const x = tf.tensor2d([[0.0, 2.0], [4.0, 6.0]]);
const x_squared = x.square();

x.dispose();
x_squared.dispose();

const x = tf.tensor2d([[0.0, 2.0], [4.0, 6.0]]);

const x_squared = x.square();

x.dispose();

x_squared.dispose();

在进行大量张量操作时，使用dispose会很麻烦。 TensorFlow.js提供了另一个函数tf.tidy，它与JavaScript中的常规作用域起着类似的作用，但是对于GPU支持的张量。
tf.tidy执行一个函数并清除所创建的任何中间张量，释放它们的GPU内存。它不会清除内部函数的返回值。

// tf.tidy takes a function to tidy up after
const average = tf.tidy(() => {
  // tf.tidy will clean up all the GPU memory used by tensors inside
  // this function, other than the tensor that is returned.
  //
  // Even in a short sequence of operations like the one below, a number
  // of intermediate tensors get created. So it is a good practice to
  // put your math ops in a tidy!
  const y = tf.tensor1d([1.0, 2.0, 3.0, 4.0]);
  const z = tf.ones([4]);

  return y.sub(z).square().mean();
});

average.print() // Output: 3.5

// tf.tidy takes a function to tidy up after

const average = tf.tidy(() => {

// tf.tidy will clean up all the GPU memory used by tensors inside

// this function, other than the tensor that is returned.

// Even in a short sequence of operations like the one below, a number

// of intermediate tensors get created. So it is a good practice to

// put your math ops in a tidy!

const y = tf.tensor1d([1.0, 2.0, 3.0, 4.0]);

const z = tf.ones([4]);

return y.sub(z).square().mean();

});

average.print() // Output: 3.5

使用tf.tidy将有助于防止应用程序中的内存泄漏。它还可以用于更加谨慎地控制何时回收内存。
【注意】
 传递给tf.tidy的函数应该是同步的，也不会返回Promise。我们建议保留更新UI的代码或在tf.tidy之外发出远程请求。
 tf.tidy不会清理变量。变量通常持续到机器学习模型的整个生命周期，因此TensorFlow.js即使它们是在tf.tidy的情况下创建的，也不会清理它们。但是，您可以手动调用dispose来清理。

21.4.6 模型和层

从概念上讲，模型是一种函数，给定一些输入将产生一些所需的输出。在TensorFlow.js中，有两种方法可以创建模型。您可以直接使用ops来表示模型所做的事情。例如：

// Define function
function predict(input) {
  // y = a * x ^ 2 + b * x + c
  // More on tf.tidy in the next section
  return tf.tidy(() => {
    const x = tf.scalar(input);

    const ax2 = a.mul(x.square());
    const bx = b.mul(x);
    const y = ax2.add(bx).add(c);

    return y;
  });
}

// Define constants: y = 2x^2 + 4x + 8
const a = tf.scalar(2);
const b = tf.scalar(4);
const c = tf.scalar(8);

// Predict output for input of 2
const result = predict(2);
result.print() // Output: 24

// Define function

function predict(input) {

// y = a * x ^ 2 + b * x + c

// More on tf.tidy in the next section

return tf.tidy(() => {

const x = tf.scalar(input);

const ax2 = a.mul(x.square());

const bx = b.mul(x);

const y = ax2.add(bx).add(c);

return y;

});

}

// Define constants: y = 2x^2 + 4x + 8

const a = tf.scalar(2);

const b = tf.scalar(4);

const c = tf.scalar(8);

// Predict output for input of 2

const result = predict(2);

result.print() // Output: 24

我们也可以使用高级API tf.model来构建层中的模型，这是深度学习中的流行方法。以下代码构造了一个tf.sequential模型：

const model = tf.sequential();
model.add(
  tf.layers.simpleRNN({
    units: 20,
    recurrentInitializer: 'GlorotNormal',
    inputShape: [80, 4]
  })
);

const optimizer = tf.train.sgd(LEARNING_RATE);
model.compile({optimizer, loss: 'categoricalCrossentropy'});
model.fit({x: data, y: labels});

const model = tf.sequential();

model.add(

tf.layers.simpleRNN({

units: 20,

recurrentInitializer: 'GlorotNormal',

inputShape: [80, 4]

})

);

const optimizer = tf.train.sgd(LEARNING_RATE);

model.compile({optimizer, loss: 'categoricalCrossentropy'});

model.fit({x: data, y: labels});

其中units. 激活输出的数量。由于这是最后一层，这里涉及20个类别的分类任务。units的数据量指节点数。
TensorFlow.js中有许多不同类型的层。如tf.layers.simpleRNN，tf.layers.gru和tf.layers.lstm等。
【注意】
利用TensorFlow.js构建网络时，第一层必须明确指定输入形状，其余的层默认从前面的层输入。如下示例代码：
（1）首先，用tf.sequential()实例化构建模型model
将使用Sequential模型（最简单的模型类型），其中张量将连续地从一层传递到下一层。

const model = tf.sequential();

1	const model = tf.sequential();

（2）添加一个卷积层

model.add(tf.layers.conv2d({
  inputShape: [28, 28, 1],
  kernelSize: 5,
  filters: 8,
  strides: 1,
  activation: 'relu',
  kernelInitializer: 'VarianceScaling'
}));

model.add(tf.layers.conv2d({

inputShape: [28, 28, 1],

kernelSize: 5,

filters: 8,

strides: 1,

activation: 'relu',

kernelInitializer: 'VarianceScaling'

}));

 nputShape.将流入模型第一层的数据的形状。这里，我们的MNIST样本是28x28像素的黑白图像。图像数据的规范格式是[row，column，depth]，所以我们在这里配置的形状是[28,28,1]——每个维度有28rowX28column个像素，而depth为1是因为我们的图像只有1个颜色通道。
 kernelSize. 应用于输入数据的滑动卷积滤波器窗口的大小。在这里，我们设置kernelSize为5，它表示一个5x5的正方形卷积窗口。
 filters. 应用于输入数据，大小为kernelSize的滤波器窗口的数量。在这里，我们将对数据应用8个过滤器。
 strides. 滑动窗口的“步长” - 即每次在图像上移动时，滤波器将移动多少个像素。在这里，我们指定步幅为1，这意味着过滤器将以1像素为单位滑过图像。
 activation.卷积完成后应用于数据的激活函数。这里，我们使用了 Rectified Linear Unit (ReLU)函数，这是ML模型中非常常见的激活函数。
 kernelInitializer. 用于随机初始化模型权重的方法，这对于训练动态是非常重要的。我们不会详细介绍初始化的细节，这里VarianceScaling是一个很不错的初始化器。

（2）添加一个池化层

model.add(tf.layers.maxPooling2d({
  poolSize: [2, 2],
  strides: [2, 2]
}));

model.add(tf.layers.maxPooling2d({

poolSize: [2, 2],

strides: [2, 2]

}));

 poolSize. 应用于输入数据的滑动窗口大小。在这里，我们设置poolSize为[2,2]，这意味着池化层将对输入数据应用2x2窗口。
 stride. 滑动窗口的“步长” - 即每次在输入数据上移动时，窗口将移动多少个像素。在这里，我们指定[2，2]的步长，这意味着滤波器将在水平和垂直两个方向上以2个像素为单位滑过图像。
【注意】
由于poolSize和strides都是2×2，所以池窗口将完全不重叠。这意味着池化层会将前一层的激活图的大小减半。
（3）再添加一个卷积层
重复使用层结构是神经网络中的常见模式。我们添加第二个卷积层到模型，并在其后添加池化层。请注意，在我们的第二个卷积层中，我们将滤波器数量从8增加到16。还要注意，我们没有指定inputShape，因为它可以从前一层的输出形状中推断出来。

model.add(tf.layers.conv2d({
  kernelSize: 5,
  filters: 16,
  strides: 1,
  activation: 'relu',
  kernelInitializer: 'VarianceScaling'
}));

model.add(tf.layers.maxPooling2d({
  poolSize: [2, 2],
  strides: [2, 2]
}));

model.add(tf.layers.conv2d({

kernelSize: 5,

filters: 16,

strides: 1,

activation: 'relu',

kernelInitializer: 'VarianceScaling'

}));

model.add(tf.layers.maxPooling2d({

poolSize: [2, 2],

strides: [2, 2]

}));

（4）添加一个展平层
我们添加一个 flatten层，将前一层的输出平铺到一个向量中。

model.add(tf.layers.flatten());

1	model.add(tf.layers.flatten());

【注意】
展平层，既没有说明输入形状，也没有说明输出形状，这些形状都是从前层输出自动获取。
（5）输出层
最后，让我们添加一个 dense层（也称为全连接层），它将执行最终的分类。在dense层前先对卷积+池化层的输出执行flatten也是神经网络中的另一种常见模式

model.add(tf.layers.dense({
  units: 10,
  kernelInitializer: 'VarianceScaling',
  activation: 'softmax'
}));

model.add(tf.layers.dense({

units: 10,

kernelInitializer: 'VarianceScaling',

activation: 'softmax'

}));

 units. 激活输出的数量。由于这是最后一层，我们正在做10个类别的分类任务（数字0-9），因此我们在这里使用10个units。（有时units被称为神经元的数量）
 kernelInitializer. 我们将对dense层使用与卷积层相同的VarianceScaling初始化策略。
 activation. 分类任务的最后一层的激活函数通常是 softmax。 Softmax将我们的10维输出向量归一化为概率分布，使得我们10个类中的每个都有一个概率值。

21.4.7 优化问题

这一部分，我们将学习如何解决优化问题。给定函数f(x)，我们要求求得x=a使得f(x)最小化。为此，我们需要一个优化器。优化器是一种沿着梯度来最小化函数的算法。文献中有许多优化器，如SGD，Adam等等，这些优化器的速度和准确性各不相同。Tensorflowjs支持大多数重要的优化器。
我们将举一个简单的例子：f(x)=x⁶+2x⁴+3x²+x+1。函数的曲线图如下所示。可以看到函数的最小值在区间[-0.5,0]。我们将使用优化器来找出确切的值。

首先，我们定义要最小化的函数:

function f(x) 
{
  const f1 = x.pow(tf.scalar(6, 'int32')) //x^6
  const f2 = x.pow(tf.scalar(4, 'int32')).mul(tf.scalar(2)) //2x^4
  const f3 = x.pow(tf.scalar(2, 'int32')).mul(tf.scalar(3)) //3x^2
  const f4 = tf.scalar(1) //1
  return f1.add(f2).add(f3).add(x).add(f4)
}

function f(x)

{

const f1 = x.pow(tf.scalar(6, 'int32')) //x^6

const f2 = x.pow(tf.scalar(4, 'int32')).mul(tf.scalar(2)) //2x^4

const f3 = x.pow(tf.scalar(2, 'int32')).mul(tf.scalar(3)) //3x^2

const f4 = tf.scalar(1) //1

return f1.add(f2).add(f3).add(x).add(f4)

}

现在我们可以迭代地最小化函数以找到最小值。我们将以a=2的初始值开始，学习率定义了达到最小值的速度。我们将使用Adam优化器:

function minimize(epochs, lr)
{
  let y = tf.variable(tf.scalar(2)) //初始化，值为2 
  const optim = tf.train.adam(lr);  //采用自适应优化器adam 
  for(let i = 0 ; i < epochs ; i++) //开始优化 optim.minimize(() => f(y));
  return y 
}

function minimize(epochs, lr)

{

let y = tf.variable(tf.scalar(2)) //初始化，值为2

const optim = tf.train.adam(lr); //采用自适应优化器adam

for(let i = 0 ; i < epochs ; i++) //开始优化 optim.minimize(() => f(y));

return y

}

使用值为0.9的学习速率，我们发现200次迭代后的最小值对应的y为-0.16092407703399658。

更多内容可参考：
https://js.tensorflow.org/tutorials/core-concepts.html

21.5设置项目(使用npm)

（1）在第一步中，我们需要设置项目。创建一个新的空目录。

$ mkdir tfjs01

1	$ mkdir tfjs01

（2）切换到新创建的项目文件夹

$ cd tfjs01

1	$ cd tfjs01

以下操作，都在该文件夹下
（3）创建一个package.json文件
在文件夹中，我们现在准备创建一个package.json文件，以便我们能够通过使用Node.js包管理器来管理依赖项：

$ npm init -y

1	$ npm init -y

（4）安装Parcel捆绑器
因为我们将在项目文件夹中本地安装依赖项（例如Tensorflow.js库），所以我们需要为Web应用程序使用模块捆绑器（bundler）。为了尽可能简单，我们将使用Parcel Web应用程序捆绑器，因为Parcel不需要进行配置。让我们通过在项目目录中执行以下命令来安装Parcel捆绑器：

$ npm install -g parcel-bundler

1	$ npm install -g parcel-bundler

（5）创建两个空文件
接下来，让我们为我们的实现创建两个新的空文件：

$ touch index.html index.js

1	$ touch index.html index.js

（6）安装Bootstrap库
我们将Bootstrap库添加为依赖项，因为我们将为我们的用户界面元素使用一些Bootstrap CSS类：

$ npm install bootstrap

1	$ npm install bootstrap

（7）修改两个空文件
在index.html中，让我们插入以下基本html页面的代码：

<html>
<body>
    <div class="container">
        <b>Welcome to TensorFlow.js</b>
        <div id="output"></div>
    </div>

    <script src="./index.js"></script>
</body>
</html>

<html>

<body>

<b>Welcome to TensorFlow.js</b>

</div>

</body>

</html>

另外，将以下代码添加到index.js

import 'bootstrap/dist/css/bootstrap.css';

document.getElementById('output').innerText = "Hello World";

import 'bootstrap/dist/css/bootstrap.css';

document.getElementById('output').innerText = "Hello World";

我们将文本Hello World写入具有ID输出的元素，以在屏幕上查看第一个结果并获得正确处理JS代码的确认。

（8）启动程序及web服务
最后，让我们通过使用parcel命令启动构建程序和开发的Web服务：

$ parcel index.html

1	$ parcel index.html

你现在应该可以在浏览器中通过URL http://localhost:1234打开网站。结果应与你在以下截图中看到的内容对应：

【注意】
以上步骤我们也可用yarn执行，npm与yarn的对应关系及优缺点，可参考：
https://juejin.im/entry/5a73ca7d6fb9a063435ea9ad
http://www.fly63.com/article/detial/554

21.6设置项目(使用yarn)

（1）在第一步中，我们需要设置项目。创建一个新的空目录。

$ mkdir tfjs02

1	$ mkdir tfjs02

（2）切换到新创建的项目文件夹

$ cd tfjs02

1	$ cd tfjs02

$ yarn init -y

1	$ yarn init -y

（4）安装Parcel捆绑器
Parcel 是一个 web 应用打包工具, 与其他工具的区别在于开发者的使用体验。它利用多核处理器提供了极快的速度, 并且不需要任何配置。
因为我们将在项目文件夹中本地安装依赖项（例如Tensorflow.js库），所以我们需要为Web应用程序使用模块捆绑器（bundler）。为了尽可能简单，我们将使用Parcel Web应用程序捆绑器，因为Parcel不需要进行配置。让我们通过在项目目录中执行以下命令来安装Parcel捆绑器：

$ yarn global add parcel-bundler

1	$ yarn global add parcel-bundler

（5）创建两个空文件
接下来，让我们为我们的实现创建两个新的空文件：

$ touch index.html index.js

1	$ touch index.html index.js

（6）安装Bootstrap库
我们将Bootstrap库添加为依赖项，因为我们将为我们的用户界面元素使用一些Bootstrap CSS类：

$ yarn add bootstrap

1	$ yarn add bootstrap

将生成一个文件（yarn.lock）和一个文件夹（node_modules）
（7）修改两个空文件
在index.html中，让我们插入以下基本html页面的代码：

<html>
<body>
    <div class="container">
        Welcome to TensorFlow.js
        <div id="output"></div>
    </div>

    <script src="./index.js"></script>
</body>
</html>

<html>

<body>

Welcome to TensorFlow.js

</div>

</body>

</html>

另外，将以下代码添加到index.js

import 'bootstrap/dist/css/bootstrap.css';
document.getElementById('output').innerText = "Hello World";

1 2	import 'bootstrap/dist/css/bootstrap.css'; document.getElementById('output').innerText = "Hello World";

我们将文本Hello World写入具有ID输出的元素，以在屏幕上查看第一个结果并获得正确处理JS代码的确认。

（8）启动程序及web服务
最后，让我们通过使用parcel命令启动构建程序和开发的Web服务：

$ parcel index.html

1	$ parcel index.html

执行该命令，将生成一个dist文件夹，同时更新相关文件。
你现在应该可以在浏览器中通过URL http://localhost:1234打开网站。当文件改变时它仍然会自动重建并支持热替换。结果应与你在以下截图中看到的内容对应：

【说明】

21.7实例详解

本实例利用tensorflow.js定义一个模型，该模型模拟一条直线（y=2x-1），然后，根据训练好的模型，在浏览器上，输入一个值，实时预测值，新体验，很不错哦！
（1）添加ensorflow.js
为了Tensorflow.js添加到项目中，我们再次使用NPM并在项目目录中执行以下命令

$ npm install @tensorflow/tfjs

1	$ npm install @tensorflow/tfjs

这将下载并将其安装到node_modules文件夹中。成功执行此命令后，我们现在可以通过在文件顶部添加以下import语句来导入index.js中的Tensorflow.js库：

import * as tf from '@tensorflow/tfjs';

1	import * as tf from '@tensorflow/tfjs';

当我们将TensorFlow.js导入为tf后，我们就可以通过在代码中使用tf对象来访问TensorFlow.js API。
（2）定义模型
现在TensorFlow.js已经可用，让我们从一个简单的机器学习练习开始。下面的示例应用程序涵盖的机器学习脚本是公式Y = 2X-1，这是个线性回归。
此函数返回给定X对应的Y值。如果绘制点（X，Y），你将得到一条直线，如下所示：

接下来我们将使用来自该函数的输入数据（X，Y）并使用这些数字训练模型。然后使用训练好的模型，根据新的X值来预测Y值。期望从模型返回的Y结果接近函数返回的精确值。

让我们创建一个非常简单的神经网络来实现。此模型只需处理一个输入值和一个输出值：

// Define a machine learning model for linear regression
const model = tf.sequential();
model.add(tf.layers.dense({units: 1, inputShape: [1]}));

// Define a machine learning model for linear regression

const model = tf.sequential();

model.add(tf.layers.dense({units: 1, inputShape: [1]}));

首先，我们通过调用tf.sequential方法创建一个新的模型实例。得到一个新的序列模型。其中一层的输出是下一层的输入，即模型是层的简单“堆叠”，没有分支或跳过。
创建好模型后，我们准备通过调用model.add来添加第一层。通过调用tf.layers.dense将新层传递给add方法。这会创建一个稠密层或全连接层。在稠密层中，层中的每个节点都连接到前一层中的每个节点。对于我们的示例，只需向神经网络添加一个具有一个输入和输出形状的密集层就足够了。

在下一步中，我们需要为模型指定损失函数和优化函数。

// Specify loss and optimizer for model
model.compile({loss: 'meanSquaredError', optimizer: 'sgd'});

1 2	// Specify loss and optimizer for model model.compile({loss: 'meanSquaredError', optimizer: 'sgd'});

通过将配置对象传递给模型实例的编译方法来完成。配置对象包含两个属性：
 loss：这里我们使用meanSquaredError损失函数。通常，损失函数用于将一个或多个变量的值映射到表示与该值相关联的一些“损失”的实数上。如果训练模型，它会尝试最小化损失函数的结果。估计量的均方误差是误差平方的平均值 - 即估计值与估计值之间的平均平方差。
 optimizer：要使用的优化器函数。我们的线性回归机器学习任务使用的是sgd函数。Sgd代表Stochastic Gradient Descent，它是一个适用于线性回归任务的优化器函数。
现在模型已配置完成，接下来将训练模型。
（3）训练模型
为了用函数Y=2X-1的值训练模型，我们定义了两个形状为6,1的张量。第一张量xs包含x值，第二张量ys包含相应的y值：

// Prepare training data
const xs = tf.tensor2d([-1, 0, 1, 2, 3, 4], [6, 1]);
const ys = tf.tensor2d([-3, -1, 1, 3, 5, 7], [6, 1])

// Prepare training data

const xs = tf.tensor2d([-1, 0, 1, 2, 3, 4], [6, 1]);

const ys = tf.tensor2d([-3, -1, 1, 3, 5, 7], [6, 1])

把这两个张量传递给调用的model.fit方法来训练模型

// Train the model
model.fit(xs, ys, {epochs: 500}).then(() => {});

1 2	// Train the model model.fit(xs, ys, {epochs: 500}).then(() => {});

对于第三个参数，我们传递一个对象，该对象包含一个名为epochs的属性，该属性设置为值500。此处指定的数字是指定TensorFlow.js通过训练集的次数。
fit方法的结果是一个Promise，所以我们注册一个回调函数，该函数在训练结束时被激活。
（4）预测
现在让我们在这个回调函数中执行最后一步，并根据给定的x值预测y值

// Train the model
model.fit(xs, ys, {epochs: 500}).then(() => {
    // Use model to predict values
    model.predict(tf.tensor2d([5], [1,1])).print();
});

// Train the model

model.fit(xs, ys, {epochs: 500}).then(() => {

// Use model to predict values

model.predict(tf.tensor2d([5], [1,1])).print();

});

使用model.predict方法完成预测。该方法以张量的形式接收输入值作为参数。在这个特定情况下，我们在内部创建一个只有一个值（5）的张量并将其传递给预测。通过调用print函数，我们确保将结果值打印到控制台，如下所示：

输出显示预测值为8.9962864并且非常接近9（如果x设置为5，函数Y=2X-1的Y值为9）
（5）优化界面
已上面经实现的示例是使用固定输入值进行预测（5）并将结果输出到浏览器控制台。让我们引入一个更复杂的用户界面，让用户能够输入用于预测的值。在index.html中添加以下代码：

<html>
<body>
    <div class="container" style="padding-top: 20px">
        <div class="card">
            <div class="card-header">
                <strong>使用TensorFlow.js简单示例 - 线性回归</strong>
            </div>
            <div class="card-body">
                <label>输入值:</label> <input type="text" id="inputValue" class="form-control"><br>
                <button type="button" class="btn btn-primary" id="predictButton" disabled>正在训练模型, 请稍等 ...</button><br><br>
                预测结果: </span>
                <span class="badge badge-secondary" id="output"></span>
            </div>
        </div>
    </div>

    <script src="./index.js"></script>
</body>
</html>

<html>

<body>

<strong>使用TensorFlow.js简单示例 - 线性回归</strong>

</div>

<button type="button" class="btn btn-primary" id="predictButton" disabled>正在训练模型, 请稍等 ...</button><br><br>

预测结果: </span>

</div>

</body>

</html>

这里我们使用各种Bootstrap CSS类，向页面添加输入和按钮元素，并定义用于输出结果的区域。

我们还需要在index.js中做一些更改：

import * as tf from '@tensorflow/tfjs';
import 'bootstrap/dist/css/bootstrap.css';

// Define a machine learning model for linear regression
const model = tf.sequential();
model.add(tf.layers.dense({units: 1, inputShape: [1]}));

// Specify loss and optimizer for model
model.compile({loss: 'meanSquaredError', optimizer: 'sgd'});

// Prepare training data
const xs = tf.tensor2d([-1, 0, 1, 2, 3, 4], [6, 1]);
const ys = tf.tensor2d([-3, -1, 1, 3, 5, 7], [6,1]);

// Train the model and set predict button to active
model.fit(xs, ys, {epochs: 500}).then(() => {
    // Use model to predict values
    document.getElementById('predictButton').disabled = false;
    document.getElementById('predictButton').innerText = "Predict";
});

// Register click event handler for predict button
document.getElementById('predictButton').addEventListener('click', (el, ev) => {
    let val = document.getElementById('inputValue').value;
	let val_num = Number(val);
    document.getElementById('output').innerText = model.predict(tf.tensor2d([val_num], [1,1]));
});

import * as tf from '@tensorflow/tfjs';

import 'bootstrap/dist/css/bootstrap.css';

// Define a machine learning model for linear regression

const model = tf.sequential();

model.add(tf.layers.dense({units: 1, inputShape: [1]}));

// Specify loss and optimizer for model

model.compile({loss: 'meanSquaredError', optimizer: 'sgd'});

// Prepare training data

const xs = tf.tensor2d([-1, 0, 1, 2, 3, 4], [6, 1]);

const ys = tf.tensor2d([-3, -1, 1, 3, 5, 7], [6,1]);

// Train the model and set predict button to active

model.fit(xs, ys, {epochs: 500}).then(() => {

// Use model to predict values

document.getElementById('predictButton').disabled = false;

document.getElementById('predictButton').innerText = "Predict";

});

// Register click event handler for predict button

document.getElementById('predictButton').addEventListener('click', (el, ev) => {

let val = document.getElementById('inputValue').value;

let val_num = Number(val);

document.getElementById('output').innerText = model.predict(tf.tensor2d([val_num], [1,1]));

});

注册了预测按钮的click事件的事件处理程序。在此函数内部，读取input元素的值（通过number函数把输入值转换为数据类型）并调用model.predict方法。此方法返回的结果将插入具有id输出的元素中。

现在的结果应该如下所示：

现在我们根据训练好的模型，输入值（x），就可实时预测Y值了！。单击“ Predict ”按钮完成预测。结果会直接显示在网站上。

（6）查看运行情况
通过管理平台（console），查看运行情况。

参考文档：
https://js.tensorflow.org/
https://cloud.tencent.com/developer/article/1346798

20.1 Pytorch简介

PyTorch 是 Torch 在 Python 上的衍生. 因为 Torch 是一个使用 Lua 语言的神经网络库,由于 PyTorch 采用了动态计算图（dynamic computational graph）结构，PyTorch 有一种独特的神经网络构建方法：使用和重放 tape recorder。而不是大多数开源框架，比如 TensorFlow、Caffe、CNTK、Theano 等采用的静态计算图。使用 PyTorch，通过一种我们称之为「Reverse-mode auto-differentiation（反向模式自动微分）」的技术，你可以零延迟或零成本地任意改变你的网络的行为。
torch 产生的 tensor 放在 GPU 中加速运算 (前提是你有合适的 GPU), 就像 Numpy 会把 array 放在 CPU 中加速运。
torch是一个支持 GPU 的 Tensor 库,如果你使用 numpy，那么你就使用过 Tensor（即 ndarray）。PyTorch 提供了支持 CPU 和 GPU 的 Tensor。
pytorch版本变化

从 2015 年谷歌开源 TensorFlow 开始，深度学习框架之争越来越越激烈，全球多个看重 AI 研究与应用的科技巨头均在加大这方面的投入。从 2017 年年初发布以来，PyTorch 可谓是异军突起，短短时间内取得了一系列成果，成为了其中的明星框架。
PyTorch 1.0 预览版已出，稳定版发布在即，全新的版本融合了 Caffe2 和 ONNX 支持模块化、面向生产的功能，并保留了 PyTorch 现有的灵活、以研究为中心的设计。PyTorch 1.0 从 Caffe2 和 ONNX 移植了模块化和产品导向的功能，并将它们和 PyTorch 已有的灵活、专注研究的设计结合，已提供多种 AI 项目的从研究原型制作到产品部署的快速、无缝路径。利用 PyTorch 1.0，AI 开发者可以通过混合前端快速地实验和优化性能，该前端可以在命令式执行和声明式执行之间无缝地转换。PyTorch 1.0 中的技术已经让很多 Facebook 的产品和服务变得更强大，包括每天执行 60 亿次文本翻译。
pytorch的组成：
PyTorch由4个主要包装组成：
①.torch：类似于Numpy的通用数组库，可以在将张量类型转换为（torch.cuda.TensorFloat）并在GPU上进行计算。
②.torch.autograd：用于构建计算图形并自动获取渐变的包。
③.torch.nn：具有共同层和成本函数的神经网络库。
④.torch.optim：具有通用优化算法（如SGD，Adam等）的优化包。

20.2Pytorch安装配置

本章的环境：python3.6，pytorch0.4.1，windows
windows下安装pytorch0.4.1的方法如下：

登录pytorch官网（https://pytorch.org/），选择安装配置内容，具体如下：

在命令行输入以下安装命令：

conda install pytorch-cpu -c pytorch

1	conda install pytorch-cpu -c pytorch

验证pytorch安装是否成功及安装版本号：

import torch
print(torch.__version__)

1 2	import torch print(torch.__version__)

运行结果
0.4.1

20.3Pytorch实例

我觉得入门最快、最有效的方法就是实战，通过一些实际案例的学习，收获往往要好于从简单概念入手。本节内容安排大致如下：
先从我们熟知的numpy开始，看如何使用numpy实现正向传播和反向传播开始，接着介绍与pytorch中与Numpy相似的Tensor，如何实现同样功能，然后依次介绍如何使用autograd、nn、optim等等模块实现正向传播和反向传播。最后介绍两个完整实例：一个是回归分析、一个是卷积神经网络。

20.3.1利用Numpy实现正向和反向传播（简单实例）

利用numpy进行正向传播和反向传播，我们先介绍一个简单实例，然后由此推广到一般情况，最后用Pytorch实现自动反向传播。
简单实例主要信息如下：
数据量（N）为1，输入数据维度或节点数（D_in）为2，隐含层节点(H)为2，输出层维度或节点数（D_out）为2，输入到隐含层的权值矩阵为w1,隐含层到输出层的权值矩阵为w2，隐含层的激活函数为Relu(np.maximum(h, 0))，网络结构及矩阵、向量信息如下：
简单示例的神经网络结构图为：

1、正向传播
正向传播示意图（看蓝色细线部分）：

具体实现步骤如下：
（1）导入需要的库或模块
这个简单实例，主要用到numpy及相关运算。

import numpy as np

1	import numpy as np

（2）生成输入数据、权重初始值等

# N为输入数据的批量大小; D_in输入数据维度;
# H 为隐含层维度; D_out输出层维度.
N, D_in, H, D_out = 1, 2, 2, 2

# 用随机数生成输入数据x及目标数据y
x = np.arange(1,3).reshape(N, D_in)
y = np.arange(1,3).reshape(N, D_out)

# 初始化权重参数
w1 =np.arange(0,0.4,0.1).reshape(D_in, H)
w2 =np.arange(0,0.4,0.1).reshape(H, D_out)

print('x的值:{0},\ny的值:{1},\nw1的初始值:{2},\nw2的初始值:{3}'.format(x,y,w1,w2))

# N为输入数据的批量大小; D_in输入数据维度;

# H 为隐含层维度; D_out输出层维度.

N, D_in, H, D_out = 1, 2, 2, 2

# 用随机数生成输入数据x及目标数据y

x = np.arange(1,3).reshape(N, D_in)

y = np.arange(1,3).reshape(N, D_out)

# 初始化权重参数

w1 =np.arange(0,0.4,0.1).reshape(D_in, H)

w2 =np.arange(0,0.4,0.1).reshape(H, D_out)

print('x的值:{0},\ny的值:{1},\nw1的初始值:{2},\nw2的初始值:{3}'.format(x,y,w1,w2))

运行结果
x的值:[[1 2]],
y的值:[[1 2]],
w1的初始值:[[0. 0.1]
[0.2 0.3]],
w2的初始值:[[0. 0.1]
[0.2 0.3]]
（3）前向传播并计算预测值

# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
print('h的值:{0},\nh_relu的值:{1}, \ny_pred的初始值:{2}'.format(h,h_relu,y_pred))

# Forward pass: compute predicted y

h = x.dot(w1)

h_relu = np.maximum(h, 0)

y_pred = h_relu.dot(w2)

print('h的值:{0},\nh_relu的值:{1}, \ny_pred的初始值:{2}'.format(h,h_relu,y_pred))

运行结果
h的值:[[0.4 0.7]],
h_relu的值:[[0.4 0.7]],
y_pred的初始值:[[0.14 0.25]]

其中dot是numpy实现内积运算，该运算规则示意图如下：

（4）计算损失值

# Compute and print loss
loss = np.square(y_pred - y).sum()
print(loss)

# Compute and print loss

loss = np.square(y_pred - y).sum()

print(loss)

3.8021

2、反向传播
反向传播示意图（看绿色粗线部分）

（1）具体求导步骤如下

#基于损失函数，对参数w1，w2进行反向传播。
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()  #深度复制
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
print('对y_pred求导:{0},\n对W2求导:{1}, \n对h_relu求导:
{2}'.format(grad_y_pred,grad_w2,grad_h_relu))
print('h的值:{0},\n对h求导:{1}, \n对w1求导:{2}'.format(grad_h,grad_w2,grad_w1))

#基于损失函数，对参数w1，w2进行反向传播。

grad_y_pred = 2.0 * (y_pred - y)

grad_w2 = h_relu.T.dot(grad_y_pred)

grad_h_relu = grad_y_pred.dot(w2.T)

grad_h = grad_h_relu.copy() #深度复制

grad_h[h < 0] = 0

grad_w1 = x.T.dot(grad_h)

print('对y_pred求导:{0},\n对W2求导:{1}, \n对h_relu求导:

{2}'.format(grad_y_pred,grad_w2,grad_h_relu))

print('h的值:{0},\n对h求导:{1}, \n对w1求导:{2}'.format(grad_h,grad_w2,grad_w1))

运行结果
对y_pred求导:[[-1.72 -3.5 ]],
对W2求导:[[-0.688 -1.4 ]
[-1.204 -2.45 ]],
对h_relu求导:[[-0.35 -1.394]]
h的值:[[-0.35 -1.394]],
对h求导:[[-0.688 -1.4 ]
[-1.204 -2.45 ]],
对w1求导:[[-0.35 -1.394]
[-0.7 -2.788]]
（2）根据梯度更新权重参数

# 根据梯度更新权重参数
learning_rate = 1e-6
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
print('更新w1:{0},\n更新w2:{1}'.format(w1,w2))

# 根据梯度更新权重参数

learning_rate = 1e-6

w1 -= learning_rate * grad_w1

w2 -= learning_rate * grad_w2

print('更新w1:{0},\n更新w2:{1}'.format(w1,w2))

运行结果

更新w1:[[3.50000000e-07 1.00001394e-01]
[2.00000700e-01 3.00002788e-01]],
更新w2:[[6.88000000e-07 1.00001400e-01]
[2.00001204e-01 3.00002450e-01]]
其中涉及对向量或矩阵求导公式推导可参考：
https://blog.csdn.net/DawnRanger/article/details/78600506
至此利用numpy求正向和反向传播就结束了，接下来，我们看一般情况，即包含批量数据、一般维度、对权重进行多次迭代运算。

20.3.2利用Numpy实现正向和反向传播（一般情况）

上节我们用一个简单实例，说明如何利用numpy实现正向和反向传播，有了这个基础之后，我们接下来介绍利用numpy实现一般情况的正向和反向传播，具体代码如下：

# -*- coding: utf-8 -*-
import numpy as np

# N为输入数据的批量大小; D_in输入数据维度;
# H 为隐含层维度; D_out输出层维度.
N, D_in, H, D_out = 64, 1000, 100, 10

# 用随机数生成输入数据x、目标数据y。
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# 随机初始化权重参数
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # 进行前向传播，计算预测值y_pred
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    #计算及打印损失值
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # 反向传播，基于损失函数计算参数的梯度
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # 根据梯度更新参数
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

# -*- coding: utf-8 -*-

import numpy as np

# N为输入数据的批量大小; D_in输入数据维度;

# H 为隐含层维度; D_out输出层维度.

N, D_in, H, D_out = 64, 1000, 100, 10

# 用随机数生成输入数据x、目标数据y。

x = np.random.randn(N, D_in)

y = np.random.randn(N, D_out)

# 随机初始化权重参数

w1 = np.random.randn(D_in, H)

w2 = np.random.randn(H, D_out)

learning_rate = 1e-6

for t in range(500):

# 进行前向传播，计算预测值y_pred

h = x.dot(w1)

h_relu = np.maximum(h, 0)

y_pred = h_relu.dot(w2)

#计算及打印损失值

loss = np.square(y_pred - y).sum()

print(t, loss)

# 反向传播，基于损失函数计算参数的梯度

grad_y_pred = 2.0 * (y_pred - y)

grad_w2 = h_relu.T.dot(grad_y_pred)

grad_h_relu = grad_y_pred.dot(w2.T)

grad_h = grad_h_relu.copy()

grad_h[h < 0] = 0

grad_w1 = x.T.dot(grad_h)

# 根据梯度更新参数

w1 -= learning_rate * grad_w1

w2 -= learning_rate * grad_w2

20.3.3Pytorch的Tensor实现正向和反向传播

Numpy是一个很棒的框架，但它不能利用GPU来加速其数值计算。对于现代深度神经网络，GPU通常提供50倍或更高的加速，所以不幸的是，numpy对于现代深度学习来说还不够。
在这里，我们介绍最基本的PyTorch概念：Tensor。 PyTorch Tensor在概念上与numpy数组相同：Tensor是一个n维数组，PyTorch提供了许多用于在这些Tensors上运算的函数。 Tensors可以跟踪计算图和梯度，也可用作科学计算的通用工具。
与numpy不同，PyTorch Tensors可以利用GPU加速其数值计算。要在GPU上运行PyTorch Tensor，只需将其转换为新的数据类型即可。
在这里，我们使用PyTorch Tensors将双层网络与随机数据相匹配。像上面的numpy示例一样，我们需要手动实现通过网络的正向和反向传播：

# -*- coding: utf-8 -*-

import torch
#定义tensor数据类型
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N为输入数据的批量大小; D_in输入数据维度;
# H 为隐含层维度; D_out输出层维度.
N, D_in, H, D_out = 64, 1000, 100, 10

# 用随机数生成输入数据x、目标数据y
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# 用随机值初始化权重参数
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # 前向传播，计算预测值y_pred
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    #计算及打印损失值
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    #反向传播，基于损失函数计算参数w1、w2的梯度
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # 使用梯度下降法更新权重参数
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

# -*- coding: utf-8 -*-

import torch

#定义tensor数据类型

dtype = torch.float

device = torch.device("cpu")

# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N为输入数据的批量大小; D_in输入数据维度;

# H 为隐含层维度; D_out输出层维度.

N, D_in, H, D_out = 64, 1000, 100, 10

# 用随机数生成输入数据x、目标数据y

x = torch.randn(N, D_in, device=device, dtype=dtype)

y = torch.randn(N, D_out, device=device, dtype=dtype)

# 用随机值初始化权重参数

w1 = torch.randn(D_in, H, device=device, dtype=dtype)

w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6

for t in range(500):

# 前向传播，计算预测值y_pred

h = x.mm(w1)

h_relu = h.clamp(min=0)

y_pred = h_relu.mm(w2)

#计算及打印损失值

loss = (y_pred - y).pow(2).sum().item()

print(t, loss)

#反向传播，基于损失函数计算参数w1、w2的梯度

grad_y_pred = 2.0 * (y_pred - y)

grad_w2 = h_relu.t().mm(grad_y_pred)

grad_h_relu = grad_y_pred.mm(w2.t())

grad_h = grad_h_relu.clone()

grad_h[h < 0] = 0

grad_w1 = x.t().mm(grad_h)

# 使用梯度下降法更新权重参数

w1 -= learning_rate * grad_w1

w2 -= learning_rate * grad_w2

20.3.4利用Tensor和autograd实现自动反向传播

在上面的例子中，我们不得不手动实现神经网络的前向和后向传递。手动实现反向传递对于小型双层网络来说并不是什么大问题，但对于大型复杂网络来说，很快就会变得非常繁琐。
是否有更高效的方法呢？我们可以使用自动微分来自动计算神经网络中的反向传播。 PyTorch中的autograd包提供了这个功能。使用autograd时，网络的正向传递将定义计算图形;图中的节点将是张量，边将是从输入张量产生输出张量的函数。通过此图反向传播，您可以轻松计算梯度。
这听起来很复杂，在实践中使用起来非常简单。每个Tensor代表计算图中的节点。如果x是具有x.requires_grad = True的Tensor，则x.grad是另一个Tensor，相对于某个标量值保持x的梯度。
在这里，我们使用PyTorch Tensors和autograd来实现我们的双层网络;现在我们不再需要手动实现通过网络的反向传播了！

# -*- coding: utf-8 -*-
import torch

#定义tensor的数据类型
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N为输入数据的批量大小; D_in输入数据维度;
# H 为隐含层维度; D_out输出层维度.
N, D_in, H, D_out = 64, 1000, 100, 10

# 用随机数生成输入数据x、目标数据y的Tensor.
# 这两个Tensors在反向传播时，无需进行梯度，故reqires_grad=False（缺省情况）
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# 用随机数生成权重参数w1、w2的Tensors 
# 这两个Tensors在进行反向传播时，需要计算梯度，故设置为requires_grad=True.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # 进行正向传播
    #计算预测值y_pred.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    #计算损失值.
    # 损失值为一个大小为(1,)的Tensor，即为标签。
    # loss.item() 为损失值.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # 用autograd计算反向传播，这里会根据所有设置了requires_grad=True的Tensor
    # 计算loss的梯度， w1.grad和w2.grad将会保存loss对于w1和w2的梯度
    loss.backward()

    # 手动更新weight，需要用torch.no_grad()，因为weight有required_grad=True
    #但我们不需要在 autograd中跟踪这个操作
    #torch.autograd.no_grad的作用是在上下文环境中切断梯度计算，在此模式下，
    #每一步计算结果中requires_grad都是False，即使input设置为quires_grad=True
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

# -*- coding: utf-8 -*-

import torch

#定义tensor的数据类型

dtype = torch.float

device = torch.device("cpu")

# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N为输入数据的批量大小; D_in输入数据维度;

# H 为隐含层维度; D_out输出层维度.

N, D_in, H, D_out = 64, 1000, 100, 10

# 用随机数生成输入数据x、目标数据y的Tensor.

# 这两个Tensors在反向传播时，无需进行梯度，故reqires_grad=False（缺省情况）

x = torch.randn(N, D_in, device=device, dtype=dtype)

y = torch.randn(N, D_out, device=device, dtype=dtype)

# 用随机数生成权重参数w1、w2的Tensors

# 这两个Tensors在进行反向传播时，需要计算梯度，故设置为requires_grad=True.

w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)

w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6

for t in range(500):

# 进行正向传播

#计算预测值y_pred.

y_pred = x.mm(w1).clamp(min=0).mm(w2)

#计算损失值.

# 损失值为一个大小为(1,)的Tensor，即为标签。

# loss.item() 为损失值.

loss = (y_pred - y).pow(2).sum()

print(t, loss.item())

# 用autograd计算反向传播，这里会根据所有设置了requires_grad=True的Tensor

# 计算loss的梯度， w1.grad和w2.grad将会保存loss对于w1和w2的梯度

loss.backward()

# 手动更新weight，需要用torch.no_grad()，因为weight有required_grad=True

#但我们不需要在 autograd中跟踪这个操作

#torch.autograd.no_grad的作用是在上下文环境中切断梯度计算，在此模式下，

#每一步计算结果中requires_grad都是False，即使input设置为quires_grad=True

with torch.no_grad():

w1 -= learning_rate * w1.grad

w2 -= learning_rate * w2.grad

# Manually zero the gradients after updating weights

w1.grad.zero_()

w2.grad.zero_()

如果不用with torch.no_grad来更新权重参数，我们可以使用优化器来实现，具体可参考优化器（optim部分）。

20.3.5拓展autograd

我们可以用autograd实现自动反向求导，如果我们想要自己写函数，而又不用自动求导，该如何实现？
实现的方式就是自己定义函数，实现它的正向和反向求导。
在PyTorch中，我们可以通过定义torch.autograd.Function的子类并实现前向和后向函数来轻松定义我们自己的autograd运算符。然后我们可以使用我们的新autograd运算符，通过构造一个实例并像函数一样调用它，传递包含输入数据的Tensors。
在下面例子中，我们定义了自己的自定义autograd函数来执行ReLU，并使用它来实现我们的双层网络：

# -*- coding: utf-8 -*-
import torch

#定义集成torch.autograd.Function的类MyReLU
class MyReLU(torch.autograd.Function):
    """
    通过继承torch.autograd.Function，我们能执行自定义的自动求导函数，而且执行正向和反向传播.
    """
    @staticmethod
    def forward(ctx, input):
    """
        在正向传递中,我们收到一个包含输入和返回输出的张量, ctx是一个上下文对象,可用于存储反向计算的信息.
        您可以使用ctx.save_for_backward方法缓存任意对象以用于反向传播。
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        在反向传播中,我们收到一个张量,其中包含相对于输出的损失梯度,
        我们需要计算相关于输入的损失函数梯度。
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N为输入数据的批量大小; D_in输入数据维度;
# H 为隐含层维度; D_out输出层维度.
N, D_in, H, D_out = 64, 1000, 100, 10

# 用随机值生成输入数据x及目标数据y.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# 初始化权重参数.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # 把类MyReLU构造成一个实例relu，便于像函数一样调用。
    relu = MyReLU.apply

    # 正向传播，计算预测值y_pred,这里计算ReLU使用我们自定义的autograd.
    y_pred = relu(x.mm(w1)).mm(w2)

    #计算及打印损失值
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # 使用autograd进行反向传播.
    loss.backward()

    # 使用梯度下降法更新权重参数.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # 更新参数后，需要对梯度置为0
        w1.grad.zero_()
        w2.grad.zero_()

# -*- coding: utf-8 -*-

import torch

#定义集成torch.autograd.Function的类MyReLU

class MyReLU(torch.autograd.Function):

"""

通过继承torch.autograd.Function，我们能执行自定义的自动求导函数，而且执行正向和反向传播.

"""

@staticmethod

def forward(ctx, input):

"""

在正向传递中,我们收到一个包含输入和返回输出的张量, ctx是一个上下文对象,可用于存储反向计算的信息.

您可以使用ctx.save_for_backward方法缓存任意对象以用于反向传播。

"""

ctx.save_for_backward(input)

return input.clamp(min=0)

@staticmethod

def backward(ctx, grad_output):

"""

在反向传播中,我们收到一个张量,其中包含相对于输出的损失梯度,

我们需要计算相关于输入的损失函数梯度。

"""

input, = ctx.saved_tensors

grad_input = grad_output.clone()

grad_input[input < 0] = 0

return grad_input

dtype = torch.float

device = torch.device("cpu")

# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N为输入数据的批量大小; D_in输入数据维度;

# H 为隐含层维度; D_out输出层维度.

N, D_in, H, D_out = 64, 1000, 100, 10

# 用随机值生成输入数据x及目标数据y.

x = torch.randn(N, D_in, device=device, dtype=dtype)

y = torch.randn(N, D_out, device=device, dtype=dtype)

# 初始化权重参数.

w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)

w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6

for t in range(500):

# 把类MyReLU构造成一个实例relu，便于像函数一样调用。

relu = MyReLU.apply

# 正向传播，计算预测值y_pred,这里计算ReLU使用我们自定义的autograd.

y_pred = relu(x.mm(w1)).mm(w2)

#计算及打印损失值

loss = (y_pred - y).pow(2).sum()

print(t, loss.item())

# 使用autograd进行反向传播.

loss.backward()

# 使用梯度下降法更新权重参数.

with torch.no_grad():

w1 -= learning_rate * w1.grad

w2 -= learning_rate * w2.grad

# 更新参数后，需要对梯度置为0

w1.grad.zero_()

w2.grad.zero_()

【小知识】
Function与Module都可以对pytorch进行自定义拓展，使其满足网络的需求，但这两者还是有十分重要的不同：

（1）Function一般只定义一个操作，因为其无法保存参数，因此适用于激活函数、pooling等操作；Module是保存了参数，因此适合于定义一层，如线性层，卷积层，也适用于定义一个网络
（2）Function需要定义三个方法：__init__, forward, backward（需要自己写求导公式）；Module：只需定义__init__和forward，而backward的计算由自动求导机制构成
（3）可以不严谨的认为，Module是由一系列Function组成，因此其在forward的过程中，Function和Tensor组成了计算图，在backward时，只需调用Function的backward就得到结果，因此Module不需要再定义backward。
（4）Module不仅包括了Function，还包括了对应的参数，以及其他函数与变量，这是Function所不具备的。

20.3.6对比TensorFlow

PyTorch autograd看起来很像TensorFlow：在两个框架中我们定义了一个计算图，并使用自动微分来计算梯度。两者之间最大的区别是TensorFlow的计算图是静态的，PyTorch使用动态计算图。
在TensorFlow中，我们定义计算图一次，然后一遍又一遍地执行相同的图，可能将不同的输入数据提供给图。在PyTorch中，每个前向传递定义了一个新的计算图。
静态图很好，因为你可以预先优化图形;例如，框架可能决定融合某些图形操作以提高效率，或者提出一种策略，用于在多个GPU或许多机器上分布图形。如果您反复使用相同的图表，那么可以分摊这个代价可能高昂的前期优化，因为相同的图表会反复重新运行。
静态和动态图表不同的一个方面是控制流程。对于某些模型，我们可能希望对每个数据点执行不同的计算;例如，可以针对每个数据点针对不同数量的时间步长展开循环网络;这种展开可以作为循环实现。使用静态图形，循环结构需要是图形的一部分;因此，TensorFlow提供了诸如tf.scan之类的运算符，用于将循环嵌入到图中。使用动态图形情况更简单：因为我们为每个示例动态构建图形，我们可以使用常规命令流程控制来执行每个输入不同的计算。
与上面的PyTorch autograd示例相比，这里我们使用TensorFlow来拟合一个简单的双层网：

# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np

# 首先设置一个计算图（缺省图）:

# N为输入数据的批量大小; D_in输入数据维度;
# H 为隐含层维度; D_out输出层维度.
N, D_in, H, D_out = 64, 1000, 100, 10

# 创建两个占位符，分别用来存放输入数据x和目标值y
#运行计算图时，导入数据.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# 创建权重变量w1和w2，并用随机值初始化.
# TensorFlow 的变量在整个计算图保存其值.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# 前向传播，计算预测值.
# 当前代码并没有实现运行，搭建计算图.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# 计算损失值
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# 计算有关参数w1、w2关于损失函数的梯度.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

#用梯度下降法更新参数. 
# 执行计算图时给 new_w1 和new_w2 赋值
# 对TensorFlow 来说，更新参数是计算图的一部分内容
# 而PyTorch，这部分是属于计算图之外.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# 已构建计算图, 接下来创建TensorFlow session，准备执行计算图.
with tf.Session() as sess:
    # 执行之前需要初始化变量w1、w2.
    sess.run(tf.global_variables_initializer())

    # 创建numpy多维数组，生成实际输入数据x_value、目标数据 y_value
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for _ in range(500):
        # 循环执行计算图. 每次需要把x_value,y_value赋给x和y.
        # 每次执行计算图时，需要计算关于new_w1和new_w2的损失值,
        # 返回numpy多维数组
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        print(loss_value)

# -*- coding: utf-8 -*-

import tensorflow as tf

import numpy as np

# 首先设置一个计算图（缺省图）:

# N为输入数据的批量大小; D_in输入数据维度;

# H 为隐含层维度; D_out输出层维度.

N, D_in, H, D_out = 64, 1000, 100, 10

# 创建两个占位符，分别用来存放输入数据x和目标值y

#运行计算图时，导入数据.

x = tf.placeholder(tf.float32, shape=(None, D_in))

y = tf.placeholder(tf.float32, shape=(None, D_out))

# 创建权重变量w1和w2，并用随机值初始化.

# TensorFlow 的变量在整个计算图保存其值.

w1 = tf.Variable(tf.random_normal((D_in, H)))

w2 = tf.Variable(tf.random_normal((H, D_out)))

# 前向传播，计算预测值.

# 当前代码并没有实现运行，搭建计算图.

h = tf.matmul(x, w1)

h_relu = tf.maximum(h, tf.zeros(1))

y_pred = tf.matmul(h_relu, w2)

# 计算损失值

loss = tf.reduce_sum((y - y_pred) ** 2.0)

# 计算有关参数w1、w2关于损失函数的梯度.

grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

#用梯度下降法更新参数.

# 执行计算图时给 new_w1 和new_w2 赋值

# 对TensorFlow 来说，更新参数是计算图的一部分内容

# 而PyTorch，这部分是属于计算图之外.

learning_rate = 1e-6

new_w1 = w1.assign(w1 - learning_rate * grad_w1)

new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# 已构建计算图, 接下来创建TensorFlow session，准备执行计算图.

with tf.Session() as sess:

# 执行之前需要初始化变量w1、w2.

sess.run(tf.global_variables_initializer())

# 创建numpy多维数组，生成实际输入数据x_value、目标数据 y_value

x_value = np.random.randn(N, D_in)

y_value = np.random.randn(N, D_out)

for _ in range(500):

# 循环执行计算图. 每次需要把x_value,y_value赋给x和y.

# 每次执行计算图时，需要计算关于new_w1和new_w2的损失值,

# 返回numpy多维数组

loss_value, _, _ = sess.run([loss, new_w1, new_w2],

feed_dict={x: x_value, y: y_value})

print(loss_value)

20.3.7高级封装（nn模块）

计算图和autograd是一个非常强大的范例，用于定义复杂的运算符并自动获取导数;然而，对于大型神经网络，原始autograd封装级别较低，需要编写很多代码。
在构建神经网络时，我们经常考虑将计算安排到层中，其中一些层具有可学习的参数，这些参数将在学习期间进行优化。
在TensorFlow中，像Keras，TensorFlow-Slim和TFLearn这样的软件包提供了对构建神经网络有用的原始计算图形的更高级别的抽象。
在PyTorch中也有更高一级的封装，nn包服务于同样的目的。 nn包定义了一组模块，它们大致相当于神经网络层。模块接收输入张量并计算输出张量，但也可以保持内部状态，例如包含可学习参数的张量。 nn包还定义了一组在训练神经网络时常用的有用损失函数。
有关torch.nn的进一步介绍，大家可参考：
http://blog.leanote.com/post/1556905690@qq.com/torch.nn
在这个例子中，我们使用nn包来实现我们的双层网络：

# -*- coding: utf-8 -*-
import torch

# N为输入数据的批量大小; D_in输入数据维度;
# H 为隐含层维度; D_out输出层维度.
N, D_in, H, D_out = 64, 1000, 100, 10

# 创建输入、目标数据的Tensor
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

#使用nn包来定义model作为layers 的序列.
# nn.Sequential是一个Module，该Module包含其他Modules(如Linear, ReLU等)
# Sequential Module会序列化的执行这些 Modules, 並且自动计算其output和grads.
#注意因为是序列化执行的, 因此无需自定义 forward. 这是与 nn.Module 的区別之一.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

#定义损失函数.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    #前向传播，计算预测值.
    y_pred = model(x)

    # 计算及打印损失值.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # 进行反向传播前，需要对梯度清零.
    model.zero_grad()

    #反向传播，根据model参数计算损失函数的梯度
    #每个model的 parameters 存放在含requires_grad=True标签的Tensors中
    loss.backward()

    # 利用梯度下降法更新权重参数 .每一个参数都是一个Tensor,并可获取他们的梯度.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

# -*- coding: utf-8 -*-

import torch

# N为输入数据的批量大小; D_in输入数据维度;

# H 为隐含层维度; D_out输出层维度.

N, D_in, H, D_out = 64, 1000, 100, 10

# 创建输入、目标数据的Tensor

x = torch.randn(N, D_in)

y = torch.randn(N, D_out)

#使用nn包来定义model作为layers 的序列.

# nn.Sequential是一个Module，该Module包含其他Modules(如Linear, ReLU等)

# Sequential Module会序列化的执行这些 Modules, 並且自动计算其output和grads.

#注意因为是序列化执行的, 因此无需自定义 forward. 这是与 nn.Module 的区別之一.

model = torch.nn.Sequential(

torch.nn.Linear(D_in, H),

torch.nn.ReLU(),

torch.nn.Linear(H, D_out),

)

#定义损失函数.

loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4

for t in range(500):

#前向传播，计算预测值.

y_pred = model(x)

# 计算及打印损失值.

loss = loss_fn(y_pred, y)

print(t, loss.item())

# 进行反向传播前，需要对梯度清零.

model.zero_grad()

#反向传播，根据model参数计算损失函数的梯度

#每个model的 parameters 存放在含requires_grad=True标签的Tensors中

loss.backward()

# 利用梯度下降法更新权重参数 .每一个参数都是一个Tensor,并可获取他们的梯度.

with torch.no_grad():

for param in model.parameters():

param -= learning_rate * param.grad

20.3.8 优化器（optim）

到目前为止，我们通过手动改变持有可学习参数的Tensors来更新模型的权重（使用torch.no_grad（）或.data以避免在autograd中跟踪历史记录）。对于像随机梯度下降这样的简单优化算法来说，这不是一个巨大的负担，但在实践中，我们经常使用更复杂的优化器如AdaGrad，RMSProp，Adam等来训练神经网络。
PyTorch中的optim包抽象出优化算法的思想，并提供常用优化算法的实现。
在这个例子中，我们将使用nn包像以前一样定义我们的模型，但我们将使用optim包提供的Adam算法优化模型：

# -*- coding: utf-8 -*-
import torch

# N为输入数据的批量大小; D_in输入数据维度;
# H 为隐含层维度; D_out输出层维度.
N, D_in, H, D_out = 64, 1000, 100, 10

# 创建随机数据x和目标数据y
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

#使用nn包定义model及损失函数.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

#使用optim包定义更新模型参数的优化器. 我们选择Adam优化器; optim package
#还包含其他优化算法. Adam的第一个参数为需要更新的Tensors.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # 计算预测值.
    y_pred = model(x)

    # 计算打印损失值.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # 反向传播前, 使用优化器对所有需要更新参数的梯度清零
    #这是因为缺省情况下，梯度会存放在缓存里.
    optimizer.zero_grad()

    # 反向传播
    loss.backward()

    #更新参数
    optimizer.step()

# -*- coding: utf-8 -*-

import torch

# N为输入数据的批量大小; D_in输入数据维度;

# H 为隐含层维度; D_out输出层维度.

N, D_in, H, D_out = 64, 1000, 100, 10

# 创建随机数据x和目标数据y

x = torch.randn(N, D_in)

y = torch.randn(N, D_out)

#使用nn包定义model及损失函数.

model = torch.nn.Sequential(

torch.nn.Linear(D_in, H),

torch.nn.ReLU(),

torch.nn.Linear(H, D_out),

)

loss_fn = torch.nn.MSELoss(reduction='sum')

#使用optim包定义更新模型参数的优化器. 我们选择Adam优化器; optim package

#还包含其他优化算法. Adam的第一个参数为需要更新的Tensors.

learning_rate = 1e-4

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for t in range(500):

# 计算预测值.

y_pred = model(x)

# 计算打印损失值.

loss = loss_fn(y_pred, y)

print(t, loss.item())

# 反向传播前, 使用优化器对所有需要更新参数的梯度清零

#这是因为缺省情况下，梯度会存放在缓存里.

optimizer.zero_grad()

# 反向传播

loss.backward()

#更新参数

optimizer.step()

20.3.9 自定义网络层

有时，您需要指定比现有模块序列更复杂的模型; 对于这些情况，您可以通过继承父类nn.Module的方法定义自己的模块，并定义一个接收输入Tensors的forward，并使用其他模块或Tensors上的其他autograd操作生成输出Tensors。
在这个例子中，我们将我们的双层网络实现为自定义Module子类：

# -*- coding: utf-8 -*-
import torch

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        通常我们將具有引用学习参数的层放在__init__函式中, 將不具有引用学习参数的操作放在forward中。
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

# N为输入数据的批量大小; D_in输入数据维度;
# H 为隐含层维度; D_out输出层维度.
N, D_in, H, D_out = 64, 1000, 100, 10

# 创建输入数据x、目标数据y
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# 实例化类，构建模型
model = TwoLayerNet(D_in, H, D_out)

#构建损失函数及优化器. 其中model.parameters()包含学习参数
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # 传入x，计算预测值
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # 对梯度清零，执行反向传播并更新参数.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# -*- coding: utf-8 -*-

import torch

class TwoLayerNet(torch.nn.Module):

def __init__(self, D_in, H, D_out):

"""

通常我们將具有引用学习参数的层放在__init__函式中, 將不具有引用学习参数的操作放在forward中。

"""

super(TwoLayerNet, self).__init__()

self.linear1 = torch.nn.Linear(D_in, H)

self.linear2 = torch.nn.Linear(H, D_out)

def forward(self, x):

h_relu = self.linear1(x).clamp(min=0)

y_pred = self.linear2(h_relu)

return y_pred

# N为输入数据的批量大小; D_in输入数据维度;

# H 为隐含层维度; D_out输出层维度.

N, D_in, H, D_out = 64, 1000, 100, 10

# 创建输入数据x、目标数据y

x = torch.randn(N, D_in)

y = torch.randn(N, D_out)

# 实例化类，构建模型

model = TwoLayerNet(D_in, H, D_out)

#构建损失函数及优化器. 其中model.parameters()包含学习参数

criterion = torch.nn.MSELoss(reduction='sum')

optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

for t in range(500):

# 传入x，计算预测值

y_pred = model(x)

# Compute and print loss

loss = criterion(y_pred, y)

print(t, loss.item())

# 对梯度清零，执行反向传播并更新参数.

optimizer.zero_grad()

loss.backward()

optimizer.step()

20.3.10 控制流与参数共享

作为动态图和权重共享的一个例子，我们实现了一个非常奇怪的模型：一个全连接的ReLU网络，中间会随机选择1到4层隐藏层，重复使用相同的权重多次计算最里面的隐藏层。
对于这个模型，我们可以使用普通的Python流控制来实现循环，并且我们可以通过在定义正向传递时多次重复使用相同的模块来实现最内层之间的权重共享。
我们可以轻松地将此模型实现为Module子类：

# -*- coding: utf-8 -*-
import random
import torch


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        实现三个 nn.Linear 层.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        在PyTorch中, 我们可以通过for循环来随机的选择中间层的层数, 使得每一次
执行forward函式时, 都有不同的中间层层数. 而这些中间层都来自于同一个Module例項, 因而具有共享的权重参数.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N为输入数据的批量大小; D_in输入数据维度;
# H 为隐含层维度; D_out输出层维度.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# -*- coding: utf-8 -*-

import random

import torch

class DynamicNet(torch.nn.Module):

def __init__(self, D_in, H, D_out):

"""

实现三个 nn.Linear 层.

"""

super(DynamicNet, self).__init__()

self.input_linear = torch.nn.Linear(D_in, H)

self.middle_linear = torch.nn.Linear(H, H)

self.output_linear = torch.nn.Linear(H, D_out)

def forward(self, x):

"""

在PyTorch中, 我们可以通过for循环来随机的选择中间层的层数, 使得每一次

执行forward函式时, 都有不同的中间层层数. 而这些中间层都来自于同一个Module例項, 因而具有共享的权重参数.

"""

h_relu = self.input_linear(x).clamp(min=0)

for _ in range(random.randint(0, 3)):

h_relu = self.middle_linear(h_relu).clamp(min=0)

y_pred = self.output_linear(h_relu)

return y_pred

# N为输入数据的批量大小; D_in输入数据维度;

# H 为隐含层维度; D_out输出层维度.

N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs

x = torch.randn(N, D_in)

y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above

model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with

# vanilla stochastic gradient descent is tough, so we use momentum

criterion = torch.nn.MSELoss(reduction='sum')

optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)

for t in range(500):

# Forward pass: Compute predicted y by passing x to the model

y_pred = model(x)

# Compute and print loss

loss = criterion(y_pred, y)

print(t, loss.item())

# Zero gradients, perform a backward pass, and update the weights.

optimizer.zero_grad()

loss.backward()

optimizer.step()

20.3.11 小试牛刀：用Tensor实现线性回归

本节主要介绍如何利用autograd/Tensor实现线性回归，以此感受autograd的便捷之处。
（1）导入需要的库

import torch as t
%matplotlib inline
from matplotlib import pyplot as plt
from IPython import display

import torch as t

%matplotlib inline

from matplotlib import pyplot as plt

from IPython import display

（2）生成输入数据x及目标数据y
设置随机数种子，为了在不同人电脑上运行时下面的输出一致

t.manual_seed(100) 
dtype = t.float
def get_fake_data(batch_size=10):
    ''' 产生随机数据：y = x*2 + 3，加上了一些噪声'''
    x = t.rand(batch_size,1,dtype=dtype) * 20
    y = x * 2 + 3 + t.randn(batch_size, 1)
    return x, y

t.manual_seed(100)

dtype = t.float

def get_fake_data(batch_size=10):

''' 产生随机数据：y = x*2 + 3，加上了一些噪声'''

x = t.rand(batch_size,1,dtype=dtype) * 20

y = x * 2 + 3 + t.randn(batch_size, 1)

return x, y

（3）查看x，y数据分布情况

x, y = get_fake_data()
#squeeze(): 去除size为1的维度,然后转换为numpy数据
plt.scatter(x.squeeze().numpy(), y.squeeze().numpy())

x, y = get_fake_data()

#squeeze(): 去除size为1的维度,然后转换为numpy数据

plt.scatter(x.squeeze().numpy(), y.squeeze().numpy())

（4）初始化权重参数

# 随机初始化参数
w = t.randn(1,1, dtype=dtype,requires_grad=True)
b = t.zeros(1,1, dtype=dtype, requires_grad=True)

lr =0.001 # 学习率

# 随机初始化参数

w = t.randn(1,1, dtype=dtype,requires_grad=True)

b = t.zeros(1,1, dtype=dtype, requires_grad=True)

lr =0.001 # 学习率

（5）训练模型

for ii in range(800):
    #x, y = get_fake_data()
       
    # forward：计算loss
    y_pred = x.mm(w) + b.expand_as(y)
    loss = 0.5 * (y_pred - y) ** 2
    loss = loss.sum()
    
    # backward：手动计算梯度
    loss.backward()
    
    # 手动更新参数，需要用torch.no_grad()更新参数
    with t.no_grad():
        w -= lr * w.grad
        b -= lr * b.grad
    
    # 梯度清零
        w.grad.zero_()
        b.grad.zero_()
    
    if ii%100 ==0:
        # 画图
        display.clear_output(wait=True)
              
        plt.plot(x.detach().numpy(), y_pred.detach().numpy()) # predicted
        plt.scatter(x.numpy(), y.numpy()) # true data
        
        plt.xlim(0,20)
        plt.ylim(0,41)   
        plt.show()
        plt.pause(0.5)
        
print(w, b)

for ii in range(800):

#x, y = get_fake_data()

# forward：计算loss

y_pred = x.mm(w) + b.expand_as(y)

loss = 0.5 * (y_pred - y) ** 2

loss = loss.sum()

# backward：手动计算梯度

loss.backward()

# 手动更新参数，需要用torch.no_grad()更新参数

with t.no_grad():

w -= lr * w.grad

b -= lr * b.grad

# 梯度清零

w.grad.zero_()

b.grad.zero_()

if ii%100 ==0:

# 画图

display.clear_output(wait=True)

plt.plot(x.detach().numpy(), y_pred.detach().numpy()) # predicted

plt.scatter(x.numpy(), y.numpy()) # true data

plt.xlim(0,20)

plt.ylim(0,41)

plt.show()

plt.pause(0.5)

print(w, b)

运行结果：
tensor([[1.9769]], requires_grad=True) tensor([[3.3058]], requires_grad=True)

20.3.12 小试牛刀：用nn训练CIFAR10

本节使用nn来构建卷积神经网络，采用了nn.Module及nn. functional等模块，因为nn是pytorch的一个较高级的封装，所以整个代码非常简洁，无需考虑很多细节。
（1）CIFAR10数据集简介
CIFAR-10数据集由10类32x32的彩色图片组成，一共包含60000张图片，每一类包含6000图片。其中50000张图片作为训练集，10000张图片作为测试集。
CIFAR-10数据集被划分成了5个训练的batch和1个测试的batch，每个batch均包含10000张图片。测试集batch的图片是从每个类别中随机挑选的1000张图片组成的,训练集batch以随机的顺序包含剩下的50000张图片。不过一些训练集batch可能出现包含某一类图片比其他类的图片数量多的情况。训练集batch包含来自每一类的5000张图片，一共50000张训练图片。下图显示的是数据集的类，以及每一类中随机挑选的10张图片

下载地址：https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
（2）采用LeNet卷积神经网络
LeNet神经网络的结构如下图：

网络结构说明：
①input: 神经网络的输入是一张 32x32 的灰度图像
②conv1: 第一层是一个卷积层，卷积核(kernel size)大小 5x5 ,步长(stride)为 1 ，不进行padding，所以刚才的输入图像，经过这层后会输出6张 28x28 的特征图(feature map)。其中卷积后的大小28，根据(n+2p-f)/s+1得到，n（输入大小）=32，p（填补）=0，f（卷积核大小）=5，s（步幅长度）=1,由此可得：32-5+1=28。
③maxpooling2: 接下来是一个降采样层，用的是maxpooling，stride为 2 , kernel size为 2x2 ，subsampling之后，输出6张 14 x 14的feature map。
④conv3: 第三层又是一个卷积层，kernel size和stride均与第一层相同，不过最后要输出16张feature map。卷积后大小为10,该值根据(n+2p-f)/s+1得到，n（输入大小）=14，p（填补）=0，f（卷积核大小）=5，s（步幅长度）=1,由此可得：14-5+1=10。
⑤maxpooling4:第四层，又是一个maxpooling。
⑥fc5：第五层开始就是全连接(fully connected layer)层，把第四层的feature map摊平，然后做矩阵运算，输出是120个节点。
⑦fc6：输出是84个节点。
⑧output：最后一步是Gaussian Connections，采用了RBF函数（即径向欧式距离函数），计算输入向量和参数向量之间的欧式距离。目前一般采用Softmax。
（3）导入需要的模块

import torch
import torchvision
import torchvision.transforms as transforms

import torch

import torchvision

import torchvision.transforms as transforms

（4）加载数据
利用torchvision可以很方便的加载数据，同时对数据进行规范化处理。

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

transform = transforms.Compose(

[transforms.ToTensor(),

transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,

download=True, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,

shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,

download=True, transform=transform)

testloader = torch.utils.data.DataLoader(testset, batch_size=4,

shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',

'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

运行结果
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data\cifar-10-python.tar.gz
Files already downloaded and verified
（5）可视化其中部分图形

import matplotlib.pyplot as plt
import numpy as np

ef imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))


# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(4)))

import matplotlib.pyplot as plt

import numpy as np

ef imshow(img):

img = img / 2 + 0.5 # unnormalize

npimg = img.numpy()

plt.imshow(np.transpose(npimg, (1, 2, 0)))

# get some random training images

dataiter = iter(trainloader)

images, labels = dataiter.next()

# show images

imshow(torchvision.utils.make_grid(images))

# print labels

print(' '.join('%5s' % classes[labels[j]] for j in range(4)))

运行结果：
cat car car horse

（6）构建网络
需引用学习参数的层放在构造函数__init__中，无需引用学习参数的层放在forward函数中。

import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()

import torch.nn as nn

import torch.nn.functional as F

class Net(nn.Module):

def __init__(self):

super(Net, self).__init__()

self.conv1 = nn.Conv2d(3, 6, 5)

self.pool = nn.MaxPool2d(2, 2)

self.conv2 = nn.Conv2d(6, 16, 5)

self.fc1 = nn.Linear(16 * 5 * 5, 120)

self.fc2 = nn.Linear(120, 84)

self.fc3 = nn.Linear(84, 10)

def forward(self, x):

x = self.pool(F.relu(self.conv1(x)))

x = self.pool(F.relu(self.conv2(x)))

x = x.view(-1, 16 * 5 * 5)

x = F.relu(self.fc1(x))

x = F.relu(self.fc2(x))

x = self.fc3(x)

return x

net = Net()

（7）定义损失函数及优化器

import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

import torch.optim as optim

criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

（8）训练模型

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

for epoch in range(2): # loop over the dataset multiple times

running_loss = 0.0

for i, data in enumerate(trainloader, 0):

# get the inputs

inputs, labels = data

# zero the parameter gradients

optimizer.zero_grad()

# forward + backward + optimize

outputs = net(inputs)

loss = criterion(outputs, labels)

loss.backward()

optimizer.step()

# print statistics

running_loss += loss.item()

if i % 2000 == 1999: # print every 2000 mini-batches

print('[%d, %5d] loss: %.3f' %

(epoch + 1, i + 1, running_loss / 2000))

running_loss = 0.0

print('Finished Training')

运行结果
[1, 2000] loss: 2.192
[1, 4000] loss: 1.840
[1, 6000] loss: 1.672
[1, 8000] loss: 1.565
[1, 10000] loss: 1.511
[1, 12000] loss: 1.473
[2, 2000] loss: 1.387
[2, 4000] loss: 1.345
[2, 6000] loss: 1.330
[2, 8000] loss: 1.299
[2, 10000] loss: 1.304
[2, 12000] loss: 1.266
Finished Training
（9）测试模型
我们已经在训练数据集上循环2次。我们先检查网络是否已经学到了什么。
我们将通过预测神经网络输出的类标签来检查这一点，并根据地面实况进行检查。如果预测正确，我们将样本添加到正确预测列表中。
我们先从测试集中显示一个图像。

dataiter = iter(testloader)
images, labels = dataiter.next()

# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))

dataiter = iter(testloader)

images, labels = dataiter.next()

# print images

imshow(torchvision.utils.make_grid(images))

print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))

运行结果
GroundTruth: cat ship ship plane

（10）查看预测结果

outputs = net(images)
_, predicted = torch.max(outputs, 1)

print('Predicted: ', ' '.join('%5s' % classes[predicted[j]]
                              for j in range(4)))

outputs = net(images)

_, predicted = torch.max(outputs, 1)

print('Predicted: ', ' '.join('%5s' % classes[predicted[j]]

for j in range(4)))

运行结果
Predicted: cat ship ship ship
从这个结果来看，虽然只循环了2次，但4张图片，已识别3张。
接下来我们看在全数据的运行情况。
（11）看神经网络在整个数据集上的表现

correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

correct = 0

total = 0

with torch.no_grad():

for data in testloader:

images, labels = data

outputs = net(images)

_, predicted = torch.max(outputs.data, 1)

total += labels.size(0)

correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (

100 * correct / total))

运行结果
Accuracy of the network on the 10000 test images: 55 %

（12）查看各类别的性能

class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs, 1)
        c = (predicted == labels).squeeze()
        for i in range(4):
            label = labels[i]
            class_correct[label] += c[i].item()
            class_total[label] += 1

for i in range(10):
    print('Accuracy of %5s : %2d %%' % (
        classes[i], 100 * class_correct[i] / class_total[i]))

class_correct = list(0. for i in range(10))

class_total = list(0. for i in range(10))

with torch.no_grad():

for data in testloader:

images, labels = data

outputs = net(images)

_, predicted = torch.max(outputs, 1)

c = (predicted == labels).squeeze()

for i in range(4):

label = labels[i]

class_correct[label] += c[i].item()

class_total[label] += 1

for i in range(10):

print('Accuracy of %5s : %2d %%' % (

classes[i], 100 * class_correct[i] / class_total[i]))

运行结果
Accuracy of plane : 52 %
Accuracy of car : 63 %
Accuracy of bird : 35 %
Accuracy of cat : 26 %
Accuracy of deer : 30 %
Accuracy of dog : 50 %
Accuracy of frog : 75 %
Accuracy of horse : 70 %
Accuracy of ship : 77 %
Accuracy of truck : 70 %

（13）GPU上运行
如果在GPU上运行以上网络，可作如下设计

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Assume that we are on a CUDA machine, then this should print a CUDA device:
print(device)

net.to(device)
inputs, labels = inputs.to(device), labels.to(device)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Assume that we are on a CUDA machine, then this should print a CUDA device:

print(device)

net.to(device)

inputs, labels = inputs.to(device), labels.to(device)

更多pytorch实例可参考：
https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

19.1 Jupyter Notebook概述

说话说得好，“磨刀不误砍柴工”，但有时选择比磨刀更重要。比如，砍树可以用斧头、砍刀、电锯等，如果选择电锯，不断不需要磨刀，而且效率也是其它工具的几倍。
Jupyter Notebook就是一把好用、高效、安全的"电锯"。目前Jupyter Notebook已成为Kaggle比赛中提交代码的主要方式。它广泛用于数据科学和机器学习领域，是运行深度学习实验的首先方法（keras之父、Kaggle竞赛教练弗朗索∙肖莱所言）。
Jupyter Notebook（此前被称为 IPython notebook）是以网页的形式打开，可以在网页页面中直接编写代码和运行代码，代码的运行结果也会直接在代码块下显示的程序。如在编程过程中需要编写说明文档，可在同一个页面中直接编写，便于做及时的说明和解释。Jupyter Notebook有以下特点：
（1）编程时具有语法高亮、缩进、tab补全的功能。
（2）可直接通过浏览器运行代码，同时在代码块下方展示运行结果。
（3）以富媒体格式展示计算结果。丰富媒体格式，包括：HTML，LaTeX，PNG，SVG等。
（4）对代码编写说明文档或语句时，支持Markdown语法。
（5）支持使用LaTeX编写数学性说明。
（6）可以直接生成Python脚本、运行Python代码和脚本、或保存为PDF、XML等格式。
（7）可以对冗长的代码拆分为可独立运行的短代码，也可把一些短代码合并为一段长代码。
（8）交互式运行代码，如果后面代码出现问题，不必运行前面的代码。

19.2安装配置Python

（1）连接到Anaconda官网（https://www.anaconda.com/download/）下载对应环境、对应位数的软件包。建议下载Python3系列的，如下图：

（2）把下载软件（Linux对应是一个.sh脚本,大约640M左右，windows环境是一个.exe文件）存放在Linux服务器上，然后执行该脚本即可。
在Linux下安装下安装
①先下载管理包Anaconda2-4.0.0-Linux-x86_64.sh，
②在Linux命令行下运行：

bash Anaconda2-4.0.0-Linux-x86_64.sh

1	bash Anaconda2-4.0.0-Linux-x86_64.sh

③按缺省步骤即可，最后有一个提示，是否把当前路径存放到配置文件.bashrc中，选择 yes就可。
④安装成功后，如果还需要安装或更新其他库，只要在命令行执行：conda install 库名称或conda update 库名称，如安装tensorflow，只需要执行 conda install tensorflow即可。
当然也可用pip安装。
⑤可以用conda list 查看已安装的包或库
在windows下安装：
①从python官网下载Anaconda3-4.4.0-Windows-x86_64.exe或更高版本的，下载时注意选择64位还是32位（window8之后一般都是64位）
②双击该文件，开始执行，基本点next就可
③安装完成后，如果还需要安装或更新其他库，只要在cmd中执行：conda install 库名称或conda update 库名称。
如安装tensorflow，只需要执行 conda install tensorflow即可。当然也可用pip安装
④在cmd中可以用conda list 查看已安装的包或库

conda的常用命令：

#列出已安装的模块或库及版本
conda list
# install package_names
conda install package_names (or  conda install package_names = version)
#update package_names
conda update package_names  (or  conda update package_names = version)
#删除package_names
conda remove package_names
#查看目前可使用的版本
conda search package_names
# 更新conda，保持conda最新
conda update conda
# 显示配置
conda config --show
# 单独显示channels
conda config --get channels
# 添加 conda-forge channel到源(示例，建议直接通过-c方式指定channel安装)
conda config --add channels conda-forge
# 通过conda-forge源安装tensorflow 1.10.0
conda install -c conda-forge tensorflow=1.10.0

#列出已安装的模块或库及版本

conda list

# install package_names

conda install package_names (or conda install package_names = version)

#update package_names

conda update package_names (or conda update package_names = version)

#删除package_names

conda remove package_names

#查看目前可使用的版本

conda search package_names

# 更新conda，保持conda最新

conda update conda

# 显示配置

conda config --show

# 单独显示channels

conda config --get channels

# 添加 conda-forge channel到源(示例，建议直接通过-c方式指定channel安装)

conda config --add channels conda-forge

# 通过conda-forge源安装tensorflow 1.10.0

conda install -c conda-forge tensorflow=1.10.0

19.3 Jupyter配置

下面是配置Jupyter Notebook的主要步骤。这是在Linux环境的配置方法，如果在window下，无需任何配置，启动Jupyter Notebook之后，自动弹出一个网页（网址为:localhost:8888）,点击其中的new下拉菜单，选择pyhton3，就可进行编写代码、运行代码了。
1）生成配置文件

jupyter notebook --generate-config

1	jupyter notebook --generate-config

将在当前用户目录下生成文件：.jupyter/jupyter_notebook_config.py
2）生成当前用户登录jupyter密码
打开ipython, 创建一个密文密码

In [1]: from notebook.auth import passwd
In [2]: passwd()
Enter password: 
Verify password:

In [1]: from notebook.auth import passwd

In [2]: passwd()

Enter password:

Verify password:

3）修改配置文件

vim ~/.jupyter/jupyter_notebook_config.py

1	vim ~/.jupyter/jupyter_notebook_config.py

进行如下修改：

c.NotebookApp.ip='*' # 就是设置所有ip皆可访问
c.NotebookApp.password = u'sha:ce...刚才复制的那个密文'
c.NotebookApp.open_browser = False # 禁止自动打开浏览器
c.NotebookApp.port =8888 #这是缺省端口，也可指定其他端口

c.NotebookApp.ip='*' # 就是设置所有ip皆可访问

c.NotebookApp.password = u'sha:ce...刚才复制的那个密文'

c.NotebookApp.open_browser = False # 禁止自动打开浏览器

c.NotebookApp.port =8888 #这是缺省端口，也可指定其他端口

4）启动jupyter notebook

#后台启动jupyter：不记日志：
nohup jupyter notebook >/dev/null 2>&1 &

1 2	#后台启动jupyter：不记日志： nohup jupyter notebook >/dev/null 2>&1 &

然在浏览器上，输入IP:port,即可看到如下类似界面。

然后,点击New下列菜单，选择python3，将弹出Python编写界面，我们就可以在浏览器进行开发调试Python或keras、Tensorflow、Pytorch等程序。

19.4 Jupyter使用实例

以下以Linux环境为例，Windows环境基本相同。
（1）如何执行cell中的代码
同时按Shift键和Enter键即可。以下为代码示例。

a="python"
b=10.2829
print("开发语言为:{0:s},b显示小数点后两位的结果:{1:.2f}".format(a,b))

a="python"

b=10.2829

print("开发语言为:{0:s},b显示小数点后两位的结果:{1:.2f}".format(a,b))

运行结果
开发语言:python
（2）执行一些简单shell命令

#查看当前目录
!pwd
#查看当前目录下的一个文件的前8行
#查看文件内容
! head -10 test1028.py

#查看当前目录

!pwd

#查看当前目录下的一个文件的前8行

#查看文件内容

! head -10 test1028.py

运行结果：

#定义一个函数
'''这是一个测试脚本'''
def fun01():
    name="北京欢迎您!"
    print(name)

#运行函数
if __name__=='__main__':
    fun01()

#定义一个函数

'''这是一个测试脚本'''

def fun01():

name="北京欢迎您!"

print(name)

#运行函数

if __name__=='__main__':

fun01()

（3）导入该脚本（或模块），并查看该模块的功能简介

（4）执行python脚本

【说明】
为了使该脚本有更好的移植性，可在第一行加上一句#!/usr/bin/python
运行.py文件时，python自动创建相应的.pyc文件，如下图，.pyc文件包含目标代码（编译后的代码），它是一种python专用的语言，以计算机能够高效运行的方式表示python源代码。这种代码无法阅读，故我们可以不管这个文件。
（5）添加注释或说明文档

（6）可以像word文档一样添加目录
为使你的jupyter更加方便快捷，我们可以对notebook内容分章节，然后对这些章节建立目录，我们要浏览或查看某个章节，只需点击对应目录，就可跳到对应位置，非常方便，如下图，需要安装一个扩展模块：jupyter_contrib_nbextensions，使用conda或pip安装，如下命令：conda install -c conda-forge jupyter_contrib_nbextensions
或pip install jupyter_contrib_nbextensions
具体安装配置，可参考以下博文：
https://www.jianshu.com/p/f314e9868cae
下图为效果图：

（7）修改文件名称
jupyter的文件自动保存，名称也是自动生成的，对自动生成的文件名称，我们也可重命名，具体步骤如下；
点击目前文件名称：

然后重命名，并点击rename即可，具体可参考下图：

（8）画图也很方便
以下是画一条抛物线的代码，在jupyter显示图形，需要加上一句：%matplotlib inline，具体代码如下：

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#生成x的正负值
x=np.arange(0,1.1,0.1)
x1=np.arange(-1,0,0.1)
#合并以上两个数据集
x=np.append(x1,x)
#得到y的值
y=x**2+1
##绘制一个图，长为6，宽为4（默认值是每个单位80像素）
plt.figure(figsize=(6,4))   
###在图列中自动显示$间内容
plt.plot(x,y,label="$y=x^2$",color="blue",linewidth=2)
plt.xlabel("X")       ##X坐标名称
plt.ylabel("Y")       ##Y坐标名称
plt.legend(loc='upper center')          ##显示图例
plt.show()

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

#生成x的正负值

x=np.arange(0,1.1,0.1)

x1=np.arange(-1,0,0.1)

#合并以上两个数据集

x=np.append(x1,x)

#得到y的值

y=x**2+1

##绘制一个图，长为6，宽为4（默认值是每个单位80像素）

plt.figure(figsize=(6,4))

###在图列中自动显示$间内容

plt.plot(x,y,label="$y=x^2$",color="blue",linewidth=2)

plt.xlabel("X") ##X坐标名称

plt.ylabel("Y") ##Y坐标名称

plt.legend(loc='upper center') ##显示图例

plt.show()

（9）在jupyter 里查看函数、模块的帮助信息也很方便。
查看命令函数及帮助等信息
命令. 然后按tab键可查看所有的函数
命令? 可查看对应命令的帮助信息

import numpy as np
a1=np.array([1,2,6,4,1])

1 2	import numpy as np a1=np.array([1,2,6,4,1])

查看a1数组，可以使用的函数

查看argmax函数的具体使用方法，只要在函数后加上一个问号(?),然后运行，就会弹出一个详细帮助信息的界面，具体可参考下图：

（10）编写公式也很方便

然后执行该cell，就可看到如下结果：

更多LaTeX的使用方法，可参考：
https://www.jianshu.com/p/93ccc63e5a1b

18.1集成学习概述

集成学习的原理正如盲人摸象这个古代寓言所揭示的道理类似：一群盲人第一次遇到大象，想要通过触觉来了解大象。每个人都摸到大象身体的不同部位。但只摸到不同部分，比如鼻子或一条腿。这些人描述的大象是这样的：“它像一条蛇”，“像一根柱子或一棵树”，等等。这些盲人就好比机器学习模型，每个人都是根据自己的假设，并从自己的角度来理解训练数据的多面性。每个人都得到真相的一部分，但不是全部真相。将他们的观点汇集在一起，你就可以得到对数据更加准确的描述。大象是多个部分的组合，每个盲人说的都不完全准确，但综合起来就成了一个相当准确的观点。
集成学习(ensemble learning)可以说是现在非常火爆的机器学习方法了。目前，集成方法在许多著名的机器学习比赛（如 Netflix、KDD 2009 和 Kaggle 比赛）中能够取得很好的名次。
集成学习本身不是一个单独的机器学习算法，而是通过构建并结合多个机器学习器来完成学习任务。也就是我们常说的“博采众长”。集成学习可以用于分类问题集成，回归问题集成，特征选取集成，异常点检测集成等等，可以说所有的机器学习领域都可以看到集成学习的身影。
集成学习的主要思想：对于一个比较复杂的任务，综合许多人的意见来进行决策往往比一家独大好，正所谓集思广益。其过程如下：

一般来说集成学习可以分为三大类：
①用于减少方差的bagging（方差描述的是预测值作为随机变量的离散程度）
②用于减少偏差的boosting（偏差描述的是预测值和真实值之间的差异，即提高拟合能力）
③用于提升预测效果的stacking

18.1.1 Bagging

Bagging是引导聚合的意思。减少一个估计方差的一种方式就是对多个估计进行平均。
Bagging使用装袋采样来获取数据子集训练基础学习器。通常分类任务使用投票的方式集成，而回归任务通过平均的方式集成。
给定一个大小为n的训练集 D，Bagging算法从中均匀、有放回地选出 m个大小为 n' 的子集Di，作为新的训练集。在这 m个训练集上使用分类、回归等算法，则可得到 m个模型，再通过取平均值、取多数票等方法综合产生预测结果，即可得到Bagging的结果。具体如下图：

对于Bagging需要注意的是，每次训练集可以取全部的特征进行训练，也可以随机选取部分特征训练，例如随机森林就是每次随机选取部分特征。
常用的集成算法模型是随机森林和随机树等。
在随机森林中，每个树模型都是装袋采样训练的。另外，特征也是随机选择的，最后对于训练好的树也是随机选择的。
这种处理的结果是随机森林的偏差增加的很少，而由于弱相关树模型的平均，方差也得以降低，最终得到一个方差小，偏差也小的模型。

18.1.2 boosting

Boosting指的是通过算法集合将弱学习器转换为强学习器。boosting的主要原则是训练一系列的弱学习器，所谓弱学习器是指仅比随机猜测好一点点的模型，例如较小的决策树，训练的方式是利用加权的数据。在训练的早期对于错分数据给予较大的权重。
对于训练好的弱分类器，如果是分类任务按照权重进行投票，而对于回归任务进行加权，然后再进行预测。boosting和bagging的区别在于是对加权后的数据利用弱分类器依次进行训练。
boosting是一族可将弱学习器提升为强学习器的算法，这族算法的工作机制类似：
（1）先从初始训练集训练出一个基学习器；
（2）再根据基学习器的表现对训练样本分布进行调整，使得先前基学习器做错的训练样本在后续受到更多关注；
（3）基于调整后的样本分布来训练下一个基学习器；
（4）重复进行上述步骤，直至基学习器数目达到事先指定的值T，最终将这T个基学习器进行加权结合。具体步骤如下图

如果上面这个图还不太直观，大家可参考以下简单示例：
1、假设我们有如下样本图：

图1
2、第一次分类

图2
第2次分类

图3
在图2中被正确测的点有较小的权重（尺寸较小），而被预测错误的点（+）则有较大的权重（尺寸较大）
第3次分类

图4
在图3中被正确测的点有较小的权重（尺寸较小），而被预测错误的点（-）则有较大的权重（尺寸较大）。

第4次综合以上分类

下面描述的算法是最常用的一种boosting算法，叫做AdaBoost，表示自适应boosting。

AdaBoost算法每一轮都要判断当前基学习器是否满足条件，一旦条件不满足，则当前学习器被抛弃，且学习过程停止。
AdaBoost算法中的个体学习器存在着强依赖关系，应用的是串行生成的序列化方法。每一个基生成器的目标，都是为了最小化损失函数。所以，可以说AdaBoost算法注重减小偏差。
由于属于boosting算法族，采用的是加性模型，对每个基学习器的输出结果加权处理，只会得到一个输出预测结果。所以标准的AdaBoost只适用于二分类任务。基于Boosting思想的除AdaBoost外，还有GBDT、XGBoost等。

18.1.3 Stacking

将训练好的所有基模型对训练基进行预测，第j个基模型对第i个训练样本的预测值（概率值或标签）将作为新的训练集中第i个样本的第j个特征值，最后基于新的训练集进行训练。同理，预测的过程也要先经过所有基模型的预测形成新的测试集，最后再对测试集进行预测。如下图所示。

上图可简化为：

其中Meta-Classifier在实际应用中通常使用单层logistic回归模型。
具体算法为：

18.1.3 .1Stacking中元分类层

为何要Meta-Classifier层？设置该层的目的是啥？其原理是什么？等等或许你还不很清楚，没关系。你看了下面这个说明或许就清楚多了。
让我们假设有三个学生名为LR，SVM，KNN，他们争论一个物理问题，他们对正确的答案可能有不同的看法：

他们认为没有办法相互说服他们的情况，他们通过平均估计他们做民主的事情，这个案例是14.他们使用了最简单的集合形式-AKA模型平均。

他们的老师，DL小姐 - 一位数学老师 - 见证了学生们所拥有的论点并决定提供帮助。她问“问题是什么？”，但是学生们拒绝告诉她（因为他们知道提供所有信息对他们不利，除了他们认为她可能会觉得愚蠢他们在争论这么微不足道的事情）。然而，他们确实告诉她这是一个与物理相关的论点。

在这种情况下，教师无法访问初始数据，因为她不知道问题是什么。然而，她确实非常了解学生 - 他们的优点和缺点，她决定她仍然可以帮助解决这个问题。使用历史信息，了解学生过去的表现，以及她知道SVM喜欢物理并且在这个课程中表现优异的事实（加上她的父亲在青年科学家的物理学院工作），她认为最多适当的答案会更像17。

在这种情况下，教师（DL）是元学习者。她使用其他模型（学生）输出的结果作为输入数据。然后，她将其与历史信息结合起来，了解学生过去的表现，以便更好地估计（并帮助解决冲突）。

然而......物理老师RF先生的意见略有不同。他一直在那里，但他一直等到这一刻才行动！ RF先生最近一直在教授LR私人物理课程，以提高他的成绩（错过DL不知道的事情），他认为LR对最终估计的贡献应该更大。因此他声称正确的答案更像是16！

在这种情况下，RF先生也是一个元学习者，他用不同的逻辑处理历史数据 - 他可以访问比DL小姐更多的来源（或不同的历史信息）。

只有校长GBM做出决定，才能解决此争议！ GBM不知道孩子们说了什么，但他很了解他的老师，他更热衷于信任他的物理老师（RF）。他总结答案更像是16.2。

在这种情况下，校长是元级学习者或元学习者的元学习者，并且通过处理他的老师的历史信息，他仍然可以提供比他们的结果的简单平均值更好的估计。
参考文档：
http://blog.kaggle.com/2017/06/15/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/

18.1.3.2Stacking的几种方法

1) 使用分类器产生的特征输出作为meta-classifier的输入
基本使用方法就是，使用前面分类器产生的特征输出作为最后总的meta-classifier的输入数据，以下为利用stacking的基本使用方法实例。
（1）生成数据

from sklearn import datasets

iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

from sklearn import datasets

iris = datasets.load_iris()

X, y = iris.data[:, 1:3], iris.target

（2）导入需要的库

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
import numpy as np

from sklearn import model_selection

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import RandomForestClassifier

from mlxtend.classifier import StackingClassifier

import numpy as np

（3）训练各种基模型

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
                          meta_classifier=lr)

print('3-fold cross validation:\n')

for clf, label in zip([clf1, clf2, clf3, sclf], 
                      ['KNN', 
                       'Random Forest', 
                       'Naive Bayes',
                       'StackingClassifier']):

    scores = model_selection.cross_val_score(clf, X, y, 
                                              cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

clf1 = KNeighborsClassifier(n_neighbors=1)

clf2 = RandomForestClassifier(random_state=1)

clf3 = GaussianNB()

lr = LogisticRegression()

sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],

meta_classifier=lr)

print('3-fold cross validation:\n')

for clf, label in zip([clf1, clf2, clf3, sclf],

['KNN',

'Random Forest',

'Naive Bayes',

'StackingClassifier']):

scores = model_selection.cross_val_score(clf, X, y,

cv=3, scoring='accuracy')

print("Accuracy: %0.2f (+/- %0.2f) [%s]"

% (scores.mean(), scores.std(), label))

运行结果
3-fold cross validation:

Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.91 (+/- 0.06) [Random Forest]
Accuracy: 0.92 (+/- 0.03) [Naive Bayes]
Accuracy: 0.95 (+/- 0.03) [StackingClassifier]
（4）可视化结果

%matplotlib inline
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
import itertools

gs = gridspec.GridSpec(2, 2)

fig = plt.figure(figsize=(10,8))

for clf, lab, grd in zip([clf1, clf2, clf3, sclf], 
                         ['KNN', 
                          'Random Forest', 
                          'Naive Bayes',
                          'StackingClassifier'],
                          itertools.product([0, 1], repeat=2)):

    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y, clf=clf)
    plt.title(lab)

%matplotlib inline

import matplotlib.pyplot as plt

from mlxtend.plotting import plot_decision_regions

import matplotlib.gridspec as gridspec

import itertools

gs = gridspec.GridSpec(2, 2)

fig = plt.figure(figsize=(10,8))

for clf, lab, grd in zip([clf1, clf2, clf3, sclf],

['KNN',

'Random Forest',

'Naive Bayes',

'StackingClassifier'],

itertools.product([0, 1], repeat=2)):

clf.fit(X, y)

ax = plt.subplot(gs[grd[0], grd[1]])

fig = plot_decision_regions(X=X, y=y, clf=clf)

plt.title(lab)

运行结果

使用网格方法选择超参数

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from mlxtend.classifier import StackingClassifier

# Initializing models

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
                          meta_classifier=lr)

params = {'kneighborsclassifier__n_neighbors': [1, 5],
          'randomforestclassifier__n_estimators': [10, 50],
          'meta-logisticregression__C': [0.1, 10.0]}

grid = GridSearchCV(estimator=sclf, 
                    param_grid=params, 
                    cv=5,
                    refit=True)
grid.fit(X, y)

cv_keys = ('mean_test_score', 'std_test_score', 'params')

for r, _ in enumerate(grid.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r"
          % (grid.cv_results_[cv_keys[0]][r],
             grid.cv_results_[cv_keys[1]][r] / 2.0,
             grid.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV

from mlxtend.classifier import StackingClassifier

# Initializing models

clf1 = KNeighborsClassifier(n_neighbors=1)

clf2 = RandomForestClassifier(random_state=1)

clf3 = GaussianNB()

lr = LogisticRegression()

sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],

meta_classifier=lr)

params = {'kneighborsclassifier__n_neighbors': [1, 5],

'randomforestclassifier__n_estimators': [10, 50],

'meta-logisticregression__C': [0.1, 10.0]}

grid = GridSearchCV(estimator=sclf,

param_grid=params,

cv=5,

refit=True)

grid.fit(X, y)

cv_keys = ('mean_test_score', 'std_test_score', 'params')

for r, _ in enumerate(grid.cv_results_['mean_test_score']):

print("%0.3f +/- %0.2f %r"

% (grid.cv_results_[cv_keys[0]][r],

grid.cv_results_[cv_keys[1]][r] / 2.0,

grid.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % grid.best_params_)

print('Accuracy: %.2f' % grid.best_score_)

运行结果
0.667 +/- 0.00 {'kneighborsclassifier__n_neighbors': 1, 'meta-logisticregression__C': 0.1, 'randomforestclassifier__n_estimators': 10}
0.667 +/- 0.00 {'kneighborsclassifier__n_neighbors': 1, 'meta-logisticregression__C': 0.1, 'randomforestclassifier__n_estimators': 50}
0.927 +/- 0.02 {'kneighborsclassifier__n_neighbors': 1, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 10}
0.913 +/- 0.03 {'kneighborsclassifier__n_neighbors': 1, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 50}
0.667 +/- 0.00 {'kneighborsclassifier__n_neighbors': 5, 'meta-logisticregression__C': 0.1, 'randomforestclassifier__n_estimators': 10}
0.667 +/- 0.00 {'kneighborsclassifier__n_neighbors': 5, 'meta-logisticregression__C': 0.1, 'randomforestclassifier__n_estimators': 50}
0.933 +/- 0.02 {'kneighborsclassifier__n_neighbors': 5, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 10}
0.940 +/- 0.02 {'kneighborsclassifier__n_neighbors': 5, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 50}
Best parameters: {'kneighborsclassifier__n_neighbors': 5, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 50}
Accuracy: 0.94

2）使用类别概率值作为meta-classfier的输入
另一种使用第一层基本分类器产生的类别概率值作为meta-classfier的输入，这种情况下需要将StackingClassifier的参数设置为 use_probas=True。如果将参数设置为 average_probas=True，那么这些基分类器对每一个类别产生的概率值会被平均，否则会拼接。
例如有两个基分类器产生的概率输出为：
classifier 1: [0.2, 0.5, 0.3]
classifier 2: [0.3, 0.4, 0.4]
1) average = True :
产生的meta-feature 为：[0.25, 0.45, 0.35]
2) average = False:
产生的meta-feature为：[0.2, 0.5, 0.3, 0.3, 0.4, 0.4]
以下为具体实例

from sklearn import datasets
 
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
 
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
import numpy as np
 
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
                          use_probas=True,
                          average_probas=False,
                          meta_classifier=lr)
 
print('3-fold cross validation:\n')
 
for clf, label in zip([clf1, clf2, clf3, sclf], 
                      ['KNN', 
                       'Random Forest', 
                       'Naive Bayes',
                       'StackingClassifier']):
 
    scores = model_selection.cross_val_score(clf, X, y, 
                                              cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

from sklearn import datasets

iris = datasets.load_iris()

X, y = iris.data[:, 1:3], iris.target

from sklearn import model_selection

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import RandomForestClassifier

from mlxtend.classifier import StackingClassifier

import numpy as np

clf1 = KNeighborsClassifier(n_neighbors=1)

clf2 = RandomForestClassifier(random_state=1)

clf3 = GaussianNB()

lr = LogisticRegression()

sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],

use_probas=True,

average_probas=False,

meta_classifier=lr)

print('3-fold cross validation:\n')

for clf, label in zip([clf1, clf2, clf3, sclf],

['KNN',

'Random Forest',

'Naive Bayes',

'StackingClassifier']):

scores = model_selection.cross_val_score(clf, X, y,

cv=3, scoring='accuracy')

print("Accuracy: %0.2f (+/- %0.2f) [%s]"

% (scores.mean(), scores.std(), label))

运行结果
3-fold cross validation:

Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.91 (+/- 0.06) [Random Forest]
Accuracy: 0.92 (+/- 0.03) [Naive Bayes]
Accuracy: 0.94 (+/- 0.03) [StackingClassifier]

显然，用stacking方法的精度（Accuracy: 0.94）明显好于单个模型的精度。
3）使用堆叠分类及网格搜索
使用堆叠分类及网格搜索（Stacked Classification and GridSearch）方法，要为scikit-learn网格搜索设置参数网格，我们只需在参数网格中提供估算器的名称 - 在meta-regressor的特殊情况下，我们附加'meta-'前缀即可，以下为代码实例。

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from mlxtend.classifier import StackingClassifier

# Initializing models

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], 
                          meta_classifier=lr)

params = {'kneighborsclassifier__n_neighbors': [1, 5],
          'randomforestclassifier__n_estimators': [10, 50],
          'meta-logisticregression__C': [0.1, 10.0]}

grid = GridSearchCV(estimator=sclf, 
                    param_grid=params, 
                    cv=5,
                    refit=True)
grid.fit(X, y)

cv_keys = ('mean_test_score', 'std_test_score', 'params')

for r, _ in enumerate(grid.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r"
          % (grid.cv_results_[cv_keys[0]][r],
             grid.cv_results_[cv_keys[1]][r] / 2.0,
             grid.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV

from mlxtend.classifier import StackingClassifier

# Initializing models

clf1 = KNeighborsClassifier(n_neighbors=1)

clf2 = RandomForestClassifier(random_state=1)

clf3 = GaussianNB()

lr = LogisticRegression()

sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],

meta_classifier=lr)

params = {'kneighborsclassifier__n_neighbors': [1, 5],

'randomforestclassifier__n_estimators': [10, 50],

'meta-logisticregression__C': [0.1, 10.0]}

grid = GridSearchCV(estimator=sclf,

param_grid=params,

cv=5,

refit=True)

grid.fit(X, y)

cv_keys = ('mean_test_score', 'std_test_score', 'params')

for r, _ in enumerate(grid.cv_results_['mean_test_score']):

print("%0.3f +/- %0.2f %r"

% (grid.cv_results_[cv_keys[0]][r],

grid.cv_results_[cv_keys[1]][r] / 2.0,

grid.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % grid.best_params_)

print('Accuracy: %.2f' % grid.best_score_)

运行结果
0.667 +/- 0.00 {'kneighborsclassifier__n_neighbors': 1, 'meta-logisticregression__C': 0.1, 'randomforestclassifier__n_estimators': 10}
0.667 +/- 0.00 {'kneighborsclassifier__n_neighbors': 1, 'meta-logisticregression__C': 0.1, 'randomforestclassifier__n_estimators': 50}
0.967 +/- 0.01 {'kneighborsclassifier__n_neighbors': 1, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 10}
0.967 +/- 0.01 {'kneighborsclassifier__n_neighbors': 1, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 50}
0.667 +/- 0.00 {'kneighborsclassifier__n_neighbors': 5, 'meta-logisticregression__C': 0.1, 'randomforestclassifier__n_estimators': 10}
0.667 +/- 0.00 {'kneighborsclassifier__n_neighbors': 5, 'meta-logisticregression__C': 0.1, 'randomforestclassifier__n_estimators': 50}
0.967 +/- 0.01 {'kneighborsclassifier__n_neighbors': 5, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 10}
0.967 +/- 0.01 {'kneighborsclassifier__n_neighbors': 5, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 50}
Best parameters: {'kneighborsclassifier__n_neighbors': 1, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 10}
Accuracy: 0.97
最一句是各参数最佳匹配模型的结果，显然，这个精度高于其他情况的精度。
如果我们计划多次使用回归算法，我们需要做的就是在参数网格中添加一个额外的数字后缀，如下所示：

from sklearn.model_selection import GridSearchCV

# Initializing models

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf1, clf2, clf3], 
                          meta_classifier=lr)

params = {'kneighborsclassifier-1__n_neighbors': [1, 5],
          'kneighborsclassifier-2__n_neighbors': [1, 5],
          'randomforestclassifier__n_estimators': [10, 50],
          'meta-logisticregression__C': [0.1, 10.0]}

grid = GridSearchCV(estimator=sclf, 
                    param_grid=params, 
                    cv=5,
                    refit=True)
grid.fit(X, y)

cv_keys = ('mean_test_score', 'std_test_score', 'params')

for r, _ in enumerate(grid.cv_results_['mean_test_score']):
    print("%0.3f +/- %0.2f %r"
          % (grid.cv_results_[cv_keys[0]][r],
             grid.cv_results_[cv_keys[1]][r] / 2.0,
             grid.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)

from sklearn.model_selection import GridSearchCV

# Initializing models

clf1 = KNeighborsClassifier(n_neighbors=1)

clf2 = RandomForestClassifier(random_state=1)

clf3 = GaussianNB()

lr = LogisticRegression()

sclf = StackingClassifier(classifiers=[clf1, clf1, clf2, clf3],

meta_classifier=lr)

params = {'kneighborsclassifier-1__n_neighbors': [1, 5],

'kneighborsclassifier-2__n_neighbors': [1, 5],

'randomforestclassifier__n_estimators': [10, 50],

'meta-logisticregression__C': [0.1, 10.0]}

grid = GridSearchCV(estimator=sclf,

param_grid=params,

cv=5,

refit=True)

grid.fit(X, y)

cv_keys = ('mean_test_score', 'std_test_score', 'params')

for r, _ in enumerate(grid.cv_results_['mean_test_score']):

print("%0.3f +/- %0.2f %r"

% (grid.cv_results_[cv_keys[0]][r],

grid.cv_results_[cv_keys[1]][r] / 2.0,

grid.cv_results_[cv_keys[2]][r]))

print('Best parameters: %s' % grid.best_params_)

print('Accuracy: %.2f' % grid.best_score_)

运行结果
0.667 +/- 0.00 {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-2__n_neighbors': 1, 'meta-logisticregression__C': 0.1, 'randomforestclassifier__n_estimators': 10}
0.667 +/- 0.00 {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-2__n_neighbors': 1, 'meta-logisticregression__C': 0.1, 'randomforestclassifier__n_estimators': 50}
0.967 +/- 0.01 {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-....
0.667 +/- 0.00 {'kneighborsclassifier-1__n_neighbors': 5, 'kneighborsclassifier-2__n_neighbors': 1, 'meta-logisticregression__C': 0.1, 'randomforestclassifier__n_estimators': 50}
0.967 +/- 0.01 {'kneighborsclassifier-1__n_neighbors': 5, 'kneighborsclassifier-2__n_neighbors': 1, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 10}
.......................................
0.967 +/- 0.01 {'kneighborsclassifier-1__n_neighbors': 5, 'kneighborsclassifier-2__n_neighbors': 5, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 50}
Best parameters: {'kneighborsclassifier-1__n_neighbors': 1, 'kneighborsclassifier-2__n_neighbors': 1, 'meta-logisticregression__C': 10.0, 'randomforestclassifier__n_estimators': 10}
Accuracy: 0.97

StackingClassifier还可以对分类器参数进行网格搜索。但是，由于目前scikit-learn中GridSearchCV的实现，不可能同时搜索不同分类器和分类器参数。例如，虽然以下参数字典有效。
4）给不同及分类器不同特征
给不同及分类器不同特征是对训练基中的特征维度进行操作的，这次不是给每一个基分类器全部的特征，而是给不同的基分类器分不同的特征，即比如基分类器1训练前半部分特征，基分类器2训练后半部分特征（可以通过sklearn 的pipelines 实现）。最终通过StackingClassifier组合起来。以下为代码实例。

from sklearn.datasets import load_iris
from mlxtend.classifier import StackingClassifier
from mlxtend.feature_selection import ColumnSelector
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

iris = load_iris()
X = iris.data
y = iris.target

pipe1 = make_pipeline(ColumnSelector(cols=(0, 2)),
                      LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
                      LogisticRegression())

sclf = StackingClassifier(classifiers=[pipe1, pipe2], 
                          meta_classifier=LogisticRegression())

sclf.fit(X, y)

from sklearn.datasets import load_iris

from mlxtend.classifier import StackingClassifier

from mlxtend.feature_selection import ColumnSelector

from sklearn.pipeline import make_pipeline

from sklearn.linear_model import LogisticRegression

iris = load_iris()

X = iris.data

y = iris.target

pipe1 = make_pipeline(ColumnSelector(cols=(0, 2)),

LogisticRegression())

pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),

LogisticRegression())

sclf = StackingClassifier(classifiers=[pipe1, pipe2],

meta_classifier=LogisticRegression())

sclf.fit(X, y)

参考文档：
https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/

18.2投票分类器（VotingClassifier）

投票分类器的原理是结合了多个不同的机器学习分类器，使用多数票或者平均预测概率（软票），预测类标签。这类分类器对一组相同表现的模型十分有用，同时可以平衡各自的弱点。投票分类又可进一步分为多数投票分类（Majority Class Labels）、加权平均概率（soft vote，软投票）。

18.2.1多数投票分类（MajorityVote Class）

多数投票分类的分类原则为预测标签不同时，按最多种类为最终分类；如果预测标签相同时，则按顺序，选择排在第1的标签为最终分类。举例如下：
预测类型的标签为该组学习器中相同最多的种类：例如给出的分类如下
分类器1 -> 标签1
分类器2 -> 标签1
分类器3 -> 标签2
投票分类器（voting=‘hard’）则该预测结果为‘标签1’。
在各个都只有一个的情况下，则按照顺序来，如下：
分类器1 -> 标签2
分类器2 -> 标签1
最终分类结果为“标签2”

18.2.1.1Iris数据集概述

首先，我们取得数据，下面这个链接中有数据的详细介绍，并可以下载数据集。https://archive.ics.uci.edu/ml/datasets/Iris
从数据的说明上，我们可以看到Iris有4个特征，3个类别。但是，我们为了数据的可视化，我们只保留2个特征（sepal length和petal length）。数据可视化代码如下：

%matplotlib inline
import pandas as pd
import matplotlib.pylab as plt
import numpy as np

# 加载Iris数据集作为DataFrame对象
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
 X = df.iloc[:, [0, 2]].values # 取出2个特征，并把它们用Numpy数组表示
 
plt.scatter(X[:50, 0], X[:50, 1],color='red', marker='o', label='setosa') # 前50个样本的散点图
plt.scatter(X[50:100, 0], X[50:100, 1],color='blue', marker='x', label='versicolor') # 中间50个样本的散点图
plt.scatter(X[100:, 0], X[100:, 1],color='green', marker='+', label='Virginica') # 后50个样本的散点图
plt.xlabel('petal length')
plt.ylabel('sepal length')
plt.legend(loc=2) # 把说明放在左上角，具体请参考官方文档
plt.show()

%matplotlib inline

import pandas as pd

import matplotlib.pylab as plt

import numpy as np

# 加载Iris数据集作为DataFrame对象

df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)

X = df.iloc[:, [0, 2]].values # 取出2个特征，并把它们用Numpy数组表示

plt.scatter(X[:50, 0], X[:50, 1],color='red', marker='o', label='setosa') # 前50个样本的散点图

plt.scatter(X[50:100, 0], X[50:100, 1],color='blue', marker='x', label='versicolor') # 中间50个样本的散点图

plt.scatter(X[100:, 0], X[100:, 1],color='green', marker='+', label='Virginica') # 后50个样本的散点图

plt.xlabel('petal length')

plt.ylabel('sepal length')

plt.legend(loc=2) # 把说明放在左上角，具体请参考官方文档

plt.show()

示例代码如下：

from sklearn import datasets
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()

eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard', weights=[2,1,2])

for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
    scores = cross_validation.cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

from sklearn import datasets

from sklearn import cross_validation

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import VotingClassifier

iris = datasets.load_iris()

X, y = iris.data[:, 1:3], iris.target

clf1 = LogisticRegression(random_state=1)

clf2 = RandomForestClassifier(random_state=1)

clf3 = GaussianNB()

eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard', weights=[2,1,2])

for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):

scores = cross_validation.cross_val_score(clf, X, y, cv=5, scoring='accuracy')

print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

运行结果如下：
Accuracy: 0.90 (+/- 0.05) [Logistic Regression]
Accuracy: 0.93 (+/- 0.05) [Random Forest]
Accuracy: 0.91 (+/- 0.04) [naive Bayes]
Accuracy: 0.95 (+/- 0.05) [Ensemble]

18.2.2多数投票分类（MajorityVote Class）

相对于多数投票（hard voting），软投票返回预测概率值的总和最大的标签。可通过参数weights指定每个分类器的权重；若权重提供了，在计算时则会按照权重计算，然后取平均；标签则为概率最高的标签。
举例说明，假设有3个分类器，3个类，每个分类器的权重为：w1=1，w2=1，w3=1。如下表：

下面例子为线性SVM，决策树，K邻近分类器：

from sklearn import datasets  
from sklearn.tree import DecisionTreeClassifier  
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.svm import SVC  
from itertools import product  
from sklearn.ensemble import VotingClassifier  

#Loading some example data  
iris = datasets.load_iris()  
X = iris.data[:, [0,2]]  
y = iris.target  

#Training classifiers  
clf1 = DecisionTreeClassifier(max_depth=4)  
clf2 = KNeighborsClassifier(n_neighbors=7)  
clf3 = SVC(kernel='rbf', probability=True)  
eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2), ('svc', clf3)], voting='soft', weights=[2,1,2])  

clf1 = clf1.fit(X,y)  
clf2 = clf2.fit(X,y)  
clf3 = clf3.fit(X,y)  
eclf = eclf.fit(X,y)


##这些分类器分类结果
x_min,x_max = X[:,0].min()-1,X[:,0].max()+1  
y_min,y_max = X[:,1].min()-1,X[:,1].max()+1  
xx,yy = np.meshgrid(np.arange(x_min,x_max,0.1),  
                    np.arange(y_min,y_max,0.1))  
f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(10, 8))  
for idx, clf, tt in zip(product([0, 1], [0, 1]),  
                        [clf1, clf2, clf3, eclf],  
                        ['Decision Tree (depth=4)', 'KNN (k=7)',  
                         'Kernel SVM', 'Soft Voting']):  
  
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])  
    Z = Z.reshape(xx.shape)  
  
    axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.4)  
    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)  
    axarr[idx[0], idx[1]].set_title(tt)  
plt.show()

from sklearn import datasets

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.svm import SVC

from itertools import product

from sklearn.ensemble import VotingClassifier

#Loading some example data

iris = datasets.load_iris()

X = iris.data[:, [0,2]]

y = iris.target

#Training classifiers

clf1 = DecisionTreeClassifier(max_depth=4)

clf2 = KNeighborsClassifier(n_neighbors=7)

clf3 = SVC(kernel='rbf', probability=True)

eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2), ('svc', clf3)], voting='soft', weights=[2,1,2])

clf1 = clf1.fit(X,y)

clf2 = clf2.fit(X,y)

clf3 = clf3.fit(X,y)

eclf = eclf.fit(X,y)

##这些分类器分类结果

x_min,x_max = X[:,0].min()-1,X[:,0].max()+1

y_min,y_max = X[:,1].min()-1,X[:,1].max()+1

xx,yy = np.meshgrid(np.arange(x_min,x_max,0.1),

np.arange(y_min,y_max,0.1))

f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(10, 8))

for idx, clf, tt in zip(product([0, 1], [0, 1]),

[clf1, clf2, clf3, eclf],

['Decision Tree (depth=4)', 'KNN (k=7)',

'Kernel SVM', 'Soft Voting']):

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.4)

axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)

axarr[idx[0], idx[1]].set_title(tt)

plt.show()

18.3自适应分类器（Adaboost）

Adaboost是一种迭代算法，其核心思想是针对同一个训练集训练不同的分类器(弱分类器)，然后把这些弱分类器集合起来，构成一个更强的最终分类器(强分类器)。其算法本身是通过改变数据分布来实现的，它根据每次训练集之中每个样本的分类是否正确，以及上次的总体分类的准确率，来确定每个样本的权值。将修改过权值的新数据集送给下层分类器进行训练，最后将每次训练得到的分类器最后融合起来，作为最后的决策分类器。使用adaboost分类器可以排除一些不必要的训练数据特征，并放在关键的训练数据上面。
下面的例子展示了AdaBoost算法拟合100个弱学习器

from sklearn.model_selection import cross_val_score  
from sklearn.datasets import load_iris  
from sklearn.ensemble import AdaBoostClassifier  

iris = load_iris()  
clf = AdaBoostClassifier(n_estimators=100)  
scores = cross_val_score(clf, iris.data, iris.target)  
scores.mean()

from sklearn.model_selection import cross_val_score

from sklearn.datasets import load_iris

from sklearn.ensemble import AdaBoostClassifier

iris = load_iris()

clf = AdaBoostClassifier(n_estimators=100)

scores = cross_val_score(clf, iris.data, iris.target)

scores.mean()

输出结果为：
0.95996732026143794

18.4 Xgboost简介

18.4.1简介

Xgboost是很多CART回归树集成，CART树以基尼系数为划分依据。回归树的样本输出是数值的形式，比如给某人发放房屋贷款的数额就是具体的数值，可以是0到120万元之间的任意值。那么，这时候你就没法用上述的信息增益、信息增益率、基尼系数来判定树的节点分裂了，你就会采用新的方式，预测误差，常用的有均方误差、对数误差等。而且节点不再是类别，是数值（预测值），那么怎么确定呢，有的是节点内样本均值，有的是最优化算出来的比如Xgboost。
xgboot特点：
1、w是最优化求出来的
2、使用正则化防止过拟合的技术
3、支持分布式、并行化，树之间没有很强的上下依赖关系
4、支持GPU
下图就是一个CART的例子，CART会把输入根据属性分配到各个也子节点上，而每个叶子节点上面会对应一个分数值。下面的例子是预测一个人是否喜欢电脑游戏。将叶子节点表示为分数之后，可以做很多事情，比如概率预测，排序等等。

一个CART往往过于简单，而无法有效的进行预测，因此更加高效的是使用多个CART进行融合，使用集成的方法提升预测效率：

假设有两颗回归树，则两棵树融合后的预测结果如上图。
xgboost涉及参数较多，具体使用可参考：https://cloud.tencent.com/developer/article/1111048
https://blog.csdn.net/han_xiaoyang/article/details/52665396

18.4.2 xgboost 实例

主要目的：利用多种预测方法，对房价进行预测
数据结构：
数据探索与预处理：
创建及优化模型：

18.4.2.1数据探索及数据预处理

（1）导入需要的库，并导入数据，查看前五行样本数据，参考文档：

#import some necessary librairies

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)


from scipy import stats
from scipy.stats import norm, skew #for some statistics


pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points


from subprocess import check_output
print(check_output(["ls", "../data"]).decode("utf8")) #check the files available in the directory


#Now let's import and put the train and test datasets in  pandas dataframe

train = pd.read_csv('../data/house_train.csv')
test = pd.read_csv('../data/house_test.csv')
##display the first five rows of the train dataset.
train.head(5)
 
#check the numbers of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))

#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']

#对原数组直接删除ID列，inplace = True为直接删除原数组.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

#check again the data size after dropping the 'Id' variable
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape)) 
print("The test data size after dropping Id feature is : {} ".format(test.shape))

#import some necessary librairies

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

%matplotlib inline

import matplotlib.pyplot as plt # Matlab-style plotting

import seaborn as sns

color = sns.color_palette()

sns.set_style('darkgrid')

import warnings

def ignore_warn(*args, **kwargs):

pass

warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

from scipy import stats

from scipy.stats import norm, skew #for some statistics

pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points

from subprocess import check_output

print(check_output(["ls", "../data"]).decode("utf8")) #check the files available in the directory

#Now let's import and put the train and test datasets in pandas dataframe

train = pd.read_csv('../data/house_train.csv')

test = pd.read_csv('../data/house_test.csv')

##display the first five rows of the train dataset.

train.head(5)

#check the numbers of samples and features

print("The train data size before dropping Id feature is : {} ".format(train.shape))

print("The test data size before dropping Id feature is : {} ".format(test.shape))

#Save the 'Id' column

train_ID = train['Id']

test_ID = test['Id']

#对原数组直接删除ID列，inplace = True为直接删除原数组.

train.drop("Id", axis = 1, inplace = True)

test.drop("Id", axis = 1, inplace = True)

#check again the data size after dropping the 'Id' variable

print("\nThe train data size after dropping Id feature is : {} ".format(train.shape))

print("The test data size after dropping Id feature is : {} ".format(test.shape))

The train data size before dropping Id feature is : (1460, 81)
The test data size before dropping Id feature is : (1459, 80)

The train data size after dropping Id feature is : (1460, 80)
The test data size after dropping Id feature is : (1459, 79)

（2）探索孤立点

fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

fig, ax = plt.subplots()

ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])

plt.ylabel('SalePrice', fontsize=13)

plt.xlabel('GrLivArea', fontsize=13)

plt.show()

（3）删除一些孤立点
删除房屋销售价（SalePr ice）小于300000（美元）并且居住面积（GrLivArea）平方英尺大于40003的记录。主要是上图中右下边这几个点。

#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

#Deleting outliers

train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#Check the graphic again

fig, ax = plt.subplots()

ax.scatter(train['GrLivArea'], train['SalePrice'])

plt.ylabel('SalePrice', fontsize=13)

plt.xlabel('GrLivArea', fontsize=13)

plt.show()

（4）探索房价分布情况
分析目标变量（房价），画出房价分布图及QQ图，QQ图就是分位数图示法（Quantile Quantile Plot，Q-Q图主要用于检验数据分布的相似性，如果要利用Q-Q图来对数据进行正态分布的检验，则可以令x轴为正态分布的分位数，y轴为样本分位数，如果这两者构成的点分布在一条直线上，就证明样本数据与正态分布存在线性相关性，即服从正态分布。

sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function

(mu, sigma) = norm.fit(train['SalePrice'])

print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution

plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],

loc='best')

plt.ylabel('Frequency')

plt.title('SalePrice distribution')

#Get also the QQ-plot

fig = plt.figure()

res = stats.probplot(train['SalePrice'], plot=plt)

plt.show()

由上图可知，目标变量是右倾斜的。由于（线性）模型喜欢正态分布的数据，我们需要对房价特征进行转换，使其接近正态分布。

（5）对房价特征进行log转换

#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])

#Check the new distribution 
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

#We use the numpy fuction log1p which applies log(1+x) to all elements of the column

train["SalePrice"] = np.log1p(train["SalePrice"])

#Check the new distribution

sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function

(mu, sigma) = norm.fit(train['SalePrice'])

print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution

plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],

loc='best')

plt.ylabel('Frequency')

plt.title('SalePrice distribution')

#Get also the QQ-plot

fig = plt.figure()

res = stats.probplot(train['SalePrice'], plot=plt)

plt.show()

现在纠正了偏差，数据看起来更正常分布。
（6）连接训练数据和测试数据
为便于统一处理，我们需要把训练数据、测试数据集成在一起。对房价数据不做处理。

ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
#为使合并后的索引正常，加上reset_index
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))

ntrain = train.shape[0]

ntest = test.shape[0]

y_train = train.SalePrice.values

#为使合并后的索引正常，加上reset_index

all_data = pd.concat((train, test)).reset_index(drop=True)

all_data.drop(['SalePrice'], axis=1, inplace=True)

print("all_data size is : {}".format(all_data.shape))

运行结果：
all_data size is : (2917, 79)
（7）查看缺失数据情况
以下我们查看各特征的缺失率

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100

all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]

missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})

missing_data.head(20)

运行结果
Missing Ratio
PoolQC 99.691
MiscFeature 96.4
Alley 93.212
Fence 80.425
FireplaceQu 48.68
LotFrontage 16.661
GarageQual 5.451
GarageCond 5.451
GarageFinish 5.451
GarageYrBlt 5.451
GarageType 5.382
BsmtExposure 2.811
BsmtCond 2.811
BsmtQual 2.777
BsmtFinType2 2.743
BsmtFinType1 2.708
MasVnrType 0.823
MasVnrArea 0.788
MSZoning 0.137
BsmtFullBath 0.069
可视化这些数据

f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

f, ax = plt.subplots(figsize=(15, 12))

plt.xticks(rotation='90')

sns.barplot(x=all_data_na.index, y=all_data_na)

plt.xlabel('Features', fontsize=15)

plt.ylabel('Percent of missing values', fontsize=15)

plt.title('Percent missing data by feature', fontsize=15)

（8）查看数据的相关性

#Correlation map to see how features are correlated with SalePrice
corrmat = train.corr()
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=0.9, square=True)

#Correlation map to see how features are correlated with SalePrice

corrmat = train.corr()

plt.subplots(figsize=(12,9))

sns.heatmap(corrmat, vmax=0.9, square=True)

颜色越深，表示相关性越强。
（9）填充缺失值
以下我们对存在缺失值的特征分别进行处理，
PoolQC：数据描述表示NA表示“无池”。这是有道理的，因为缺失值的比例很大（+ 99％），而且大多数房屋一般都没有游泳池。这里我们把缺失值改为None。

all_data["PoolQC"] = all_data["PoolQC"].fillna("None")

1	all_data["PoolQC"] = all_data["PoolQC"].fillna("None")

MiscFeature：数据描述表示NA表示“没有杂项功能，这里把缺失值改为None
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
类似把特征Alley 、Fence、FireplaceQu、GarageType, GarageFinish, GarageQual and GarageCond,进行相同处理，这些特征的缺失率都比较高。

for col in ('Alley','Fence','FireplaceQu','GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')

1 2	for col in ('Alley','Fence','FireplaceQu','GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'): all_data[col] = all_data[col].fillna('None')

LotFrontage（与街道连接的线性脚）：由于连接到房产的每条街道的区域很可能与其附近的其他房屋有相似的区域，我们可以通过邻域的中位数LotFrontage填写缺失值。

#Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

#Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood

all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(

lambda x: x.fillna(x.median()))

GarageYrBlt，GarageArea和GarageCars：用0代替缺失数据（因为没有车库=这样的车库没有车辆。）

for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)

1 2	for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'): all_data[col] = all_data[col].fillna(0)

对BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath, MasVnrArea做同样处理。

for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath','MasVnrArea'):
    all_data[col] = all_data[col].fillna(0)

1 2	for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath','MasVnrArea'): all_data[col] = all_data[col].fillna(0)

对BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2','MasVnrType特征的缺失值改为None。

for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2','MasVnrType'):
    all_data[col] = all_data[col].fillna('None')

1 2	for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2','MasVnrType'): all_data[col] = all_data[col].fillna('None')

MSZoning（一般分区分类）：'RL'是迄今为止最常见的值。所以我们可以用'RL'来填补缺失值,取频度最大的那个数据，采用mode()[0]的格式。

all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])

1	all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])

对特征Electrical,KitchenQual,Exterior1st,Exterior2nd,SaleType做相同处理。

for col in ('Electrical','KitchenQual','Exterior1st','Exterior2nd','SaleType'):
    all_data[col] = all_data[col].fillna(all_data[col].mode()[0])

1 2	for col in ('Electrical','KitchenQual','Exterior1st','Exterior2nd','SaleType'): all_data[col] = all_data[col].fillna(all_data[col].mode()[0])

Utilities：对于此分类功能，所有记录都是“AllPub”，除了一个“NoSeWa”和2个NA。由于带有“NoSewa”的房子位于训练集中，因此该功能无助于预测建模。然后我们可以安全地删除它。

all_data = all_data.drop(['Utilities'], axis=1)
Functional：数据描述说NA意味着典型
all_data["Functional"] = all_data["Functional"].fillna("Typ")

MSSubClass：Na很可能意味着没有建筑类。 我们可以用None替换缺失值

all_data = all_data.drop(['Utilities'], axis=1)

Functional：数据描述说NA意味着典型

all_data["Functional"] = all_data["Functional"].fillna("Typ")

MSSubClass：Na很可能意味着没有建筑类。我们可以用None替换缺失值

all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")

1	all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")

通过以上缺失值的填充，我们再看一下缺失情况

#Check remaining missing values if any 
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head()

#Check remaining missing values if any

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100

all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)

missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})

missing_data.head()

运行结果为
没有缺失值特征了！
（10）对一些分类的数值特征转换。
这里使用LabelEncoder 对不连续的数字或者文本进行编号。

#MSSubClass=The building class
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)


#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)


#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

#MSSubClass=The building class

all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)

#Changing OverallCond into a categorical variable

all_data['OverallCond'] = all_data['OverallCond'].astype(str)

#Year and month sold are transformed into categorical features.

all_data['YrSold'] = all_data['YrSold'].astype(str)

all_data['MoSold'] = all_data['MoSold'].astype(str)

【说明】
apply(str)与astype(str)功能相同，一般前者快于后者。

（11）对类别特征进行编码
这里使用LabelEncoder 对不连续的数字或者文本进行编号。

from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))

# shape        
print('Shape all_data: {}'.format(all_data.shape))

from sklearn.preprocessing import LabelEncoder

cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',

'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',

'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',

'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',

'YrSold', 'MoSold')

# process columns, apply LabelEncoder to categorical features

for c in cols:

lbl = LabelEncoder()

lbl.fit(list(all_data[c].values))

all_data[c] = lbl.transform(list(all_data[c].values))

# shape

print('Shape all_data: {}'.format(all_data.shape))

（12）增加特征
由于区域相关特征对于确定房价非常重要，这里将增加了一个特征，即每个房屋的地下室总面积，一楼和二楼面积之和

#Adding total sqfootage feature 
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

1 2	#Adding total sqfootage feature all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

（13）查看一些特征的歪斜程度
这里主要对一些数值型特征进行分析

numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)

numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features

skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)

print("\nSkew in numerical features: \n")

skewness = pd.DataFrame({'Skew' :skewed_feats})

skewness.head(10)

运行结果

（14）对这些倾斜特征进行box-cox变换
Box-Cox变换是统计建模中常用的一种数据变换，用于连续的响应变量不满足正态分布的情况。Box-Cox变换使线性回归模型满足线性性、独立性、方差齐性以及正态性的同时，又不丢失信息。
使用Box-Cox变换族一般都可以保证将数据进行成功的正态变换，但在二分变量或较少水平的等级变量的情况下，不能成功进行转换，此时，我们可以考虑使用广义线性模型，如LOGUSTICS模型、Johnson转换等.
Box-Cox变换后，残差可以更好的满足正态性、独立性等假设前提，降低了伪回归的概率

其中确定λ是关键，如何确定λ？一般采用最大似然估计来求。该值一般在[-5,5]之间。
一般y分布为左偏(左边比较陡峭)，则λ=0；如果y为右偏，取λ>0。

skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    #all_data[feat] += 1
    all_data[feat] = boxcox1p(all_data[feat], lam)

skewness = skewness[abs(skewness) > 0.75]

print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p

skewed_features = skewness.index

lam = 0.15

for feat in skewed_features:

#all_data[feat] += 1

all_data[feat] = boxcox1p(all_data[feat], lam)

（15）得到虚拟分类特征，对标称类别转换为one-hot编码。

all_data = pd.get_dummies(all_data)
print(all_data.shape)

1 2	all_data = pd.get_dummies(all_data) print(all_data.shape)

运行结果：
(2917, 220)
（16）得到新的训练集、测试集

train = all_data[:ntrain]
test = all_data[ntrain:]

1 2	train = all_data[:ntrain] test = all_data[ntrain:]

查看他们的维度

print(train.shape)
print(test.shape)

1 2	print(train.shape) print(test.shape)

运行结果
(1458, 220)
(1459, 220)
至此，数据探索及预处理就基本完成，接下来开始创建模型。

18.4.2.2创建模型

（1）导入需要的库

from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

from sklearn.kernel_ridge import KernelRidge

from sklearn.pipeline import make_pipeline

from sklearn.preprocessing import RobustScaler

from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone

from sklearn.model_selection import KFold, cross_val_score, train_test_split

from sklearn.metrics import mean_squared_error

import xgboost as xgb

import lightgbm as lgb

（2）使用K折交叉验证，其中k=5
K折交叉验证简介：
将数据集平均分割成K个等份
使用1份数据作为测试数据，其余作为训练数据
计算测试准确率
使用不同的测试集，重复2、3步骤
对测试准确率做平均，作为对未知数据预测准确率的估计
如下图

#Validation function
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

#Validation function

n_folds = 5

def rmsle_cv(model):

kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)

rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))

return(rmse)

【说明】
# 这里的cross_val_score将交叉验证的整个过程连接起来，不用再进行手动的分割数据
# cv参数用于规定将原始数据分成多少份
# scoring：该参数来控制它们对 estimators evaluated （评估的估计量）应用的指标。
对分类模型，该值可以为‘accuracy’ 或‘f1’；对回归模型，可以为‘explained_variance’， ‘neg_mean_squared_error’等。对性能来说，越小越好，最佳为0.
#多种评估指标：回归模型
均方差（MSE）:

均方误差对数（MSLE）：

平均绝对误差（MAE）:

更多模型评估指标，可参考：http://sklearn.apachecn.org/cn/0.19.0/modules/model_evaluation.html
（3）使用模型
在模型选择上，“没有免费的午餐”。为了比较模型间的性能，这里使用多种模型。首先使用3 种回归模型（Linear Regression，Lasso，Ridge。
使用LASSO Regression :
该模型可能对异常值非常敏感。所以我们需要让它们更加健壮。为此，我们在管道上使用sklearn的Robustscaler（）方法。

lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

1	lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

使用Elastic Net Regression :
增益对异常值有所增强，故这里也使用Robustscaler（）方法

ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))

1	ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))

【ElasticNet简介】
ElasticNet 是一种使用L1和L2先验作为正则化矩阵的线性回归模型.这种组合用于只有很少的权重非零的稀疏模型，比如:class:Lasso, 但是又能保持:class:Ridge 的正则化属性.我们可以使用ρ（ρ的具体位置，请看下面这个表达式）参数来调节L1和L2的凸组合(一类特殊的线性组合)。
当多个特征和另一个特征相关的时候弹性网络非常有用。Lasso 倾向于随机选择其中一个，而弹性网络更倾向于选择两个.
在实践中，Lasso 和 Ridge 之间权衡的一个优势是它允许在循环过程（Under rotate）中继承 Ridge 的稳定性.
弹性网络的目标函数是最小化：

ElasticNetCV 可以通过交叉验证来用来设置参数:
alpha (α)，l1_ratio (ρ)

使用Kernel Ridge Regression :

KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

1	KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

使用Gradient Boosting Regression :
其对异常值具有鲁棒性

GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)

GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,

max_depth=4, max_features='sqrt',

min_samples_leaf=15, min_samples_split=10,

loss='huber', random_state =5)

使用XGBoost :

model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)

model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,

learning_rate=0.05, max_depth=3,

min_child_weight=1.7817, n_estimators=2200,

reg_alpha=0.4640, reg_lambda=0.8571,

subsample=0.5213, silent=1,

random_state =7, nthread = -1)

使用LightGBM

model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,

learning_rate=0.05, n_estimators=720,

max_bin = 55, bagging_fraction = 0.8,

bagging_freq = 5, feature_fraction = 0.2319,

feature_fraction_seed=9, bagging_seed=9,

min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

18.4.2.3看各模型的性能

让我们通过评估交叉验证rmsle错误来了解这些基本模型如何对数据执行

score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(model_xgb)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(model_lgb)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))

score = rmsle_cv(lasso)

print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(ENet)

print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(KRR)

print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(GBoost)

print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(model_xgb)

print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

score = rmsle_cv(model_lgb)

print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))

运行结果
Lasso score: 0.0032 (0.0002)
ElasticNet score: 0.0031 (0.0002)
Kernel Ridge score: 0.0034 (0.0004)
Gradient Boosting score: 0.0026 (0.0002)
Xgboost score: 0.0086 (0.0003)
LGBM score: 0.0026 (0.0002)

18.4.2.4堆叠（或集成）各模型

我们从这种平均基本模型的简单方法开始。我们构建了一个新类来扩展scikit-learn与我们的模型，并且还包括封装和代码重用（继承）
（1）对各模型求平均值

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):

def __init__(self, models):

self.models = models

# we define clones of the original models to fit the data in

def fit(self, X, y):

self.models_ = [clone(x) for x in self.models]

# Train cloned base models

for model in self.models_:

model.fit(X, y)

return self

#Now we do the predictions for cloned models and average them

def predict(self, X):

predictions = np.column_stack([

model.predict(X) for model in self.models_

])

return np.mean(predictions, axis=1)

（2）看集成后模型性能
这里集成ENet, GBoost, KRR, lasso这4种基本模型

averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)

print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

运行结果
Averaged base models score: 0.0027 (0.0002)
不错！从初步结果来看，上面我们采用的最简单的堆叠方法确实提高了分数。这将鼓励我们进一步探索不那么简单的堆叠方法。

18.4.2.5添加元模型

这种集成方法的基本思想如下：
在这种方法中，我们在平均基础模型上添加元模型，并使用这些基础模型的折叠后预测来训练我们的元模型。
培训部分的程序可以描述如下：
1）将整个训练集分成两个不相交的集（这里是train和holdout）
2）在第一部分（即train）训练几个基础模型
3）在第二部分（即holdout）测试这些基础模型
4）使用来自3）的预测（称为折叠外预测）作为输入，并使用正确的响应（目标变量）作为输出来训练更高级别的元模型。
前三个步骤是迭代完成的。如果我们采用5-折（5-fold）堆叠，我们首先将训练数据分成5折（fold）。然后我们将进行5次迭代。在每次迭代中，我们训练每个基础模型4份并在剩余的一份数据上进行预测。
因此，经过5次迭代后，我们将确保使用整个数据进行折叠后预测，然后我们将使用这些预测作为新特征来训练第4步中的元模型。
对于预测部分，我们对测试数据上所有基础模型的预测进行平均，并将它们用作元特征，最终预测是使用元模型完成的。

以上步骤如下图所示：

数据生成过程的动态效果图：

有关stacking的介绍可参考：
http://blog.kaggle.com/2017/06/15/stacking-made-easy-an-introduction-to-stacknet-by-competitions-grandmaster-marios-michailidis-kazanova/
此外，下图从另一个角度来说明stacking的原理。
改为如下图（主要把上部分的列标题都为Model1）

如何理解这个图呢？我们通过一个简单实例来说明：
Train Data有890行。(请对应图中的上层部分）
每1次的fold，都会生成 712行小train， 178行小test。我们用Model 1来训练 712行的小train，然后预测 178行小test。预测的结果是长度为 178 的预测值。
这样的动作走5次！长度为178 的预测值 X 5 = 890 预测值，刚好和Train data长度吻合。这个890预测值是Model 1产生的，我们先存着，因为，一会让它将是第二层模型的训练来源。
重点：这一步产生的预测值我们可以转成 890 X 1 （890 行，1列），记作 P1 (大写P)
接着说 Test Data 有 418 行。(请对应图中的下层部分，对对对，绿绿的那些框框）
每1次的fold，712行小train训练出来的Model 1要去预测我们全部的Test Data（全部！因为Test Data没有加入5-fold，所以每次都是全部！）。此时，Model 1的预测结果是长度为418的预测值。
这样的动作走5次！我们可以得到一个 5 X 418 的预测值矩阵。然后我们根据行来就平均值，最后得到一个 1 X 418 的平均预测值。
重点：这一步产生的预测值我们可以转成 418 X 1 （418行，1列），记作 p1 (小写p)
走到这里，你的第一层的Model 1完成了它的使命。
第一层还会有其他Model的，比如Model 2，同样的走一遍，我们有可以得到 890 X 1 (P2) 和 418 X 1 (p2) 列预测值。
这样吧，假设你第一层有3个模型，这样你就会得到：
来自5-fold的预测值矩阵 890 X 3，（P1，P2， P3）和来自Test Data预测值矩阵 418 X 3，（p1, p2, p3）。
到第二层了
来自5-fold的预测值矩阵 890 X 3 作为你的Train Data，训练第二层的模型
来自Test Data预测值矩阵 418 X 3 就是你的Test Data，用训练好的模型来预测他们吧。

（1）堆叠平均模型类

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):

def __init__(self, base_models, meta_model, n_folds=5):

self.base_models = base_models

self.meta_model = meta_model

self.n_folds = n_folds

# We again fit the data on clones of the original models

def fit(self, X, y):

self.base_models_ = [list() for x in self.base_models]

self.meta_model_ = clone(self.meta_model)

kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)

# Train cloned base models then create out-of-fold predictions

# that are needed to train the cloned meta-model

out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))

for i, model in enumerate(self.base_models):

for train_index, holdout_index in kfold.split(X, y):

instance = clone(model)

self.base_models_[i].append(instance)

instance.fit(X[train_index], y[train_index])

y_pred = instance.predict(X[holdout_index])

out_of_fold_predictions[holdout_index, i] = y_pred

# Now train the cloned meta-model using the out-of-fold predictions as new feature

self.meta_model_.fit(out_of_fold_predictions, y)

return self

#Do the predictions of all base models on the test data and use the averaged predictions as

#meta-features for the final prediction which is done by the meta-model

def predict(self, X):

meta_features = np.column_stack([

np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)

for base_models in self.base_models_ ])

return self.meta_model_.predict(meta_features)

（2）堆叠平均模型得分
为了使两种方法具有可比性（通过使用相同数量的模型），我们只是平均Enet，KRR和Gboost，然后我们添加lasso作为元模型。

stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
                                                 meta_model = lasso)

score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),

meta_model = lasso)

score = rmsle_cv(stacked_averaged_models)

print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

运行结果
Stacking Averaged models score: 0.0026 (0.0002)
由此可知，通过添加元学习器，我们再次获得更好的分数。
这个结果比简单求平均值的得分Averaged base models score: 0.0027 (0.0002)
更好！

（3）集成StackedRegressor，XGBoost和LightGBM
我们将XGBoost和LightGBM添加到之前定义的StackedRegressor中。
我们首先定义一个rmsle评估函数

def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

1 2	def rmsle(y, y_pred): return np.sqrt(mean_squared_error(y, y_pred))

（4）最终培训和预测

stacked_averaged_models.fit(train.values, y_train)
stacked_train_pred = stacked_averaged_models.predict(train.values)
stacked_pred = np.expm1(stacked_averaged_models.predict(test.values))
print(rmsle(y_train, stacked_train_pred))

stacked_averaged_models.fit(train.values, y_train)

stacked_train_pred = stacked_averaged_models.predict(train.values)

stacked_pred = np.expm1(stacked_averaged_models.predict(test.values))

print(rmsle(y_train, stacked_train_pred))

运行结果
0.00165651517208
（5）计算xgboost

model_xgb.fit(train, y_train)
xgb_train_pred = model_xgb.predict(train)
xgb_pred = np.expm1(model_xgb.predict(test))
print(rmsle(y_train, xgb_train_pred))

model_xgb.fit(train, y_train)

xgb_train_pred = model_xgb.predict(train)

xgb_pred = np.expm1(model_xgb.predict(test))

print(rmsle(y_train, xgb_train_pred))

运行结果
0.00860202560688
（6）计算LightGBM:

model_lgb.fit(train, y_train)
lgb_train_pred = model_lgb.predict(train)

lgb_pred = np.expm1(model_lgb.predict(test.values))
print(rmsle(y_train, lgb_train_pred))

model_lgb.fit(train, y_train)

lgb_train_pred = model_lgb.predict(train)

lgb_pred = np.expm1(model_lgb.predict(test.values))

print(rmsle(y_train, lgb_train_pred))

运行结果
0.0016023894136

'''RMSE on the entire Train data when averaging'''

print('RMSLE score on train data:')
print(rmsle(y_train,stacked_train_pred*0.70 +
               xgb_train_pred*0.15 + lgb_train_pred*0.15 ))

'''RMSE on the entire Train data when averaging'''

print('RMSLE score on train data:')

print(rmsle(y_train,stacked_train_pred*0.70 +

xgb_train_pred*0.15 + lgb_train_pred*0.15 ))

运行结果
RMSLE score on train data:
0.002334422609
由此看出，这似乎最简单的堆叠（或集成）方法确实提高了分数。
大家可参考：https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard
集成学习非常成功，该算法不仅在挑战性的数据集上频频打破性能方面的记录，而且是 Kaggle 数据科学竞赛的获奖者常用的方法之一。
有关集成学习方法可参考：
https://zhuanlan.zhihu.com/p/25836678

17 降维简介

当特征选择完成后，可以直接训练模型了，但是可能由于特征矩阵过大，导致计算量大，训练时间长的问题，因此降低特征矩阵维度也是必不可少的。常见的降维方法除了以上提到的基于L1惩罚项的模型以外，另外还有主成分分析法（PCA）和线性判别分析（LDA），线性判别分析本身也是一个分类模型。PCA和LDA有很多的相似点，其本质是要将原始的样本映射到维度更低的样本空间中，但是PCA和LDA的映射目标不一样：PCA是为了让映射后的样本具有最大的发散性；而LDA是为了让映射后的样本有最好的分类性能。所以说PCA是一种无监督的降维方法，而LDA是一种有监督的降维方法。
PCA、LDA降维一般假设数据集为线性可分，如果用这两种方法，对线性不可分的数据集进行降维，效果往往不理想。本质上PCA和LDA还是一种线性变换。而线性不可分数据应该是很普遍的，对线性不可分数据集该如何进行降维呢？这里我们介绍一种核PCA方法，这样降维方法综合了核技巧及PCA思想，对非线性数据集降维有非常好的效果。
此外，这里我们还介绍SVD方法，这也是一种非常有效的降维方法。

17.1 PCA简介

主成分分析（Principal Components Analysis），简称PCA，是一种数据降维技术，用于数据预处理。一般我们获取的原始数据维度都很高，比如1000个特征，在这1000个特征中可能包含了很多无用的信息或者噪声，真正有用的特征才50个或更少，那么我们可以运用PCA算法将1000个特征降到50个特征。这样不仅可以去除无用的噪声，还能减少很大的计算量。
PCA算法是如何实现的？
简单来说，就是将数据从原特征空间转换到新的特征空间中，例如原始的空间是三维的(x,y,z)，x、y、z分别是原始空间的三个基，我们可以通过某种方法，用新的坐标系(a,b,c)来表示原始的数据，那么a、b、c就是新的基，它们组成新的特征空间。在新的特征空间中，可能所有的数据在c上的投影都接近于0，即可以忽略，那么我们就可以直接用(a,b)来表示数据，这样数据就从三维的(x,y,z)降到了二维的(a,b)。
问题是如何求新的基(a,b,c)?
一般步骤是这样的：
1）对原始数据集做标准化处理。
2）求协方差矩阵。
3）计算协方差矩阵的特征值和特征向量。
4）选择前k个最大的特征向量，k小于原数据集维度。
5）通过前k个特征向量组成了新的特征空间，设为W。
6)通过矩阵W,把原数据转换到新的k维特征子空间。

17.2 PCA算法实现

这里以葡萄酒数据为例，数据集特征如下：

数据来源于：https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
1）对原数据集做标准化处理
导入需要的库及数据

import pandas as pd

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 
'Alcalinity of ash', 'Magnesium', 'Total phenols', 
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']

df_wine.head()

import pandas as pd

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',

'Alcalinity of ash', 'Magnesium', 'Total phenols',

'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',

'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']

df_wine.head()

部分内容：

为便于后续处理，把数据集分为训练集和测试集，划分比例为7:3

from sklearn.cross_validation import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.3, random_state=0)

from sklearn.cross_validation import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = \

train_test_split(X, y, test_size=0.3, random_state=0)

对原数据进行标准化处理

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.fit_transform(X_test)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train_std = sc.fit_transform(X_train)

X_test_std = sc.fit_transform(X_test)

2) 求协方差矩阵
这里使用numpy.cov函数，求标准化后数据的协方差矩阵

import numpy as np
cov_mat = np.cov(X_train_std.T)

1 2	import numpy as np cov_mat = np.cov(X_train_std.T)

3）计算协方差矩阵的特征值和特征向量
使用np.linalg.eig函数，求协方差的特征值和特征向量

eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)
print('\nEigenvalues \n%s' % eigen_vals)

1 2	eigen_vals, eigen_vecs = np.linalg.eig(cov_mat) print('\nEigenvalues \n%s' % eigen_vals)

得到13个特征向量：
Eigenvalues
[ 4.8923083 2.46635032 1.42809973 1.01233462 0.84906459 0.60181514 0.52251546 0.08414846 0.33051429 0.29595018 0.16831254 0.21432212 0.2399553 ]
要实现降维，我们可以选择前k个最多信息（或方差最大）特征向量组成新的子集，由于特征值的大小决定了特征向量的重要性，因此，可以通过对特征值的排序，获取前k个特征值。特征值λ_i的方差贡献率是指特征值λ_i与所有特征值和的比例：

我们可以通过numpy.cumsum函数计算累计方差。

tot = sum(eigen_vals)
var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

#然后用matplotlib各主成分的方差贡献率图形。

import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.font_manager as fm
myfont = fm.FontProperties(fname='/home/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/simhei.ttf')

plt.bar(range(1, 14), var_exp, alpha=0.5, align='center',
        label='individual explained variance')
plt.step(range(1, 14), cum_var_exp, where='mid',
         label='cumulative explained variance')
plt.ylabel('方差贡献率',fontproperties=myfont,size=12)
plt.xlabel('主成分',fontproperties=myfont,size=12)
plt.legend(loc='best')
plt.tight_layout()
plt.show()

tot = sum(eigen_vals)

var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)]

cum_var_exp = np.cumsum(var_exp)

#然后用matplotlib各主成分的方差贡献率图形。

import matplotlib.pyplot as plt

%matplotlib inline

import matplotlib.font_manager as fm

myfont = fm.FontProperties(fname='/home/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/simhei.ttf')

plt.bar(range(1, 14), var_exp, alpha=0.5, align='center',

label='individual explained variance')

plt.step(range(1, 14), cum_var_exp, where='mid',

label='cumulative explained variance')

plt.ylabel('方差贡献率',fontproperties=myfont,size=12)

plt.xlabel('主成分',fontproperties=myfont,size=12)

plt.legend(loc='best')

plt.tight_layout()

plt.show()

从这个图可以看出第一个主成分占了方差总和的40%左右，前两个主成分占了近60%。
4）选择前k个最大的特征向量，k小于原数据集维度
首先，按特征值按降序排序

# 构成一个元组 (eigenvalue, eigenvector) 
eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:,i]) for i in range(len(eigen_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eigen_pairs.sort(reverse=True)

# 构成一个元组 (eigenvalue, eigenvector)

eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:,i]) for i in range(len(eigen_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low

eigen_pairs.sort(reverse=True)

5）通过前k个特征向量组成了新的特征空间，设为W。
为便于数据可视化，这里我们取k=2，实际上前2个特征值已占了总方差的近60%。

w = np.hstack((eigen_pairs[0][1][:, np.newaxis],
               eigen_pairs[1][1][:, np.newaxis]))
print('Matrix W:\n', w)

w = np.hstack((eigen_pairs[0][1][:, np.newaxis],

eigen_pairs[1][1][:, np.newaxis]))

print('Matrix W:\n', w)

这样我们就可得到一个由这两个特征向量构成的13*2矩阵W:
Matrix W:
[[ 0.14669811 0.50417079]
[-0.24224554 0.24216889]
[-0.02993442 0.28698484]
[-0.25519002 -0.06468718]
[ 0.12079772 0.22995385]
[ 0.38934455 0.09363991]
[ 0.42326486 0.01088622]
[-0.30634956 0.01870216]
[ 0.30572219 0.03040352]
[-0.09869191 0.54527081]
[ 0.30032535 -0.27924322]
[ 0.36821154 -0.174365 ]
[ 0.29259713 0.36315461]]

6）通过矩阵W,把原数据转换到新的k维特征子空间
通过这个特征矩阵W,把原样本x转换到PCA的子空间上，得到一个新样本x^，。
x^，=xW
训练集与W点积后，把这个训练集转换到包括两个主成分的子空间上。然后，把子空间的数据可视化。

X_train_pca = X_train_std.dot(w)
colors = ['r', 'b', 'g']
markers = ['s', 'x', 'o']

for l, c, m in zip(np.unique(y_train), colors, markers):
    plt.scatter(X_train_pca[y_train==l, 0], 
                X_train_pca[y_train==l, 1], 
                c=c, label=l, marker=m)

plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.tight_layout()
plt.show()

X_train_pca = X_train_std.dot(w)

colors = ['r', 'b', 'g']

markers = ['s', 'x', 'o']

for l, c, m in zip(np.unique(y_train), colors, markers):

plt.scatter(X_train_pca[y_train==l, 0],

X_train_pca[y_train==l, 1],

c=c, label=l, marker=m)

plt.xlabel('PC 1')

plt.ylabel('PC 2')

plt.legend(loc='lower left')

plt.tight_layout()

plt.show()

从以上图形可以看出，大部分数据沿PC1方向分布，而且可以线性划分，在可视化图形时，为便于标识点，这里采用了y_train标签信息。

我们用来6步来实现PCA，这个过程还是比较麻烦的，是否有更简单的方法呢？
有的，接下来我们介绍利用Scikit-learn中PCA类进行降维。

17.3 利用Scikit-learn进行主成分分析

我们将使用Scikit-learn中PCA对数据集进行预测处理，然后使用逻辑斯谛回归对转换后的数据进行分类，最后对数据进行可视化。
1）数据预处理

from sklearn.decomposition import PCA

pca = PCA()
X_train_pca = pca.fit_transform(X_train_std)
pca.explained_variance_ratio_

from sklearn.decomposition import PCA

pca = PCA()

X_train_pca = pca.fit_transform(X_train_std)

pca.explained_variance_ratio_

得到主成分数据：
array([ 0.37329648, 0.18818926, 0.10896791, 0.07724389, 0.06478595, 0.04592014, 0.03986936, 0.02521914, 0.02258181, 0.01830924, 0.01635336, 0.01284271, 0.00642076])

2）可视化主成分方差贡献率图

plt.bar(range(1, 14), pca.explained_variance_ratio_, alpha=0.5, align='center')
plt.step(range(1, 14), np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.show()

plt.bar(range(1, 14), pca.explained_variance_ratio_, alpha=0.5, align='center')

plt.step(range(1, 14), np.cumsum(pca.explained_variance_ratio_), where='mid')

plt.ylabel('Explained variance ratio')

plt.xlabel('Principal components')

plt.show()

3）获取前2个主成分

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
4）把训练集映射到主成分空间上，并可视化。
plt.scatter(X_train_pca[:,0], X_train_pca[:,1])
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.show()

pca = PCA(n_components=2)

X_train_pca = pca.fit_transform(X_train_std)

X_test_pca = pca.transform(X_test_std)

4）把训练集映射到主成分空间上，并可视化。

plt.scatter(X_train_pca[:,0], X_train_pca[:,1])

plt.xlabel('PC 1')

plt.ylabel('PC 2')

plt.show()

5）利用回归模型对数据进行分类。

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr = lr.fit(X_train_pca, y_train)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr = lr.fit(X_train_pca, y_train)

6）为了更好看到分类后情况，这里我们定义一个函数plot_decision_regions，通过这个函数对决策区域数据可视化。

from matplotlib.colors import ListedColormap
def plot_decision_regions(X, y, classifier, resolution=0.02):

    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                         np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    # plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                    alpha=0.8, c=cmap(idx),
                    marker=markers[idx], label=cl)

from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, resolution=0.02):

# setup marker generator and color map

markers = ('s', 'x', 'o', '^', 'v')

colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')

cmap = ListedColormap(colors[:len(np.unique(y))])

# plot the decision surface

x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1

x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1

xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),

np.arange(x2_min, x2_max, resolution))

Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)

Z = Z.reshape(xx1.shape)

plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)

plt.xlim(xx1.min(), xx1.max())

plt.ylim(xx2.min(), xx2.max())

# plot class samples

for idx, cl in enumerate(np.unique(y)):

plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],

alpha=0.8, c=cmap(idx),

marker=markers[idx], label=cl)

7)把训练数据转换到前两个主成分轴后生成决策区域图形

plot_decision_regions(X_train_pca, y_train, classifier=lr)
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.tight_layout()
# plt.savefig('./figures/pca3.png', dpi=300)
plt.show()

plot_decision_regions(X_train_pca, y_train, classifier=lr)

plt.xlabel('PC 1')

plt.ylabel('PC 2')

plt.legend(loc='lower left')

plt.tight_layout()

# plt.savefig('./figures/pca3.png', dpi=300)

plt.show()

对高维数据集进行降维除了PCA方法，还有线性判别分析（Linear Discriminant Analysis， LDA）、决策树、核主成分分析、SVD等等。

17.4 LDA 降维

LDA的基本概念与PCA类似，PCA是在数据集中找到方差最大的正交的主成分分量的轴。而LDA的目标是发现可以最优化分类的特征子空间。两者都是可以用于降维的线性转换方法，其中，PCA是无监督算法，LDA是监督算法。与PCA相比，LDA是一种更优越的用于分类的特征提取技术。
LDA的主要步骤：
（1）对d维数据集进行标准化处理（d为特征数量）
（2）对每一类别，计算d维的均值向量
（3）构造类间的散布矩阵S_B以及类内的散布矩阵S_W
（4）计算矩阵〖S_W〗^(-1) S_B的特征值所对应的特征向量，
（5）选取前k个特征值对应的特征向量，构造一个d x k维的转换矩阵W，其中特征向量以列的形式排列
（6）使用转换矩阵W将样本映射到新的特征子空间上.
以下还是以下葡萄酒数据为例，用代码实现以上各步：
（1）对d维数据集进行标准化处理

from sklearn.preprocessing import StandardScaler

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
#对特征进行标准化处理
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

from sklearn.preprocessing import StandardScaler

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

#对特征进行标准化处理

sc = StandardScaler()

X_train_std = sc.fit_transform(X_train)

X_test_std = sc.transform(X_test)

（2）对每一类别，计算d维的均值向量

#设置精度
np.set_printoptions(precision=4)

#求各类的平均值
mean_vecs = []
for label in range(1,4):
    mean_vecs.append(np.mean(X_train_std[y_train==label], axis=0))
    print('MV %s: %s\n' %(label, mean_vecs[label-1]))

#设置精度

np.set_printoptions(precision=4)

#求各类的平均值

mean_vecs = []

for label in range(1,4):

mean_vecs.append(np.mean(X_train_std[y_train==label], axis=0))

print('MV %s: %s\n' %(label, mean_vecs[label-1]))

运行结果
MV 1: [ 0.9259 -0.3091 0.2592 -0.7989 0.3039 0.9608 1.0515 -0.6306 0.5354 0.2209 0.4855 0.798 1.2017]

MV 2: [-0.8727 -0.3854 -0.4437 0.2481 -0.2409 -0.1059 0.0187 -0.0164 0.1095 -0.8796 0.4392 0.2776 -0.7016]

MV 3: [ 0.1637 0.8929 0.3249 0.5658 -0.01 -0.9499 -1.228 0.7436 -0.7652 0.979 -1.1698 -1.3007 -0.3912]
（3）构造类间的散布矩阵S_B以及类内的散布矩阵S_W
通过均值向量计算类内散布矩阵Sw：

通过累加各类别i的散布矩阵Si来计算：

d = 13 # number of features
S_W = np.zeros((d, d))
for label,mv in zip(range(1, 4), mean_vecs):
    class_scatter = np.zeros((d, d)) # scatter matrix for each class
    for row in X_train_std[y_train == label]:
        row, mv = row.reshape(d, 1), mv.reshape(d, 1) # make column vectors
        class_scatter += (row-mv).dot((row-mv).T)
    S_W += class_scatter                             # sum class scatter matrices

print('Within-class scatter matrix: %sx%s' % (S_W.shape[0], S_W.shape[1]))

d = 13 # number of features

S_W = np.zeros((d, d))

for label,mv in zip(range(1, 4), mean_vecs):

class_scatter = np.zeros((d, d)) # scatter matrix for each class

for row in X_train_std[y_train == label]:

row, mv = row.reshape(d, 1), mv.reshape(d, 1) # make column vectors

class_scatter += (row-mv).dot((row-mv).T)

S_W += class_scatter # sum class scatter matrices

print('Within-class scatter matrix: %sx%s' % (S_W.shape[0], S_W.shape[1]))

运行结果
Within-class scatter matrix: 13x13

计算各类标样本数

print('Class label distribution: %s' % np.bincount(y_train)[1:])

1	print('Class label distribution: %s' % np.bincount(y_train)[1:])

运行结果为：
Class label distribution: [40 49 35]
由此看出，各类记录数不很均匀，为此，需要对SB进行归一化处理：

d = 13 # number of features
S_W = np.zeros((d, d))
for label,mv in zip(range(1, 4), mean_vecs):
    class_scatter = np.cov(X_train_std[y_train==label].T)
    S_W += class_scatter
print('Scaled within-class scatter matrix: %sx%s' % (S_W.shape[0], S_W.shape[1]))

d = 13 # number of features

S_W = np.zeros((d, d))

for label,mv in zip(range(1, 4), mean_vecs):

class_scatter = np.cov(X_train_std[y_train==label].T)

S_W += class_scatter

print('Scaled within-class scatter matrix: %sx%s' % (S_W.shape[0], S_W.shape[1]))

运行结果
Scaled within-class scatter matrix: 13x13

计算类间散布矩阵：

mean_overall = np.mean(X_train_std, axis=0)
d = 13 # number of features
S_B = np.zeros((d, d))
for i,mean_vec in enumerate(mean_vecs):
    n = X_train[y_train==i+1, :].shape[0]
    mean_vec = mean_vec.reshape(d, 1) # make column vector
    mean_overall = mean_overall.reshape(d, 1) # make column vector
    S_B += n * (mean_vec - mean_overall).dot((mean_vec - mean_overall).T)

print('Between-class scatter matrix: %sx%s' % (S_B.shape[0], S_B.shape[1]))

mean_overall = np.mean(X_train_std, axis=0)

d = 13 # number of features

S_B = np.zeros((d, d))

for i,mean_vec in enumerate(mean_vecs):

n = X_train[y_train==i+1, :].shape[0]

mean_vec = mean_vec.reshape(d, 1) # make column vector

mean_overall = mean_overall.reshape(d, 1) # make column vector

S_B += n * (mean_vec - mean_overall).dot((mean_vec - mean_overall).T)

print('Between-class scatter matrix: %sx%s' % (S_B.shape[0], S_B.shape[1]))

运行结果
Between-class scatter matrix: 13x13

eigen_vals, eigen_vecs = np.linalg.eig(np.linalg.inv(S_W).dot(S_B))

1	eigen_vals, eigen_vecs = np.linalg.eig(np.linalg.inv(S_W).dot(S_B))

（5）选取前k个特征值对应的特征向量，构造一个d x k维的转换矩阵W，其中特征向量以列的形式排列

求得广义特征值之后，按照降序对特征值排序

# 生成特征值与特征向量构成的元组
eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:,i]) for i in range(len(eigen_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eigen_pairs = sorted(eigen_pairs, key=lambda k: k[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues

print('Eigenvalues in decreasing order:\n')
for eigen_val in eigen_pairs:
    print(eigen_val[0])

# 生成特征值与特征向量构成的元组

eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:,i]) for i in range(len(eigen_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low

eigen_pairs = sorted(eigen_pairs, key=lambda k: k[0], reverse=True)

# Visually confirm that the list is correctly sorted by decreasing eigenvalues

print('Eigenvalues in decreasing order:\n')

for eigen_val in eigen_pairs:

print(eigen_val[0])

运行结果
Eigenvalues in decreasing order:

452.721581245
156.43636122
7.05575044266e-14
5.68434188608e-14
3.41129233161e-14
3.40797229523e-14
3.40797229523e-14
1.16775565372e-14
1.16775565372e-14
8.59477909861e-15
8.59477909861e-15
4.24523361436e-15
2.6858909629e-15
d x d维协方差矩阵的秩最大为d-1，得到两个非0的特征值。
与PCA一样，我们可视化各特征贡献率

tot = sum(eigen_vals.real)
discr = [(i / tot) for i in sorted(eigen_vals.real, reverse=True)]
cum_discr = np.cumsum(discr)

plt.bar(range(1, 14), discr, alpha=0.5, align='center',
        label='individual "discriminability"')
plt.step(range(1, 14), cum_discr, where='mid',
         label='cumulative "discriminability"')
plt.ylabel('"discriminability" ratio')
plt.xlabel('Linear Discriminants')
plt.ylim([-0.1, 1.1])
plt.legend(loc='best')
plt.tight_layout()
# plt.savefig('./figures/lda1.png', dpi=300)
plt.show()

tot = sum(eigen_vals.real)

discr = [(i / tot) for i in sorted(eigen_vals.real, reverse=True)]

cum_discr = np.cumsum(discr)

plt.bar(range(1, 14), discr, alpha=0.5, align='center',

label='individual "discriminability"')

plt.step(range(1, 14), cum_discr, where='mid',

label='cumulative "discriminability"')

plt.ylabel('"discriminability" ratio')

plt.xlabel('Linear Discriminants')

plt.ylim([-0.1, 1.1])

plt.legend(loc='best')

plt.tight_layout()

# plt.savefig('./figures/lda1.png', dpi=300)

plt.show()

运行结果

（6）使用转换矩阵W将样本映射到新的特征子空间上.
由上面两个新得到两个特征构成一个新矩阵

w = np.hstack((eigen_pairs[0][1][:, np.newaxis].real,
                      eigen_pairs[1][1][:, np.newaxis].real))
print('Matrix W:\n', w)

w = np.hstack((eigen_pairs[0][1][:, np.newaxis].real,

eigen_pairs[1][1][:, np.newaxis].real))

print('Matrix W:\n', w)

d x d维协方差矩阵的秩最大为d-1，得到两个非0的特征值。Matrix W:
[[-0.0662 -0.3797]
[ 0.0386 -0.2206]
[-0.0217 -0.3816]
[ 0.184 0.3018]
[-0.0034 0.0141]
[ 0.2326 0.0234]
[-0.7747 0.1869]
[-0.0811 0.0696]
[ 0.0875 0.1796]
[ 0.185 -0.284 ]
[-0.066 0.2349]
[-0.3805 0.073 ]
[-0.3285 -0.5971]]
将样本映射到新的特征空间

X_train_lda = X_train_std.dot(w)
colors = ['r', 'b', 'g']
markers = ['s', 'x', 'o']

for l, c, m in zip(np.unique(y_train), colors, markers):
    plt.scatter(X_train_lda[y_train==l, 0] * (-1), 
                X_train_lda[y_train==l, 1] * (-1), 
                c=c, label=l, marker=m)

plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.legend(loc='lower right')
plt.tight_layout()
# plt.savefig('./figures/lda2.png', dpi=300)
plt.show()

X_train_lda = X_train_std.dot(w)

colors = ['r', 'b', 'g']

markers = ['s', 'x', 'o']

for l, c, m in zip(np.unique(y_train), colors, markers):

plt.scatter(X_train_lda[y_train==l, 0] * (-1),

X_train_lda[y_train==l, 1] * (-1),

c=c, label=l, marker=m)

plt.xlabel('LD 1')

plt.ylabel('LD 2')

plt.legend(loc='lower right')

plt.tight_layout()

# plt.savefig('./figures/lda2.png', dpi=300)

plt.show()

运行结果

17.5 利用Scikit-learn进行LDA分析

下面我们利用scikit-learn中对LDA类的实现
这里先定义一个函数，后面需要用到

from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, resolution=0.02):

    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                         np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    # plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                    alpha=0.8, c=cmap(idx),
                    marker=markers[idx], label=cl)

from matplotlib.colors import ListedColormap

def plot_decision_regions(X, y, classifier, resolution=0.02):

# setup marker generator and color map

markers = ('s', 'x', 'o', '^', 'v')

colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')

cmap = ListedColormap(colors[:len(np.unique(y))])

# plot the decision surface

x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1

x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1

xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),

np.arange(x2_min, x2_max, resolution))

Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)

Z = Z.reshape(xx1.shape)

plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)

plt.xlim(xx1.min(), xx1.max())

plt.ylim(xx2.min(), xx2.max())

# plot class samples

for idx, cl in enumerate(np.unique(y)):

plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],

alpha=0.8, c=cmap(idx),

marker=markers[idx], label=cl)

对数据先LDA处理，然后用逻辑回归进行分类。

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

 
lda = LDA(n_components=2)
X_train_lda = lda.fit_transform(X_train_std, y_train)
# 逻辑回归在相对低维数据上的表现
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr = lr.fit(X_train_lda, y_train)

plot_decision_regions(X_train_lda, y_train, classifier=lr)
plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.legend(loc='lower left')
plt.tight_layout()
# plt.savefig('./images/lda3.png', dpi=300)
plt.show()

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

lda = LDA(n_components=2)

X_train_lda = lda.fit_transform(X_train_std, y_train)

# 逻辑回归在相对低维数据上的表现

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr = lr.fit(X_train_lda, y_train)

plot_decision_regions(X_train_lda, y_train, classifier=lr)

plt.xlabel('LD 1')

plt.ylabel('LD 2')

plt.legend(loc='lower left')

plt.tight_layout()

# plt.savefig('./images/lda3.png', dpi=300)

plt.show()

运行结果

还有几个点划分错误，下面通过正则化，效果将更好

X_test_lda = lda.transform(X_test_std)

plot_decision_regions(X_test_lda, y_test, classifier=lr)
plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.legend(loc='lower left')
plt.tight_layout()
# plt.savefig('./images/lda4.png', dpi=300)
plt.show()

X_test_lda = lda.transform(X_test_std)

plot_decision_regions(X_test_lda, y_test, classifier=lr)

plt.xlabel('LD 1')

plt.ylabel('LD 2')

plt.legend(loc='lower left')

plt.tight_layout()

# plt.savefig('./images/lda4.png', dpi=300)

plt.show()

运行结果

17.6使用核PCA降维

前面我们介绍了两种降维方法，PCA及LDA.这两种方法，如果用于线性不可分数据集上进行分类，效果往往不很理想，原因是通过他们无法把线性不可分数据集变为线性可分数据集。如果遇到线性不可分数据集（这样的数据集往往比较普遍），有什么好方法，既降维，又可把线性不可分数据集变为线性可分数据集？
在SVM中，我们了解到核函数的神奇，把可以通过把线性不可分的数据集映射到一个高维空间，变得线性可分。基于这点，如果我们在降维时也采用核技术是否也可以呢？可以的，这就是接下来我们要介绍的内容---核PCA.
核PCA=核技术+PCA,具体步骤如下：
（1）计算核矩阵，也就是计算任意两个训练样本。这里以向基核函数（RBF）为例
经向基函数核（又称高斯核）为：

得到以下矩阵：

（2）对核矩阵K进行中心化处理

其中,是n*n的矩阵，n=训练集样本数，中每个元素都等于.l_n中的每个元素都是1/n
（3）求核矩阵的特征向量，并按降序排列，提取前k个特征向量。
不同于标准PCA，这里的特征向量并不是主成分轴。
下面我们根据以上三个步骤，实现一个核PCA。借助SciPy和NumPy，其实实现核PCA很简单。

from scipy.spatial.distance import pdist, squareform
from scipy import exp
from scipy.linalg import eigh
import numpy as np

def rbf_kernel_pca(X, gamma, n_components):
    """
    RBF kernel PCA implementation.

    Parameters
    ------------
    X: {NumPy ndarray}, shape = [n_samples, n_features]
        
    gamma: float
      Tuning parameter of the RBF kernel
        
    n_components: int
      Number of principal components to return

    Returns
    ------------
     X_pc: {NumPy ndarray}, shape = [n_samples, k_features]
       Projected dataset   

    """
    # Calculate pairwise squared Euclidean distances
    # in the MxN dimensional dataset.
    sq_dists = pdist(X, 'sqeuclidean')

    # Convert pairwise distances into a square matrix.
    mat_sq_dists = squareform(sq_dists)

    # Compute the symmetric kernel matrix.
    K = exp(-gamma * mat_sq_dists)

    # Center the kernel matrix.
    N = K.shape[0]
    one_n = np.ones((N,N)) / N
    K = K - one_n.dot(K) - K.dot(one_n) + one_n.dot(K).dot(one_n)

    # Obtaining eigenpairs from the centered kernel matrix
    # numpy.eigh returns them in sorted order
    eigvals, eigvecs = eigh(K)

    # Collect the top k eigenvectors (projected samples)
    X_pc = np.column_stack((eigvecs[:, -i]
                            for i in range(1, n_components + 1)))

    return X_pc

from scipy.spatial.distance import pdist, squareform

from scipy import exp

from scipy.linalg import eigh

import numpy as np

def rbf_kernel_pca(X, gamma, n_components):

"""

RBF kernel PCA implementation.

Parameters

------------

X: {NumPy ndarray}, shape = [n_samples, n_features]

gamma: float

Tuning parameter of the RBF kernel

n_components: int

Number of principal components to return

Returns

------------

X_pc: {NumPy ndarray}, shape = [n_samples, k_features]

Projected dataset

"""

# Calculate pairwise squared Euclidean distances

# in the MxN dimensional dataset.

sq_dists = pdist(X, 'sqeuclidean')

# Convert pairwise distances into a square matrix.

mat_sq_dists = squareform(sq_dists)

# Compute the symmetric kernel matrix.

K = exp(-gamma * mat_sq_dists)

# Center the kernel matrix.

N = K.shape[0]

one_n = np.ones((N,N)) / N

K = K - one_n.dot(K) - K.dot(one_n) + one_n.dot(K).dot(one_n)

# Obtaining eigenpairs from the centered kernel matrix

# numpy.eigh returns them in sorted order

eigvals, eigvecs = eigh(K)

# Collect the top k eigenvectors (projected samples)

X_pc = np.column_stack((eigvecs[:, -i]

for i in range(1, n_components + 1)))

return X_pc

下面以一分离同心数据集为例，分别用PCA和核PCA对数据集进行处理，然后处理后的结果，具体请看以下代码及生成的图形：

from sklearn.datasets import make_circles

X, y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2)

plt.scatter(X[y==0, 0], X[y==0, 1], color='red', marker='^', alpha=0.5)
plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', marker='o', alpha=0.5)

plt.tight_layout()
plt.show()

from sklearn.datasets import make_circles

X, y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2)

plt.scatter(X[y==0, 0], X[y==0, 1], color='red', marker='^', alpha=0.5)

plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', marker='o', alpha=0.5)

plt.tight_layout()

plt.show()

这是典型线性不可数据集，现在我们分别用PCA及核PCA进行处理。
（1）用PCA处理，然后进行分类

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

scikit_pca = PCA(n_components=2)
X_spca = scikit_pca.fit_transform(X)

plt.figure( figsize=(5,3))

plt.scatter(X_spca[y==0, 0], X_spca[y==0, 1], 
            color='red', marker='^', alpha=0.5)
plt.scatter(X_spca[y==1, 0], X_spca[y==1, 1],
            color='blue', marker='o', alpha=0.5)


plt.xlabel('PC1')
plt.ylabel('PC2')

plt.show()

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

scikit_pca = PCA(n_components=2)

X_spca = scikit_pca.fit_transform(X)

plt.figure( figsize=(5,3))

plt.scatter(X_spca[y==0, 0], X_spca[y==0, 1],

color='red', marker='^', alpha=0.5)

plt.scatter(X_spca[y==1, 0], X_spca[y==1, 1],

color='blue', marker='o', alpha=0.5)

plt.xlabel('PC1')

plt.ylabel('PC2')

plt.show()

（2）用核PCA处理，然后进行分类

X_kpca = rbf_kernel_pca(X, gamma=15, n_components=2)

plt.figure( figsize=(5,3))
plt.scatter(X_kpca[y==0, 0], X_kpca[y==0, 1], 
            color='red', marker='^', alpha=0.5)
plt.scatter(X_kpca[y==1, 0], X_kpca[y==1, 1],
            color='blue', marker='o', alpha=0.5)

plt.xlabel('PC1')
plt.ylabel('PC2')

plt.show()

X_kpca = rbf_kernel_pca(X, gamma=15, n_components=2)

plt.figure( figsize=(5,3))

plt.scatter(X_kpca[y==0, 0], X_kpca[y==0, 1],

color='red', marker='^', alpha=0.5)

plt.scatter(X_kpca[y==1, 0], X_kpca[y==1, 1],

color='blue', marker='o', alpha=0.5)

plt.xlabel('PC1')

plt.ylabel('PC2')

plt.show()

（3）使用sklearn实现核PCA
源数据的图形为

这里通过核PCA把该数据变为线性可分数据集，实现代码如下：

from sklearn.decomposition import KernelPCA
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=100, random_state=123)
scikit_kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15)
X_skernpca = scikit_kpca.fit_transform(X)

plt.scatter(X_skernpca[y==0, 0], X_skernpca[y==0, 1], 
            color='red', marker='^', alpha=0.5)
plt.scatter(X_skernpca[y==1, 0], X_skernpca[y==1, 1], 
            color='blue', marker='o', alpha=0.5)

plt.xlabel('PC1')
plt.ylabel('PC2')
plt.tight_layout()
plt.show()

from sklearn.decomposition import KernelPCA

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=100, random_state=123)

scikit_kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15)

X_skernpca = scikit_kpca.fit_transform(X)

plt.scatter(X_skernpca[y==0, 0], X_skernpca[y==0, 1],

color='red', marker='^', alpha=0.5)

plt.scatter(X_skernpca[y==1, 0], X_skernpca[y==1, 1],

color='blue', marker='o', alpha=0.5)

plt.xlabel('PC1')

plt.ylabel('PC2')

plt.tight_layout()

plt.show()

17.7 SVD矩阵分解

（1）SVD奇异值分解的定义
假设有一个mxn矩阵，如果存在一个分解

其中U为的mxm酉矩阵，∑为mxn的半正定对角矩阵，除了对角元素不为0，其他元素都为0，并且对角元素是从大到小排列的，前面的元素比较大，后面的很多元素接近0。这些对角元素就是奇异值。V^T为V的共轭转置矩阵，且为nxn的酉矩阵。这样的分解称为的奇异值分解，对角线上的元素称为奇异值，U称为左奇异矩阵，V^T称为右奇异矩阵。
SVD在信息检索（隐性语义索引）、图像压缩、推荐系统、金融等领域都有应用。
（2）SVD奇异值分解与特征值分解的关系
特征值分解与SVD奇异值分解的目的都是提取一个矩阵最重要的特征。然而，特征值分解只适用于方阵，而SVD奇异值分解适用于任意的矩阵，不一定是方阵。

这里M^T M和MM^T都是方阵，UU^T和VV^T都是单位矩阵，V是M^T M的特征向量，U是MM^T的特征向量。
（3）SVD奇异值分解的作用和意义
奇异值分解最大的作用就是数据的降维，当然，还有其他很多的作用，这里主要讨论数据的降维，对于mxn的M矩阵，进行奇异值分解

取其前k个非零奇异值，可以还原原来的矩阵，即前k个非零奇异值对应的奇异向量代表了矩阵的主要特征。可以表示为

17.8 用Python实现SVD，并用于图像压缩
（1）首先读取一张图片（128*128*3）：

#!python
#  -*- coding:utf-8 -*-
from PIL import Image
import os
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
 
if __name__ == '__main__':
    mpl.rcParams['font.sans-serif'] = [u'simHei']
    mpl.rcParams['axes.unicode_minus'] = False
    A = Image.open('02.jpg')
    a = np.array(A)  #转换成矩阵

#!python

# -*- coding:utf-8 -*-

from PIL import Image

import os

import numpy as np

import matplotlib as mpl

import matplotlib.pyplot as plt

if __name__ == '__main__':

mpl.rcParams['font.sans-serif'] = [u'simHei']

mpl.rcParams['axes.unicode_minus'] = False

A = Image.open('02.jpg')

a = np.array(A) #转换成矩阵

（2）然后可以利用python的numpy库对彩色图像的3个通道进行SVD分解

numpy库中有SVD分解函数：np.linalg.svd
#由于是彩色图像，所以3通道。a的最内层数组为三个数，分别表示RGB，用来表示一个像素
u_r, sigma_r, v_r = np.linalg.svd(a[:, :, 0])
u_g, sigma_g, v_g = np.linalg.svd(a[:, :, 1])
u_b, sigma_b, v_b = np.linalg.svd(a[:, :, 2])

numpy库中有SVD分解函数：np.linalg.svd

#由于是彩色图像，所以3通道。a的最内层数组为三个数，分别表示RGB，用来表示一个像素

u_r, sigma_r, v_r = np.linalg.svd(a[:, :, 0])

u_g, sigma_g, v_g = np.linalg.svd(a[:, :, 1])

u_b, sigma_b, v_b = np.linalg.svd(a[:, :, 2])

（3）然后便可以根据需要压缩图像（丢弃分解出来的三个矩阵中的数据），利用的奇异值个数越少，则压缩的越厉害。下面来看一下不同程度压缩后，重构图像的清晰度：

plt.figure(facecolor = 'w', figsize = (10, 10))
# 奇异值个数依次取：1,2,...,16。来看看一下效果
K = 16
for k in range(1, K + 1):
    R = restore(u_r, sigma_r, v_r, k)
    G = restore(u_g, sigma_g, v_g, k)
    B = restore(u_b, sigma_b, v_b, k)
    I = np.stack((R, G, B), axis = 2)
    # 将图片重构后的显示出来
    plt.subplot(4, 4, k)
    plt.imshow(I)
    plt.axis('off')
    plt.title(u'奇异值个数：%d' %  k)

plt.suptitle(u'SVD与图像分解', fontsize = 20)
plt.tight_layout(0.1, rect = (0, 0, 1, 0.92))
plt.show()

plt.figure(facecolor = 'w', figsize = (10, 10))

# 奇异值个数依次取：1,2,...,16。来看看一下效果

K = 16

for k in range(1, K + 1):

R = restore(u_r, sigma_r, v_r, k)

G = restore(u_g, sigma_g, v_g, k)

B = restore(u_b, sigma_b, v_b, k)

I = np.stack((R, G, B), axis = 2)

# 将图片重构后的显示出来

plt.subplot(4, 4, k)

plt.imshow(I)

plt.axis('off')

plt.title(u'奇异值个数：%d' % k)

plt.suptitle(u'SVD与图像分解', fontsize = 20)

plt.tight_layout(0.1, rect = (0, 0, 1, 0.92))

plt.show()

（4）其中restore函数定义为

def restore(u, sigma, v, k):
    m = len(u)
    n = len(v)
    a = np.zeros((m, n))
    # 重构图像
    a = np.dot(u[:, :k], np.diag(sigma[:k])).dot(v[:k, :])
    # 上述语句等价于：
    # for i in range(k):
    #     ui = u[:, i].reshape(m, 1)
    #     vi = v[i].reshape(1, n)
    #     a += sigma[i] * np.dot(ui, vi)
    a[a < 0] = 0 a[a > 255] = 255
    return np.rint(a).astype('uint8')

def restore(u, sigma, v, k):

m = len(u)

n = len(v)

a = np.zeros((m, n))

# 重构图像

a = np.dot(u[:, :k], np.diag(sigma[:k])).dot(v[:k, :])

# 上述语句等价于：

# for i in range(k):

# ui = u[:, i].reshape(m, 1)

# vi = v[i].reshape(1, n)

# a += sigma[i] * np.dot(ui, vi)

a[a < 0] = 0 a[a > 255] = 255

return np.rint(a).astype('uint8')

16.1 目的

回归分析（regression analysis)是确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法。运用十分广泛，回归分析按照涉及的变量的多少，分为一元回归和多元回归分析；按照因变量的多少，可分为简单回归分析和多重回归分析；按照自变量和因变量之间的关系类型，可分为线性回归分析和非线性回归分析。如果在回归分析中，只包括一个自变量和一个因变量，且二者的关系可用一条直线近似表示，这种回归分析称为一元线性回归分析。如果回归分析中包括两个或两个以上的自变量，且自变量之间存在线性相关，则称为多重线性回归分析.

16.2 模型

16.3 问题

假设我们有一组某企业广告费与销售额之间的对应关系的实际数据：对这些数据可视化结果如下：

可视化代码如下：

import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
myfont = fm.FontProperties(fname='/home/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/simhei.ttf')


x=[4,8,9,8,7,12,6,10,6,9]
y=[9,20,22,15,17,23,18,25,10,20]
#y1=[10,18,20,18,16,26,14,22,14,20]  #y=2x+2

plt.figure(figsize=(6,3))
plt.scatter(x,y)
#plt.plot(x,y1,color='blue',linestyle='dashed')
plt.xlabel("广告费(万)",fontproperties=myfont,size=12)
plt.ylabel("销售额（万）",fontproperties=myfont,size=12)
plt.show()

import matplotlib.pyplot as plt

import matplotlib.font_manager as fm

myfont = fm.FontProperties(fname='/home/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/simhei.ttf')

x=[4,8,9,8,7,12,6,10,6,9]

y=[9,20,22,15,17,23,18,25,10,20]

#y1=[10,18,20,18,16,26,14,22,14,20] #y=2x+2

plt.figure(figsize=(6,3))

plt.scatter(x,y)

#plt.plot(x,y1,color='blue',linestyle='dashed')

plt.xlabel("广告费(万)",fontproperties=myfont,size=12)

plt.ylabel("销售额（万）",fontproperties=myfont,size=12)

plt.show()

如果该企业先增加广告费，如20万，那么它可能带来的销售额大概是多少？
解决这个问题的关键就是如何根据已有数据的总体趋势，拟合出一条直线，如图1.1中虚线所示，那么新给出的广告费就可预测其对应的销售额？如何拟合这条直线呢？

16.4 简单入手

假设我们要拟合的这条直线为一元一次直线，表达式为：
y=ax+b （1.1）
要拟合这条直线或确定这条直线，只要求出a和b即可。那如何得到a和b呢？
这条直线要满足什么条件才是最好的或最能体现这些样本的趋势？
如果能使预测值与实际值的差距最小，应该是一个不错的直线。
由此，想到最小二乘法，利用最小二乘法作为损失函数，然后通过使损失函数最小化来求参数a和b。

最损失函数的最小值，而且损失函数为凸函数，故可利用梯度为0来求出参数a和b
具体过程如下：

如此，a，b确定后，自然直线y=ax+b也就确定了，这样便可以根据新的x值，预测其y值了。

16.5、用Python解方程求出a和b

#根据y=ax+b随意写几个数
X=[4,8,9,8,7,12,6,10,6,9]
Y=[9,20,22,15,17,23,18,25,10,20]
#初始化以下序列
Xsum=0.0
X2sum=0.0
Ysum=0.0
XY=0.0
#得到序列元素个数或长度
n=len(X)
for i in range(n):
    Xsum+=X[i]
    Ysum+=Y[i]
    XY+=X[i]*Y[i]
    X2sum+=X[i]**2
a=(n*XY-Xsum*Ysum)/( n*X2sum-Xsum**2)
b=(Ysum-a*Xsum)/n
print('所求直线为: y=%f*x+%f' % (a,b) )

#根据y=ax+b随意写几个数

X=[4,8,9,8,7,12,6,10,6,9]

Y=[9,20,22,15,17,23,18,25,10,20]

#初始化以下序列

Xsum=0.0

X2sum=0.0

Ysum=0.0

XY=0.0

#得到序列元素个数或长度

n=len(X)

for i in range(n):

Xsum+=X[i]

Ysum+=Y[i]

XY+=X[i]*Y[i]

X2sum+=X[i]**2

a=(n*XY-Xsum*Ysum)/( n*X2sum-Xsum**2)

b=(Ysum-a*Xsum)/n

print('所求直线为: y=%f*x+%f' % (a,b) )

打印结果为：
所求直线为: y=1.980810*x+2.251599

16.6 用迭代方式求参数

以下利用迭代的方法求出参数a和b
直接通过解方程来求参数a和b，如果参数比较多，样本数也很多的情况下，计算量非常大，而且也不现实。因此，我们需要另辟蹊径！
还是以求函数y=x^2的最小值为例，我们可以通过其导数为0来求来求的是y最小的x值；我们也可以通过迭代的方式来求，先从某点开始，如x_0开始，然后沿梯度的方向，不断靠近最小值点，如下图所示：

为啥每次修改的值，都能往函数最小值那个方向前进呢？这里的奥秘在于，我们每次都是向函数的梯度的相反方向来修改。

什么是梯度呢？翻开大学高数课的课本，我们会发现梯度是一个向量，它指向函数值上升最快的方向。显然，梯度的反方向当然就是函数值下降最快的方向了。我们每次沿着梯度相反方向去修改的值，当然就能走到函数的最小值附近。

之所以是最小值附近而不是最小值那个点，是因为我们每次移动的步长不会那么恰到好处，有可能最后一次迭代走远了越过了最小值那个点。步长的选择是门手艺，如果选择小了，那么就会迭代很多轮才能走到最小值附近；如果选择大了，那可能就会越过最小值很远，收敛不到一个好的点上。

按照上面的讨论，我们就可以写出梯度下降算法的公式

X=[4,8,9,8,7,12,6,10,6,9]
Y=[9,20,22,15,17,23,18,25,10,20]

#初始化参数a、b
b=0.1
a=0.2
#alpha是学习率，代表的是迭代的步长
alpha=0.001


#定义线性模型函数
def h(i):
    return b+a*X[i]

def diff(i):
    return h(i)-Y[i]

#times 表示迭代1000次，那么基本上可以下降到梯度的最小值了
#每次迭代做的事情，就是式（1.13）、（1.14）两个公式，不停的去修改参数a，b
for times in range(1000):
    sum1=0
    sum2=0
    for i in range(m):
        sum1=sum1+diff(i)
        sum2=sum2+diff(i)*X[i]
        b=b-alpha*sum1
        a=a-alpha*sum2


print ("参数b的值:%.2f"%b)
print ("参数a的值:%.2f"%a)

X=[4,8,9,8,7,12,6,10,6,9]

Y=[9,20,22,15,17,23,18,25,10,20]

#初始化参数a、b

b=0.1

a=0.2

#alpha是学习率，代表的是迭代的步长

alpha=0.001

#定义线性模型函数

def h(i):

return b+a*X[i]

def diff(i):

return h(i)-Y[i]

#times 表示迭代1000次，那么基本上可以下降到梯度的最小值了

#每次迭代做的事情，就是式（1.13）、（1.14）两个公式，不停的去修改参数a，b

for times in range(1000):

sum1=0

sum2=0

for i in range(m):

sum1=sum1+diff(i)

sum2=sum2+diff(i)*X[i]

b=b-alpha*sum1

a=a-alpha*sum2

print ("参数b的值:%.2f"%b)

print ("参数a的值:%.2f"%a)

打印结果：
参数b的值:2.44
参数a的值:1.95

通过迭代方法求得的参数a，b与通过梯度求得的参数进行比较，发现他们非常接近，有异曲同工之妙！
这种方法也可看作只有一个神经元的神经网络，并且没有激活函数的这种。

16.7 迭代中用矩阵代替循环

这里利用梯度下降更新参数时采用一个循环，循环一般比较耗费资源，如果有一千万个数据，将需要循环一千万次，这是不可接受的，那么我们是否能不用循环？
当然可以，如果用采用矩阵方式，需要进行如下操作：
 把输入X,Y变为矩阵；
 把模型变为矩阵或向量；
 把式（1.13）变为矩阵与向量的点乘
 把式（1.14）变为矩阵的累加
具体代码实现如下：
（1）定义一个线性回归类，在这个类中初始化两个参数，并定义几个函数。

class LinearRegressGD(object):
    def __init__(self,eta=0.001,n_iter=4000):
        self.eta=eta
        self.n_iter=n_iter
    def fit(self,X,Y):
        #初始化参数
        self.w=np.zeros(1+X.shape[1])
        self.w[0]=0.1
        self.w[1]=0.2
        self.cost=[]
        #开始迭代循环
        for i in range(self.n_iter):
            output=self.net_input(X)
            errors=(Y-output)
            self.w[1] +=self.eta*X.T.dot(errors)
            self.w[0] +=self.eta*(errors).sum()
            cost1=(errors**2).sum()/2.0
            self.cost.append(cost1)
        return self
    def net_input(self,X):
        return np.dot(X,self.w[1])+self.w[0]
    def predict(self,X):
        return self.net_input(X)

class LinearRegressGD(object):

def __init__(self,eta=0.001,n_iter=4000):

self.eta=eta

self.n_iter=n_iter

def fit(self,X,Y):

#初始化参数

self.w=np.zeros(1+X.shape[1])

self.w[0]=0.1

self.w[1]=0.2

self.cost=[]

#开始迭代循环

for i in range(self.n_iter):

output=self.net_input(X)

errors=(Y-output)

self.w[1] +=self.eta*X.T.dot(errors)

self.w[0] +=self.eta*(errors).sum()

cost1=(errors**2).sum()/2.0

self.cost.append(cost1)

return self

def net_input(self,X):

return np.dot(X,self.w[1])+self.w[0]

def predict(self,X):

return self.net_input(X)

（2）把输入变为10x1矩阵，如何运行以上函数

import numpy as np
X=np.array([4,8,9,8,7,12,6,10,6,9]).reshape(10,1)
Y=np.array([9,20,22,15,17,23,18,25,10,20]).reshape(10,1)
lr=LinearRegressGD()
lr.fit(X,Y)
print("参数a的值:%.3f"%lr.w[1])
print("参数b的值:%.3f"%lr.w[0])

import numpy as np

X=np.array([4,8,9,8,7,12,6,10,6,9]).reshape(10,1)

Y=np.array([9,20,22,15,17,23,18,25,10,20]).reshape(10,1)

lr=LinearRegressGD()

lr.fit(X,Y)

print("参数a的值:%.3f"%lr.w[1])

print("参数b的值:%.3f"%lr.w[0])

打印结果：
参数a的值:1.995
参数b的值:2.130
（3）把迭代过程可视化

import matplotlib.pyplot as plt
plt.plot(range(1,lr.n_iter+1),lr.cost)
plt.ylabel('SSE')
plt.xlabel('Epoch')
plt.show()

import matplotlib.pyplot as plt

plt.plot(range(1,lr.n_iter+1),lr.cost)

plt.ylabel('SSE')

plt.xlabel('Epoch')

plt.show()

16.8 用Tensorflow架构实现自动求导,求参数

import numpy as np 
import tensorflow as tf  #导入tensorflow 
import matplotlib.pyplot as plt # import matplotlib
%matplotlib inline

##构造训练数据##
x_data=np.array([4,8,9,8,7,12,6,10,6,9])
y_data=np.array([9,20,22,15,17,23,18,25,10,20])
#可视化
plt.figure() # Create a new figure
plt.scatter(x_data,y_data) #Plot a scatter draw of the random datapoints
plt.plot (x_data, 2.0 + 2.0 * x_data) # Draw one line with the line function


##初始化权重变量#
w=tf.Variable(tf.random_uniform([1],-1.0,1.0)) 
b=tf.Variable(tf.zeros([1]))     
y=w*x_data+b


##计算损失函数
loss=tf.reduce_mean(tf.square(y-y_data))  #判断与正确值的差距
optimizer=tf.train.GradientDescentOptimizer(0.01) #根据差距进行反向传播修正参数
train=optimizer.minimize(loss) #建立训练器

init=tf.global_variables_initializer() #初始化TensorFlow训练结构
sess=tf.Session()  #建立TensorFlow训练会话
#sess = tf.InteractiveSession()  
sess.run(init)     #将训练结构装载到会话中
print('初始权重值w:',w.eval(session=sess))
for  step in range(1000): #循环训练100次
     sess.run(train)  #使用训练器根据训练结构进行训练
     if  step%100==0:  #每20次打印一次训练结果
        print(step,sess.run(w),sess.run(b)) #训练次数，W值，b值
        
print(sess.run(loss))        
print('最后权重值w:',w.eval(session=sess))
print('最后偏移量b:',b.eval(session=sess))
#可视化预测直线
plt.figure() # Create a new figure
plt.scatter(x_data,y_data)
plt.plot (x_data, sess.run(b) + x_data * sess.run(w))

import numpy as np

import tensorflow as tf #导入tensorflow

import matplotlib.pyplot as plt # import matplotlib

%matplotlib inline

##构造训练数据##

x_data=np.array([4,8,9,8,7,12,6,10,6,9])

y_data=np.array([9,20,22,15,17,23,18,25,10,20])

#可视化

plt.figure() # Create a new figure

plt.scatter(x_data,y_data) #Plot a scatter draw of the random datapoints

plt.plot (x_data, 2.0 + 2.0 * x_data) # Draw one line with the line function

##初始化权重变量#

w=tf.Variable(tf.random_uniform([1],-1.0,1.0))

b=tf.Variable(tf.zeros([1]))

y=w*x_data+b

##计算损失函数

loss=tf.reduce_mean(tf.square(y-y_data)) #判断与正确值的差距

optimizer=tf.train.GradientDescentOptimizer(0.01) #根据差距进行反向传播修正参数

train=optimizer.minimize(loss) #建立训练器

init=tf.global_variables_initializer() #初始化TensorFlow训练结构

sess=tf.Session() #建立TensorFlow训练会话

#sess = tf.InteractiveSession()

sess.run(init) #将训练结构装载到会话中

print('初始权重值w:',w.eval(session=sess))

for step in range(1000): #循环训练100次

sess.run(train) #使用训练器根据训练结构进行训练

if step%100==0: #每20次打印一次训练结果

print(step,sess.run(w),sess.run(b)) #训练次数，W值，b值

print(sess.run(loss))

print('最后权重值w:',w.eval(session=sess))

print('最后偏移量b:',b.eval(session=sess))

#可视化预测直线

plt.figure() # Create a new figure

plt.scatter(x_data,y_data)

plt.plot (x_data, sess.run(b) + x_data * sess.run(w))

运行结果
初始权重值w: [-0.99174666]
0 [ 3.35317731] [ 0.51469594]
100 [ 2.17306757] [ 0.62030703]
200 [ 2.14828968] [ 0.83054471]
300 [ 2.12670541] [ 1.01368761]
400 [ 2.10790277] [ 1.17322707]
500 [ 2.09152317] [ 1.31220567]
600 [ 2.07725477] [ 1.43327284]
700 [ 2.06482506] [ 1.53873765]
800 [ 2.05399752] [ 1.63060951]
900 [ 2.0445652] [ 1.71064162]
6.90384
最后权重值w: [ 2.03642535]
最后偏移量b: [ 1.77970874]

16.9 拓展

15.1 聚类概述

“物以类聚人以群分",现实生活中很多事物都存在类似现象，当然这些现象反映在数据上，就需要我们通过一定算法，找出这些类或族。聚类算法就是解决类似问题而提出的。
聚类就是按照某个特定标准(如距离准则)把一个数据集分割成不同的类或簇，使得同一个簇内的数据对象的相似性尽可能大，同时不在同一个簇中的数据对象的差异性也尽可能地大。即聚类后同一类的数据尽可能聚集到一起，不同数据尽量分离。
目前，有大量的聚类算法。而对于具体应用，聚类算法的选择取决于数据的类型、聚类的目的。如果聚类分析被用作描述或探查的工具，可以对同样的数据尝试多种算法，以发现数据可能揭示的结果。
主要的聚类算法可以划分为如下几类：划分方法、层次方法、基于密度的方法、基于网格的方法以及基于模型的方法
每一类中都存在着得到广泛应用的算法，例如：划分方法中的k-means聚类算法、层次方法中的层次聚类算法、基于模型方法中的高斯混合聚类算法等。
目前,聚类问题的研究不仅仅局限于上述的硬聚类，即每一个数据只能被归为一类，模糊聚类也是聚类分析中研究较为广泛的一个分支。模糊聚类通过隶属函数来确定每个数据隶属于各个簇的程度，而不是将一个数据对象硬性地归类到某一簇中。目前已有很多关于模糊聚类的算法被提出，如著名的高斯混合聚类等。
下图演示了K-Means进行聚类的迭代过程：

下图为高斯混合聚类迭代过程：

15.2 k-means模型

算法步骤：
（1）首先我们选择一些类/组，并随机初始化它们各自的中心点。
（2）计算每个数据点到中心点的距离，数据点距离哪个中心点最近就划分到哪一类中。
（3）重新计算每一类中心点作为新的中心点，各中心点求每个类中的平均值。
（4）重复以上步骤，直到每一类中心在每次迭代后变化不大为止。也可以多次随机初始化中心点，然后选择运行结果最好的一个。

优点：
计算简便
缺点：
我们必须提前知道数据有多少类/组。

15.3 简单实例

假定我们有如下8个点：
A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9)
现希望分成3个聚类（即k=3）
初始化选择 A1(2, 10), A4(5, 8) ，A7(1, 2)为聚类中心点，假设两点距离定义为ρ(a, b) = |x2 – x1| + |y2 – y1| . （当然也可以定义为其它格式，如欧氏距离）
第一步：选择3个聚类中，分别为A1，A4，A7

这些点的分布图如下：

图1
第二步：计算各点到3个类中心的距离，那个点里类中心最近，就把这个样本点
划归到这个类。选定3个类中心（即:A1，A4，A7），如下图：
图2
对A1点，计算其到每个cluster 的距离
A1->class1 = |2-2|+|10-10}=0
A1->class2 = |2-5|+|10-8|=5
A1->class3 = |2-1|+|10-2|=9
因此A1 属于cluster1，如下表：
(
按照类似方法，算出各点到聚类中心的距离，然后按照最近原则，把样本点放在那个族中。如下表：

根据距离最短原则，样本点的第一次样本划分，如下图：
图3
第三步：求出各类中样本点的均值，并以此为类的中心。
cluster1只有1个点，因此A1为中心点
cluster2的中心点为 ( (8+5+7+6+4)/5,(4+8+5+4+9)/5 )=（6,6）。注意：这个点并非样本点。
cluster3的中心点为( (2+1)/2, (5+2)/2 )= (1.5, 3.5)，
新族的具体请看下图中x点：

图4
第四步：计算各样本点到各聚类中心的距离，重复以上第二、第三步，把样本划分到新聚类中，如下图：
图5
持续迭代，直到前后两次迭代不发生变化为止，如下：
图6

15.4 简单实例用Python实现

（1）生成数据

from numpy import *
dataSet=array([[2,10],[2,5],[8,4],[5,8],[7,5],[6,4],[1,2],[4,9]])

1 2	from numpy import * dataSet=array([[2,10],[2,5],[8,4],[5,8],[7,5],[6,4],[1,2],[4,9]])

（2）创建距离函数

def distabs(vecA, vecB):
    '''
    # 计算两个向量的距离，用的是差的绝对值
    :param vecA:
    :param vecB: 
    :return: 距离
    '''
    return sum(abs(vecA - vecB))

def distabs(vecA, vecB):

'''

# 计算两个向量的距离，用的是差的绝对值

:param vecA:

:param vecB:

:return: 距离

'''

return sum(abs(vecA - vecB))

（3）手工选择3个聚类中心

def selectCent(dataSet, k):
    '''
    :param dataSet: 输入需聚类的数据集
    :param k: 划分的簇数
    :return: 返回 k 个簇的初始随机中心数组
    '''
    centroids = []
    for j in [0,3,6]:
        dataSet[j]
        centroids.append(dataSet[j])
    return array(centroids)

def selectCent(dataSet, k):

'''

:param dataSet: 输入需聚类的数据集

:param k: 划分的簇数

:return: 返回 k 个簇的初始随机中心数组

'''

centroids = []

for j in [0,3,6]:

dataSet[j]

centroids.append(dataSet[j])

return array(centroids)

（4）创聚类函数

def kMeans(dataSet, k,max_times=6, distMeas=distabs, createCent=selectCent):
    '''
    # K-mean 算法实现
    :param dataSet: 需聚类的数据集
    :param k: 要划分的簇数
    :param max_times:最高迭代次数 
    :param distMeas: 距离的度量算法，默认方法为 distabs（即差的绝对值距离）
    :param selectCent: 生成初始的质心
    :return: 返回 K 个簇的中心以及划分过后的数据集
    '''
    m = shape(dataSet)[0]

    # 创建一个数据划分空间
    clusterAssment = mat(zeros((m, 2)))
    # 获取初始簇中心点
    centroids = createCent(dataSet, k)
    clusterChanged = True
    while clusterChanged or max_times < 0:
        max_times -= 1
        clusterChanged = False
        for i in range(m):
            minDist = inf
            minIndex = -1
            for j in range(k):
                # 获取每个点到中心 k 的欧几里得距离
                distJI = distMeas(centroids[j, :], dataSet[i, :])
                # 选择最小距离
                if distJI < minDist:
                    minDist = distJI
                    minIndex = j
            # 判断中心点是否变化
            if clusterAssment[i, 0] != minIndex:
                clusterChanged = True
            clusterAssment[i, :] = minIndex, minDist ** 2

        # 重新计算中心
        for cent in range(k):
            # 获取这个簇中的所有点
            ptsInClust = dataSet[nonzero(clusterAssment[:, 0].A == cent)[0]]
            # 分配质心为平均值
            centroids[cent, :] = mean(ptsInClust, axis=0)
    return centroids, clusterAssment

def kMeans(dataSet, k,max_times=6, distMeas=distabs, createCent=selectCent):

'''

# K-mean 算法实现

:param dataSet: 需聚类的数据集

:param k: 要划分的簇数

:param max_times:最高迭代次数

:param distMeas: 距离的度量算法，默认方法为 distabs（即差的绝对值距离）

:param selectCent: 生成初始的质心

:return: 返回 K 个簇的中心以及划分过后的数据集

'''

m = shape(dataSet)[0]

# 创建一个数据划分空间

clusterAssment = mat(zeros((m, 2)))

# 获取初始簇中心点

centroids = createCent(dataSet, k)

clusterChanged = True

while clusterChanged or max_times < 0:

max_times -= 1

clusterChanged = False

for i in range(m):

minDist = inf

minIndex = -1

for j in range(k):

# 获取每个点到中心 k 的欧几里得距离

distJI = distMeas(centroids[j, :], dataSet[i, :])

# 选择最小距离

if distJI < minDist:

minDist = distJI

minIndex = j

# 判断中心点是否变化

if clusterAssment[i, 0] != minIndex:

clusterChanged = True

clusterAssment[i, :] = minIndex, minDist ** 2

# 重新计算中心

for cent in range(k):

# 获取这个簇中的所有点

ptsInClust = dataSet[nonzero(clusterAssment[:, 0].A == cent)[0]]

# 分配质心为平均值

centroids[cent, :] = mean(ptsInClust, axis=0)

return centroids, clusterAssment

（5）运行

kMeans(dataSet,3)

1	kMeans(dataSet,3)

运行结果
(array([[3, 9],
[7, 4],
[1, 3]]), matrix([[ 0., 4.],
[ 2., 9.],
[ 1., 1.],
[ 0., 9.],
[ 1., 1.],
[ 1., 1.],
[ 2., 1.],
[ 0., 1.]]))
这个运行结果与图6的结果一致。

15.5 简单实例用sklearn实现

（1）导入需要的库或模块

%matplotlib inline
import numpy as np      #科学计算包  
import matplotlib.pyplot as plt      #python画图包  
from matplotlib.font_manager import FontProperties
  
from sklearn.cluster import KMeans       #导入K-means算法包  

import matplotlib.font_manager as fm    ###便于中文显示
#myfont = fm.FontProperties(fname='/home/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/simhei.ttf')
myfont = FontProperties(fname=r"c:\windows\fonts\simkai.ttf", size=14)

%matplotlib inline

import numpy as np #科学计算包

import matplotlib.pyplot as plt #python画图包

from matplotlib.font_manager import FontProperties

from sklearn.cluster import KMeans #导入K-means算法包

import matplotlib.font_manager as fm ###便于中文显示

#myfont = fm.FontProperties(fname='/home/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/simhei.ttf')

myfont = FontProperties(fname=r"c:\windows\fonts\simkai.ttf", size=14)

（2）创建数据

X=np.array([[2,10],[2,5],[8,4],[5.0,8],[7,5],[6,4.0],[1,2],[4,9]])
y=np.array([0,1,1,1,1,2,2,2])

1 2	X=np.array([[2,10],[2,5],[8,4],[5.0,8],[7,5],[6,4.0],[1,2],[4,9]]) y=np.array([0,1,1,1,1,2,2,2])

（3）利用kmeans进行聚类，并把结果可视化

plt.figure(figsize=(12, 12))  
  
''''' 
centers:产生数据的聚类中心点，默认值3 
init:采用随机还是k-means++等（可使聚类中心尽可能的远的一种方法）
random_state:随机生成器的种子 

'''  
random_state = 10  
  
# 使用k-means聚类 

plt.subplot(221)  #在2图里添加子图1  
y_pred = KMeans(n_clusters=3, init='random',random_state=random_state).fit_predict(X) 
plt.scatter(X[:, 0], X[:, 1], c=y_pred)  
plt.title("使用kmeas聚类",fontproperties=myfont,size=12)   #加标题  
  
#使用kmeans++聚类
y_pred = KMeans(n_clusters=3, init='k-means++',random_state=random_state).fit_predict(X) 
  
plt.subplot(222)#在2图里添加子图2  
plt.scatter(X[:, 0], X[:, 1], c=y_pred)  
plt.title("使用kmeans++聚类",fontproperties=myfont,size=12)  

plt.show()

plt.figure(figsize=(12, 12))

'''''

centers:产生数据的聚类中心点，默认值3

init:采用随机还是k-means++等（可使聚类中心尽可能的远的一种方法）

random_state:随机生成器的种子

'''

random_state = 10

# 使用k-means聚类

plt.subplot(221) #在2图里添加子图1

y_pred = KMeans(n_clusters=3, init='random',random_state=random_state).fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred)

plt.title("使用kmeas聚类",fontproperties=myfont,size=12) #加标题

#使用kmeans++聚类

y_pred = KMeans(n_clusters=3, init='k-means++',random_state=random_state).fit_predict(X)

plt.subplot(222)#在2图里添加子图2

plt.scatter(X[:, 0], X[:, 1], c=y_pred)

plt.title("使用kmeans++聚类",fontproperties=myfont,size=12)

plt.show()

15.6 简单实例用Tensorflow实现

这里需要用到很多tensorflow函数，大家可参考：
https://www.cnblogs.com/wuzhitj/p/6648563.html
(1)导入需要的库，初始化参数

import tensorflow as tf
import numpy as np
import time
import matplotlib
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
#样本数
N=8
#族数
K=3
#最大迭代数
MAX_ITERS = 20
changed = True
iters = 0

import tensorflow as tf

import numpy as np

import time

import matplotlib

import matplotlib.pyplot as plt

from sklearn.datasets.samples_generator import make_blobs

#样本数

N=8

#族数

K=3

#最大迭代数

MAX_ITERS = 20

changed = True

iters = 0

（2）创建数据集，并可视化三个族中心

centers = [(2, 10.0), (5, 8), (1, 2)]
data=np.array([[2,10],[2,5],[8,4],[5.0,8],[7,5],[6,4.0],[1,2],[4,9]])
features=np.array([0,1,1,1,1,2,2,2])
fig, ax = plt.subplots()
#s-表示标记的大小
ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250)
plt.show()

centers = [(2, 10.0), (5, 8), (1, 2)]

data=np.array([[2,10],[2,5],[8,4],[5.0,8],[7,5],[6,4.0],[1,2],[4,9]])

features=np.array([0,1,1,1,1,2,2,2])

fig, ax = plt.subplots()

#s-表示标记的大小

ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250)

plt.show()

（3）可视化样本

fig, ax = plt.subplots()
ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250)
ax.scatter(data.transpose()[0], data.transpose()[1], marker = 'o', s = 100, c = features, cmap=plt.cm.coolwarm )
plt.show()

fig, ax = plt.subplots()

ax.scatter(np.asarray(centers).transpose()[0], np.asarray(centers).transpose()[1], marker = 'o', s = 250)

ax.scatter(data.transpose()[0], data.transpose()[1], marker = 'o', s = 100, c = features, cmap=plt.cm.coolwarm )

plt.show()

（4）计算各样本的到各族中心距离

points=tf.Variable(data)
cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))
centroids = tf.Variable(tf.slice(points.initialized_value(), [0,0], [K,2]))
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run(centroids)

#为计算每点对族中心的距离，使rep_centroids、rep_points都变为NxKx2矩阵
rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, 2])
rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, 2])
sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),2)
#获取最小值的对应索引值
best_centroids = tf.argmin(sum_squares, 1)
#判断所有族中心是否不再变化
did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids, cluster_assignments))

points=tf.Variable(data)

cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))

centroids = tf.Variable(tf.slice(points.initialized_value(), [0,0], [K,2]))

sess = tf.Session()

sess.run(tf.global_variables_initializer())

sess.run(centroids)

#为计算每点对族中心的距离，使rep_centroids、rep_points都变为NxKx2矩阵

rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, 2])

rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, 2])

sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),2)

#获取最小值的对应索引值

best_centroids = tf.argmin(sum_squares, 1)

#判断所有族中心是否不再变化

did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids, cluster_assignments))

（5）定义函数，更新各族中心坐标

#定义函数，更新各族中心坐标
def bucket_mean(data, bucket_ids, num_buckets):
    total = tf.unsorted_segment_sum(data, bucket_ids, num_buckets)
    count = tf.unsorted_segment_sum(tf.ones_like(data), bucket_ids, num_buckets)
    return total / count
means = bucket_mean(points, best_centroids, K)
#确定执行依赖关系，先执行did_assignments_change，然后执行后续命令
with tf.control_dependencies([did_assignments_change]):
    do_updates = tf.group(centroids.assign(means),cluster_assignments.assign(best_centroids))

#定义函数，更新各族中心坐标

def bucket_mean(data, bucket_ids, num_buckets):

total = tf.unsorted_segment_sum(data, bucket_ids, num_buckets)

count = tf.unsorted_segment_sum(tf.ones_like(data), bucket_ids, num_buckets)

return total / count

means = bucket_mean(points, best_centroids, K)

#确定执行依赖关系，先执行did_assignments_change，然后执行后续命令

with tf.control_dependencies([did_assignments_change]):

do_updates = tf.group(centroids.assign(means),cluster_assignments.assign(best_centroids))

（6）可视化迭代过程

fig, ax = plt.subplots()
colourindexes=[2,1,3]
#循环停止条件是族中心不再变化而且循环次数不超过指定值
while changed and iters < MAX_ITERS:
    fig, ax = plt.subplots()
    iters += 1
    [changed, _] = sess.run([did_assignments_change, do_updates])
    [centers, assignments] = sess.run([centroids, cluster_assignments])
    ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker = 'o', s = 200, c = assignments, cmap=plt.cm.coolwarm )
    ax.scatter(centers[:,0],centers[:,1], marker = '^', s = 550, c = colourindexes, cmap=plt.cm.plasma)
    ax.set_title('Iteration ' + str(iters))
    plt.savefig("kmeans" + str(iters) +".png")
ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker = 'o', s = 200, c = assignments, cmap=plt.cm.coolwarm )
plt.show()

fig, ax = plt.subplots()

colourindexes=[2,1,3]

#循环停止条件是族中心不再变化而且循环次数不超过指定值

while changed and iters < MAX_ITERS:

fig, ax = plt.subplots()

iters += 1

[changed, _] = sess.run([did_assignments_change, do_updates])

[centers, assignments] = sess.run([centroids, cluster_assignments])

ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker = 'o', s = 200, c = assignments, cmap=plt.cm.coolwarm )

ax.scatter(centers[:,0],centers[:,1], marker = '^', s = 550, c = colourindexes, cmap=plt.cm.plasma)

ax.set_title('Iteration ' + str(iters))

plt.savefig("kmeans" + str(iters) +".png")

ax.scatter(sess.run(points).transpose()[0], sess.run(points).transpose()[1], marker = 'o', s = 200, c = assignments, cmap=plt.cm.coolwarm )

plt.show()

迭代1次就到达最佳结果，看来是要tensorflow效果不错！

15.7 改进

由于 K-means 算法的分类结果会受到初始点的选取而有所区别，因此有提出这种算法的改进: K-means++。其实这个算法也只是对初始点的选择有改进而已，其他步骤都一样。初始质心选取的基本思路就是，初始的聚类中心之间的相互距离要尽可能的远。整个算法的过程如下：
下面结合一个简单的例子说明K-means++是如何选取初始聚类中心的。数据集中共有8个样本，分布以及对应序号如下图所示：
图7
假设经过图7的步骤一后6号点被选择为第一个初始聚类中心，那在进行步骤二时每个样本的D(x)和被选择为第二个聚类中心的概率如下表所示：

其中的P(x)就是每个样本被选为下一个聚类中心的概率。最后一行的Sum是概率P(x)的累加和，用于轮盘法选择出第二个聚类中心。方法是随机产生出一个0~1之间的随机数，判断它属于哪个区间，那么该区间对应的序号就是被选择出来的第二个聚类中心了。例如1号点的区间为[0,0.2)，2号点的区间为[0.2, 0.525)。
从上表可以直观的看到第二个初始聚类中心是1号，2号，3号，4号中的一个的概率为0.9。而这4个点正好是离第一个初始聚类中心6号点较远的四个点。这也验证了K-means的改进思想：即离当前已有聚类中心较远的点有更大的概率被选为下一个聚类中心。可以看到，该例的K值取2是比较合适的。当K值大于2时，每个样本会有多个距离，需要取最小的那个距离作为D(x)。