月度归档：2018年02月

第14章TensorFlowOnSpark详解

前面我们介绍了Spark MLlib的多种机器学习算法，如分类、回归、聚类、推荐等，Spark目前还缺乏对神经网络、深度学习的足够支持，但近几年市场对神经网络，尤其对深度学习热情高涨，成了当下很多企业的研究热点，缺失神经网络的支持，这或许也算是Spark MLlib尚欠不足之处吧。
不过好消息是TensorFlow这个深度学习框架，已经有了Spark接口，即TensorFlowOnSpark。TensorFlow是目前很热门的深度学习框架，是Google于2015年11月9日开源的第二代深度学习系统，也是AlphaGo的基础程序。
本章我们将介绍深度学习最好框架TensorFlow及TensorFlowOnSpark，具体包括：
TensorFlow简介
TensorFlow实现卷积神经网络
分布式TensorFlow
TensorFlowOnSpark架构
TensorFlowOnSpark实例

14.1TensorFlow简介

14.1.1TensorFlow的安装

安装TensorFlow，因本环境的python2.7采用anaconda来安装，故这里采用conda管理工具来安装TensorFlow，目前conda缺省安装版本为TensorFlow 1.1。

conda  install  tensorflow

1	conda install tensorflow

验证安装是否成功，可以通过导入tensorflow来检验。
启动ipython（或python）

import tensorflow as tf

1	import tensorflow as tf

14.1.2TensorFlow的发展

2015年11月9日谷歌开源了人工智能系统TensorFlow，同时成为2015年最受关注的开源项目之一。TensorFlow的开源大大降低了深度学习在各个行业中的应用难度。TensorFlow的近期里程碑事件主要有：
2016年04月：发布了分布式TensorFlow的0.8版本，把DeepMind模型迁移到TensorFlow；
2016年06月：TensorFlow v0.9发布，改进了移动设备的支持；
2016年11月：TensorFlow开源一周年；
2017年2月：TensorFlow v1.0发布，增加了Java、Go的API,以及专用的编译器和调试工具，同时TensorFlow 1.0引入了一个高级API，包含tf.layers，tf.metrics和tf.losses模块。还宣布增了一个新的tf.keras模块，它与另一个流行的高级神经网络库Keras完全兼容。
2017年4月：TensorFlow v1.1发布，为 Windows 添加 Java API 支，添加 tf.spectral 模块， Keras 2 API等；
2017年6月：TensorFlow v1.2发布，包括 API 的重要变化、contrib API的变化和Bug 修复及其他改变等。

14.1.3TensorFlow的特点

14.1.4TensorFlow编程模型

TensorFlow如何工作？我们通过一个简单的实例进行说明，为计算x+y，你需要创建下图（图14-1）这张数据流图：

图14-1计算x+y的数据流图

以下构成上数据流图（图14-1）的详细步骤：
1）定义x= [1,3,5]，y =[2,4,7]，这个图和tf.Tensor一起工作来代表数据的单位，你需要创建恒定的张量：

import tensorflow as tf
x = tf.constant([1,3,5]) 
y = tf.constant([2,4,7])

import tensorflow as tf

x = tf.constant([1,3,5])

y = tf.constant([2,4,7])

2）定义操作

op = tf.add(x,y)

1	op = tf.add(x,y)

3）张量和操作都有了，接下来就是创建图

my_graph = tf.Graph()

1	my_graph = tf.Graph()

注意：这一步不是必须的，在创建回话时，系统将自动创建一个默认图。
4）为了运行这图你将需要创建一个回话(tf.Session),一个tf.Session对象封装了操作对象执行的环境，为了做到这一点，我们需要定义在会话中将要用到哪一张图：

with tf.Session(graph=my_graph) as sess:
    x = tf.constant([1,3,5]) 
    y = tf.constant([2,4,7])
    op = tf.add(x,y)

with tf.Session(graph=my_graph) as sess:

x = tf.constant([1,3,5])

y = tf.constant([2,4,7])

op = tf.add(x,y)

5）想要执行这个操作，要用到tf.Session.run()这个方法：

import tensorflow as tf
	my_graph = tf.Graph()
	with tf.Session(graph=my_graph) as sess:
	x = tf.constant([1,3,5]) 
	y = tf.constant([2,4,7])
	op = tf.add(x,y)
	result = sess.run(fetches=op)
	print(result)

import tensorflow as tf

my_graph = tf.Graph()

with tf.Session(graph=my_graph) as sess:

x = tf.constant([1,3,5])

y = tf.constant([2,4,7])

op = tf.add(x,y)

result = sess.run(fetches=op)

print(result)

6）运行结果：
[ 3 7 12]

14.1.5TensorFlow常用函数

14.1.6TensorFlow的运行原理

TensorFlow有一个重要组件client，即客户端，此外，还有master、worker，这些有点类似Spark的结构。它通过Session的接口与master及多个worker相连，其中每一个worker可以与多个硬件设备（device）相连，比如CPU或GPU，并负责管理这些硬件。而master则负责管理所有worker按流程执行计算图。

14.2TensorFlow实现卷积神经网络

神经网络可为机器学习中最活跃的领域之一，尤其代表深度学习的卷积神经（Convolutional Neural Network,CNN）、循环神经网络（Recurrent Neural Network，RNN）更是炙手可热。

14.2.1卷积神经网络简介

卷积神经网络是人工神经网络的一种，已成为图像识别、视频处理、语音分析等领域的研究热点。它的权值共享网络结构使之更类似于生物神经网络，减少了权值的数量，降低了网络模型的复杂度，防止因参数太多导致过拟合。

14.2.3卷积神经网络的网络结构

接下来，我们利用训练集训练卷积神经网络模型，然后在测试集上验证该模型。
搭建的卷积神经网络使用的一些参数是：
卷积层1：kernel_size [5, 5], stride=1，32个卷积窗口
池化层1： pool_size [2, 2], stride = 2
卷积层2：kernel_size [5, 5], stride=1，64个卷积窗口
池化层2： pool_size [2, 2], stride = 2
全连接层: 1024个特征，使用dropout减少过拟合
输出层: 使用softmax进行分类

14.2.4.1 导入数据

首先启动ipython，进入交互计算环境，当然直接启动python也可，然后通过TensorFlow自带的函数读取图片数据。

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
~/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py 中函数read_data_sets四个local_file

import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

~/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py 中函数read_data_sets四个local_file

如果无法直接通过input_data下载，可以先把MNIST数据下载，然后，修改
python/learn/datasets/mnist.py文件中read_data_sets函数中4个local_file的值
具体如下，注释原来的local_file，新增4行local_file

#local_file = base.maybe_download(TRAIN_IMAGES, train_dir,SOURCE_URL + TRAIN_IMAGES)
local_file = train_dir + "/" + TRAIN_IMAGES
with open(local_file, 'rb') as f:
  train_images = extract_images(f)

#local_file = base.maybe_download(TRAIN_LABELS, train_dir,SOURCE_URL + TRAIN_LABELS)
local_file = train_dir + "/" + TRAIN_LABELS
with open(local_file, 'rb') as f:
  train_labels = extract_labels(f, one_hot=one_hot)

#local_file = base.maybe_download(TEST_IMAGES, train_dir,SOURCE_URL + TEST_IMAGES)
local_file = train_dir + "/" + TEST_IMAGES
with open(local_file, 'rb') as f:
  test_images = extract_images(f)

#local_file = base.maybe_download(TEST_LABELS, train_dir,SOURCE_URL + TEST_LABELS)
local_file = train_dir + "/" + TEST_LABELS

#local_file = base.maybe_download(TRAIN_IMAGES, train_dir,SOURCE_URL + TRAIN_IMAGES)

local_file = train_dir + "/" + TRAIN_IMAGES

with open(local_file, 'rb') as f:

train_images = extract_images(f)

#local_file = base.maybe_download(TRAIN_LABELS, train_dir,SOURCE_URL + TRAIN_LABELS)

local_file = train_dir + "/" + TRAIN_LABELS

with open(local_file, 'rb') as f:

train_labels = extract_labels(f, one_hot=one_hot)

#local_file = base.maybe_download(TEST_IMAGES, train_dir,SOURCE_URL + TEST_IMAGES)

local_file = train_dir + "/" + TEST_IMAGES

with open(local_file, 'rb') as f:

test_images = extract_images(f)

#local_file = base.maybe_download(TEST_LABELS, train_dir,SOURCE_URL + TEST_LABELS)

local_file = train_dir + "/" + TEST_LABELS

更加数据实际存放路径，修改read_data_sets中读取文件路径。

mnist = input_data.read_data_sets("./TensorFlowOnSpark/mnist", one_hot=True)
# 创建交互式session 
sess = tf.InteractiveSession()

mnist = input_data.read_data_sets("./TensorFlowOnSpark/mnist", one_hot=True)

# 创建交互式session

sess = tf.InteractiveSession()

14.2.4.2 权重初始化

# 正态分布，标准差为0.1，默认最大为1，最小为-1，均值为0

def weight_variable(shape):  
        initial = tf.truncated_normal(shape, stddev=0.1)  
        return tf.Variable(initial) 
# 创建一个结构为shape矩阵也可以说是数组shape声明其行列，初始化所有值为0.1  		
def bias_variable(shape): 
        initial = tf.constant(0.1, shape=shape)  
        return tf.Variable(initial)

def weight_variable(shape):

initial = tf.truncated_normal(shape, stddev=0.1)

return tf.Variable(initial)

# 创建一个结构为shape矩阵也可以说是数组shape声明其行列，初始化所有值为0.1

def bias_variable(shape):

initial = tf.constant(0.1, shape=shape)

return tf.Variable(initial)

14.2.4.3 构建卷积神经网络结构

# 卷积遍历各方向步数为1，SAME：边缘外自动补0，遍历相乘 
def conv2d(x, W): 
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')  
# 池化卷积结果（conv2d）池化层采用kernel大小为2*2，步数也为2，周围补0，取最大值。数据量缩小了4倍  	
def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],strides=[1, 2, 2, 1], padding='SAME')    

#定义输入输出结构  

# 声明一个占位符，None表示输入图片的数量不定，28*28图片分辨率  
xs = tf.placeholder(tf.float32, [None, 28*28])   
# 类别是0-9总共10个类别，对应输出分类结果  
ys = tf.placeholder(tf.float32, [None, 10])   
keep_prob = tf.placeholder(tf.float32)  
# x_image又把xs reshape成了28*28*1的形状，因为是灰色图片，所以通道是1.作为训练时的input，-1代表图片数量不定  
x_image = tf.reshape(xs, [-1, 28, 28, 1])   
#搭建网络,定义算法公式，也就是forward时的计算  

## 第一层卷积操作 ##  
# 第一二参数值得卷积核尺寸大小，即patch，第三个参数是图像通道数，第四个参数是卷积核的数目，代表会出现多少个卷积特征图像;  
W_conv1 = weight_variable([5, 5, 1, 32])   
# 对于每一个卷积核都有一个对应的偏置量。  
b_conv1 = bias_variable([32])    
# 图片乘以卷积核，并加上偏执量，卷积结果28x28x32  
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)    
# 池化结果14x14x32 卷积结果乘以池化卷积核  
h_pool1 = max_pool_2x2(h_conv1)   

## 第二层卷积操作 ##     
# 32通道卷积，卷积出64个特征    
w_conv2 = weight_variable([5,5,32,64])   
# 64个偏执数据  
b_conv2  = bias_variable([64])   
# 注意h_pool1是上一层的池化结果，#卷积结果14x14x64  
h_conv2 = tf.nn.relu(conv2d(h_pool1,w_conv2)+b_conv2)    
# 池化结果7x7x64  
h_pool2 = max_pool_2x2(h_conv2)    
# 原图像尺寸28*28，第一轮图像缩小为14*14，共有32张，第二轮后图像缩小为7*7，共有64张    

## 第三层全连接操作
# 二维张量，第一个参数7*7*64的patch，第二个参数代表卷积个数共1024个  
W_fc1 = weight_variable([7*7*64, 1024])   
# 1024个偏执数据  
b_fc1 = bias_variable([1024])   
# 将第二层卷积池化结果reshape成只有一行7*7*64个数据# [n_samples, 7, 7, 64] ->> [n_samples, 7*7*64]  
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])   
# 卷积操作，结果是1*1*1024，matmul实现最基本的矩阵相乘。
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)   

# dropout操作，减少过拟合。对卷积结果执行dropout操作。
keep_prob = tf.placeholder(tf.float32)   
h_fc1_drop = tf.nn.dropout(h_fc1,keep_prob) 
## 第四层输出操作 ##  
# 二维张量，1*1024矩阵卷积，共10个卷积，对应我们开始的ys长度为10  
W_fc2 = weight_variable([1024, 10])    
b_fc2 = bias_variable([10])    
# 最后的分类，结果为1*1*10 softmax  
y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
#定义交叉熵为loss函数，采用Adam方法优化loss。
cross_entropy = -tf.reduce_sum(ys * tf.log(y_conv))    
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

# 卷积遍历各方向步数为1，SAME：边缘外自动补0，遍历相乘

def conv2d(x, W):

return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

# 池化卷积结果（conv2d）池化层采用kernel大小为2*2，步数也为2，周围补0，取最大值。数据量缩小了4倍

def max_pool_2x2(x):

return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],strides=[1, 2, 2, 1], padding='SAME')

#定义输入输出结构

# 声明一个占位符，None表示输入图片的数量不定，28*28图片分辨率

xs = tf.placeholder(tf.float32, [None, 28*28])

# 类别是0-9总共10个类别，对应输出分类结果

ys = tf.placeholder(tf.float32, [None, 10])

keep_prob = tf.placeholder(tf.float32)

# x_image又把xs reshape成了28*28*1的形状，因为是灰色图片，所以通道是1.作为训练时的input，-1代表图片数量不定

x_image = tf.reshape(xs, [-1, 28, 28, 1])

#搭建网络,定义算法公式，也就是forward时的计算

## 第一层卷积操作 ##

# 第一二参数值得卷积核尺寸大小，即patch，第三个参数是图像通道数，第四个参数是卷积核的数目，代表会出现多少个卷积特征图像;

W_conv1 = weight_variable([5, 5, 1, 32])

# 对于每一个卷积核都有一个对应的偏置量。

b_conv1 = bias_variable([32])

# 图片乘以卷积核，并加上偏执量，卷积结果28x28x32

h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

# 池化结果14x14x32 卷积结果乘以池化卷积核

h_pool1 = max_pool_2x2(h_conv1)

## 第二层卷积操作 ##

# 32通道卷积，卷积出64个特征

w_conv2 = weight_variable([5,5,32,64])

# 64个偏执数据

b_conv2 = bias_variable([64])

# 注意h_pool1是上一层的池化结果，#卷积结果14x14x64

h_conv2 = tf.nn.relu(conv2d(h_pool1,w_conv2)+b_conv2)

# 池化结果7x7x64

h_pool2 = max_pool_2x2(h_conv2)

# 原图像尺寸28*28，第一轮图像缩小为14*14，共有32张，第二轮后图像缩小为7*7，共有64张

## 第三层全连接操作

# 二维张量，第一个参数7*7*64的patch，第二个参数代表卷积个数共1024个

W_fc1 = weight_variable([7*7*64, 1024])

# 1024个偏执数据

b_fc1 = bias_variable([1024])

# 将第二层卷积池化结果reshape成只有一行7*7*64个数据# [n_samples, 7, 7, 64] ->> [n_samples, 7*7*64]

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])

# 卷积操作，结果是1*1*1024，matmul实现最基本的矩阵相乘。

h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

# dropout操作，减少过拟合。对卷积结果执行dropout操作。

keep_prob = tf.placeholder(tf.float32)

h_fc1_drop = tf.nn.dropout(h_fc1,keep_prob)

## 第四层输出操作 ##

# 二维张量，1*1024矩阵卷积，共10个卷积，对应我们开始的ys长度为10

W_fc2 = weight_variable([1024, 10])

b_fc2 = bias_variable([10])

# 最后的分类，结果为1*1*10 softmax

y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

#定义交叉熵为loss函数，采用Adam方法优化loss。

cross_entropy = -tf.reduce_sum(ys * tf.log(y_conv))

train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

14.2.4.4 训练评估模型

#模型训练及评测  
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(ys,1))  
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))  
tf.global_variables_initializer().run()  

for i in range(2000):  
    batch = mnist.train.next_batch(50)  
    if i%100 == 0:  
            train_accuracy = accuracy.eval(feed_dict={xs:batch[0], ys: batch[1], keep_prob: 1.0})  
            print("step %d, training accuracy %g"%(i, train_accuracy))  
    train_step.run(feed_dict={xs: batch[0], ys: batch[1], keep_prob: 0.5})

#模型训练及评测

correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(ys,1))

accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

tf.global_variables_initializer().run()

for i in range(2000):

batch = mnist.train.next_batch(50)

if i%100 == 0:

train_accuracy = accuracy.eval(feed_dict={xs:batch[0], ys: batch[1], keep_prob: 1.0})

print("step %d, training accuracy %g"%(i, train_accuracy))

train_step.run(feed_dict={xs: batch[0], ys: batch[1], keep_prob: 0.5})

这里只迭代了2000次，运行结果；如果迭代20000次，在测试集上的精度可到99.2%左右
## -- End pasted text --
step 0, training accuracy 0.14
step 100, training accuracy 0.86
step 200, training accuracy 0.94
step 300, training accuracy 0.94
step 400, training accuracy 0.94
step 500, training accuracy 0.98
step 600, training accuracy 0.94
step 700, training accuracy 0.94
step 800, training accuracy 0.98
step 900, training accuracy 0.98
step 1000, training accuracy 1
step 1100, training accuracy 0.94
step 1200, training accuracy 0.98
step 1300, training accuracy 0.96
step 1400, training accuracy 0.92
step 1500, training accuracy 0.96
step 1600, training accuracy 0.96
step 1700, training accuracy 1
step 1800, training accuracy 1
step 1900, training accuracy 0.96

在测试集上，测试模型精度

print("test accuracy %g"%accuracy.eval(feed_dict={xs: mnist.test.images, ys: mnist.test.labels, keep_prob: 1.0})) 
test accuracy 0.9778

1 2	print("test accuracy %g"%accuracy.eval(feed_dict={xs: mnist.test.images, ys: mnist.test.labels, keep_prob: 1.0})) test accuracy 0.9778

14.3TensorFlow实现循环神经网络

14.3.1循环神经网络简介

在传统的神经网络模型中，是从输入层到隐含层再到输出层，层与层之间是全连接的，
每层之间的节点是无连接的。但是这种普通的神经网络对于很多问题却无能无力。

14.3.2LSTM循环神经网络简介

LSTM是一种特殊的RNNs，可以很好地解决长时依赖问题。

14.3.4TensorFlow实现循环神经网络

前面我们用卷积神经网络，对MNIST中的手写数进行设别，如果迭代20000次，精度可达到99.2左右，这个精度应该比较高；如果我们用循环神经网络来识别，是否可行？如果可以，效果如何？
为了适合使用RNN来识别，每张图片大小为28x28像素，我们把每张图片的每一行(元素个数为28)作为输入数据n_inputs，把每一行（一张图片共28行）看成是与时间序列有关的步数n_steps，这样图片的所有信息都用上了，而且适合使用RNN的应用场景。
启动ipython，进入ipython的交互式界面，导入需要的库，并启动交互式会话。

import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()

import tensorflow as tf

import numpy as np

sess = tf.InteractiveSession()

加载数据，具体实现细节可参考14.2.4.1小节，这里就不详细说明了。

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("./TensorFlowOnSpark/mnist", one_hot=True)

1 2	from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("./TensorFlowOnSpark/mnist", one_hot=True)

1. 构建模型
设置训练模型的超参数，学习速率，批量大小等。

learning_rate = 0.01
batch_size = 128

1 2	learning_rate = 0.01 batch_size = 128

设置循环神经网络的参数，包括输入数长度，输入的步数，隐藏节点数，类别数等。

n_input = 28
n_steps = 28
n_hidden = 256
n_classes = 10

n_input = 28

n_steps = 28

n_hidden = 256

n_classes = 10

定义输入数据及权重等

x = tf.placeholder(tf.float32, [None, n_steps, n_input])
y = tf.placeholder(tf.float32, [None, n_classes])

1 2	x = tf.placeholder(tf.float32, [None, n_steps, n_input]) y = tf.placeholder(tf.float32, [None, n_classes])

定义权重及初始化偏移量

# Classifier weights and biases
w = tf.Variable(tf.truncated_normal([n_hidden, n_classes]))
b = tf.Variable(tf.zeros([n_classes]))

# Classifier weights and biases

w = tf.Variable(tf.truncated_normal([n_hidden, n_classes]))

b = tf.Variable(tf.zeros([n_classes]))

定义并初始化Input gate、Forget gate、Output gate、Memory cell等的输入数据、权重、偏移量，这里采用tensorflow中truncated_normal函数初始化相关参数值。

# Input gate: input, previous output, and bias
ix = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))
im = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))
ib = tf.Variable(tf.zeros([1, n_hidden]))
# Forget gate: input, previous output, and bias
fx = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))
fm = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))
fb = tf.Variable(tf.zeros([1, n_hidden]))
# Memory cell: input, state, and bias
cx = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))
cm = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))
cb = tf.Variable(tf.zeros([1, n_hidden]))
# Output gate: input, previous output, and bias
ox = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))
om = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))
ob = tf.Variable(tf.zeros([1, n_hidden]))

# Input gate: input, previous output, and bias

ix = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))

im = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))

ib = tf.Variable(tf.zeros([1, n_hidden]))

# Forget gate: input, previous output, and bias

fx = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))

fm = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))

fb = tf.Variable(tf.zeros([1, n_hidden]))

# Memory cell: input, state, and bias

cx = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))

cm = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))

cb = tf.Variable(tf.zeros([1, n_hidden]))

# Output gate: input, previous output, and bias

ox = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))

om = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))

ob = tf.Variable(tf.zeros([1, n_hidden]))

创建循环神经网络结构

def LSTMRNN(x, n_steps, n_input, n_hidden, n_classes): 
    # 定义LSTM单元
    def lstm_cell(i, o, state):
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        update = tf.tanh(tf.matmul(i, cx) + tf.matmul(o, cm) + cb)
        state = forget_gate * state + input_gate * update
        output_gate = tf.sigmoid(tf.matmul(i, ox) +  tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state

    # 把状态线上的多个值串联起来
    outputs = list()
    state = tf.Variable(tf.zeros([batch_size, n_hidden]))
    output = tf.Variable(tf.zeros([batch_size, n_hidden]))

    # 输入数据x用函数transpose把第一个维度与第二个维度互换，使用reshape把x
   #变为(n_steps*batch_size,n_input)的形状，然后利用split把x拆成长度为n_steps
   #的列表，这样适合LMTM的输入格式。
    x = tf.transpose(x, [1, 0, 2])
    x = tf.reshape(x, [-1, n_input])
    x = tf.split(x, n_steps, 0)
    for i in x:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)
    logits =tf.matmul(outputs[-1], w) + b
    return logits

def LSTMRNN(x, n_steps, n_input, n_hidden, n_classes):

# 定义LSTM单元

def lstm_cell(i, o, state):

input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)

forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)

update = tf.tanh(tf.matmul(i, cx) + tf.matmul(o, cm) + cb)

state = forget_gate * state + input_gate * update

output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)

return output_gate * tf.tanh(state), state

# 把状态线上的多个值串联起来

outputs = list()

state = tf.Variable(tf.zeros([batch_size, n_hidden]))

output = tf.Variable(tf.zeros([batch_size, n_hidden]))

# 输入数据x用函数transpose把第一个维度与第二个维度互换，使用reshape把x

#变为(n_steps*batch_size,n_input)的形状，然后利用split把x拆成长度为n_steps

#的列表，这样适合LMTM的输入格式。

x = tf.transpose(x, [1, 0, 2])

x = tf.reshape(x, [-1, n_input])

x = tf.split(x, n_steps, 0)

for i in x:

output, state = lstm_cell(i, output, state)

outputs.append(output)

logits =tf.matmul(outputs[-1], w) + b

return logits

2. 定义损失函数及优化器

pred = LSTMRNN(x, n_steps, n_input, n_hidden, n_classes)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initializing the variables
init = tf.global_variables_initializer()

# Launch the graph
sess.run(init)

pred = LSTMRNN(x, n_steps, n_input, n_hidden, n_classes)

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))

accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initializing the variables

init = tf.global_variables_initializer()

# Launch the graph

sess.run(init)

3. 训练数据及评估模型

for step in range(10000):
    batch_x, batch_y = mnist.train.next_batch(batch_size)
    batch_x = batch_x.reshape((batch_size, n_steps, n_input))
    sess.run(optimizer, feed_dict={x: batch_x, y: batch_y})

    if step % 100 == 0:
        acc = sess.run(accuracy, feed_dict={x: batch_x, y: batch_y})
        loss = sess.run(cost, feed_dict={x: batch_x, y: batch_y})
        print "Iter " + str(step) + ", Minibatch Loss= " + "{:.6f}".format(loss) + ", Training Accuracy= " + "{:.5f}".format(acc)
print "Optimization Finished!"

for step in range(10000):

batch_x, batch_y = mnist.train.next_batch(batch_size)

batch_x = batch_x.reshape((batch_size, n_steps, n_input))

sess.run(optimizer, feed_dict={x: batch_x, y: batch_y})

if step % 100 == 0:

acc = sess.run(accuracy, feed_dict={x: batch_x, y: batch_y})

loss = sess.run(cost, feed_dict={x: batch_x, y: batch_y})

print "Iter " + str(step) + ", Minibatch Loss= " + "{:.6f}".format(loss) + ", Training Accuracy= " + "{:.5f}".format(acc)

print "Optimization Finished!"

运行结果，以下是最后批次的运行结果。
Iter 8000, Minibatch Loss= 0.085752, Training Accuracy= 0.97656
Iter 8100, Minibatch Loss= 0.065435, Training Accuracy= 0.96875
Iter 8200, Minibatch Loss= 0.088926, Training Accuracy= 0.97656
Iter 8300, Minibatch Loss= 0.039572, Training Accuracy= 1.00000
Iter 8400, Minibatch Loss= 0.050593, Training Accuracy= 0.98438
Iter 8500, Minibatch Loss= 0.030424, Training Accuracy= 0.99219
Iter 8600, Minibatch Loss= 0.026174, Training Accuracy= 0.99219
Iter 8700, Minibatch Loss= 0.045043, Training Accuracy= 0.98438
Iter 8800, Minibatch Loss= 0.031143, Training Accuracy= 0.98438
Iter 8900, Minibatch Loss= 0.055115, Training Accuracy= 0.99219
Iter 9000, Minibatch Loss= 0.061676, Training Accuracy= 0.98438
Iter 9100, Minibatch Loss= 0.123581, Training Accuracy= 0.97656
Iter 9200, Minibatch Loss= 0.057620, Training Accuracy= 0.98438
Iter 9300, Minibatch Loss= 0.043013, Training Accuracy= 0.99219
Iter 9400, Minibatch Loss= 0.067405, Training Accuracy= 0.98438
Iter 9500, Minibatch Loss= 0.020679, Training Accuracy= 1.00000
Iter 9600, Minibatch Loss= 0.079038, Training Accuracy= 0.98438
Iter 9700, Minibatch Loss= 0.080076, Training Accuracy= 0.97656
Iter 9800, Minibatch Loss= 0.010582, Training Accuracy= 1.00000
Iter 9900, Minibatch Loss= 0.019426, Training Accuracy= 1.00000
Optimization Finished!

在测试集上验证模型

# Calculate accuracy for 128 mnist test images
test_len = batch_size
test_data = mnist.test.images[:test_len].reshape((-1, n_steps, n_input))
test_label = mnist.test.labels[:test_len]
print "Testing Accuracy:", sess.run(accuracy, feed_dict={x: test_data, y: test_label})

# Calculate accuracy for 128 mnist test images

test_len = batch_size

test_data = mnist.test.images[:test_len].reshape((-1, n_steps, n_input))

test_label = mnist.test.labels[:test_len]

print "Testing Accuracy:", sess.run(accuracy, feed_dict={x: test_data, y: test_label})

运行结果如下，这个结果虽然比CNN结果低些，但也是不错的一个结果。
Testing Accuracy: 0.976562

14.4分布式TensorFlow

2016年4月14日，Google发布了分布式TensorFlow，能够支持在几百台机器上并行训练。分布式的TensorFlow由高性能的gRPC库作为底层技术支持。

14.4.1客户端、主节点和工作节点间的关系

14.4.2分布式模式

常用的深度学习训练模型为数据并行化，即TensorFlow任务采用相同的训练模型在不同的小批量数据集上进行训练，然后在参数服务器上更新模型的共享参数。TensorFlow支持同步训练和异步训练两种模型训练方式。

14.4.3在Pyspark集群环境运行TensorFlow

这节将通过神经网络来模拟一个一元二次方程：y=x^2-0.5，
TensorFlowOnSpark的详细配置，请参考14.4节。已集群方式启动pyspark：

pyspark --master spark://master:7077 --driver-memory 1G --total-executor-cores 2

1	pyspark --master spark://master:7077 --driver-memory 1G --total-executor-cores 2

进入pyspark的交换界面

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

#构造满足一元二次方程的函数

x_data = np.linspace(-1, 1, 300)[:, np.newaxis]
#加入一些噪声
noise = np.random.normal(0, 0.05, x_data.shape)
y_data = np.square(x_data) - 0.5 + noise
#画出散点图
fig=plt.figure()
ax=fig.add_subplot(1,1,1)
ax.scatter(x_data,y_data)

import tensorflow as tf

import numpy as np

import matplotlib.pyplot as plt

#构造满足一元二次方程的函数

x_data = np.linspace(-1, 1, 300)[:, np.newaxis]

#加入一些噪声

noise = np.random.normal(0, 0.05, x_data.shape)

y_data = np.square(x_data) - 0.5 + noise

#画出散点图

fig=plt.figure()

ax=fig.add_subplot(1,1,1)

ax.scatter(x_data,y_data)

构造一个神经网络

xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])
#定义添加层的函数

def add_layer(inputs, in_size, out_size, activation_function=None):
    weights = tf.Variable(tf.random_normal([in_size, out_size]))
    biases = tf.Variable(tf.zeros([1, out_size]) + 0.1)
    Wx_plus_b = tf.matmul(inputs, weights) + biases
    if activation_function is None:
        outputs = Wx_plus_b
    else:
        outputs = activation_function(Wx_plus_b)
    return outputs
#构造输入层为1，隐藏层20个，输出层为1的神经网络

h1 = add_layer(xs, 1, 20, activation_function=tf.nn.relu)

#构造输出层，隐含层的输出为输出层的输入

prediction = add_layer(h1, 20, 1, activation_function=None)
#计算损失值
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction),reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
#初始化所以变量
init = tf.global_variables_initializer()

sess = tf.Session()
sess.run(init)
#训练1000次

for i in range(1000):
    sess.run(train_step, feed_dict={xs: x_data, ys: y_data})
    if i % 50 == 0:
        print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))
        prediction_value=sess.run(prediction,feed_dict={xs:x_data})
        lines=ax.plot(x_data,prediction_value,'r',lw=5)

xs = tf.placeholder(tf.float32, [None, 1])

ys = tf.placeholder(tf.float32, [None, 1])

#定义添加层的函数

def add_layer(inputs, in_size, out_size, activation_function=None):

weights = tf.Variable(tf.random_normal([in_size, out_size]))

biases = tf.Variable(tf.zeros([1, out_size]) + 0.1)

Wx_plus_b = tf.matmul(inputs, weights) + biases

if activation_function is None:

outputs = Wx_plus_b

else:

outputs = activation_function(Wx_plus_b)

return outputs

#构造输入层为1，隐藏层20个，输出层为1的神经网络

h1 = add_layer(xs, 1, 20, activation_function=tf.nn.relu)

#构造输出层，隐含层的输出为输出层的输入

prediction = add_layer(h1, 20, 1, activation_function=None)

#计算损失值

loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction),reduction_indices=[1]))

train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)

#初始化所以变量

init = tf.global_variables_initializer()

sess = tf.Session()

sess.run(init)

#训练1000次

for i in range(1000):

sess.run(train_step, feed_dict={xs: x_data, ys: y_data})

if i % 50 == 0:

print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))

prediction_value=sess.run(prediction,feed_dict={xs:x_data})

lines=ax.plot(x_data,prediction_value,'r',lw=5)

输出结果：
1.62758
0.00996406
0.00634915
0.00483868
0.0043179
0.00399014
0.00368176
0.00337165
0.00309145
0.00284696
0.00267657
0.00255845
0.0024702
0.00240239
0.00235583
0.00232014
0.00229183
0.00226797
0.00224843

14.5TensorFlowOnSpark架构

TensorFlowOnSpark(TFoS)，支持 TensorFlow 在 Spark 和 Hadoop 上的分布式运行。

14.6TensorFlowOnSpark安装

安装TensorFlowOnSpark，采用pip管理工具进行安装，缺省安装是1.0版本。

pip  install  tensorflowonspark

1	pip install tensorflowonspark

执行以上命令后，在用户当前目录下，将新增一个TensorFlowOnSpark目录。
然后，在.bashrc定义该路径。

export TFoS_HOME=/home/hadoop/TensorFlowOnSpark

1	export TFoS_HOME=/home/hadoop/TensorFlowOnSpark

可以通过pyspark环境来验证，以上2个安装是否成功。

pyspark
>>> import tensorflow as tf
>>> from tensorflowonspark import TFCluster

pyspark

>>> import tensorflow as tf

>>> from tensorflowonspark import TFCluster

导入这些库，如果没有异常，说明安装成功。接下来开始为训练数据做一些准备工作。
对scripts目录进行打包，便于把该包发布到各worker上

cd TensorFlowOnSpark/scripts
zip  -r ../tfspark.zip *

1 2	cd TensorFlowOnSpark/scripts zip -r ../tfspark.zip *

14.7TensorFlowOnSpark实例

使用TensorFlowOnSpark对MNIST数据进行预测，MNIST是一个手写数字数据库，它有60000个训练样本集和10000个测试样本集，train-images-idx3-ubyte.gz、train-labels-idx1-ubyte.gz等四个文件。这些图像数据都保存在二进制文件中。每个样本图像的宽高为28*28。
下载MNIST数据

mkdir ${TFoS_HOME}/mnist
pushd ${TFoS_HOME}/mnist
curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"

mkdir ${TFoS_HOME}/mnist

pushd ${TFoS_HOME}/mnist

curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"

curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"

curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"

curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"

14.7.1TensorFlowOnSpark单机模式实例

设置本机相关参数，在单机上启动一个master节点，两个worker节点。

export MASTER=spark://$(hostname):7077
export SPARK_WORKER_INSTANCES=2
export CORES_PER_WORKER=1 
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES})) 
${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 2G ${MASTER}

export MASTER=spark://$(hostname):7077

export SPARK_WORKER_INSTANCES=2

export CORES_PER_WORKER=1

export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 2G ${MASTER}

启动以后，通过jps可以看到如下一些进程。一个master，两个worker，namenode是之前启动hadoop的进程。

[hadoop@master ~]$ jps
25157 Master
25258 Worker
27229 RunJar
26893 NameNode
27087 SecondaryNameNode
13071 Jps
25215 Worker

[hadoop@master ~]$ jps

25157 Master

25258 Worker

27229 RunJar

26893 NameNode

27087 SecondaryNameNode

13071 Jps

25215 Worker

相关服务起来后，接下来把MNIST数据上传到HDFS上，并把数据转换cvs格式。

${SPARK_HOME}/bin/spark-submit --master spark://master:7077 ${TFoS_HOME}/examples/mnist/mnist_data_setup.py --output /examples/mnist/csv --format csv

1	${SPARK_HOME}/bin/spark-submit --master spark://master:7077 ${TFoS_HOME}/examples/mnist/mnist_data_setup.py --output /examples/mnist/csv --format csv

运行完成后，通过hadoop fs命令可以在HDFS上看到如下信息：

hadoop fs -ls /user/hadoop/examples/mnist/csv/train/images
Found 11 items
-rw-r--r--   1 hadoop supergroup          0 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/_SUCCESS
-rw-r--r--   1 hadoop supergroup    9338236 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00000
-rw-r--r--   1 hadoop supergroup   11231804 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00001
-rw-r--r--   1 hadoop supergroup   11214784 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00002
-rw-r--r--   1 hadoop supergroup   11226100 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00003
-rw-r--r--   1 hadoop supergroup   11212767 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00004
-rw-r--r--   1 hadoop supergroup   11173834 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00005
-rw-r--r--   1 hadoop supergroup   11214285 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00006
-rw-r--r--   1 hadoop supergroup   11201024 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00007
-rw-r--r--   1 hadoop supergroup   11194141 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00008
-rw-r--r--   1 hadoop supergroup   10449019 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00009

hadoop fs -ls /user/hadoop/examples/mnist/csv/train/images

Found 11 items

-rw-r--r-- 1 hadoop supergroup 0 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/_SUCCESS

-rw-r--r-- 1 hadoop supergroup 9338236 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00000

-rw-r--r-- 1 hadoop supergroup 11231804 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00001

-rw-r--r-- 1 hadoop supergroup 11214784 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00002

-rw-r--r-- 1 hadoop supergroup 11226100 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00003

-rw-r--r-- 1 hadoop supergroup 11212767 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00004

-rw-r--r-- 1 hadoop supergroup 11173834 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00005

-rw-r--r-- 1 hadoop supergroup 11214285 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00006

-rw-r--r-- 1 hadoop supergroup 11201024 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00007

-rw-r--r-- 1 hadoop supergroup 11194141 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00008

-rw-r--r-- 1 hadoop supergroup 10449019 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00009

数据加载转换成功后，开始训练数据。

${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images examples/mnist/csv/train/images \
--labels examples/mnist/csv/train/labels \
--format csv \
--mode train \
--model mnist_model

${SPARK_HOME}/bin/spark-submit \

--master ${MASTER} \

--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \

--conf spark.cores.max=${TOTAL_CORES} \

--conf spark.task.cpus=${CORES_PER_WORKER} \

--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \

${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \

--cluster_size ${SPARK_WORKER_INSTANCES} \

--images examples/mnist/csv/train/images \

--labels examples/mnist/csv/train/labels \

--format csv \

--mode train \

--model mnist_model

运行完成后，可以看到如下内容：

2017-06-18 05:30:50,072 INFO (MainThread-25741) Feeding training data
2017-06-18 05:32:07,655 INFO (MainThread-25741) Stopping TensorFlow nodes       
2017-06-18 05:32:07,883 INFO (MainThread-25741) Shutting down cluster
2017-06-18T05:32:13.346161 ===== Stop

2017-06-18 05:30:50,072 INFO (MainThread-25741) Feeding training data

2017-06-18 05:32:07,655 INFO (MainThread-25741) Stopping TensorFlow nodes

2017-06-18 05:32:07,883 INFO (MainThread-25741) Shutting down cluster

2017-06-18T05:32:13.346161 ===== Stop

如果运行过程中，过程被卡，可以调整mnist_dist.py文件中两处（在115,125行）logdir=logdir改为logdir=None。
训练完成后，接下来就是用测试集验证模型，并对结果进行预测。

${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images examples/mnist/csv/test/images \
--labels examples/mnist/csv/test/labels \
--mode inference \
--format csv \
--model mnist_model \
--output predictions

${SPARK_HOME}/bin/spark-submit \

--master ${MASTER} \

--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \

--conf spark.cores.max=${TOTAL_CORES} \

--conf spark.task.cpus=${CORES_PER_WORKER} \

--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \

${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \

--cluster_size ${SPARK_WORKER_INSTANCES} \

--images examples/mnist/csv/test/images \

--labels examples/mnist/csv/test/labels \

--mode inference \

--format csv \

--model mnist_model \

--output predictions

运行完成以后，在HDFS上，就可看到predictions目录及相关内容。

[hadoop@master spark]$ hadoop fs -ls /user/hadoop/predictions
17/06/20 02:45:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 11 items
-rw-r--r--   1 hadoop supergroup          0 2017-06-18 14:04 /user/hadoop/predictions/_SUCCESS
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00000
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00001
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00002
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00003
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00004
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00005
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00006
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00007
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00008
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00009

[hadoop@master spark]$ hadoop fs -ls /user/hadoop/predictions

17/06/20 02:45:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Found 11 items

-rw-r--r-- 1 hadoop supergroup 0 2017-06-18 14:04 /user/hadoop/predictions/_SUCCESS

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00000

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00001

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00002

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00003

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00004

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00005

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00006

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00007

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00008

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00009

打开其中一个文件，可以看到预测结果信息。
2017-06-18T05:51:42.397905 Label: 5, Prediction: 5
2017-06-18T05:51:42.397923 Label: 9, Prediction: 8
2017-06-18T05:51:42.397941 Label: 7, Prediction: 5
2017-06-18T05:51:42.397958 Label: 3, Prediction: 5
2017-06-18T05:51:42.397976 Label: 4, Prediction: 8
2017-06-18T05:51:42.397993 Label: 9, Prediction: 8
2017-06-18T05:51:42.398012 Label: 6, Prediction: 5

14.7.2TensorFlowOnSpark集群模式实例

设置本机相关参数，在以集群方式启动spark，一个master节点，slave1、slave2作为
两个worker节点，各节点资源配置信息。
训练模型

${SPARK_HOME}/bin/spark-submit \
--master spark://master:7077 \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=4 \
--conf spark.task.cpus=2 \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
--conf spark.executorEnv.LD_LIBRARY_PATH="${JAVA_HOME}/jre/lib/amd64/server" \
--conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size 2 \
--images examples/mnist/csv/train/images \
--labels examples/mnist/csv/train/labels \
--format csv \
--mode train \
--model mnist_model

${SPARK_HOME}/bin/spark-submit \

--master spark://master:7077 \

--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \

--conf spark.cores.max=4 \

--conf spark.task.cpus=2 \

--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \

--conf spark.executorEnv.LD_LIBRARY_PATH="${JAVA_HOME}/jre/lib/amd64/server" \

--conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" \

${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \

--cluster_size 2 \

--images examples/mnist/csv/train/images \

--labels examples/mnist/csv/train/labels \

--format csv \

--mode train \

--model mnist_model

测试模型

${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=4 \
--conf spark.task.cpus=2 \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size 2 \
--images examples/mnist/csv/test/images \
--labels examples/mnist/csv/test/labels \
--mode inference \
--format csv \
--model mnist_model \
--output predictions

${SPARK_HOME}/bin/spark-submit \

--master ${MASTER} \

--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \

--conf spark.cores.max=4 \

--conf spark.task.cpus=2 \

--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \

${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \

--cluster_size 2 \

--images examples/mnist/csv/test/images \

--labels examples/mnist/csv/test/labels \

--mode inference \

--format csv \

--model mnist_model \

--output predictions

查看运行结果

$ hadoop fs -ls /user/hadoop/predictions
Found 11 items
-rw-r--r--   1 hadoop supergroup          0 2017-06-20 08:55 /user/hadoop/predictions/_SUCCESS
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00000
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00001
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00002
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00003
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00004
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00005
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00006
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00007
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00008
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00009

$ hadoop fs -ls /user/hadoop/predictions

Found 11 items

-rw-r--r-- 1 hadoop supergroup 0 2017-06-20 08:55 /user/hadoop/predictions/_SUCCESS

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00000

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00001

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00002

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00003

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00004

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00005

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00006

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00007

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00008

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00009

运行时各节点报错信息，可以查看spark/work/app-20170620085449-0003/1下的

14.8小结

为了弥补Spark机器学习中，缺乏神经网络、深度学习等的不足，这章我们介绍脱胎于AlphaGo的深度学习框架TensorFlow，以基础知识为主，在这个基础上介绍了使用TensorFlow的几个实例，最后介绍TensorFlow的分布式架构及与Spark整合的架构TensorFlowOnSpark。

第13章使用Spark Streaming构建在线学习模型

前面我们介绍的这些算法，一般基于一个或几个相对固定的文件，以这样的数据为模型处理的源数据是固定的，这样的数据或许很大，很多。训练或测试都是建立在这些固定数据之上，当然，测试时，可能取这个数据源之外的数据，如新数据或其他数据等。训练模型的数据一般是相对固定的，这样的机器学习的场景是很普遍的。
但实际环境中，还有其他一些场景，如源数据是经常变换，就像流水一样，时刻在变换着，如很多在线数据、很多日志数据等等。面对这些数据的学习我们该如何处理呢？
这个问题实际上属于流水计算问题，目前解决这类问题有Spark Streaming、Storm、Samza等。这章我们主要介绍Spark Streaming。
本章主要包括以下内容：
 介绍Spark Streaming主要内容
 Spark Streaming入门实例
 在线学习实例

13.1 Spark Streaming简介

Spark Streaming 是Spark核心API的一个扩展，可以实现高吞吐量的、具备容错机制的实时流数据的处理。支持从多种数据源获取数据，包括Kafk、Flume、Twitter、ZeroMQ、Kinesis 以及TCP sockets，从数据源获取数据之后，可以使用诸如map、reduce、join和window等高级函数进行复杂算法的处理。最后还可以将处理结果存储到文件系统，数据库和现场仪表盘。在“One Stack rule them all”的基础上，还可以使用Spark的其他子框架，如集群学习、图计算等，对流数据进行处理。

13.1.1Spark Streaming常用术语

在简介Spark Streaming前，我们先简单介绍Streaming的一些常用术语。

13.1.2Spark Streaming处理流程

Spark Streaming处理的数据流图如图13-1所示。

图13-1 Spark Streaming计算过程

13.2 Dstream操作

RDD有很多操作和转换，与RDD类似，DStream也提供了自己的一系列操作方法，本节主要介绍如何操作DStream，包括输入、转换、修改状态及输出等。

13.2.1 Dstream输入

在Spark Streaming中所有的操作都是基于流的，而输入源是一切操作的起点。
Spark Streaming 提供两种类型的流式输入数据源：
 基础输入源：能直接应用于StreamingContext API输入源。例如：文件系统、Socket（套接字）连接和 Akka actors；
 高级输入源：能应用于特定工具类的输入源，如 Kafka、Flume、Kinesis、Twitter 等，使用这些输入源需要导入一些额外依赖包。

13.2.2 Dstream转换

DStream转换操作是在一个或多个DStream上创建新的DStream。

13.2.3 Dstream修改

Spark Streaming除提供一些基本操作，还提供一些状态操作。

13.2 .4Dstream输出

Spark Streaming允许DStream的数据输出到外部系统，如数据库、文件系统等。

13.3 Spark Streaming应用实例

先启动nc，端口为9999

nc -lk 9999

1	nc -lk 9999

然后,以本地方式启动spark shell

//导入类或包
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext, Time}
import spark.implicits._


// 创建一个间隔时间为3秒的context
val ssc = new StreamingContext(sc, Seconds(3))
// 创建一个socket stream ，基于master:9999
val lines = ssc.socketTextStream("master",9999)
val words = lines.flatMap(_.split(" "))
//为便于使用SQL进行统计，把DStream的RDD转换为DataFrame。
// 把RDD[String] 转换为RDD[case class] ，最后转换为DataFrame
case class Record(word: String)

words.foreachRDD { (rdd:RDD[String], time:Time) =>
val wordsDataFrame = rdd.map(w => Record(w)).toDF()

// 创建一个临时视图
wordsDataFrame.createOrReplaceTempView("words")
//使用SQL进行统计
val wordCountsDataFrame = spark.sql("select word, count(*) as total from words group by word")
println(s"========= $time =========")
wordCountsDataFrame.show()
    }
ssc.start()
ssc.awaitTermination()

//导入类或包

import org.apache.spark.SparkConf

import org.apache.spark.rdd.RDD

import org.apache.spark.sql.SparkSession

import org.apache.spark.storage.StorageLevel

import org.apache.spark.streaming.{Seconds, StreamingContext, Time}

import spark.implicits._

// 创建一个间隔时间为3秒的context

val ssc = new StreamingContext(sc, Seconds(3))

// 创建一个socket stream ，基于master:9999

val lines = ssc.socketTextStream("master",9999)

val words = lines.flatMap(_.split(" "))

//为便于使用SQL进行统计，把DStream的RDD转换为DataFrame。

// 把RDD[String] 转换为RDD[case class] ，最后转换为DataFrame

case class Record(word: String)

words.foreachRDD { (rdd:RDD[String], time:Time) =>

val wordsDataFrame = rdd.map(w => Record(w)).toDF()

// 创建一个临时视图

wordsDataFrame.createOrReplaceTempView("words")

//使用SQL进行统计

val wordCountsDataFrame = spark.sql("select word, count(*) as total from words group by word")

println(s"========= $time =========")

wordCountsDataFrame.show()

}

ssc.start()

ssc.awaitTermination()

在启动了nc的界面输入：

ok ok m p py py py

1	ok ok m p py py py

在spark shell界面，可以看到如下输出：

========= 1494714360000 ms =========
+----+-----+
|word|total|
+----+-----+
|   m|    1|
|  ok|    2|
|   p|    1|
|  py|    3|
+----+-----+

========= 1494714360000 ms =========

+----+-----+

|word|total|

+----+-----+

| m| 1|

| ok| 2|

| p| 1|

| py| 3|

+----+-----+

13.4 Spark Streaming在线学习实例

前面我们简单介绍一个利用nc产生文本数据，Spark Streaming实时统计词频的一个实例，通过这个例子，我们对Streaming有个大致了解，它的源数据可以是实时产生、实时变化的，基于这个数据流，Spark Streaming能实时进行统计词频信息，并输出到界面。
除了统计词频，实际上Spark Streaming 还可以做在线机器学习工作，目前Spark Streaming支持Streaming Linear Regression, Streaming KMeans等，这节我们模拟一个在线学习线性回归的算法，源数据为多个文件，首先在一个文件中训练模型，然后在新数据上进行调整模型，对新数据进行预测等。

//导入需要的类
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.feature.StandardScaler
import breeze.linalg.DenseVector
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._


//交互式编程
val ssc = new StreamingContext(sc, Seconds(10))
val stream = ssc.textFileStream("file:///home/hadoop/data/streaming/traindir")

val NumFeatures = 11
val zeroVector = DenseVector.zeros[Double](NumFeatures)
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.dense(zeroVector.data))
.setNumIterations(20)
.setRegParam(0.8)
.setStepSize(0.01) 

		
//创建一个含标签的数据流
val labeledStream = stream.map { line =>
val split = line.split(";")
val y = split(11).toDouble
val features=split.slice(0,11).map(_.toDouble)
    LabeledPoint(label = y, features = Vectors.dense(features))
    }	
//在数据流上训练测试模型。    
model.trainOn(labeledStream)
model.predictOnValues(labeledStream.map(lp => (lp.label, lp.features))) .print()
//启动Spark Streaming
ssc.start()
ssc.awaitTermination()

//导入需要的类

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.feature.StandardScaler

import breeze.linalg.DenseVector

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD

import org.apache.spark.streaming._

import org.apache.spark.streaming.StreamingContext._

//交互式编程

val ssc = new StreamingContext(sc, Seconds(10))

val stream = ssc.textFileStream("file:///home/hadoop/data/streaming/traindir")

val NumFeatures = 11

val zeroVector = DenseVector.zeros[Double](NumFeatures)

val model = new StreamingLinearRegressionWithSGD()

.setInitialWeights(Vectors.dense(zeroVector.data))

.setNumIterations(20)

.setRegParam(0.8)

.setStepSize(0.01)

//创建一个含标签的数据流

val labeledStream = stream.map { line =>

val split = line.split(";")

val y = split(11).toDouble

val features=split.slice(0,11).map(_.toDouble)

LabeledPoint(label = y, features = Vectors.dense(features))

}

//在数据流上训练测试模型。

model.trainOn(labeledStream)

model.predictOnValues(labeledStream.map(lp => (lp.label, lp.features))) .print()

//启动Spark Streaming

ssc.start()

ssc.awaitTermination()

13.5小结

前几章主要介绍了Spark ML对批量数据或离线数据的分析和处理，本章主要介绍Spark Streamin对在线数据或流式数据的处理及分析，首先对Spark Streaming的一些概念、输入源、Dstream的一些转换、修改、输出作了简单介绍，然后，通过两个实例把这些内容结合在一起，进一步说明Spark Streaming在线统计、在线学习的具体使用。

本章数据集下载
第12章 Spark R 朴素贝叶斯模型

前一章我们介绍了PySpark，就是用Python语言操作Spark大数据计算框架上的任务，这样把自然把Python的优点与Spark的优势进行叠加。Spark提供了Python的API，也提供了R语言的API，其组件名称为Spark R。Spark R的运行原理或架构，具体请看图12-1。

图12-1 Spark R 架构图

Spark R的架构类似于PySpark，Driver端除了一个JVM进程（包含一个SparkContext,在Spark2.X中SparkContext已经被SparkSession所代替）外，还有起一个R的进程，这两个进程通过Socket进行通信，用户可以提交R语言代码，R的进程会执行这些R代码，
当R代码调用Spark相关函数时，R进程会通过Socket触发JVM中的对应任务。
当R进程向JVM进程提交任务的时候，R会把子任务需要的环境进行打包，并发送到JVM的driver端。通过R生成的RDD都会是RRDD类型，当触发RRDD的action时，Spark的执行器会开启一个R进程，执行器和R进程通过Socket进行通信。执行器会把任务和所需的环境发送给R进程，R进程会加载对应的package，执行任务，并返回结果。
本章通过一个实例来说明如何使用Spark R，具体内容如下：
 Spark R简介
 把数据上传到HDFS,然后导入Hive，最后从Hive读取数据
 使用朴素贝叶斯分类器
 探索数据
 预处理数据
 训练模型
 评估模型

12.1. Spark R简介

目前SparkR的最新版本为2.0.1，API参考文档（http://spark.apache.org/docs/latest/api/R/index.html）。

12.2获取数据

12.2.1 SparkDataFrame数据结构说明

SparkDataFrame是Spark提供的分布式数据格式（DataFrame）。类似于关系数据库中的表或R语言中的DataFrame。SparkDataFrames可以从各种各样的源构造，例如：结构化数据文件，Hive中的表，外部数据库或现有的本地数据。

12.2.2创建SparkDataFrame

1.从本地文件加载数据，生成SparkDataFrame
SparkR支持通过SparkDataFrame接口对各种数据源进行操作。示例：

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR
## 
## Attaching package: 'SparkR'
## The following objects are masked from 'package:stats':
## 
##     cov, filter, lag, na.omit, predict, sd, var, window
## The following objects are masked from 'package:base':
## 
##     as.data.frame, colnames, colnames<-, drop, endsWith,
##     intersect, rank, rbind, sample, startsWith, subset, summary,
##     transform, union
sparkR.session(sparkHome ='/u01/bigdata/spark' )#启动Spark环境
## Spark package found in SPARK_HOME: /u01/bigdata/spark
## Launching java with spark-submit command /u01/bigdata/spark/bin/spark-submit   sparkr-shell /tmp/RtmpA30Gvz/backend_port2ad11c0705d8
## Java ref type org.apache.spark.sql.SparkSession id 1
# 读取本地csv文件
Sparkdf <-read.df("/u01/bigdata/data/df2.csv",source='csv',header='TRUE',inferSchema ="true")
# 查看SparkDataFrame
head(Sparkdf)
##   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR

## Attaching package: 'SparkR'

## The following objects are masked from 'package:stats':

## cov, filter, lag, na.omit, predict, sd, var, window

## The following objects are masked from 'package:base':

## as.data.frame, colnames, colnames<-, drop, endsWith,

## intersect, rank, rbind, sample, startsWith, subset, summary,

## transform, union

sparkR.session(sparkHome ='/u01/bigdata/spark' )#启动Spark环境

## Spark package found in SPARK_HOME: /u01/bigdata/spark

## Launching java with spark-submit command /u01/bigdata/spark/bin/spark-submit sparkr-shell /tmp/RtmpA30Gvz/backend_port2ad11c0705d8

## Java ref type org.apache.spark.sql.SparkSession id 1

# 读取本地csv文件

Sparkdf <-read.df("/u01/bigdata/data/df2.csv",source='csv',header='TRUE',inferSchema ="true")

# 查看SparkDataFrame

head(Sparkdf)

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3.0 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5.0 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

2.利用R环境的data frames创建SparkDataFrame
创建SparkDataFrame的最简单的方法是将本地R环境变量中的data frames转换为SparkDataFrame。我们可以使用as.DataFrame或createDataFrame函数来创建SparkDataFrame。作为示例，我们使用R自带的iris数据集来创建SparkDataFrame。

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR
## 
## Attaching package: 'SparkR'
## The following objects are masked from 'package:stats':
## 
##     cov, filter, lag, na.omit, predict, sd, var, window
## The following objects are masked from 'package:base':
## 
##     as.data.frame, colnames, colnames<-, drop, endsWith,
##     intersect, rank, rbind, sample, startsWith, subset, summary,
##     transform, union
sparkR.session(sparkHome ='/u01/bigdata/spark' )#启动Spark环境
## Java ref type org.apache.spark.sql.SparkSession id 1
# 创建SparkDataFrame Sparkdf，数据来自iris数据集
Sparkdf <-as.DataFrame(iris)
# 查看刚创建好的SparkDataFrame
head(Sparkdf)
##   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR

## Attaching package: 'SparkR'

## The following objects are masked from 'package:stats':

## cov, filter, lag, na.omit, predict, sd, var, window

## The following objects are masked from 'package:base':

## as.data.frame, colnames, colnames<-, drop, endsWith,

## intersect, rank, rbind, sample, startsWith, subset, summary,

## transform, union

sparkR.session(sparkHome ='/u01/bigdata/spark' )#启动Spark环境

## Java ref type org.apache.spark.sql.SparkSession id 1

# 创建SparkDataFrame Sparkdf，数据来自iris数据集

Sparkdf <-as.DataFrame(iris)

# 查看刚创建好的SparkDataFrame

head(Sparkdf)

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3.0 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5.0 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

3.从HDFS文件系统加载数据，生成SparkDataFrame
示例：

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR
## 
## Attaching package: 'SparkR'
## The following objects are masked from 'package:stats':
## 
##     cov, filter, lag, na.omit, predict, sd, var, window
## The following objects are masked from 'package:base':
## 
##     as.data.frame, colnames, colnames<-, drop, endsWith,
##     intersect, rank, rbind, sample, startsWith, subset, summary,
##     transform, union
## Java ref type org.apache.spark.sql.SparkSession id 1
# 读取HDFS文件
Sparkdf <-read.df("hdfs://192.168.1.112:9000/u01/bigdata/data/df2.csv",source='csv',header='TRUE',inferSchema ="true")
# 查看SparkDataFrame
head(Sparkdf)
##   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR

## Attaching package: 'SparkR'

## The following objects are masked from 'package:stats':

## cov, filter, lag, na.omit, predict, sd, var, window

## The following objects are masked from 'package:base':

## as.data.frame, colnames, colnames<-, drop, endsWith,

## intersect, rank, rbind, sample, startsWith, subset, summary,

## transform, union

## Java ref type org.apache.spark.sql.SparkSession id 1

# 读取HDFS文件

Sparkdf <-read.df("hdfs://192.168.1.112:9000/u01/bigdata/data/df2.csv",source='csv',header='TRUE',inferSchema ="true")

# 查看SparkDataFrame

head(Sparkdf)

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3.0 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5.0 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

4.读取Hive数据仓库中的表，生成SparkDataFrame
我们还可以从Hive表创建SparkDataFrame。

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR
# 查看数据库
sql('show databases')
## SparkDataFrame[databaseName:string]
# 选择hive库
sql('use hive')
## SparkDataFrame[]
# 查看hive数据库的表
sql('show tables')
## SparkDataFrame[database:string, tableName:string, isTemporary:boolean]
# 查看表df2的信息
sql('desc df2')
## SparkDataFrame[col_name:string, data_type:string, comment:string]
# 读取hive表df2，生成SparkDataFrame
Sparkdf<-sql('select * from df2')
# 查看SparkDataFrame
head(Sparkdf)
##       height     weight
## 1  0.3307575 -1.4197984
## 2  0.4970992 -1.4364733
## 3  1.4477968 -0.7579736
## 4  0.6815300 -1.7573564
## 5  0.8915567  1.1815332
## 6 -2.2494993 -1.6438995

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR

# 查看数据库

sql('show databases')

## SparkDataFrame[databaseName:string]

# 选择hive库

sql('use hive')

## SparkDataFrame[]

# 查看hive数据库的表

sql('show tables')

## SparkDataFrame[database:string, tableName:string, isTemporary:boolean]

# 查看表df2的信息

sql('desc df2')

## SparkDataFrame[col_name:string, data_type:string, comment:string]

# 读取hive表df2，生成SparkDataFrame

Sparkdf<-sql('select * from df2')

# 查看SparkDataFrame

head(Sparkdf)

## height weight

## 1 0.3307575 -1.4197984

## 2 0.4970992 -1.4364733

## 3 1.4477968 -0.7579736

## 4 0.6815300 -1.7573564

## 5 0.8915567 1.1815332

## 6 -2.2494993 -1.6438995

12.2.3 SparkDataFrame的常用操作

1.选择行，或者列

df <-as.DataFrame(iris)
str(df)
## 'SparkDataFrame': 5 variables:
##  $ Sepal_Length: num 5.1 4.9 4.7 4.6 5 5.4
##  $ Sepal_Width : num 3.5 3 3.2 3.1 3.6 3.9
##  $ Petal_Length: num 1.4 1.4 1.3 1.5 1.4 1.7
##  $ Petal_Width : num 0.2 0.2 0.2 0.2 0.2 0.4
##  $ Species     : chr "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
head(df)
##   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# 选择Sepal_Length列
head(select(df, df$Sepal_Length))
##   Sepal_Length
## 1          5.1
## 2          4.9
## 3          4.7
## 4          4.6
## 5          5.0
## 6          5.4
# 或者
head(select(df, "Sepal_Length"))
##   Sepal_Length
## 1          5.1
## 2          4.9
## 3          4.7
## 4          4.6
## 5          5.0
## 6          5.4
# 过滤出Sepal_Length小于5的行
head(filter(df, df$Sepal_Length <5))
##   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1          4.9         3.0          1.4         0.2  setosa
## 2          4.7         3.2          1.3         0.2  setosa
## 3          4.6         3.1          1.5         0.2  setosa
## 4          4.6         3.4          1.4         0.3  setosa
## 5          4.4         2.9          1.4         0.2  setosa
## 6          4.9         3.1          1.5         0.1  setosa

df <-as.DataFrame(iris)

str(df)

## 'SparkDataFrame': 5 variables:

## $ Sepal_Length: num 5.1 4.9 4.7 4.6 5 5.4

## $ Sepal_Width : num 3.5 3 3.2 3.1 3.6 3.9

## $ Petal_Length: num 1.4 1.4 1.3 1.5 1.4 1.7

## $ Petal_Width : num 0.2 0.2 0.2 0.2 0.2 0.4

## $ Species : chr "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"

head(df)

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3.0 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5.0 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

# 选择Sepal_Length列

head(select(df, df$Sepal_Length))

## Sepal_Length

## 1 5.1

## 2 4.9

## 3 4.7

## 4 4.6

## 5 5.0

## 6 5.4

# 或者

head(select(df, "Sepal_Length"))

## Sepal_Length

## 1 5.1

## 2 4.9

## 3 4.7

## 4 4.6

## 5 5.0

## 6 5.4

# 过滤出Sepal_Length小于5的行

head(filter(df, df$Sepal_Length <5))

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 1 4.9 3.0 1.4 0.2 setosa

## 2 4.7 3.2 1.3 0.2 setosa

## 3 4.6 3.1 1.5 0.2 setosa

## 4 4.6 3.4 1.4 0.3 setosa

## 5 4.4 2.9 1.4 0.2 setosa

## 6 4.9 3.1 1.5 0.1 setosa

2.数据分组，聚合

df <-as.DataFrame(faithful)
#数据分组,并统计每组出现的个数
head(summarize(groupBy(df, df$waiting), count =n(df$waiting)))
##   waiting count
## 1      70     4
## 2      67     1
## 3      69     2
## 4      88     6
## 5      49     5
## 6      64     4
# 对结果进行排序
waiting_counts <-summarize(groupBy(df, df$waiting), count =n(df$waiting))
head(arrange(waiting_counts, desc(waiting_counts$count)))
##   waiting count
## 1      78    15
## 2      83    14
## 3      81    13
## 4      77    12
## 5      82    12
## 6      79    10

df <-as.DataFrame(faithful)

#数据分组,并统计每组出现的个数

head(summarize(groupBy(df, df$waiting), count =n(df$waiting)))

## waiting count

## 1 70 4

## 2 67 1

## 3 69 2

## 4 88 6

## 5 49 5

## 6 64 4

# 对结果进行排序

waiting_counts <-summarize(groupBy(df, df$waiting), count =n(df$waiting))

head(arrange(waiting_counts, desc(waiting_counts$count)))

## waiting count

## 1 78 15

## 2 83 14

## 3 81 13

## 4 77 12

## 5 82 12

## 6 79 10

3.对SparkDataFrame的列进行运算操作

df$waiting_secs <-df$waiting *60
head(df)
##   eruptions waiting waiting_secs
## 1     3.600      79         4740
## 2     1.800      54         3240
## 3     3.333      74         4440
## 4     2.283      62         3720
## 5     4.533      85         5100
## 6     2.883      55         3300

df$waiting_secs <-df$waiting *60

head(df)

## eruptions waiting waiting_secs

## 1 3.600 79 4740

## 2 1.800 54 3240

## 3 3.333 74 4440

## 4 2.283 62 3720

## 5 4.533 85 5100

## 6 2.883 55 3300

4.apply系列函数的应用
• dapply函数类似于R语言的apply函数，看一个示例。

df <-as.DataFrame(iris)
df1 <-dapply(df, function(x) { x[x[,1]>6,]},schema =schema(df))
head(collect(df1))
##   Sepal_Length Sepal_Width Petal_Length Petal_Width    Species
## 1          7.0         3.2          4.7         1.4 versicolor
## 2          6.4         3.2          4.5         1.5 versicolor
## 3          6.9         3.1          4.9         1.5 versicolor
## 4          6.5         2.8          4.6         1.5 versicolor
## 5          6.3         3.3          4.7         1.6 versicolor
## 6          6.6         2.9          4.6         1.3 versicolor
str(df1)
## 'SparkDataFrame': 5 variables:
##  $ Sepal_Length: num 7 6.4 6.9 6.5 6.3 6.6
##  $ Sepal_Width : num 3.2 3.2 3.1 2.8 3.3 2.9
##  $ Petal_Length: num 4.7 4.5 4.9 4.6 4.7 4.6
##  $ Petal_Width : num 1.4 1.5 1.5 1.5 1.6 1.3
##  $ Species     : chr "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
dim(df1)
## [1] 61  5

df <-as.DataFrame(iris)

df1 <-dapply(df, function(x) { x[x[,1]>6,]},schema =schema(df))

head(collect(df1))

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 1 7.0 3.2 4.7 1.4 versicolor

## 2 6.4 3.2 4.5 1.5 versicolor

## 3 6.9 3.1 4.9 1.5 versicolor

## 4 6.5 2.8 4.6 1.5 versicolor

## 5 6.3 3.3 4.7 1.6 versicolor

## 6 6.6 2.9 4.6 1.3 versicolor

str(df1)

## 'SparkDataFrame': 5 variables:

## $ Sepal_Length: num 7 6.4 6.9 6.5 6.3 6.6

## $ Sepal_Width : num 3.2 3.2 3.1 2.8 3.3 2.9

## $ Petal_Length: num 4.7 4.5 4.9 4.6 4.7 4.6

## $ Petal_Width : num 1.4 1.5 1.5 1.5 1.6 1.3

## $ Species : chr "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"

dim(df1)

## [1] 61 5

12.3朴素贝叶斯分类器

该案例数据来自泰坦尼克号人员存活情况，响应变量为Survived，包含2个分类（yes，no），特征变量有Sex 、 Age 、Class（船舱等级），说明如下：
Class :0 = crew, 1 = first, 2 = second, 3 = third Age :1 = adult, 0 = child Sex :1 = male, 0 = female Survived :1 = yes, 0 = no

12.3.1数据探查

让我们来观察一下数据，

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR
## 
## Attaching package: 'SparkR'
## The following objects are masked from 'package:stats':
## 
##     cov, filter, lag, na.omit, predict, sd, var, window
## The following objects are masked from 'package:base':
## 
##     as.data.frame, colnames, colnames<-, drop, endsWith,
##     intersect, rank, rbind, sample, startsWith, subset, summary,
##     transform, union
## Java ref type org.apache.spark.sql.SparkSession id 1
## SparkDataFrame[]
#从hive仓库加载数据
titanic <-sql('select * from titanic')
# 查看SparkDataFrame
head(titanic)
##   class age sex survived
## 1     1   1   1        1
## 2     1   1   1        1
## 3     1   1   1        1
## 4     1   1   1        1
## 5     1   1   1        1
## 6     1   1   1        1
dim(titanic)
## [1] 2201    4
# 查看SparkDataFrame
dim(titanic)#查看数据的记录数以及维度数量
## [1] 2201    4

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR

## Attaching package: 'SparkR'

## The following objects are masked from 'package:stats':

## cov, filter, lag, na.omit, predict, sd, var, window

## The following objects are masked from 'package:base':

## as.data.frame, colnames, colnames<-, drop, endsWith,

## intersect, rank, rbind, sample, startsWith, subset, summary,

## transform, union

## Java ref type org.apache.spark.sql.SparkSession id 1

## SparkDataFrame[]

#从hive仓库加载数据

titanic <-sql('select * from titanic')

# 查看SparkDataFrame

head(titanic)

## class age sex survived

## 1 1 1 1 1

## 2 1 1 1 1

## 3 1 1 1 1

## 4 1 1 1 1

## 5 1 1 1 1

## 6 1 1 1 1

dim(titanic)

## [1] 2201 4

# 查看SparkDataFrame

dim(titanic)#查看数据的记录数以及维度数量

## [1] 2201 4

12.3.2对原始数据集进行转换

titanic_df=as.data.frame(titanic)
titanic <-as.data.frame(table(titanic_df))
colnames(titanic)<-paste0(toupper(substring(colnames(titanic),1,1)),substring(colnames(titanic),2))
titanic_temp<-titanic[titanic$Freq >0, -5]

head(titanic_temp)
##    Class Age Sex Survived
## 4      3   0   0        0
## 5      0   1   0        0
## 6      1   1   0        0
## 7      2   1   0        0
## 8      3   1   0        0
## 12     3   0   1        0

titanic_df=as.data.frame(titanic)

titanic <-as.data.frame(table(titanic_df))

colnames(titanic)<-paste0(toupper(substring(colnames(titanic),1,1)),substring(colnames(titanic),2))

titanic_temp<-titanic[titanic$Freq >0, -5]

head(titanic_temp)

## Class Age Sex Survived

## 4 3 0 0 0

## 5 0 1 0 0

## 6 1 1 0 0

## 7 2 1 0 0

## 8 3 1 0 0

## 12 3 0 1 0

12.3.3查看不同船舱的生还率差异

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:SparkR':
## 
##     arrange, between, collect, contains, count, cume_dist,
##     dense_rank, desc, distinct, explain, filter, first, group_by,
##     intersect, lag, last, lead, mutate, n, n_distinct, ntile,
##     percent_rank, rename, row_number, sample_frac, select, sql,
##     summarize, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
tempdata<-aggregate(Freq~Class+Survived,data = titanic,FUN = sum)
ggplot(data = tempdata,mapping =aes(x = Class,y=Freq,fill=Survived))+geom_bar(position ='dodge',stat ='identity')+ylab("number")+xlim(c("1","2","3","0"))+theme(text=element_text(family ="Italic",size=18))

library(ggplot2)

library(dplyr)

## Attaching package: 'dplyr'

## The following objects are masked from 'package:SparkR':

## arrange, between, collect, contains, count, cume_dist,

## dense_rank, desc, distinct, explain, filter, first, group_by,

## intersect, lag, last, lead, mutate, n, n_distinct, ntile,

## percent_rank, rename, row_number, sample_frac, select, sql,

## summarize, union

## The following objects are masked from 'package:stats':

## filter, lag

## The following objects are masked from 'package:base':

## intersect, setdiff, setequal, union

tempdata<-aggregate(Freq~Class+Survived,data = titanic,FUN = sum)

ggplot(data = tempdata,mapping =aes(x = Class,y=Freq,fill=Survived))+geom_bar(position ='dodge',stat ='identity')+ylab("number")+xlim(c("1","2","3","0"))+theme(text=element_text(family ="Italic",size=18))

然后，对比一下不同性别之间的生还率：

tempdata<-aggregate(Freq~Sex+Survived,data = titanic,FUN = sum)
ggplot(data = tempdata,mapping =aes(x = Sex,y=Freq,fill=Survived))+geom_bar(position ='dodge',stat ='identity')

1 2	tempdata<-aggregate(Freq~Sex+Survived,data = titanic,FUN = sum) ggplot(data = tempdata,mapping =aes(x = Sex,y=Freq,fill=Survived))+geom_bar(position ='dodge',stat ='identity')

最后再看看不同年龄段的生还情况：

tempdata<-aggregate(Freq~Age+Survived,data = titanic,FUN = sum)
ggplot(data = tempdata,mapping =aes(x = Age,y=Freq,fill=Survived))+geom_bar(position ='dodge',stat ='identity')

1 2	tempdata<-aggregate(Freq~Age+Survived,data = titanic,FUN = sum) ggplot(data = tempdata,mapping =aes(x = Age,y=Freq,fill=Survived))+geom_bar(position ='dodge',stat ='identity')

12.3.4转换成SparkDataFrame格式的数据

titanicDF <-createDataFrame(titanic[titanic$Freq >0, -5])
nbDF

1 2	titanicDF <-createDataFrame(titanic[titanic$Freq >0, -5]) nbDF

12.3.5模型概要

summary(nbModel)
## $apriori
##              1         0
## [1,] 0.5769231 0.4230769
## 
## $tables
##   Class_3   Class_2 Class_1 Sex_0 Age_1 
## 1 0.3125    0.3125  0.3125  0.5   0.5625
## 0 0.4166667 0.25    0.25    0.5   0.75

summary(nbModel)

## $apriori

## 1 0

## [1,] 0.5769231 0.4230769

## $tables

## Class_3 Class_2 Class_1 Sex_0 Age_1

## 1 0.3125 0.3125 0.3125 0.5 0.5625

## 0 0.4166667 0.25 0.25 0.5 0.75

12.3.6预测

nbPredictions <-predict(nbModel, nbTestDF)
showDF(nbPredictions)
## +-----+---+---+--------+--------------------+--------------------+----------+
## |Class|Age|Sex|Survived|       rawPrediction|         probability|prediction|
## +-----+---+---+--------+--------------------+--------------------+----------+
## |    3|  0|  0|       0|[-3.9824097993521...|[0.60062402496099...|         1|
## |    0|  1|  0|       0|[-2.9426380107070...|[0.50318824507901...|         1|
## |    1|  1|  0|       0|[-3.7310953710712...|[0.58003280993672...|         1|
## |    2|  1|  0|       0|[-3.7310953710712...|[0.58003280993672...|         1|
## |    3|  1|  0|       0|[-3.7310953710712...|[0.39192399049881...|         0|
## |    3|  0|  1|       0|[-3.9824097993521...|[0.60062402496099...|         1|
## |    0|  1|  1|       0|[-2.9426380107070...|[0.50318824507901...|         1|
## |    1|  1|  1|       0|[-3.7310953710712...|[0.58003280993672...|         1|
## |    2|  1|  1|       0|[-3.7310953710712...|[0.58003280993672...|         1|
## |    3|  1|  1|       0|[-3.7310953710712...|[0.39192399049881...|         0|
## |    1|  0|  0|       1|[-3.9824097993521...|[0.76318223866790...|         1|
## |    2|  0|  0|       1|[-3.9824097993521...|[0.76318223866790...|         1|
## |    3|  0|  0|       1|[-3.9824097993521...|[0.60062402496099...|         1|
## |    0|  1|  0|       1|[-2.9426380107070...|[0.50318824507901...|         1|
## |    1|  1|  0|       1|[-3.7310953710712...|[0.58003280993672...|         1|
## |    2|  1|  0|       1|[-3.7310953710712...|[0.58003280993672...|         1|
## |    3|  1|  0|       1|[-3.7310953710712...|[0.39192399049881...|         0|
## |    1|  0|  1|       1|[-3.9824097993521...|[0.76318223866790...|         1|
## |    2|  0|  1|       1|[-3.9824097993521...|[0.76318223866790...|         1|
## |    3|  0|  1|       1|[-3.9824097993521...|[0.60062402496099...|         1|
## +-----+---+---+--------+--------------------+--------------------+----------+
## only showing top 20 rows

nbPredictions <-predict(nbModel, nbTestDF)

showDF(nbPredictions)

## +-----+---+---+--------+--------------------+--------------------+----------+

## +-----+---+---+--------+--------------------+--------------------+----------+

## | 3| 0| 0| 0|[-3.9824097993521...|[0.60062402496099...| 1|

## | 0| 1| 0| 0|[-2.9426380107070...|[0.50318824507901...| 1|

## | 1| 1| 0| 0|[-3.7310953710712...|[0.58003280993672...| 1|

## | 2| 1| 0| 0|[-3.7310953710712...|[0.58003280993672...| 1|

## | 3| 1| 0| 0|[-3.7310953710712...|[0.39192399049881...| 0|

## | 3| 0| 1| 0|[-3.9824097993521...|[0.60062402496099...| 1|

## | 0| 1| 1| 0|[-2.9426380107070...|[0.50318824507901...| 1|

## | 1| 1| 1| 0|[-3.7310953710712...|[0.58003280993672...| 1|

## | 2| 1| 1| 0|[-3.7310953710712...|[0.58003280993672...| 1|

## | 3| 1| 1| 0|[-3.7310953710712...|[0.39192399049881...| 0|

## | 1| 0| 0| 1|[-3.9824097993521...|[0.76318223866790...| 1|

## | 2| 0| 0| 1|[-3.9824097993521...|[0.76318223866790...| 1|

## | 3| 0| 0| 1|[-3.9824097993521...|[0.60062402496099...| 1|

## | 0| 1| 0| 1|[-2.9426380107070...|[0.50318824507901...| 1|

## | 1| 1| 0| 1|[-3.7310953710712...|[0.58003280993672...| 1|

## | 2| 1| 0| 1|[-3.7310953710712...|[0.58003280993672...| 1|

## | 3| 1| 0| 1|[-3.7310953710712...|[0.39192399049881...| 0|

## | 1| 0| 1| 1|[-3.9824097993521...|[0.76318223866790...| 1|

## | 2| 0| 1| 1|[-3.9824097993521...|[0.76318223866790...| 1|

## | 3| 0| 1| 1|[-3.9824097993521...|[0.60062402496099...| 1|

## +-----+---+---+--------+--------------------+--------------------+----------+

## only showing top 20 rows

12.3.7评估模型

nbPredictions<-as.data.frame(nbPredictions)

# 计算混淆矩阵
ct<-table(titanic_temp$Survived,nbPredictions$prediction)
ct
##    
##      0  1
##   0  2  8
##   1  2 12

nbPredictions<-as.data.frame(nbPredictions)

# 计算混淆矩阵

ct<-table(titanic_temp$Survived,nbPredictions$prediction)

## 0 1

## 0 2 8

## 1 2 12

计算准确率

(ct[1,1]+ct[2,2])/sum(ct)
## [1] 0.5833333

1 2	(ct[1,1]+ct[2,2])/sum(ct) ## [1] 0.5833333

计算召回率

ct[2,2]/(ct[2,2]+ct[2,1])
## [1] 0.8571429

1 2	ct[2,2]/(ct[2,2]+ct[2,1]) ## [1] 0.8571429

计算精准率

ct[2,2]/(ct[2,2]+ct[1,2])
## [1] 0.6

1 2	ct[2,2]/(ct[2,2]+ct[1,2]) ## [1] 0.6

12.4 小结

本章主要介绍了如何使用Spark R组件的问题，Spark R 给R开发人员提供很多API,利用这些API，开发人员就可以通过R语言操作Spark，把用R编写的代码放在Spark这个大数据技术平台运行，这样可以使R不但可以操作HDFS或Hive中数据，也自然使用Spark分布式基于内存的架构。

本章数据集下载

第11章 PySpark 决策树模型

Spark不但好用、而且还易用、通用，它提供多种的开发语言的API，除了Scala外，还有Java、Python、R等，可以说集成目前市场最有代表性的开发语言，使得Spark受众上升几个数据量级，同时也无形中降低了学习和使用它的门槛，使得很多熟悉Java、Python、R的编程人员、数据分析师，也可方便地利用Spark大数据计算框架来实现他们的大数据处理、机器学习等任务。
Python作为机器学习中的利器，一直被很多开发者和学习者所推崇的一种语言。除了开源、易学以及简洁的代码风格的特性之外，Python当中还有很多优秀的第三方的库，为我们对数据进行处理、探索和模型的构建提供很大的便利，如Pandas、Numpy、Scipy、Matplotlib、StatsModels、Scikit-Learn、Keras等。Python的强大还体现在它的与时俱进，它与大数据计算平台Spark的结合，可为是强强联合、优势互补、相得益彰，这就有了现如今Spark当中一个重要分支--PySpark。其内部架构可参考图11-1（该图取自https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals?spm=5176.100239.0.0.eI85ij）。

图11-1 PySpark架构图

PySpark的Python解释器在启动时会同时启动一个JVM,Python解释器与JVM进程通过socket进行通信，在python driver端，SparkContext利用Py4J启动一个JVM并产生一个JavaSparkContext。Py4J只使用在driver端，用于本地python与Java SparkContext objects的通信。大量数据的传输使用的是另一个机制。RDD在python下的转换会被映射成java环境下PythonRDD。在远端worker机器上，PythonRDD对象启动一些子进程并通过pipes与这些子进程通信，以此send用户代码和数据。
本章节就机器学习中的决策树模型，使用PySpark中的ML库以及IPython交互式环境进行示例。具体内容如下：
 决策树简介
 数据加载
 数据探索
 创建决策树模型
 训练模型并进行预测
 利用交叉验证、网格参数等进行模型调优
 最后生成一个可执行python脚本

11.1 PySpark 简介

在Spark的官网上这么介绍PySpark：“PySpark is the Python API for Spark”，也就是说PySpark其实是Spark为Python提供的编程接口。此外，Spark还提供了关于Scala、Java和R的编程接口，关于Spark为R提供的编程接口（Spark R）将在第12章进行介绍。

11.2 决策树简介

决策树在机器学习中是很常见且经常使用的模型，它是一个强大的非概率模型，可以用来表达复杂的非线性模式和特征相互关系。

图11-2决策树结构

关于决策树的原理，这里不再赘述。本章着重讨的是，决策树的分类模型在PySpark中的应用。

11.3数据加载

11.3.1 原数据集初探

这里的数据选择为某比赛的数据集，用来预测推荐的一些页面是短暂（昙花一现）还是长久（长时流行）。原数据集为train.tsv，存放路径在 /home/hadoop/data/train.tsv。
先使用shell命令对数据进行试探性的查看，并做一些简单的数据处理。
1) 查看前2行数据

$ head -2 train.tsv

"url" "urlid" "boilerplate"	"alchemy_category"	"alchemy_category_score"	"avglinksize"	"commonlinkratio_1"	"commonlinkratio_2"	"commonlinkratio_3"	"commonlinkratio_4"	"compression_ratio"	"embed_ratio"	"framebased"	"frameTagRatio"	"hasDomainLink"	"html_ratio"	"image_ratio"	"is_news"	"lengthyLinkDomain"	"linkwordscore"	"news_front_page"   "non_markup_alphanum_characters"	"numberOfLinks"	"numwords_in_url"	"parametrizedLinkRatio"	"spelling_errors_ratio"	"label"
"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"	"4042" "{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose Cali	"8"	.............. "0.152941176" "0.079129575"	"0"

$ head -2 train.tsv

"url" "urlid" "boilerplate" "alchemy_category" "alchemy_category_score" "avglinksize" "commonlinkratio_1" "commonlinkratio_2" "commonlinkratio_3" "commonlinkratio_4" "compression_ratio" "embed_ratio" "framebased" "frameTagRatio" "hasDomainLink" "html_ratio" "image_ratio" "is_news" "lengthyLinkDomain" "linkwordscore" "news_front_page" "non_markup_alphanum_characters" "numberOfLinks" "numwords_in_url" "parametrizedLinkRatio" "spelling_errors_ratio" "label"

"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html" "4042" "{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose Cali "8" .............. "0.152941176" "0.079129575" "0"

数据集中的第1行为标题（字段名）行，下面是一些的字段说明。
查看文件记录总数

$ cat train.tsv |wc -l
7396

1 2	$ cat train.tsv \|wc -l 7396

结果显示共有：数据集一共有7396条数据
2) 由于textFile目前不好过滤标题行数据，为便于spark操作数据，需要先删除标题。

$ sed  1d train.tsv >train_noheader.tsv

1	$ sed 1d train.tsv >train_noheader.tsv

3) 将数据文件上传到 hdfs

$ hdfs dfs -put train_noheader.tsv /data

1	$ hdfs dfs -put train_noheader.tsv /data

4) 查看是否成功

hadoop@master:~/data$ hdfs dfs -ls /data
17/05/24 00:46:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 hadoop supergroup   21972457 2017-05-24 00:46 /data/train_noheader.tsv

hadoop@master:~/data$ hdfs dfs -ls /data

17/05/24 00:46:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Found 1 items

-rw-r--r-- 1 hadoop supergroup 21972457 2017-05-24 00:46 /data/train_noheader.tsv

11.3.2 PySpark 的启动

以spark Standalone模式启动spark集群，保证内存分配充足。

$ pyspark --master spark://master:7077 --driver-memory 1G --total-executor-cores 4

1	$ pyspark --master spark://master:7077 --driver-memory 1G --total-executor-cores 4

[注]：使用pyspark --help 可以查看指令的详细帮助信息。

# Default to standard python interpreter unless told otherwise
if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then
  PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"ipython"}"
fi

# Default to standard python interpreter unless told otherwise

if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then

PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"ipython"}"

11.3.3 基本函数

这里将本章节中需要用到函数和方法做一个简单的说明，如表11-4所示。
表11-4 本章使用的一些函数或方法简介

11.4数据探索

1) 通过sc对象的textFile方法，载入本地数据文件，创建RDD

In [1]: raw_data = sc.textFile("hdfs://master:9000/data/train_noheader.tsv")

1	In [1]: raw_data = sc.textFile("hdfs://master:9000/data/train_noheader.tsv")

2) 查看第1行数据

In [2]: raw_data.take(2)
Out[2]:[u'"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"\t"4042"\t"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose Cali..."\t...\t"8"\t"0.152941176"\t"0.079129575"\t"0"',
u'"http://www.popsci.com/technology/article/2012-07/electronic-futuristic-starting-gun-eliminates-advantages-races"\t"8471"\t"{""title"":""The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races the fully electronic, futuristic starting gun that eliminates advantages in races the fully electronic, futuristic starting gun that eliminates advantages in races"",""body"":""And that can be carried on a plane without the hassle too The Omega..."\t...\t"9"\t"0.181818182"\t"0.125448029"\t"1"']

In [2]: raw_data.take(2)

Out[2]:[u'"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"\t"4042"\t"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose Cali..."\t...\t"8"\t"0.152941176"\t"0.079129575"\t"0"',

u'"http://www.popsci.com/technology/article/2012-07/electronic-futuristic-starting-gun-eliminates-advantages-races"\t"8471"\t"{""title"":""The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races the fully electronic, futuristic starting gun that eliminates advantages in races the fully electronic, futuristic starting gun that eliminates advantages in races"",""body"":""And that can be carried on a plane without the hassle too The Omega..."\t...\t"9"\t"0.181818182"\t"0.125448029"\t"1"']

3) 查看数据文件的总行数

In [3]: numRaws = raw_data.count()
In [4]: numRaws
Out[4]: 7395

In [3]: numRaws = raw_data.count()

In [4]: numRaws

Out[4]: 7395

4) 按键进行统计

In [5]: raw_data.countByKey()
Out[5]: defaultdict(int, {u'"': 7395})

1 2	In [5]: raw_data.countByKey() Out[5]: defaultdict(int, {u'"': 7395})

原数据文件总的行数为7396，由于我们在数据加载中将数据集的第一行数据已经去除掉，所以这里结果为7395。

11.5数据预处理

1) 由于后续的算法我们不需要时间戳以及网页的内容，所以这里先将其过滤掉。

In [6]: records = raw_data.map(lambda line: line.split('\t'))

1	In [6]: records = raw_data.map(lambda line: line.split('\t'))

2) 查看records 数据结构

In [7]: records.first()
Out[7]:
[u'"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"',
  u'"4042"',
  u'"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar...""}"',
  u'"business"',
  u'"0.789131"',
  u'"2.055555556"',
  u'"0.676470588"',
  u'"0.205882353"',
  u'"0.047058824"',
  u'"0.023529412"',
  u'"0.443783175"',
  u'"0"',
  u'"0"',
  u'"0.09077381"',
  u'"0"',
  u'"0.245831182"',
  u'"0.003883495"',
  u'"1"',
  u'"1"',
  u'"24"',
  u'"0"',
  u'"5424"',
  u'"170"',
  u'"8"',
  u'"0.152941176"',
  u'"0.079129575"',
  u'"0"']

In [7]: records.first()

Out[7]:

[u'"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"',

u'"4042"',

u'"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar...""}"',

u'"business"',

u'"0.789131"',

u'"2.055555556"',

u'"0.676470588"',

u'"0.205882353"',

u'"0.047058824"',

u'"0.023529412"',

u'"0.443783175"',

u'"0"',

u'"0.09077381"',

u'"0"',

u'"0.245831182"',

u'"0.003883495"',

u'"1"',

u'"24"',

u'"0"',

u'"5424"',

u'"170"',

u'"8"',

u'"0.152941176"',

u'"0.079129575"',

u'"0"']

3) 查看每一行的列数

In [8]: len(records.first())
Out[8]: 27

1 2	In [8]: len(records.first()) Out[8]: 27

导入Vectors 矢量方法

In [9]: from pyspark.ml.linalg import Vectors

1	In [9]: from pyspark.ml.linalg import Vectors

导入决策树分类器

In [10]: from pyspark.ml.classification import DecisionTreeClassifier

1	In [10]: from pyspark.ml.classification import DecisionTreeClassifier

4) 将RDD中的所有元素以列表的形式返回

In [11]: data = records.collect()

1	In [11]: data = records.collect()

5) 查看data数据一行有多少列

In [12]: numColumns = len(data[0])
In [13]: numColumns
Out[13]: 27

In [12]: numColumns = len(data[0])

In [13]: numColumns

Out[13]: 27

6) 定义一个列表data1，存放清理过的数据，格式为[(label_1, features_1), (label_2, features_2),…]

In [14]: data1 = []

1	In [14]: data1 = []

对数据进行清理工作中的1,2,3步

In [15]:
for i in range(numRaws):
    trimmed = [ each.replace('"', "") for each in data[i] ]
    label = int(trimmed[-1])
    features = map(lambda x: 0.0 if x == "?" else x, trimmed[4:numColumns-1])
c = (label, Vectors.dense(map(float, features)))
data1.append(c)

In [15]:

for i in range(numRaws):

trimmed = [ each.replace('"', "") for each in data[i] ]

label = int(trimmed[-1])

features = map(lambda x: 0.0 if x == "?" else x, trimmed[4:numColumns-1])

c = (label, Vectors.dense(map(float, features)))

data1.append(c)

11.6创建决策树模型

1) 将data1 转换为DataFrame对象，label表示标签列，features 表示特征值列

In [16]: df= spark.createDataFrame(data1, ["label","features"])
In [17]: df.show(10)
+-----+--------------------+
|label|            features|
+-----+--------------------+
|    0|[0.789131,2.05555...|
|    1|[0.574147,3.67796...|
|    1|[0.996526,2.38288...|
|    1|[0.801248,1.54310...|
|    0|[0.719157,2.67647...|
|    0|[0.0,119.0,0.7454...|
|    1|[0.22111,0.773809...|
|    0|[0.0,1.883333333,...|
|    1|[0.0,0.471502591,...|
|    1|[0.0,2.41011236,0...|
+-----+--------------------+
only showing top 10 rows
# 显示 df 的Schema
In [18]: df.printSchema()
root
 |-- label: long (nullable = true)
 |-- features: vector (nullable = true)

In [16]: df= spark.createDataFrame(data1, ["label","features"])

In [17]: df.show(10)

+-----+--------------------+

|label| features|

+-----+--------------------+

| 0|[0.789131,2.05555...|

| 1|[0.574147,3.67796...|

| 1|[0.996526,2.38288...|

| 1|[0.801248,1.54310...|

| 0|[0.719157,2.67647...|

| 0|[0.0,119.0,0.7454...|

| 1|[0.22111,0.773809...|

| 0|[0.0,1.883333333,...|

| 1|[0.0,0.471502591,...|

| 1|[0.0,2.41011236,0...|

+-----+--------------------+

only showing top 10 rows

# 显示 df 的Schema

In [18]: df.printSchema()

root

|-- label: long (nullable = true)

|-- features: vector (nullable = true)

2) 由于后面会经常使用，所以将df载入内存

In [19]: df.cache()
Out[19]: DataFrame[label: double, features: vector]

1 2	In [19]: df.cache() Out[19]: DataFrame[label: double, features: vector]

3) 建立特征索引

In [20]: from pyspark.ml.feature import VectorIndexer
In [20]: featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)

1 2	In [20]: from pyspark.ml.feature import VectorIndexer In [20]: featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)

4) 将数据切分成80%训练集和20%测试集

#seed=1234L，表示每次随机生成的训练集和测试集的总行数不变
In [21]: (trainingData, testData) = df.randomSplit([0.8, 0.2],seed=1234L)

In [22]: trainingData.count()
Out[22]: 5912

In [23]: testData.count()
Out[23]: 1483

#seed=1234L，表示每次随机生成的训练集和测试集的总行数不变

In [21]: (trainingData, testData) = df.randomSplit([0.8, 0.2],seed=1234L)

In [22]: trainingData.count()

Out[22]: 5912

In [23]: testData.count()

Out[23]: 1483

5) 指定决策树模型的深度、标签列，特征值列，使用信息熵(entropy)作为评估方法，并训练数据。

In [24]: dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")

1	In [24]: dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")

6) 构建流水线工作流

In [25]: from pyspark.ml import Pipeline

In [26]: pipeline = Pipeline(stages=[featureIndexer, dt])

In [27]: model = pipeline.fit(trainingData)      ## 训练模型

In [25]: from pyspark.ml import Pipeline

In [26]: pipeline = Pipeline(stages=[featureIndexer, dt])

In [27]: model = pipeline.fit(trainingData) ## 训练模型

下面我们用一组已知数据和一组新数据重新预测下结果：

11.7训练模型进行预测

1) 使用第一行数据进行预测结果，看看是否相符合，这里先来看一下原数据集第一行数据

In [28]: data1[0]
Out[28]: 
(0.0,
 DenseVector([0.7891, 2.0556, 0.6765, 0.2059, 0.0471, 0.0235, 0.4438, 0.0, 0.0, 0.0908, 0.0, 0.2458, 0.0039, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.1529, 0.0791]))

In [28]: data1[0]

Out[28]:

(0.0,

DenseVector([0.7891, 2.0556, 0.6765, 0.2059, 0.0471, 0.0235, 0.4438, 0.0, 0.0, 0.0908, 0.0, 0.2458, 0.0039, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.1529, 0.0791]))

2) 使用数据集中第一行的特征值数据进行预测

In [29]: test0 = spark.createDataFrame([(data1[0][1],)], ["features"])
In [30]: result = model.transform(test0)
# 查看预测结果
In [31]: result.show()
+--------------------+--------------------+-------------+--------------------+----------+
|            features|     indexedFeatures|rawPrediction|         probability|prediction|
+--------------------+--------------------+-------------+--------------------+----------+
|[0.789131,2.05555...|[0.789131,2.05555...|[274.0,310.0]|[0.46917808219178...|       1.0|
+--------------------+--------------------+-------------+--------------------+----------+
In [32]: predictedResult.select(['prediction']).show() 	#只获取预测值
+----------+
|prediction|
+----------+
|       1.0|
+----------+

In [29]: test0 = spark.createDataFrame([(data1[0][1],)], ["features"])

In [30]: result = model.transform(test0)

# 查看预测结果

In [31]: result.show()

+--------------------+--------------------+-------------+--------------------+----------+

+--------------------+--------------------+-------------+--------------------+----------+

|[0.789131,2.05555...|[0.789131,2.05555...|[274.0,310.0]|[0.46917808219178...| 1.0|

+--------------------+--------------------+-------------+--------------------+----------+

In [32]: predictedResult.select(['prediction']).show() #只获取预测值

+----------+

|prediction|

+----------+

| 1.0|

+----------+

3) 将第一行的特征值数据修改掉2个（这里换掉第一个和第二个值），进行该特征值下的预测.

# 将第一行的数据进行修改
In [33]: firstRaw = list(data1[0][1])
In [34]: firstRaw[0] = 2.7891
In [35]: firstRaw[1] = 0.0556

In [36]: predictedData = Vectors.dense(firstRaw)
In [37]: predictedData
Out[37]: DenseVector([2.7891, 0.0556, 0.6765, 0.2059, 0.0471, 0.0235, 0.4438, 0.0, 0.0, 0.0908, 0.0, 0.2458, 0.0039, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.1529, 0.0791])

# 将第一行的数据进行修改

In [33]: firstRaw = list(data1[0][1])

In [34]: firstRaw[0] = 2.7891

In [35]: firstRaw[1] = 0.0556

In [36]: predictedData = Vectors.dense(firstRaw)

In [37]: predictedData

Out[37]: DenseVector([2.7891, 0.0556, 0.6765, 0.2059, 0.0471, 0.0235, 0.4438, 0.0, 0.0, 0.0908, 0.0, 0.2458, 0.0039, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.1529, 0.0791])

4) 进行新数据的预测

In [38]: predictedRaw = spark.createDataFrame([(predictedData,)], ["features"])
In [39]: predictedResult = model.transform(predictedRaw)
In [40]: predictedResult.show()
+--------------------+--------------------+-------------+--------------------+----------+
|            features|     indexedFeatures|rawPrediction|         probability|prediction|
+--------------------+--------------------+-------------+--------------------+----------+
|[2.7891,0.0556,0....|[2.7891,0.0556,0....|[274.0,310.0]|[0.46917808219178...|       1.0|
+--------------------+--------------------+-------------+--------------------+----------+
In [41]: predictedResult.select(['prediction']).show()
+----------+
|prediction|
+----------+
|       1.0|
+----------+

In [38]: predictedRaw = spark.createDataFrame([(predictedData,)], ["features"])

In [39]: predictedResult = model.transform(predictedRaw)

In [40]: predictedResult.show()

+--------------------+--------------------+-------------+--------------------+----------+

+--------------------+--------------------+-------------+--------------------+----------+

|[2.7891,0.0556,0....|[2.7891,0.0556,0....|[274.0,310.0]|[0.46917808219178...| 1.0|

+--------------------+--------------------+-------------+--------------------+----------+

In [41]: predictedResult.select(['prediction']).show()

+----------+

|prediction|

+----------+

| 1.0|

+----------+

5) 下面我们用测试数据做决策树准确度测试

# 通过模型，预测测试集
In [42]: predictedResultAll = model.transform(testData)

#查看预测值
In [43]: predictedResultAll.select("prediction").show()
+----------+
|prediction|
+----------+
|       0.0|
|       0.0|
|       1.0|
|       1.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       1.0|
+----------+
only showing top 10 rows

#由于预测值是DataFrame对象，每一行是Raw型，不可做修改
#需将预测值转换为pandas，然后转换为列表
In [44]:df_prediction = predictedResultAll.select("prediction").toPandas()
In [45]: dtPredictions = list(df_prediction.prediction)

#查看前10个预测值
In [46]: dtPredictions[:10]
Out[46]: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0]

#对预测值做准确性统计
In [47]: dtTotalCorrect = 0
#获取测试集的总行数
In [48]: testRaw = testData.count()
In [49]: testLabel = testData.select("label").collect()
In [50]: 
for i in range(testRaw):
    if dtPredictions[i] == testLabel[i]:
        dtTotalCorrect += 1

In [51]: dtTotalCorrect
Out[51]: 940

In [52]: 1.0 * dtTotalCorrect / testRaw
Out[52]: 0.6338503034389751

# 通过模型，预测测试集

In [42]: predictedResultAll = model.transform(testData)

#查看预测值

In [43]: predictedResultAll.select("prediction").show()

+----------+

|prediction|

+----------+

| 0.0|

| 1.0|

| 0.0|

| 1.0|

+----------+

only showing top 10 rows

#由于预测值是DataFrame对象，每一行是Raw型，不可做修改

#需将预测值转换为pandas，然后转换为列表

In [44]:df_prediction = predictedResultAll.select("prediction").toPandas()

In [45]: dtPredictions = list(df_prediction.prediction)

#查看前10个预测值

In [46]: dtPredictions[:10]

Out[46]: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0]

#对预测值做准确性统计

In [47]: dtTotalCorrect = 0

#获取测试集的总行数

In [48]: testRaw = testData.count()

In [49]: testLabel = testData.select("label").collect()

In [50]:

for i in range(testRaw):

if dtPredictions[i] == testLabel[i]:

dtTotalCorrect += 1

In [51]: dtTotalCorrect

Out[51]: 940

In [52]: 1.0 * dtTotalCorrect / testRaw

Out[52]: 0.6338503034389751

11.8模型优化
在上一个小节中，我们发现使用决策树的正确率不算高，只有63.3850%。在这一小节，我们探究一下改进预测准确率的方法。

11.8.1特征值的优化

1) 先将之前用到的一些代码加载进来。

In [1]: from pyspark.ml.linalg import Vectors
In [2]: from pyspark.ml.classification import DecisionTreeClassifier
In [3]: from pyspark.ml.feature import VectorIndexer
In [4]: from pyspark.ml import Pipeline
In [5]: raw_data = sc.textFile("hdfs://master:9000/data/train_noheader.tsv")
In [6]: numRaws = raw_data.count()
In [7]: records = raw_data.map(lambda line: line.split('\t'))
In [8]: data = records.collect()
In [9]: numColumns = len(data[0])
In [10]: data1 = []

In [1]: from pyspark.ml.linalg import Vectors

In [2]: from pyspark.ml.classification import DecisionTreeClassifier

In [3]: from pyspark.ml.feature import VectorIndexer

In [4]: from pyspark.ml import Pipeline

In [5]: raw_data = sc.textFile("hdfs://master:9000/data/train_noheader.tsv")

In [6]: numRaws = raw_data.count()

In [7]: records = raw_data.map(lambda line: line.split('\t'))

In [8]: data = records.collect()

In [9]: numColumns = len(data[0])

In [10]: data1 = []

2) 由于这里对网页类型的标识有很多，需要单独挑选出来进行处理。

#将第三列网页类型的引号去除掉
In [11]: category = records.map(lambda x: x[3].replace("\"",""))

1 2	#将第三列网页类型的引号去除掉 In [11]: category = records.map(lambda x: x[3].replace("\"",""))

将网页的唯一类型删选出来，并进行排序。

In [12]: categories = sorted(category.distinct().collect())
In [13]: categories
Out[13]: 
[u'?',
 u'arts_entertainment',
 u'business',
 u'computer_internet',
 u'culture_politics',
 u'gaming',
 u'health',
 u'law_crime',
 u'recreation',
 u'religion',
 u'science_technology',
 u'sports',
 u'unknown',
 u'weather']

In [12]: categories = sorted(category.distinct().collect())

In [13]: categories

Out[13]:

[u'?',

u'arts_entertainment',

u'business',

u'computer_internet',

u'culture_politics',

u'gaming',

u'health',

u'law_crime',

u'recreation',

u'religion',

u'science_technology',

u'sports',

u'unknown',

u'weather']

3) 查看网页类型的个数。

In [14]: numCategories = len(categories)
In [15]: numCategories
Out[15]: 14

In [14]: numCategories = len(categories)

In [15]: numCategories

Out[15]: 14

4) 紧接着，我们定义一个函数，用于返回当前网页类型的列表。

In [16]: 
def transform_category(x):
    markCategory = [0] * numCategories
    index = categories.index(x)
markCategory[index] = 1
return markCategory

In [16]:

def transform_category(x):

markCategory = [0] * numCategories

index = categories.index(x)

markCategory[index] = 1

return markCategory

5) 通过这样的处理，我们将网页类型这一个特征值转化14个特征值，整体的特征值其实就增加了14个。接下来，我们在处理的时候将这个些特征值加入进去。

In [17]: 
for i in range(numRaws):
    trimmed = [ each.replace('"', "") for each in data[i] ]
    label = float(trimmed[-1])
    cate = transform_category(trimmed[3]) #调用函数，返回一个类型列表
features = cate + map(lambda x: 0.0 if x == "?" else (x), trimmed[4:numColumns - 1])
c = (label, Vectors.dense(map(float, features)))
data1.append(c)

In [17]:

for i in range(numRaws):

trimmed = [ each.replace('"', "") for each in data[i] ]

label = float(trimmed[-1])

cate = transform_category(trimmed[3]) #调用函数，返回一个类型列表

features = cate + map(lambda x: 0.0 if x == "?" else (x), trimmed[4:numColumns - 1])

c = (label, Vectors.dense(map(float, features)))

data1.append(c)

6) 创建DataFrame对象。

In [18]: df= spark.createDataFrame(data1, ["label","features"])

#由于后面经常使用df，所以载入内存
In [19]: df.cache()
Out[20]: DataFrame[label: double, features: vector]

In [18]: df= spark.createDataFrame(data1, ["label","features"])

#由于后面经常使用df，所以载入内存

In [19]: df.cache()

Out[20]: DataFrame[label: double, features: vector]

7) 建立特征索引。

In [21]: from pyspark.ml.feature import VectorIndexer
In [22]: featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)

1 2	In [21]: from pyspark.ml.feature import VectorIndexer In [22]: featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)

8) 将数据切分成80%训练集和20%测试集。

In [23]: (trainingData, testData) = df.randomSplit([0.8, 0.2],seed=1234L)

In [24]: trainingData.count()
Out[24]: 5912

In [25]: testData.count()
Out[25]: 1483

In [23]: (trainingData, testData) = df.randomSplit([0.8, 0.2],seed=1234L)

In [24]: trainingData.count()

Out[24]: 5912

In [25]: testData.count()

Out[25]: 1483

9) 创建决策树模型。

In [26]: dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")

1	In [26]: dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")

10) 构建流水线工作流。

In [27]: pipeline = Pipeline(stages=[featureIndexer, dt])
In [28]: model = pipeline.fit(trainingData)      ## 训练模型
11)	用测试数据再一次做下决策树准确度测试。
In [29]: predictedResultAll = model.transform(testData)
In [30]:df_prediction = predictedResultAll.select("prediction").toPandas()
In [31]: dtPredictions = list(df_prediction.prediction)

#对预测值做准确性统计
In [32]: dtTotalCorrect = 0
#测试集的总行数
In [33]: testRaw = testData.count()
In [49]: testLabel = testData.select("label").collect()
In [34]: 
for i in range(testRaw):
    if dtPredictions[i] == testLabel[i]:
        dtTotalCorrect += 1

In [35]: dtTotalCorrect
Out[35]: 967

In [36]: 1.0 * dtTotalCorrect / testRaw
Out[36]: 0.6520566419420094

In [27]: pipeline = Pipeline(stages=[featureIndexer, dt])

In [28]: model = pipeline.fit(trainingData) ## 训练模型

11) 用测试数据再一次做下决策树准确度测试。

In [29]: predictedResultAll = model.transform(testData)

In [30]:df_prediction = predictedResultAll.select("prediction").toPandas()

In [31]: dtPredictions = list(df_prediction.prediction)

#对预测值做准确性统计

In [32]: dtTotalCorrect = 0

#测试集的总行数

In [33]: testRaw = testData.count()

In [49]: testLabel = testData.select("label").collect()

In [34]:

for i in range(testRaw):

if dtPredictions[i] == testLabel[i]:

dtTotalCorrect += 1

In [35]: dtTotalCorrect

Out[35]: 967

In [36]: 1.0 * dtTotalCorrect / testRaw

Out[36]: 0.6520566419420094

可以看到，准确率增大到了63.3850%，而未做优化前的准确率是65.2057%。增长了1.88%。效果还是比较显著的。

11.8.2交叉验证和网格参数

In [1]: from pyspark.ml.linalg import Vectors
In [2]: from pyspark.ml.classification import DecisionTreeClassifier
In [3]: from pyspark.ml.feature import VectorIndexer
In [4]: from pyspark.ml import Pipeline
In [5]: raw_data = sc.textFile("hdfs://master:9000/data/train_noheader.tsv")
In [6]: numRaws = raw_data.count()
In [7]: records = raw_data.map(lambda line: line.split('\t'))
In [8]: data = records.collect()
In [9]: numColumns = len(data[0])
In [10]: data1 = []
In [11]: category = records.map(lambda x: x[3].replace("\"",""))
In [12]: categories = sorted(category.distinct().collect())
In [13]: numCategories = len(categories)
In [14]: 
def transform_category(x):
    markCategory = [0] * numCategories
    index = categories.index(x)
markCategory[index] = 1
return markCategory
In [15]: 
for i in range(numRaws):
    trimmed = [ each.replace('"', "") for each in data[i] ]
    label = float(trimmed[-1])
    cate = transform_category(trimmed[3]) #调用函数，返回一个类型列表
features = cate + map(lambda x: 0.0 if x == "?" else (x), trimmed[4:numColumns - 1])
c = (label, Vectors.dense(map(float, features)))
data1.append(c)
In [16]: df= spark.createDataFrame(data1, ["label","features"])
In [17]: df.cache()
In [18]: featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)
In [19]: dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")
In [20]: pipeline = Pipeline(stages=[featureIndexer, dt])
In [21]: (trainingData, testData) = df.randomSplit([0.8, 0.2],seed=1234L)

In [1]: from pyspark.ml.linalg import Vectors

In [2]: from pyspark.ml.classification import DecisionTreeClassifier

In [3]: from pyspark.ml.feature import VectorIndexer

In [4]: from pyspark.ml import Pipeline

In [5]: raw_data = sc.textFile("hdfs://master:9000/data/train_noheader.tsv")

In [6]: numRaws = raw_data.count()

In [7]: records = raw_data.map(lambda line: line.split('\t'))

In [8]: data = records.collect()

In [9]: numColumns = len(data[0])

In [10]: data1 = []

In [11]: category = records.map(lambda x: x[3].replace("\"",""))

In [12]: categories = sorted(category.distinct().collect())

In [13]: numCategories = len(categories)

In [14]:

def transform_category(x):

markCategory = [0] * numCategories

index = categories.index(x)

markCategory[index] = 1

return markCategory

In [15]:

for i in range(numRaws):

trimmed = [ each.replace('"', "") for each in data[i] ]

label = float(trimmed[-1])

cate = transform_category(trimmed[3]) #调用函数，返回一个类型列表

features = cate + map(lambda x: 0.0 if x == "?" else (x), trimmed[4:numColumns - 1])

c = (label, Vectors.dense(map(float, features)))

data1.append(c)

In [16]: df= spark.createDataFrame(data1, ["label","features"])

In [17]: df.cache()

In [18]: featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)

In [19]: dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")

In [20]: pipeline = Pipeline(stages=[featureIndexer, dt])

In [21]: (trainingData, testData) = df.randomSplit([0.8, 0.2],seed=1234L)

创建交叉验证和网格参数

# 导入交叉验证和参数网格
In [22]: from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
#导入二分类评估器
In [23]: from pyspark.ml.evaluation import BinaryClassificationEvaluator
In [24]: evaluator = BinaryClassificationEvaluator()  # 初始化一个评估器
#设置参数网格
In [25]: paramGrid = ParamGridBuilder().addGrid(dt.maxDepth, [4,5,6]).build()
#设置交叉验证的参数
In [26]: cv = CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)

# 导入交叉验证和参数网格

In [22]: from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

#导入二分类评估器

In [23]: from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [24]: evaluator = BinaryClassificationEvaluator() # 初始化一个评估器

#设置参数网格

In [25]: paramGrid = ParamGridBuilder().addGrid(dt.maxDepth, [4,5,6]).build()

#设置交叉验证的参数

In [26]: cv = CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)

通过交叉验证来训练模型

In [27]: cvModel = cv.fit(trainingData)

1	In [27]: cvModel = cv.fit(trainingData)

测试模型

In [28]: Predictions=cvModel.transform(testData)

1	In [28]: Predictions=cvModel.transform(testData)

准确率统计

In [29]: df_prediction = Predictions.select("prediction").toPandas()
In [30]: dtPredictions = list(df_prediction.prediction)

#对预测值做准确性统计
In [31]: dtTotalCorrect = 0
#测试集的总行数
In [32]: testRaw = testData.count()
In [34]: testLabel = testData.select("label").collect()
In [33]: 
for i in range(testRaw):
    if dtPredictions[i] == testLabel[i]:
        dtTotalCorrect += 1

In [34]: dtTotalCorrect
Out[34]: 960

In [35]: 1.0 * dtTotalCorrect / testRaw
Out[35]: 0.6473364801078895
我们还可以查看最匹配模型的具体参数
In [36]: bestmodel = cvModel.bestModel.stages[1]

In [37]: bestmodel.numFeatures 	#决策树有36个特征值
Out[37]: 36

In [38]: bestmodel.depth  #最大深度为10
Out[38]: 6

In [39]: bestmodel.numNodes  #决策树中点有457个

In [29]: df_prediction = Predictions.select("prediction").toPandas()

In [30]: dtPredictions = list(df_prediction.prediction)

#对预测值做准确性统计

In [31]: dtTotalCorrect = 0

#测试集的总行数

In [32]: testRaw = testData.count()

In [34]: testLabel = testData.select("label").collect()

In [33]:

for i in range(testRaw):

if dtPredictions[i] == testLabel[i]:

dtTotalCorrect += 1

In [34]: dtTotalCorrect

Out[34]: 960

In [35]: 1.0 * dtTotalCorrect / testRaw

Out[35]: 0.6473364801078895

我们还可以查看最匹配模型的具体参数

In [36]: bestmodel = cvModel.bestModel.stages[1]

In [37]: bestmodel.numFeatures #决策树有36个特征值

Out[37]: 36

In [38]: bestmodel.depth #最大深度为10

Out[38]: 6

In [39]: bestmodel.numNodes #决策树中点有457个

11.9脚本方式运行

11.9.1 在脚本中添加配置信息

创建一个decisionTree.py文件，添加如下代码来配置启动pyspark。将上述在pyspark的IPython中的代码添加到该文件中来。
本文的示例程序存为 /home/hadoop/projects/spark/pyspark/decisionTree.py。

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

#指定本地运行spark
conf = SparkConf().setMaster("local[*]")
sc = SparkContext(conf=conf)
spark = SparkSession.builder.master('local') \
                .appName('DecisionTree') \
                .config("spark.some.config.option", "some-value") \
                .getOrCreate()

from pyspark import SparkConf, SparkContext

from pyspark.sql import SparkSession

#指定本地运行spark

conf = SparkConf().setMaster("local[*]")

sc = SparkContext(conf=conf)

spark = SparkSession.builder.master('local') \

.appName('DecisionTree') \

.config("spark.some.config.option", "some-value") \

.getOrCreate()

11.9.2运行脚本程序

spark 2.0 之前使用pyspark decisionTree.py 来执行文件，spark 2.0之后统一用spark-submit decisionTree.py 执行文件。读者可以使用spark-submit --help来查看相关命令的帮助信息。

$ spark-submit decisionTree.py

1	$ spark-submit decisionTree.py

11.10小结

本章我们了Spark中的 PySpark使用方法，对于PySpark做了简单的介绍。讨论了分类模型中最常见的决策树模型在PySpark 的应用，用实例讲解了如何对数据进行清理、转换，分析了分类模型的准确性有待提高的的原因，通过可视化对决策树的不同深度下的准确度进行讨论。

第1章：Keras基础

1.1Keras简介

Tensorflow、theano是神经网络、机器学习的基础框架，但使用它们大家神经网络，尤其深度学习网络，像tensorflow或theano属于符号编程，需要涉及如何定义变量、图形、各层、session、初始化、各种算法等等，有时显得比较繁琐，尤其对新手而言，更是如此，是否有更简单的方法呢？keras就是一个很好工具！Keras是一个高层神经网络API，Keras由纯Python编写而成并基Tensorflow、Theano以及CNTK后端。Keras 为支持快速实验而生，能够把你的想法迅速转换为结果。
其主要特点：

简便的原型设计
支持CNN和RNN，或二者的结合
在CPU和GPU间无缝切换

keras资料：
keras官网：
https://keras.io/
keras中文网
https://keras-cn.readthedocs.io/en/latest/

1.2keras安装

1）安装Python3.6
建议用anaconda安装，先下载最新版anaconda 支持linux或windows
2）安装numpy、scipy

conda install numpy  scipy

1	conda install numpy scipy

3）安装theano

conda install theano

1	conda install theano

4）安装tensorflow

pip  install tensorflow     （cpu）
pip  install tensorfow-gpu    (gpu)

1 2	pip install tensorflow （cpu） pip install tensorfow-gpu (gpu)

(gpu 需要有GPU卡，并安装GPU驱动及cuda等)
5）安装keras

pip  install keras

1	pip install keras

6）测试
【说明】变更keras后台支持的几种方法

（1）修改~/.keras/keras.json

hadoop@master:~/.keras$ cat keras.json 
{
    "epsilon": 1e-07, 
    "floatx": "float32", 
    "image_data_format": "channels_last", 
    "backend": "tensorflow"
}

hadoop@master:~/.keras$ cat keras.json

{

"epsilon": 1e-07,

"floatx": "float32",

"image_data_format": "channels_last",

"backend": "tensorflow"

}

(2)在客户端直接修改
改为theano为支持后台

import os
os.environ['KERAS_BACKEND']='theano'

1 2	import os os.environ['KERAS_BACKEND']='theano'

改为tensorflow为支持后台（缺省）

import os
os.environ['KERAS_BACKEND']='tensorflow'

1 2	import os os.environ['KERAS_BACKEND']='tensorflow'

在客户端修改，影响范围是当前脚本或session。

1.3 keras常用概念

François Chollet作为人工智能时代的先行者，为无数的开发者提供了开源深度学习框架Keras，目前就职于Google公司，主推tf.keras。
在开始学习Keras之前，我们希望传递一些关于Keras，关于深度学习的基本概念和技术，我们建议新手在使用Keras之前浏览一下本页面提到的内容，这将减少你学习中的困惑。

符号计算

Keras的底层库使用Theano或TensorFlow，这两个库也称为Keras的后端。无论是Theano还是TensorFlow，都是一个“符号式”的库。
因此，这也使得Keras的编程与传统的Python代码有所差别。笼统的说，符号主义的计算首先定义各种变量，然后建立一个“计算图”，计算图规定了各个变量之间的计算关系。建立好的计算图需要编译以确定其内部细节，然而，此时的计算图还是一个“空壳子”，里面没有任何实际的数据，只有当你把需要运算的输入放进去后，才能在整个模型中形成数据流，从而形成输出值。
就像用管道搭建供水系统，当你在拼水管的时候，里面是没有水的。只有所有的管子都接完了，才能送水。
符号计算也叫数据流图，如下图是一个经典的数据流计算可视化图形。
saddle_point_evaluation_optimizers

张量

张量，或tensor，可以看作是向量、矩阵的自然推广，用来表示广泛的数据类型。张量的阶数也叫维度。
0阶张量,即标量,是一个数。
1阶张量,即向量,一组有序排列的数
2阶张量,即矩阵,一组向量有序的排列起来
3阶张量，即立方体，一组矩阵上下排列起来
4阶张量......
依次类推
重点：关于维度的理解
假如有一个10长度的列表，那么我们横向看有10个数字，也可以叫做10维度，纵向看只能看到1个数字，那么就叫1维度。注意这个区别有助于理解Keras或者神经网络中计算时出现的维度问题
张量的阶数有时候也称为维度，或者轴，轴这个词翻译自英文axis。譬如一个矩阵[[1,2],[3,4]]，是一个2阶张量，有两个维度或轴，沿着第0个轴（为了与python的计数方式一致，本文档维度和轴从0算起）你看到的是[1,2]，[3,4]两个向量，沿着第1个轴你看到的是[1,3]，[2,4]两个向量。
 数据格式(data_format)
目前主要有两种方式来表示张量：
a) th模式或channels_first模式，Theano和caffe使用此模式。
b）tf模式或channels_last模式，TensorFlow使用此模式。
模式的修改，可以通修改配置文件~/.keras/keras.json中的image_data_format。

下面举例说明两种模式的区别：
对于100张RGB3通道的16×32（高为16宽为32）彩色图，
th表示方式：（100,3,16,32）
tf表示方式：（100,16,32,3）
唯一的区别就是表示通道个数3的位置不一样。

模型

Keras有两种类型的模型，序贯（或序列）模型（Sequential）和函数式模型（Model），函数式模型应用更为广泛，序贯模型是函数式模型的一种特殊情况。
a）序贯模型（Sequential):单输入单输出，一条路通到底，层与层之间只有相邻关系，没有跨层连接。这种模型编译速度快，操作也比较简单
b）函数式模型（Model）：多输入多输出，层与层之间任意连接。这种模型编译速度慢。

batch

这个概念与Keras无关，老实讲不应该出现在这里的，但是因为它频繁出现，而且不了解这个技术的话看函数说明会很头痛，这里还是简单说一下。
深度学习的优化算法，说白了就是梯度下降。每次的参数更新有两种方式。
第一种，遍历全部数据集算一次损失函数，然后算函数对各个参数的梯度，更新梯度。这种方法每更新一次参数都要把数据集里的所有样本都看一遍，计算量开销大，计算速度慢，不支持在线学习，这称为Batch gradient descent，批梯度下降。
另一种，每看一个数据就算一下损失函数，然后求梯度更新参数，这个称为随机梯度下降，stochastic gradient descent。这个方法速度比较快，但是收敛性能不太好，可能在最优点附近晃来晃去，hit不到最优点。两次参数的更新也有可能互相抵消掉，造成目标函数震荡的比较剧烈。
为了克服两种方法的缺点，现在一般采用的是一种折中手段，mini-batch gradient decent，小批的梯度下降，这种方法把数据分为若干个批，按批来更新参数，这样，一个批中的一组数据共同决定了本次梯度的方向，下降起来就不容易跑偏，减少了随机性。另一方面因为批的样本数与整个数据集相比小了很多，计算量也不是很大。
基本上现在的梯度下降都是基于mini-batch的，所以Keras的模块中经常会出现batch_size，就是指这个。

epochs

epochs指的就是训练过程中数据将被“轮”多少次。

1.4 keras与Tensorflow

1.5 keras的主要模块

【说明】
这里选择了一些常用模块，更多或更详细的说明请参考keras中文网站：
https://keras-cn.readthedocs.io/en/latest/
我们先从总体上了解一下Keras的主要模块及常用层，可参考下图，然后我们对各模块和常用层展开详细说明。

该图取自：http://blog.csdn.net/zdy0_2004/article/details/74736656

1.5.1优化器（optimizers）

优化器是调整每个节点权重的方法，看一个代码示例：

model = Sequential() 
model.add(Dense(64, init='uniform', input_dim=10)) model.add(Activation('tanh')) 
model.add(Activation('softmax')) 
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) model.compile(loss='mean_squared_error', optimizer=sgd)

model = Sequential()

model.add(Dense(64, init='uniform', input_dim=10)) model.add(Activation('tanh'))

model.add(Activation('softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) model.compile(loss='mean_squared_error', optimizer=sgd)

可以看到优化器在模型编译前定义，作为编译时的两个参数之一。
代码中的sgd是随机梯度下降算法
lr表示学习速率
momentum表示动量项
decay是学习速率的衰减系数(每个epoch衰减一次)
Nesterov的值是False或者True，表示使不使用Nesterov momentum
除了sgd，还可以选择的优化器有RMSprop（适合递归神经网络）、Adagrad、Adadelta、Adam、Adamax、Nadam等。

1.5.2目标函数（objectives）

目标函数又称损失函数（loss），目的是计算神经网络的输出与样本标记的差的一种方法，代码示例：

model = Sequential() 
model.add(Dense(64, init='uniform', input_dim=10)) model.add(Activation('tanh')) 
model.add(Activation('softmax')) 
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) model.compile(loss='mean_squared_error', optimizer=sgd)

model = Sequential()

model.add(Dense(64, init='uniform', input_dim=10)) model.add(Activation('tanh'))

model.add(Activation('softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) model.compile(loss='mean_squared_error', optimizer=sgd)

mean_squared_error就是损失函数的名称。
可以选择的损失函数有：
mean_squared_error，mean_absolute_error，squared_hinge，hinge，binary_crossentropy，categorical_crossentropy
其中binary_crossentropy 和 categorical_crossentropy也就是交叉熵为logloss，一般用于分类模型。

1.5.3激活函数（activations）

每一个神经网络层都需要一个激活函数，代码示例：

from keras.layers.core import Activation, Dense

model.add(Dense(64))
model.add(Activation('tanh'))

#或把上面两行合并为：
model.add(Dense(64, activation='tanh'))

from keras.layers.core import Activation, Dense

model.add(Dense(64))

model.add(Activation('tanh'))

#或把上面两行合并为：

model.add(Dense(64, activation='tanh'))

可以选择的激活函数有：
linear、sigmoid、hard_sigmoid、tanh、softplus、relu、 softplus，softmax、softsign
还有一些高级激活函数，比如如PReLU，LeakyReLU等。

1.5.4 参数初始化（Initializations）

这个模块的作用是在添加layer时调用init进行这一层的权重初始化，有两种初始化方法

1.5.4.1 通过制定初始化方法的名称

示例代码：

model.add(Dense(64, init='uniform'))

1	model.add(Dense(64, init='uniform'))

可以选择的初始化方法有：
uniform、lecun_uniform、normal、orthogonal、zero、glorot_normal、he_normal等。

1.5.4.2 通过调用对象

该对象必须包含两个参数:shape(待初始化的变量的shape)和name(该变量的名字),该可调用对象必须返回一个(Keras)变量,例如K.variable()返回的就是这种变量，示例代码：

from keras import backend as K
import numpy as np

def my_init(shape, name=None):
    value = np.random.random(shape)
    return K.variable(value, name=name)
model.add(Dense(64, init=my_init))

from keras import backend as K

import numpy as np

def my_init(shape, name=None):

value = np.random.random(shape)

return K.variable(value, name=name)

model.add(Dense(64, init=my_init))

或者

from keras import initializations
def my_init(shape, name=None):
    return initializations.normal(shape, scale=0.01, name=name)
model.add(Dense(64, init=my_init))

from keras import initializations

def my_init(shape, name=None):

return initializations.normal(shape, scale=0.01, name=name)

model.add(Dense(64, init=my_init))

所以说可以通过库中的方法设定每一层的初始化权重，
也可以自己初始化权重，自己设定的话可以精确到每个节点的权重。

1.5.5 常用层（layer）

keras的层主要包括：
常用层（Core）、卷积层（Convolutional）、池化层（Pooling）、局部连接层、递归层（Recurrent）、嵌入层（ Embedding）、高级激活层、规范层、噪声层、包装层，当然也可以编写自己的层

1.5.5.1 Dense层(全连接层）

keras.layers.core.Dense(units, activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
Dense就是常用的全连接层，所实现的运算是output = activation(dot(input, kernel)+bias)。其中activation是逐元素计算的激活函数，kernel是本层的权值矩阵，bias为偏置向量，只有当use_bias=True才会添加。如果本层的输入数据的维度大于2，则会先被压为与kernel相匹配的大小。
参数：
units：大于0的整数，代表该层的输出维度。
activation：激活函数，为预定义的激活函数名（参考激活函数），或逐元素（element-wise）的Theano函数。如果不指定该参数，将不会使用任何激活函数（即使用线性激活函数：a(x)=x）
use_bias: 布尔值，是否使用偏置项
kernel_initializer：权值初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的初始化器。参考initializers
bias_initializer：偏置向量初始化方法，为预定义初始化方法名的字符串，或用于初始化偏置向量的初始化器。参考initializers
kernel_regularizer：施加在权重上的正则项，为Regularizer对象
bias_regularizer：施加在偏置向量上的正则项，为Regularizer对象
activity_regularizer：施加在输出上的正则项，为Regularizer对象
kernel_constraints：施加在权重上的约束项，为Constraints对象
bias_constraints：施加在偏置上的约束项，为Constraints对象
输入
形如(batch_size, ..., input_dim)的nD张量，最常见的情况为(batch_size, input_dim)的2D张量。
输出
形如(batch_size, ..., units)的nD张量，最常见的情况为(batch_size, units)的2D张量。
示例

# as first layer in a sequential model:
model = Sequential()
model.add(Dense(32, input_shape=(16,)))
# model.add(Dense(32, input_dim=16))
# now the model will take as input arrays of shape (*, 16)
# and output arrays of shape (*, 32)

# after the first layer, you don't need to specify the size of the input anymore
model.add(Dense(32))

# as first layer in a sequential model:

model = Sequential()

model.add(Dense(32, input_shape=(16,)))

# model.add(Dense(32, input_dim=16))

# now the model will take as input arrays of shape (*, 16)

# and output arrays of shape (*, 32)

# after the first layer, you don't need to specify the size of the input anymore

model.add(Dense(32))

1.5.5.2 Flatten层

Flatten层用来将输入“压平”，即把多维的输入一维化，常用在从卷积层到全连接层的过渡。Flatten不影响batch的大小。
keras.layers.core.Flatten()
示例

model = Sequential()
model.add(Convolution2D(64, 3, 3,
            border_mode='same',
            input_shape=(3, 32, 32)))
# now: model.output_shape == (None, 64, 32, 32)

model.add(Flatten())
# now: model.output_shape == (None, 65536)

model = Sequential()

model.add(Convolution2D(64, 3, 3,

border_mode='same',

input_shape=(3, 32, 32)))

# now: model.output_shape == (None, 64, 32, 32)

model.add(Flatten())

# now: model.output_shape == (None, 65536)

1.5.5.3 dropout层

为输入数据施加Dropout。Dropout将在训练过程中每次更新参数时随机断开一定百分比（p）的输入神经元连接，Dropout层用于防止过拟合。
keras.layers.core.Dropout(p)

1.5.5.4 卷积层（Convolutional）

1.5.5.4.1 Conv1D层

keras.layers.convolutional.Conv1D(filters, kernel_size, strides=1, padding='valid', dilation_rate=1, activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
一维卷积层（即时域卷积），用以在一维输入信号上进行邻域滤波。当使用该层作为首层时，需要提供关键字参数input_shape。例如(10,128)代表一个长为10的序列，序列中每个信号为128向量。而(None, 128)代表变长的128维向量序列。
该层生成将输入信号与卷积核按照单一的空域（或时域）方向进行卷积。如果use_bias=True，则还会加上一个偏置项，若activation不为None，则输出为经过激活函数的输出。
参数
filters：卷积核的数目（即输出的维度）
kernel_size：整数或由单个整数构成的list/tuple，卷积核的空域或时域窗长度
strides：整数或由单个整数构成的list/tuple，为卷积的步长。任何不为1的strides均与任何不为1的dilation_rate均不兼容
padding：补0策略，为“valid”, “same” 或“causal”，“causal”将产生因果（膨胀的）卷积，即output[t]不依赖于input[t+1：]。当对不能违反时间顺序的时序信号建模时有用。参考WaveNet: A Generative Model for Raw Audio, section 2.1.。“valid”代表只进行有效的卷积，即对边界数据不处理。“same”代表保留边界处的卷积结果，通常会导致输出shape与输入shape相同。
activation：激活函数，为预定义的激活函数名（参考激活函数），或逐元素（element-wise）的Theano函数。如果不指定该参数，将不会使用任何激活函数（即使用线性激活函数：a(x)=x）
dilation_rate：整数或由单个整数构成的list/tuple，指定dilated convolution中的膨胀比例。任何不为1的dilation_rate均与任何不为1的strides均不兼容。
use_bias:布尔值，是否使用偏置项
kernel_initializer：权值初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的初始化器。参考initializers
bias_initializer：权值初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的初始化器。参考initializers
kernel_regularizer：施加在权重上的正则项，为Regularizer对象
bias_regularizer：施加在偏置向量上的正则项，为Regularizer对象
activity_regularizer：施加在输出上的正则项，为Regularizer对象

kernel_constraints：施加在权重上的约束项，为Constraints对象
bias_constraints：施加在偏置上的约束项，为Constraints对象
输入shape
形如（samples，steps，input_dim）的3D张量。
输出shape
形如（samples，new_steps，nb_filter）的3D张量，因为有向量填充的原因，steps的值会改变。

1.5.5.4.2 Conv2D层

keras.layers.convolutional.Conv2D(filters, kernel_size, strides=(1, 1), padding='valid', data_format=None, dilation_rate=(1, 1), activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
二维卷积层，即对图像的空域卷积。该层对二维输入进行滑动窗卷积，当使用该层作为第一层时，应提供input_shape参数。例如input_shape = (128,128,3)代表128*128的彩色RGB图像（data_format='channels_last'）

1.5.5.4.3 Conv3D层

keras.layers.convolutional.Conv3D(filters, kernel_size, strides=(1, 1, 1), padding='valid', data_format=None, dilation_rate=(1, 1, 1), activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
三维卷积对三维的输入进行滑动窗卷积，当使用该层作为第一层时，应提供input_shape参数。例如input_shape = (3,10,128,128)代表对10帧128*128的彩色RGB图像进行卷积。数据的通道位置仍然有data_format参数指定。

1.5.5.5池化层

1.5.5.5.1 MaxPooling1D层

keras.layers.pooling.MaxPooling1D(pool_size=2, strides=None, padding='valid')
对时域1D信号进行最大值池化
参数
pool_size：整数，池化窗口大小
strides：整数或None，下采样因子，例如设2将会使得输出shape为输入的一半，若为None则默认值为pool_size。
padding：‘valid’或者‘same’
输入shape
形如（samples，steps，features）的3D张量
输出shape
形如（samples，downsampled_steps，features）的3D张量

1.5.5.5.2 MaxPooling2D层

keras.layers.pooling.MaxPooling2D(pool_size=(2, 2), strides=None, padding='valid', data_format=None)
为空域信号施加最大值池化

1.5.5.5.3 AveragePooling1D层

keras.layers.pooling.AveragePooling1D(pool_size=2, strides=None, padding='valid')
对时域1D信号进行平均值池化

1.5.5.5.4 AveragePooling2D层

keras.layers.pooling.AveragePooling2D(pool_size=(2, 2), strides=None, padding='valid', data_format=None)
为空域信号施加平均值池化

1.5.5.6 循环层（Recurrent）

循环层包含三种模型：LSTM、GRU和SimpleRNN。
所有的循环层（LSTM,GRU,SimpleRNN）都继承本层，因此下面的参数可以在任何循环层中使用。

1.5.5.6.1抽象层，不能直接使用

keras.layers.recurrent.Recurrent(return_sequences=False, go_backwards=False, stateful=False, unroll=False, implementation=0)
weights：numpy array的list，用以初始化权重。该list形如[(input_dim, output_dim),(output_dim, output_dim),(output_dim,)]
 return_sequences：布尔值，默认False，控制返回类型。若为True则返回整个序列，否则仅返回输出序列的最后一个输出
 go_backwards：布尔值，默认为False，若为True，则逆向处理输入序列并返回逆序后的序列
 stateful：布尔值，默认为False，若为True，则一个batch中下标为i的样本的最终状态将会用作下一个batch同样下标的样本的初始状态。
 unroll：布尔值，默认为False，若为True，则循环层将被展开，否则就使用符号化的循环。当使用TensorFlow为后端时，循环网络本来就是展开的，因此该层不做任何事情。层展开会占用更多的内存，但会加速RNN的运算。层展开只适用于短序列。
 implementation：0，1或2，若为0，则RNN将以更少但是更大的矩阵乘法实现，因此在CPU上运行更快，但消耗更多的内存。如果设为1，则RNN将以更多但更小的矩阵乘法实现，因此在CPU上运行更慢，在GPU上运行更快，并且消耗更少的内存。如果设为2（仅LSTM和GRU可以设为2），则RNN将把输入门、遗忘门和输出门合并为单个矩阵，以获得更加在GPU上更加高效的实现。注意，RNN dropout必须在所有门上共享，并导致正则效果性能微弱降低。
 input_dim：输入维度，当使用该层为模型首层时，应指定该值（或等价的指定input_shape)
 input_length：当输入序列的长度固定时，该参数为输入序列的长度。当需要在该层后连接Flatten层，然后又要连接Dense层时，需要指定该参数，否则全连接的输出无法计算出来。注意，如果循环层不是网络的第一层，你需要在网络的第一层中指定序列的长度（通过input_shape指定）。

1.5.5.6.2全连接RNN网络

keras.layers.SimpleRNN(units, activation='tanh', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False, go_backwards=False, stateful=False, unroll=False)
全连接RNN网络，RNN的输出会被回馈到输入
参数说明
• units：输出维度
• activation：激活函数，为预定义的激活函数名（参考激活函数）
• use_bias: 布尔值，是否使用偏置项
• kernel_initializer：权值初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的初始化器。参考initializers
• recurrent_initializer：循环核的初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的初始化器。参考initializers
• bias_initializer：权值初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的初始化器。参考initializers
• kernel_regularizer：施加在权重上的正则项，为Regularizer对象
• bias_regularizer：施加在偏置向量上的正则项，为Regularizer对象
• recurrent_regularizer：施加在循环核上的正则项，为Regularizer对象
• activity_regularizer：施加在输出上的正则项，为Regularizer对象
• kernel_constraints：施加在权重上的约束项，为Constraints对象
• recurrent_constraints：施加在循环核上的约束项，为Constraints对象
• bias_constraints：施加在偏置上的约束项，为Constraints对象
• dropout：0~1之间的浮点数，控制输入线性变换的神经元断开比例
• recurrent_dropout：0~1之间的浮点数，控制循环状态的线性变换的神经元断开比例
• 其他参数参考Recurrent的说明

输入shape
形如（samples，timesteps，input_dim）的3D张量
输出shape
如果return_sequences=True：返回形如（samples，timesteps，output_dim）的3D张量
否则，返回形如（samples，output_dim）的2D张量
示例：

# as the first layer in a Sequential model
model = Sequential()
model.add(LSTM(32, input_shape=(10, 64)))
# now model.output_shape == (None, 32)
# note: `None` is the batch dimension.

# 以下与上面相同:
model = Sequential()
model.add(LSTM(32, input_dim=64, input_length=10))

# for subsequent layers, no need to specify the input size:
         model.add(LSTM(16))

# to stack recurrent layers, you must use return_sequences=True
# on any recurrent layer that feeds into another recurrent layer.
# note that you only need to specify the input size on the first layer.
model = Sequential()
model.add(LSTM(64, input_dim=64, input_length=10, return_sequences=True))
model.add(LSTM(32, return_sequences=True))
model.add(LSTM(10))

# as the first layer in a Sequential model

model = Sequential()

model.add(LSTM(32, input_shape=(10, 64)))

# now model.output_shape == (None, 32)

# note: `None` is the batch dimension.

# 以下与上面相同:

model = Sequential()

model.add(LSTM(32, input_dim=64, input_length=10))

# for subsequent layers, no need to specify the input size:

model.add(LSTM(16))

# to stack recurrent layers, you must use return_sequences=True

# on any recurrent layer that feeds into another recurrent layer.

# note that you only need to specify the input size on the first layer.

model = Sequential()

model.add(LSTM(64, input_dim=64, input_length=10, return_sequences=True))

model.add(LSTM(32, return_sequences=True))

model.add(LSTM(10))

1.5.5.6.3 LSTM层

keras.layers.recurrent.LSTM(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0)
Keras长短期记忆模型
forget_bias_init：遗忘门偏置的初始化函数，建议初始化为全1元素。
inner_activation：内部单元激活函数

1.5.5.6.4 GRU

keras.layers.recurrent.GRU(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0)
门限循环单元

1.5.5.6.5Embedding层

keras.layers.embeddings.Embedding(input_dim, output_dim, init='uniform', input_length=None, W_regularizer=None, activity_regularizer=None, W_constraint=None, mask_zero=False, weights=None, dropout=0.0)
只能作为模型第一层
mask_zero：布尔值，确定是否将输入中的‘0’看作是应该被忽略的‘填充’（padding）值，该参数在使用递归层处理变长输入时有用。设置为True的话，模型中后续的层必须都支持masking，否则会抛出异常

1.5.6 model层

model层是最主要的模块，model层可以将上面定义了各种基本组件组合起来。
model的方法：
model.summary() ：打印出模型概况
model.get_config() ：返回包含模型配置信息的Python字典
model.get_weights()：返回模型权重张量的列表，类型为numpy array
model.set_weights()：从numpy array里将权重载入给模型
model.to_json：返回代表模型的JSON字符串，仅包含网络结构，不包含权值。可以从JSON字符串中重构原模型：

from models import model_from_json

json_string = model.to_json()
model = model_from_json(json_string)
model.to_yaml：与model.to_json类似，同样可以从产生的YAML字符串中重构模型
from models import model_from_yaml

yaml_string = model.to_yaml()
model = model_from_yaml(yaml_string)

from models import model_from_json

json_string = model.to_json()

model = model_from_json(json_string)

model.to_yaml：与model.to_json类似，同样可以从产生的YAML字符串中重构模型

from models import model_from_yaml

yaml_string = model.to_yaml()

model = model_from_yaml(yaml_string)

model.save_weights(filepath)：将模型权重保存到指定路径，文件类型是HDF5（后缀是.h5）。
model.load_weights(filepath, by_name=False)：从HDF5文件中加载权重到当前模型中, 默认情况下模型的结构将保持不变。如果想将权重载入不同的模型（有些层相同）中，则设置by_name=True，只有名字匹配的层才会载入权重。
keras有两种model，分别是Sequential模型和泛型模型。

1.5.6.1 Sequential模型

Sequential是多个网络层的线性堆叠
可以通过向Sequential模型传递一个layer的list来构造该模型：

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential([Dense(32, input_dim=784),
Activation('relu'),
Dense(10),
Activation('softmax'),])

from keras.models import Sequential

from keras.layers import Dense, Activation

model = Sequential([Dense(32, input_dim=784),

Activation('relu'),

Dense(10),

Activation('softmax'),])

也可以通过.add()方法一个个的将layer加入模型中：

model = Sequential()
model.add(Dense(32, input_dim=784))
model.add(Activation('relu'))

model = Sequential()

model.add(Dense(32, input_dim=784))

model.add(Activation('relu'))

还可以通过merge将两个Sequential模型通过某种方式合并
Merge层提供了一系列用于融合两个层或两个张量的层对象和方法。以大写首字母开头的是Layer类，以小写字母开头的是张量的函数。小写字母开头的张量函数在内部实际上是调用了大写字母开头的层。
keras.engine.topology.Merge(layers=None, mode='sum', concat_axis=-1, dot_axes=-1, output_shape=None, node_indices=None, tensor_indices=None, name=None)
layers：该参数为Keras张量的列表，或Keras层对象的列表。该列表的元素数目必须大于1。
mode：合并模式，如果为字符串，则为下列值之一{“sum”，“mul”，“concat”，“ave”，“cos”，“dot”}
其中sum和mul是对待合并层输出做一个简单的求和、乘积运算，因此要求待合并层输出shape要一致。concat是将待合并层输出沿着最后一个维度进行拼接，因此要求待合并层输出只有最后一个维度不同。
Merge是一个层对象，在多个sequential组成的网络模型中，如果
x：输入数据。如果模型只有一个输入，那么x的类型是numpy array，如果模型有多个输入，那么x的类型应当为list，list的元素是对应于各个输入的numpy array
y：标签，numpy array
否则运行时很可能会提示意思就是你输入的维度与实际不符
 Add
keras.layers.Add()
添加输入列表的图层。
该层接收一个相同shape列表张量，并返回它们的和，shape不变。

import keras

input1 = keras.layers.Input(shape=(16,))
x1 = keras.layers.Dense(8, activation='relu')(input1)
input2 = keras.layers.Input(shape=(32,))
x2 = keras.layers.Dense(8, activation='relu')(input2)
added = keras.layers.Add()([x1, x2])  # equivalent to added = keras.layers.add([x1, x2])

out = keras.layers.Dense(4)(added)
model = keras.models.Model(inputs=[input1, input2], outputs=out)

import keras

input1 = keras.layers.Input(shape=(16,))

x1 = keras.layers.Dense(8, activation='relu')(input1)

input2 = keras.layers.Input(shape=(32,))

x2 = keras.layers.Dense(8, activation='relu')(input2)

added = keras.layers.Add()([x1, x2]) # equivalent to added = keras.layers.add([x1, x2])

out = keras.layers.Dense(4)(added)

model = keras.models.Model(inputs=[input1, input2], outputs=out)

 Concatenate
keras.layers.Concatenate(axis=-1)
该层接收一个列表的同shape张量，并返回它们的按照给定轴相接构成的向量。

1.5.6.2 函数式（Functional）模型

在Keras 2里我们将这个词改译为“函数式”，对函数式编程有所了解的同学应能够快速get到该类模型想要表达的含义。函数式模型称作Functional，但它的类名是Model，因此我们有时候也用Model来代表函数式模型。
Keras函数式模型接口是用户定义多输出模型、非循环有向模型或具有共享层的模型等复杂模型的途径。一句话，只要你的模型不是类似VGG一样一条路走到黑的模型，或者你的模型需要多于一个的输出，那么你总应该选择函数式模型。函数式模型是最广泛的一类模型，序贯模型（Sequential）只是它的一种特殊情况。
这部分的文档假设你已经对Sequential模型已经比较熟悉
让我们从简单一点的模型开始
第一个模型：全连接网络
Sequential当然是实现全连接网络的最好方式，但我们从简单的全连接网络开始，有助于我们学习这部分的内容。在开始前，有几个概念需要澄清：
层对象接受张量为参数，返回一个张量。
输入是张量，输出也是张量的一个框架就是一个模型，通过Model定义。
这样的模型可以被像Keras的Sequential一样被训练

from keras.layers import Input, Dense
from keras.models import Model

# This returns a tensor
inputs = Input(shape=(784,))

# a layer instance is callable on a tensor, and returns a tensor
x = Dense(64, activation='relu')(inputs)
x = Dense(64, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)

# This creates a model that includes
# the Input layer and three Dense layers
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(data, labels)  # starts training

from keras.layers import Input, Dense

from keras.models import Model

# This returns a tensor

inputs = Input(shape=(784,))

# a layer instance is callable on a tensor, and returns a tensor

x = Dense(64, activation='relu')(inputs)

x = Dense(64, activation='relu')(x)

predictions = Dense(10, activation='softmax')(x)

# This creates a model that includes

# the Input layer and three Dense layers

model = Model(inputs=inputs, outputs=predictions)

model.compile(optimizer='rmsprop',

loss='categorical_crossentropy',

metrics=['accuracy'])

model.fit(data, labels) # starts training

所有的模型都是可调用的，就像层一样
利用函数式模型的接口，我们可以很容易的重用已经训练好的模型：你可以把模型当作一个层一样，通过提供一个tensor来调用它。注意当你调用一个模型时，你不仅仅重用了它的结构，也重用了它的权重。

x = Input(shape=(784,))
# This works, and returns the 10-way softmax we defined above.
y = model(x)

x = Input(shape=(784,))

# This works, and returns the 10-way softmax we defined above.

y = model(x)

使用函数式模型的一个典型场景是搭建多输入、多输出的模型，如下图：

auxiliary_input = Input(shape=(5,), name='aux_input')
x = keras.layers.concatenate([lstm_out, auxiliary_input])

# We stack a deep densely-connected network on top
x = Dense(64, activation='relu')(x)
x = Dense(64, activation='relu')(x)
x = Dense(64, activation='relu')(x)

# And finally we add the main logistic regression layer
main_output = Dense(1, activation='sigmoid', name='main_output')(x)

auxiliary_input = Input(shape=(5,), name='aux_input')

x = keras.layers.concatenate([lstm_out, auxiliary_input])

# We stack a deep densely-connected network on top

x = Dense(64, activation='relu')(x)

# And finally we add the main logistic regression layer

main_output = Dense(1, activation='sigmoid', name='main_output')(x)

第2章 keras的使用流程

2.1 流程说明

第1步：构造数据：定义输入数据
第2步：构造模型：确定各个变量之间的计算关系
第3步：编译模型：编译已确定其内部细节
第4步：训练模型：导入数据，训练模型
第5步：测试模型
第6步：保存模型
把这些步骤进一步图形化为：

2.2 实例-详细说明使用流程

第1步：构造数据
我们需要根据模型fit（训练）时需要的数据格式来构造数据的shape，这里我们用numpy构造两个矩阵：一个是数据矩阵，一个是标签矩阵。

import numpy as np
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam

x_train = np.random.random((1000, 784))
y_train = np.random.randint(2, size=(1000, 1))
x_test = np.random.random((200, 784))
y_test = np.random.randint(2, size=(200, 1))

import numpy as np

from keras.utils import np_utils

from keras.models import Sequential

from keras.layers import Dense, Activation

from keras.optimizers import Adam

x_train = np.random.random((1000, 784))

y_train = np.random.randint(2, size=(1000, 1))

x_test = np.random.random((200, 784))

y_test = np.random.randint(2, size=(200, 1))

通过numpy的random生成随机矩阵，数据矩阵是1000行784列的矩阵，标签矩阵是1000行1列的句子，所以数据矩阵的一行就是一个样本，这个样本是784维的。
第2步构造模型
我们来构造一个神经网络模型，keras构造深度学习模型可以采用序列模型（基于Sequential类）或函数模型（又称为通用模型）（基于Model类）。两种间差异是拓扑结构不一样。这里我们采用序列模型。

model = Sequential()
model.add(Dense(32, activation='relu', input_dim=784))
model.add(Dense(1, activation='sigmoid'))

model = Sequential()

model.add(Dense(32, activation='relu', input_dim=784))

model.add(Dense(1, activation='sigmoid'))

在这一步中可以add多个层，也可以merge合并两个模型。
第3步：编译模型
我们编译上一步构造好的模型，并指定一些模型的参数，optimizer（优化器），loss（目标函数或损失函数），metrics（评估模型的指标）等。编译模型时损失函数和优化器这两项是必须的。

model.compile(optimizer='Adam',loss='binary_crossentropy',metrics=['accuracy'])

1	model.compile(optimizer='Adam',loss='binary_crossentropy',metrics=['accuracy'])

第4步：训练模型
传入要训练的数据和标签，并指定训练的一些参数，然后进行模型训练。

model.fit(x_train, y_train, epochs=10,verbose=2, batch_size=32,)

1	model.fit(x_train, y_train, epochs=10,verbose=2, batch_size=32,)

Epoch 1/10
- 1s - loss: 0.7063 - acc: 0.5010
Epoch 2/10
- 0s - loss: 0.6955 - acc: 0.5110
.................................
Epoch 10/10
- 0s - loss: 0.5973 - acc: 0.6980

epochs：整数，训练的轮数。
verbose：训练时显示实时信息，0表示不显示数据，1表示显示进度条，2表示用只显示一个数据。
batch_size：整数，指定进行梯度下降时每个batch包含的样本数。训练时一个batch的样本会被计算一次梯度下降，使目标函数优化一步。
第5步：测试模型
用测试数据测试已经训练好的模型，并可以获得测试结果，从而对模型进行评估

score = model.evaluate(x_test, y_test, batch_size=32)

1	score = model.evaluate(x_test, y_test, batch_size=32)

200/200 [==============================] - 0s 146us/step
本函数返回一个测试误差的标量值（如果模型没有其他评价指标），或一个标量的list（如果模型还有其他的评价指标）
第6步：保存模型

#将模型保存为json
json_string = model.to_json()  
#从保存的json中加载模型  
from keras.models import model_from_json  
model_re = model_from_json(json_string)

#将模型保存为json

json_string = model.to_json()

#从保存的json中加载模型

from keras.models import model_from_json

model_re = model_from_json(json_string)

【项目延伸】
上面是采用全连接的神经网络，包括输入层、一个隐含层及一个输出层。如果我们卷积神经网络是否可以？例如：一个卷积层+池化层+展平+全连接+输出层。

第3章 keras实现单层神经网络

3.1利用keras实现单层神经

本章利用Keras架构实现一个传统机器学习算法---线性回归
根据输入数据及目标数据，模拟一个线性函数y=kx+b
这里使用一个神经元，神经元中使用Relu作为激活函数。如下图：

第1步：构造数据

import numpy as np
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
%matplotlib inline

#构造数据
X = np.linspace(-2, 2, 200)
np.random.shuffle(X)    # randomize the data
#添加一些噪音数据
Y = 0.5 * X + 2 + np.random.normal(0, 0.05, (200, ))

# 显示输入数据
plt.scatter(X, Y)
plt.show()

import numpy as np

from keras.models import Sequential

from keras.layers import Dense

import matplotlib.pyplot as plt

%matplotlib inline

#构造数据

X = np.linspace(-2, 2, 200)

np.random.shuffle(X) # randomize the data

#添加一些噪音数据

Y = 0.5 * X + 2 + np.random.normal(0, 0.05, (200, ))

# 显示输入数据

plt.scatter(X, Y)

plt.show()

把200份数据划分为训练数据、测试数据。

X_train, Y_train = X[:160], Y[:160]     # first 160 data points
X_test, Y_test = X[160:], Y[160:]       # last 40 data points

1 2	X_train, Y_train = X[:160], Y[:160] # first 160 data points X_test, Y_test = X[160:], Y[160:] # last 40 data points

第2步构造模型

# build a neural network from the 1st layer to the last layer
model = Sequential()
model.add(Dense(units=1,activation='relu', input_dim=1))

# build a neural network from the 1st layer to the last layer

model = Sequential()

model.add(Dense(units=1,activation='relu', input_dim=1))

第3步编译模型

# choose loss function and optimizing method
model.compile(loss='mse', optimizer='sgd')

1 2	# choose loss function and optimizing method model.compile(loss='mse', optimizer='sgd')

第4步训练模型

model.fit(X_train, Y_train, epochs=100,verbose=0, batch_size=64,)

1	model.fit(X_train, Y_train, epochs=100,verbose=0, batch_size=64,)

第5步测试模型

# test
print('\nTesting ------------')
cost = model.evaluate(X_test, Y_test, batch_size=40)
print('test cost:', cost)
W, b = model.layers[0].get_weights()
print('Weights=', W, '\nbiases=', b)

# test

print('\nTesting ------------')

cost = model.evaluate(X_test, Y_test, batch_size=40)

print('test cost:', cost)

W, b = model.layers[0].get_weights()

print('Weights=', W, '\nbiases=', b)

Testing ------------
40/40 [==============================] - 0s 996us/step
test cost: 0.00395184289664
Weights= [[ 0.48489931]]
biases= [ 1.95838749]

可视化结果：

# plotting the prediction
Y_pred = model.predict(X_test)
plt.scatter(X_test, Y_test)
plt.plot(X_test, Y_pred)
plt.show()

# plotting the prediction

Y_pred = model.predict(X_test)

plt.scatter(X_test, Y_test)

plt.plot(X_test, Y_pred)

plt.show()

第4章 keras实现多层神经网络

利用keras构造一个多层神经网络，用该神经网络识别手写数字，上次我们采用python来实现，这里我们采用keras来构造多层神经网络。
网络构造图形：

在整个网络设计中，输入数据的维度，优化方法、损失函数需要重点考虑。当然激活函数也很重要，特别是层数较多时。
这里为便于说明keras构建多层神经网络的方法，采用MNIST数据集，MNIST是一个手写数字0-9的数据集，它有60000个训练样本集和10000个测试样本集它是NIST数据库的一个子集。该数据集keras有现成的数据处理API(mnist.load_data())。
数据预处理：
（1）展平矩阵：
原数据为28*28图片，在利用全连接前，需要把矩阵拉平为一维数组，大小为784；
（2）规范训练数据
转换为都是0-255的像素，为提高模型的泛化能力，需要对数据规范化，即除以255，使数据范围都在[0,1]之间；
（3）规范标签数据
把标签数据转换为one-hot格式，向量维度为10，每行除一个1元素外，其它都是0，如把2标签转换为[0,0,1,0,0,0,0,0,0,0]
以下为详细计算步骤：
第1步：构建数据

import numpy as np
from keras.datasets import mnist
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam

# download the mnist to the path '~/.keras/datasets/' 
(X_train, y_train), (X_test, y_test) = mnist.load_data()
#查看数据集维度
print("展平前")
print(X_train.shape,y_train.shape)

# data pre-processing
X_train = X_train.reshape(X_train.shape[0], -1) / 255.   # normalize
X_test = X_test.reshape(X_test.shape[0], -1) / 255.      # normalize
y_train = np_utils.to_categorical(y_train, num_classes=10)
y_test = np_utils.to_categorical(y_test, num_classes=10)
print("展平后")
print(X_train.shape,y_train.shape)

import numpy as np

from keras.datasets import mnist

from keras.utils import np_utils

from keras.models import Sequential

from keras.layers import Dense, Activation

from keras.optimizers import Adam

# download the mnist to the path '~/.keras/datasets/'

(X_train, y_train), (X_test, y_test) = mnist.load_data()

#查看数据集维度

print("展平前")

print(X_train.shape,y_train.shape)

# data pre-processing

X_train = X_train.reshape(X_train.shape[0], -1) / 255. # normalize

X_test = X_test.reshape(X_test.shape[0], -1) / 255. # normalize

y_train = np_utils.to_categorical(y_train, num_classes=10)

y_test = np_utils.to_categorical(y_test, num_classes=10)

print("展平后")

print(X_train.shape,y_train.shape)

运行结果：
Using TensorFlow backend.
展平前
(60000, 28, 28) (60000,)
展平后
(60000, 784) (60000, 10)

第2步构建网络

model = Sequential([
    Dense(32, input_dim=784),
    Activation('relu'),
    Dense(10),
    Activation('softmax'),
])

model = Sequential([

Dense(32, input_dim=784),

Activation('relu'),

Dense(10),

Activation('softmax'),

])

第3步编译模型

model.compile(optimizer='Adam',loss='categorical_crossentropy',metrics=['accuracy'])

1	model.compile(optimizer='Adam',loss='categorical_crossentropy',metrics=['accuracy'])

运行结果：
【备注】
如果我们需要对优化方法进行某些定制化，也很方便：

Adam1 =Adam(lr=0.001, beta_1=0.5, epsilon=1e-08, decay=0.0)

1	Adam1 =Adam(lr=0.001, beta_1=0.5, epsilon=1e-08, decay=0.0)

第4步训练模型

print('Training ------------')
model.fit(X_train, y_train, epochs=2,verbose=2, batch_size=32)

1 2	print('Training ------------') model.fit(X_train, y_train, epochs=2,verbose=2, batch_size=32)

运行结果：
Training ------------
Epoch 1/2
- 7s - loss: 0.1584 - acc: 0.9539
Epoch 2/2
- 6s - loss: 0.1340 - acc: 0.9607
第5步测试模型

print('\nTesting ------------')
# Evaluate the model with the metrics we defined earlier
loss, accuracy = model.evaluate(X_test, y_test)

print('test loss: ', loss)
print('test accuracy: ', accuracy)

print('\nTesting ------------')

# Evaluate the model with the metrics we defined earlier

loss, accuracy = model.evaluate(X_test, y_test)

print('test loss: ', loss)

print('test accuracy: ', accuracy)

运行结果：
test loss: 0.132445694927
test accuracy: 0.9614

第14章TensorFlowOnSpark详解

14.1TensorFlow简介

14.1.1TensorFlow的安装

14.1.2TensorFlow的发展

14.1.3TensorFlow的特点

14.1.4TensorFlow编程模型

14.1.5TensorFlow常用函数

14.1.6TensorFlow的运行原理

14.2TensorFlow实现卷积神经网络

14.2.1卷积神经网络简介

14.2.3卷积神经网络的网络结构

14.2.4.1 导入数据

14.2.4.2 权重初始化

14.2.4.3 构建卷积神经网络结构

14.2.4.4 训练评估模型

14.3TensorFlow实现循环神经网络

14.3.1循环神经网络简介

14.3.2LSTM循环神经网络简介

14.3.4TensorFlow实现循环神经网络

14.4分布式TensorFlow

14.4.1客户端、主节点和工作节点间的关系

14.4.2分布式模式

14.4.3在Pyspark集群环境运行TensorFlow

14.5TensorFlowOnSpark架构

14.6TensorFlowOnSpark安装

14.7TensorFlowOnSpark实例

14.7.1TensorFlowOnSpark单机模式实例

14.7.2TensorFlowOnSpark集群模式实例

14.8小结

第13章 使用Spark Streaming构建在线学习模型

13.1 Spark Streaming简介

13.1.1Spark Streaming常用术语

13.2 Dstream操作

13.2.1 Dstream输入

13.2.2 Dstream转换

13.2.3 Dstream修改

13.2 .4Dstream输出

13.3 Spark Streaming应用实例

13.4 Spark Streaming在线学习实例

13.5小结

12.1. Spark R简介

12.2获取数据

12.2.1 SparkDataFrame数据结构说明

12.2.2创建SparkDataFrame

12.2.3 SparkDataFrame的常用操作

12.3朴素贝叶斯分类器

12.3.1数据探查

12.3.2对原始数据集进行转换

12.3.3查看不同船舱的生还率差异

12.3.4转换成SparkDataFrame格式的数据

12.3.5模型概要

12.3.6预测

12.3.7评估模型

12.4 小结

第11章 PySpark 决策树模型

11.1 PySpark 简介

11.2 决策树简介

11.3数据加载

11.3.1 原数据集初探

11.3.2 PySpark 的启动

11.3.3 基本函数

11.4数据探索

11.5数据预处理

11.6创建决策树模型

11.7训练模型进行预测

11.8.1特征值的优化

11.8.2交叉验证和网格参数

11.9脚本方式运行

11.9.1 在脚本中添加配置信息

11.9.2运行脚本程序

11.10小结

第1章：Keras基础

1.1Keras简介

1.2keras安装

1.3 keras常用概念

1.4 keras与Tensorflow

1.5 keras的主要模块

1.5.1优化器（optimizers）

1.5.2目标函数（objectives）

1.5.3激活函数（activations）

第13章使用Spark Streaming构建在线学习模型