分类目录归档：人工智能

第24章语音识别基础

语音作为最自然便捷的交流方式，一直是人机通信和交互最重要的研究领域之一。自动语音识别（Automatic Speech Recognition，ASR）是实现人机交互尤为关键的技术，其所要解决的问题是让计算机能够“听懂”人类的语音，将语音中传化为文本，或其它语言的语音(如同声翻译)等。
自动语音识别技术起源于上世纪五十年代，经过几十年的发展已经取得了显著的成效。尤其是近些年，把深度学习融入其中，使ASR取得突飞猛进的发展。越来越多的语音识别智能软件和应用走入了人们的日常生活，比较典型的有苹果的Siri、亚马逊的Alexa、微软的小娜、科大讯飞的语音输入法、叮咚智能音箱等。

本章主要介绍语音识别的基本概念、发展轨迹及主要技术和原理等。包括：
 语音识别系统架构
 语音识别基础
 语音识别发展轨迹
 语音识别未来方向

24.1 语言识别系统的架构

图24.1为语言识别系统的典型结构，其中关键技术涉及信号处理、特征提取、解码器及声学模型和语言模型的建立等。

图24.1 语音识别系统的架构
 信号处理及特征提取
该模块的主要任务是从输入信号中提取特征，供声学模型处理。同时，它一般也包括了一些信号处理技术，以尽可能降低环境噪声、信道、说话人等因素对特征造成的影响。
 声学模型
声学模型充分利用了声学、语音学、环境特性以及说话人性别口音等信息，对语音进行建模。目前的语音识别系统往往采用隐含马尔科夫模型（Hidden Markov Model，HMM）建模，表示某一语音特征矢量序列对某一状态序列的后验概率。隐含马尔科夫模型是一种概率图模型，可以用来表示序列之间的相关关系，常常被用来对时序数据建模。
 词典
词典包含系统所能处理的词汇集及其发音。词典实际提供了声学模型建模单元与语言模型建模单元间的映射。
 语言模型
语言模型对系统所针对的语言进行建模。理论上，包括正则语言，上下文无关文法在内的各种语言模型都可以作为语言模型，但目前各种系统常用的还是基于统计的N 元文法（N-Gram），即统计前后 N 个字出现的概率。
 解码器
解码器是语音识别系统的核心之一，其任务是对输入的信号，根据声学、语言模型及词典，寻找能够以最大概率输出该信号的词序列。
这节我们介绍了语音识别系统的主要架构，接下了我们从基本概念入手，进一步分析语音识别系统如果把语音或声音一步步变成文字或语句的。

24.2 语音识别的原理

上节介绍了语音识别系统的架构，这节我们对上节内容进行细化或具体化，为了大家对语音识别系统有直观认识，主要介绍系统涉及的一般基本概念及其原理。
语音识别系统的输入是何种形式的文件？
语音识别系统的输入自然是语音或声音，我们知道声音实际上是一种波。常见的MP3等格式都是压缩格式，必须转成非压缩的纯波形文件来处理，比如通常看到wav文件。wav文件里存储的除了一个文件头以外，就是声音波形的一个个点了。下图24.2是一个波形的示例

图24.2 声音波形示意图
所以输入文件一般是wav文件。有了输入文件后，接下就需要文件进行预处理，一般需要做哪些预处理呢？
（1）去首尾静音
利用信号处理的一些技术，切除首尾端的静音，这种操作一般称为VAD,以降低对后续步骤造成的干扰。
（2）分帧
对声音分帧，也就是把声音切成一小段一小段，每小段称为一帧。分帧操作一般不是简单的切开，而是使用移动窗口函数来实现。帧与帧之间一般有交叠，如图24.3

图24.3 分帧示意图
图24.3中，每帧的长度为25毫秒，每两帧之间有25-10=15毫秒的交叠。我们称为以帧长25ms、帧移动10ms分帧。
（3）特征提取
通过分帧后，语音变成了很多小段。但这些小段波形计算机还是无法处理，因此需要将波形作变换，常用的一般变换方法是提取MFCC特征，依据人耳的生理特性，把每一帧波形变成一个多维向量，这个向量包含了这帧语音的内容信息，这个过程称为声学特征提取。
这样声音就成了12行（假设声学特征为12维）、N列的一个矩阵（N是总帧数），称为观察序列。观察序列如图24.4，每一帧都用一个12维的向量表示，色块的颜色深浅表示向量值的大小。至此，准备工作基本完成，接下就是训练模型了。

图24.4 观察序列
（4）训练模型
在开始训练模型前，我们先介绍两个基本概念：
音素：单词的发音由音素构成。对英语，一种常用的音素集是卡内基梅隆大学的一套由39个音素构成的音素集。汉语一般直接用全部声母和韵母作为音素集，另外汉语识别还分有调无调。
状态：这里理解成比音素更细致的语音单位就可。通常把一个音素划分成3个状态。
具体训练过程可分为三步来实现：
第一步，把帧识别成状态(难点)；
第二步，把状态组合成音素；
第三步，把音素组合成单词。
首先看第一步，如何把帧识别成状态？如图24.5所示

图24.5 音素的合作过程
在图24.5中，每个小竖条代表一帧，若干帧语音对应一个状态，每三个状态组合成一个音素，若干个音素组合成一个单词。也就是说，只要知道每帧语音对应哪个状态，语音识别的结果也就出来了。如何由帧音素来确定对应的状态呢？有个容易想到的办法，看某帧对应哪个状态的概率最大，那这帧就属于哪个状态。如图24.6所示，这帧在状态S3上的条件概率最大，因此就猜这帧属于状态S3。

图24.6 由音素确定其对应状态的过程
图24.6中的那些用到的概率从哪里读取呢?有个叫“声学模型”的东西，里面存了一大堆参数，通过这些参数，就可以知道帧和状态对应的概率。获取这一大堆参数的方法叫做“训练”，需要使用巨大数量的语音数据，训练的方法比较繁琐，这里不讲。
但这样做有一个问题：每一帧都会得到一个状态号，最后整个语音就会得到一堆乱七八糟的状态号，相邻两帧间的状态号基本都不相同。假设语音有1000帧，每帧对应1个状态，每3个状态组合成一个音素，那么大概会组合成300个音素，但这段语音其实根本没有这么多音素。如果真这么做，得到的状态号可能根本无法组合成音素。实际上，相邻帧的状态应该大多数都是相同的才合理，因为每帧很短。这里就涉及第二步，如何由状态合成音素的问题。
解决这个问题的常用方法就是使用隐马尔可夫模型(Hidden Markov Model，HMM)。这东西听起来好像很高深的样子，实际上用起来很简单，主要由以下两步来实现：
第一步，构建一个状态网络；
第二步，从状态网络中寻找与声音最匹配的路径。
这样就把结果限制在预先设定的网络中，避免了刚才说到的问题。当然也带来一个局限，比如你设定的网络里只包含了“今天晴天”和“今天下雨”两个句子的状态路径，那么不管说些什么，识别出的结果必然是这两个句子中的一句。
如果想识别任意文本呢?只要把这个网络搭得足够大，包含任意文本的路径就可以了。但这个网络越大，想要达到比较好的识别准确率就越难。所以要根据实际任务的需求，合理选择网络大小和结构。
如何由音素合成单词呢？首先搭建状态网络，它是由单词级网络展开成音素网络，再展开成状态网络，其中涉及到HMM、词典以及语言模型等。具体如图24.7所示。语音识别过程其实就是在状态网络中搜索一条最佳路径，语音对应这条路径的概率最大，这称之为“解码”。路径搜索的算法是一种动态规划剪枝的算法，称之为Viterbi算法，用于寻找全局最优路径。

图24.7 状态网络示意图
这里所说的概率，由三部分构成，分别是：
观察概率：每帧和每个状态对应的概率
转移概率：每个状态转移到自身或转移到下个状态的概率
语言概率：根据语言统计规律得到的概率
其中，前两种概率从声学模型中获取，最后一种概率从语言模型中获取。语言模型是使用大量的文本训练出来的，可以利用某门语言本身的统计规律来帮助提升识别正确率。语言模型很重要，如果不使用语言模型，当状态网络较大时，识别出的结果基本是一团乱麻。
至此，语音识别主要过程就完成了,这就是语音识别技术的原理。
【备注】本节内容主要参考知乎，作者：张俊博，链接：https://www.zhihu.com/question/20398418/answer/18080841

24.3 语音识别发展历程

语音识别与上世纪50年代就开始了，当时主要尝试对单个的孤立词进行识别；
60-70年代人们开始开始探索连续语言识别工作，但进展比较缓慢；
80年代得到开始发展，得益于两个关键技术的应用：隐马尔科夫模型（HMM）的理论和N-gram语言模型；
90年代是语音识别基本成熟的时期，主要进展是语音识别声学模型的区分性训练准则和模型自适应方法的提出。这个时期剑桥语音识别组推出的HTK工具包对于促进语音识别的发展起到了很大的推动作用。此后语音识别发展很缓慢，主流的框架GMM-HMM趋于稳定，但是识别效果离实用化还相差甚远，语音识别的研究陷入了瓶颈；
2006年。这一年辛顿（Hinton）提出深度置信网络（DBN），促使了深度神经网络（Deep Neural Network，DNN）研究的复苏，掀起了深度学习的热潮，从此语音识别开始了新篇章，基于GMM-HMM的语音识别框架被打破，大量研究人员开始转向基于DNN-HMM的语音识别系统的研究。用DNN替换GMM,主要优势有：
（1）使用DNN估计HMM的状态的后验概率分布不需要对语音数据分布进行假设；
（2）DNN的输入特征可以是多种特征的融合，包括离散或者连续的；
（3） DNN可以利用相邻的语音帧所包含的结构信息。
图24.8为基于深度神经网络的语音识别系统框架。

图24.8 基于深度神经网络的语音识别系统框架
不过这种框架中DNN对语音信号的长时相关性解决不很理想，于是，循环神经网络（Recurrent Neural Network，RNN）近年来逐渐替代传统的DNN成为主流的语音识别建模方案。如图24.9。

图24.9 基于RNN和CTC的主流语音识别系统框架
循环神经网络在隐层上增加了一个反馈连接，这就很好解决长时相关性问题，再加上序列短时分类（Connectionist Temporal Classification，CTC）输出层，使整个训练更加高效，实现有效的“端对端”训练。
近几年，随着大数据、深度学习及云计算的进一步发展和融合，语音识别性能得到进一步提升，现在已有很多商业化应用，并且正在不断拓展其应用的广度和深度。

第14章TensorFlowOnSpark详解

前面我们介绍了Spark MLlib的多种机器学习算法，如分类、回归、聚类、推荐等，Spark目前还缺乏对神经网络、深度学习的足够支持，但近几年市场对神经网络，尤其对深度学习热情高涨，成了当下很多企业的研究热点，缺失神经网络的支持，这或许也算是Spark MLlib尚欠不足之处吧。
不过好消息是TensorFlow这个深度学习框架，已经有了Spark接口，即TensorFlowOnSpark。TensorFlow是目前很热门的深度学习框架，是Google于2015年11月9日开源的第二代深度学习系统，也是AlphaGo的基础程序。
本章我们将介绍深度学习最好框架TensorFlow及TensorFlowOnSpark，具体包括：
TensorFlow简介
TensorFlow实现卷积神经网络
分布式TensorFlow
TensorFlowOnSpark架构
TensorFlowOnSpark实例

14.1TensorFlow简介

14.1.1TensorFlow的安装

安装TensorFlow，因本环境的python2.7采用anaconda来安装，故这里采用conda管理工具来安装TensorFlow，目前conda缺省安装版本为TensorFlow 1.1。

conda  install  tensorflow

1	conda install tensorflow

验证安装是否成功，可以通过导入tensorflow来检验。
启动ipython（或python）

import tensorflow as tf

1	import tensorflow as tf

14.1.2TensorFlow的发展

2015年11月9日谷歌开源了人工智能系统TensorFlow，同时成为2015年最受关注的开源项目之一。TensorFlow的开源大大降低了深度学习在各个行业中的应用难度。TensorFlow的近期里程碑事件主要有：
2016年04月：发布了分布式TensorFlow的0.8版本，把DeepMind模型迁移到TensorFlow；
2016年06月：TensorFlow v0.9发布，改进了移动设备的支持；
2016年11月：TensorFlow开源一周年；
2017年2月：TensorFlow v1.0发布，增加了Java、Go的API,以及专用的编译器和调试工具，同时TensorFlow 1.0引入了一个高级API，包含tf.layers，tf.metrics和tf.losses模块。还宣布增了一个新的tf.keras模块，它与另一个流行的高级神经网络库Keras完全兼容。
2017年4月：TensorFlow v1.1发布，为 Windows 添加 Java API 支，添加 tf.spectral 模块， Keras 2 API等；
2017年6月：TensorFlow v1.2发布，包括 API 的重要变化、contrib API的变化和Bug 修复及其他改变等。

14.1.3TensorFlow的特点

14.1.4TensorFlow编程模型

TensorFlow如何工作？我们通过一个简单的实例进行说明，为计算x+y，你需要创建下图（图14-1）这张数据流图：

图14-1计算x+y的数据流图

以下构成上数据流图（图14-1）的详细步骤：
1）定义x= [1,3,5]，y =[2,4,7]，这个图和tf.Tensor一起工作来代表数据的单位，你需要创建恒定的张量：

import tensorflow as tf
x = tf.constant([1,3,5]) 
y = tf.constant([2,4,7])

import tensorflow as tf

x = tf.constant([1,3,5])

y = tf.constant([2,4,7])

2）定义操作

op = tf.add(x,y)

1	op = tf.add(x,y)

3）张量和操作都有了，接下来就是创建图

my_graph = tf.Graph()

1	my_graph = tf.Graph()

注意：这一步不是必须的，在创建回话时，系统将自动创建一个默认图。
4）为了运行这图你将需要创建一个回话(tf.Session),一个tf.Session对象封装了操作对象执行的环境，为了做到这一点，我们需要定义在会话中将要用到哪一张图：

with tf.Session(graph=my_graph) as sess:
    x = tf.constant([1,3,5]) 
    y = tf.constant([2,4,7])
    op = tf.add(x,y)

with tf.Session(graph=my_graph) as sess:

x = tf.constant([1,3,5])

y = tf.constant([2,4,7])

op = tf.add(x,y)

5）想要执行这个操作，要用到tf.Session.run()这个方法：

import tensorflow as tf
	my_graph = tf.Graph()
	with tf.Session(graph=my_graph) as sess:
	x = tf.constant([1,3,5]) 
	y = tf.constant([2,4,7])
	op = tf.add(x,y)
	result = sess.run(fetches=op)
	print(result)

import tensorflow as tf

my_graph = tf.Graph()

with tf.Session(graph=my_graph) as sess:

x = tf.constant([1,3,5])

y = tf.constant([2,4,7])

op = tf.add(x,y)

result = sess.run(fetches=op)

print(result)

6）运行结果：
[ 3 7 12]

14.1.5TensorFlow常用函数

14.1.6TensorFlow的运行原理

TensorFlow有一个重要组件client，即客户端，此外，还有master、worker，这些有点类似Spark的结构。它通过Session的接口与master及多个worker相连，其中每一个worker可以与多个硬件设备（device）相连，比如CPU或GPU，并负责管理这些硬件。而master则负责管理所有worker按流程执行计算图。

14.2TensorFlow实现卷积神经网络

神经网络可为机器学习中最活跃的领域之一，尤其代表深度学习的卷积神经（Convolutional Neural Network,CNN）、循环神经网络（Recurrent Neural Network，RNN）更是炙手可热。

14.2.1卷积神经网络简介

卷积神经网络是人工神经网络的一种，已成为图像识别、视频处理、语音分析等领域的研究热点。它的权值共享网络结构使之更类似于生物神经网络，减少了权值的数量，降低了网络模型的复杂度，防止因参数太多导致过拟合。

14.2.3卷积神经网络的网络结构

接下来，我们利用训练集训练卷积神经网络模型，然后在测试集上验证该模型。
搭建的卷积神经网络使用的一些参数是：
卷积层1：kernel_size [5, 5], stride=1，32个卷积窗口
池化层1： pool_size [2, 2], stride = 2
卷积层2：kernel_size [5, 5], stride=1，64个卷积窗口
池化层2： pool_size [2, 2], stride = 2
全连接层: 1024个特征，使用dropout减少过拟合
输出层: 使用softmax进行分类

14.2.4.1 导入数据

首先启动ipython，进入交互计算环境，当然直接启动python也可，然后通过TensorFlow自带的函数读取图片数据。

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
~/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py 中函数read_data_sets四个local_file

import tensorflow as tf

from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

~/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py 中函数read_data_sets四个local_file

如果无法直接通过input_data下载，可以先把MNIST数据下载，然后，修改
python/learn/datasets/mnist.py文件中read_data_sets函数中4个local_file的值
具体如下，注释原来的local_file，新增4行local_file

#local_file = base.maybe_download(TRAIN_IMAGES, train_dir,SOURCE_URL + TRAIN_IMAGES)
local_file = train_dir + "/" + TRAIN_IMAGES
with open(local_file, 'rb') as f:
  train_images = extract_images(f)

#local_file = base.maybe_download(TRAIN_LABELS, train_dir,SOURCE_URL + TRAIN_LABELS)
local_file = train_dir + "/" + TRAIN_LABELS
with open(local_file, 'rb') as f:
  train_labels = extract_labels(f, one_hot=one_hot)

#local_file = base.maybe_download(TEST_IMAGES, train_dir,SOURCE_URL + TEST_IMAGES)
local_file = train_dir + "/" + TEST_IMAGES
with open(local_file, 'rb') as f:
  test_images = extract_images(f)

#local_file = base.maybe_download(TEST_LABELS, train_dir,SOURCE_URL + TEST_LABELS)
local_file = train_dir + "/" + TEST_LABELS

#local_file = base.maybe_download(TRAIN_IMAGES, train_dir,SOURCE_URL + TRAIN_IMAGES)

local_file = train_dir + "/" + TRAIN_IMAGES

with open(local_file, 'rb') as f:

train_images = extract_images(f)

#local_file = base.maybe_download(TRAIN_LABELS, train_dir,SOURCE_URL + TRAIN_LABELS)

local_file = train_dir + "/" + TRAIN_LABELS

with open(local_file, 'rb') as f:

train_labels = extract_labels(f, one_hot=one_hot)

#local_file = base.maybe_download(TEST_IMAGES, train_dir,SOURCE_URL + TEST_IMAGES)

local_file = train_dir + "/" + TEST_IMAGES

with open(local_file, 'rb') as f:

test_images = extract_images(f)

#local_file = base.maybe_download(TEST_LABELS, train_dir,SOURCE_URL + TEST_LABELS)

local_file = train_dir + "/" + TEST_LABELS

更加数据实际存放路径，修改read_data_sets中读取文件路径。

mnist = input_data.read_data_sets("./TensorFlowOnSpark/mnist", one_hot=True)
# 创建交互式session 
sess = tf.InteractiveSession()

mnist = input_data.read_data_sets("./TensorFlowOnSpark/mnist", one_hot=True)

# 创建交互式session

sess = tf.InteractiveSession()

14.2.4.2 权重初始化

# 正态分布，标准差为0.1，默认最大为1，最小为-1，均值为0

def weight_variable(shape):  
        initial = tf.truncated_normal(shape, stddev=0.1)  
        return tf.Variable(initial) 
# 创建一个结构为shape矩阵也可以说是数组shape声明其行列，初始化所有值为0.1  		
def bias_variable(shape): 
        initial = tf.constant(0.1, shape=shape)  
        return tf.Variable(initial)

def weight_variable(shape):

initial = tf.truncated_normal(shape, stddev=0.1)

return tf.Variable(initial)

# 创建一个结构为shape矩阵也可以说是数组shape声明其行列，初始化所有值为0.1

def bias_variable(shape):

initial = tf.constant(0.1, shape=shape)

return tf.Variable(initial)

14.2.4.3 构建卷积神经网络结构

# 卷积遍历各方向步数为1，SAME：边缘外自动补0，遍历相乘 
def conv2d(x, W): 
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')  
# 池化卷积结果（conv2d）池化层采用kernel大小为2*2，步数也为2，周围补0，取最大值。数据量缩小了4倍  	
def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],strides=[1, 2, 2, 1], padding='SAME')    

#定义输入输出结构  

# 声明一个占位符，None表示输入图片的数量不定，28*28图片分辨率  
xs = tf.placeholder(tf.float32, [None, 28*28])   
# 类别是0-9总共10个类别，对应输出分类结果  
ys = tf.placeholder(tf.float32, [None, 10])   
keep_prob = tf.placeholder(tf.float32)  
# x_image又把xs reshape成了28*28*1的形状，因为是灰色图片，所以通道是1.作为训练时的input，-1代表图片数量不定  
x_image = tf.reshape(xs, [-1, 28, 28, 1])   
#搭建网络,定义算法公式，也就是forward时的计算  

## 第一层卷积操作 ##  
# 第一二参数值得卷积核尺寸大小，即patch，第三个参数是图像通道数，第四个参数是卷积核的数目，代表会出现多少个卷积特征图像;  
W_conv1 = weight_variable([5, 5, 1, 32])   
# 对于每一个卷积核都有一个对应的偏置量。  
b_conv1 = bias_variable([32])    
# 图片乘以卷积核，并加上偏执量，卷积结果28x28x32  
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)    
# 池化结果14x14x32 卷积结果乘以池化卷积核  
h_pool1 = max_pool_2x2(h_conv1)   

## 第二层卷积操作 ##     
# 32通道卷积，卷积出64个特征    
w_conv2 = weight_variable([5,5,32,64])   
# 64个偏执数据  
b_conv2  = bias_variable([64])   
# 注意h_pool1是上一层的池化结果，#卷积结果14x14x64  
h_conv2 = tf.nn.relu(conv2d(h_pool1,w_conv2)+b_conv2)    
# 池化结果7x7x64  
h_pool2 = max_pool_2x2(h_conv2)    
# 原图像尺寸28*28，第一轮图像缩小为14*14，共有32张，第二轮后图像缩小为7*7，共有64张    

## 第三层全连接操作
# 二维张量，第一个参数7*7*64的patch，第二个参数代表卷积个数共1024个  
W_fc1 = weight_variable([7*7*64, 1024])   
# 1024个偏执数据  
b_fc1 = bias_variable([1024])   
# 将第二层卷积池化结果reshape成只有一行7*7*64个数据# [n_samples, 7, 7, 64] ->> [n_samples, 7*7*64]  
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])   
# 卷积操作，结果是1*1*1024，matmul实现最基本的矩阵相乘。
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)   

# dropout操作，减少过拟合。对卷积结果执行dropout操作。
keep_prob = tf.placeholder(tf.float32)   
h_fc1_drop = tf.nn.dropout(h_fc1,keep_prob) 
## 第四层输出操作 ##  
# 二维张量，1*1024矩阵卷积，共10个卷积，对应我们开始的ys长度为10  
W_fc2 = weight_variable([1024, 10])    
b_fc2 = bias_variable([10])    
# 最后的分类，结果为1*1*10 softmax  
y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
#定义交叉熵为loss函数，采用Adam方法优化loss。
cross_entropy = -tf.reduce_sum(ys * tf.log(y_conv))    
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

# 卷积遍历各方向步数为1，SAME：边缘外自动补0，遍历相乘

def conv2d(x, W):

return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

# 池化卷积结果（conv2d）池化层采用kernel大小为2*2，步数也为2，周围补0，取最大值。数据量缩小了4倍

def max_pool_2x2(x):

return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],strides=[1, 2, 2, 1], padding='SAME')

#定义输入输出结构

# 声明一个占位符，None表示输入图片的数量不定，28*28图片分辨率

xs = tf.placeholder(tf.float32, [None, 28*28])

# 类别是0-9总共10个类别，对应输出分类结果

ys = tf.placeholder(tf.float32, [None, 10])

keep_prob = tf.placeholder(tf.float32)

# x_image又把xs reshape成了28*28*1的形状，因为是灰色图片，所以通道是1.作为训练时的input，-1代表图片数量不定

x_image = tf.reshape(xs, [-1, 28, 28, 1])

#搭建网络,定义算法公式，也就是forward时的计算

## 第一层卷积操作 ##

# 第一二参数值得卷积核尺寸大小，即patch，第三个参数是图像通道数，第四个参数是卷积核的数目，代表会出现多少个卷积特征图像;

W_conv1 = weight_variable([5, 5, 1, 32])

# 对于每一个卷积核都有一个对应的偏置量。

b_conv1 = bias_variable([32])

# 图片乘以卷积核，并加上偏执量，卷积结果28x28x32

h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

# 池化结果14x14x32 卷积结果乘以池化卷积核

h_pool1 = max_pool_2x2(h_conv1)

## 第二层卷积操作 ##

# 32通道卷积，卷积出64个特征

w_conv2 = weight_variable([5,5,32,64])

# 64个偏执数据

b_conv2 = bias_variable([64])

# 注意h_pool1是上一层的池化结果，#卷积结果14x14x64

h_conv2 = tf.nn.relu(conv2d(h_pool1,w_conv2)+b_conv2)

# 池化结果7x7x64

h_pool2 = max_pool_2x2(h_conv2)

# 原图像尺寸28*28，第一轮图像缩小为14*14，共有32张，第二轮后图像缩小为7*7，共有64张

## 第三层全连接操作

# 二维张量，第一个参数7*7*64的patch，第二个参数代表卷积个数共1024个

W_fc1 = weight_variable([7*7*64, 1024])

# 1024个偏执数据

b_fc1 = bias_variable([1024])

# 将第二层卷积池化结果reshape成只有一行7*7*64个数据# [n_samples, 7, 7, 64] ->> [n_samples, 7*7*64]

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])

# 卷积操作，结果是1*1*1024，matmul实现最基本的矩阵相乘。

h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

# dropout操作，减少过拟合。对卷积结果执行dropout操作。

keep_prob = tf.placeholder(tf.float32)

h_fc1_drop = tf.nn.dropout(h_fc1,keep_prob)

## 第四层输出操作 ##

# 二维张量，1*1024矩阵卷积，共10个卷积，对应我们开始的ys长度为10

W_fc2 = weight_variable([1024, 10])

b_fc2 = bias_variable([10])

# 最后的分类，结果为1*1*10 softmax

y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

#定义交叉熵为loss函数，采用Adam方法优化loss。

cross_entropy = -tf.reduce_sum(ys * tf.log(y_conv))

train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

14.2.4.4 训练评估模型

#模型训练及评测  
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(ys,1))  
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))  
tf.global_variables_initializer().run()  

for i in range(2000):  
    batch = mnist.train.next_batch(50)  
    if i%100 == 0:  
            train_accuracy = accuracy.eval(feed_dict={xs:batch[0], ys: batch[1], keep_prob: 1.0})  
            print("step %d, training accuracy %g"%(i, train_accuracy))  
    train_step.run(feed_dict={xs: batch[0], ys: batch[1], keep_prob: 0.5})

#模型训练及评测

correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(ys,1))

accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

tf.global_variables_initializer().run()

for i in range(2000):

batch = mnist.train.next_batch(50)

if i%100 == 0:

train_accuracy = accuracy.eval(feed_dict={xs:batch[0], ys: batch[1], keep_prob: 1.0})

print("step %d, training accuracy %g"%(i, train_accuracy))

train_step.run(feed_dict={xs: batch[0], ys: batch[1], keep_prob: 0.5})

这里只迭代了2000次，运行结果；如果迭代20000次，在测试集上的精度可到99.2%左右
## -- End pasted text --
step 0, training accuracy 0.14
step 100, training accuracy 0.86
step 200, training accuracy 0.94
step 300, training accuracy 0.94
step 400, training accuracy 0.94
step 500, training accuracy 0.98
step 600, training accuracy 0.94
step 700, training accuracy 0.94
step 800, training accuracy 0.98
step 900, training accuracy 0.98
step 1000, training accuracy 1
step 1100, training accuracy 0.94
step 1200, training accuracy 0.98
step 1300, training accuracy 0.96
step 1400, training accuracy 0.92
step 1500, training accuracy 0.96
step 1600, training accuracy 0.96
step 1700, training accuracy 1
step 1800, training accuracy 1
step 1900, training accuracy 0.96

在测试集上，测试模型精度

print("test accuracy %g"%accuracy.eval(feed_dict={xs: mnist.test.images, ys: mnist.test.labels, keep_prob: 1.0})) 
test accuracy 0.9778

1 2	print("test accuracy %g"%accuracy.eval(feed_dict={xs: mnist.test.images, ys: mnist.test.labels, keep_prob: 1.0})) test accuracy 0.9778

14.3TensorFlow实现循环神经网络

14.3.1循环神经网络简介

在传统的神经网络模型中，是从输入层到隐含层再到输出层，层与层之间是全连接的，
每层之间的节点是无连接的。但是这种普通的神经网络对于很多问题却无能无力。

14.3.2LSTM循环神经网络简介

LSTM是一种特殊的RNNs，可以很好地解决长时依赖问题。

14.3.4TensorFlow实现循环神经网络

前面我们用卷积神经网络，对MNIST中的手写数进行设别，如果迭代20000次，精度可达到99.2左右，这个精度应该比较高；如果我们用循环神经网络来识别，是否可行？如果可以，效果如何？
为了适合使用RNN来识别，每张图片大小为28x28像素，我们把每张图片的每一行(元素个数为28)作为输入数据n_inputs，把每一行（一张图片共28行）看成是与时间序列有关的步数n_steps，这样图片的所有信息都用上了，而且适合使用RNN的应用场景。
启动ipython，进入ipython的交互式界面，导入需要的库，并启动交互式会话。

import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()

import tensorflow as tf

import numpy as np

sess = tf.InteractiveSession()

加载数据，具体实现细节可参考14.2.4.1小节，这里就不详细说明了。

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("./TensorFlowOnSpark/mnist", one_hot=True)

1 2	from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("./TensorFlowOnSpark/mnist", one_hot=True)

1. 构建模型
设置训练模型的超参数，学习速率，批量大小等。

learning_rate = 0.01
batch_size = 128

1 2	learning_rate = 0.01 batch_size = 128

设置循环神经网络的参数，包括输入数长度，输入的步数，隐藏节点数，类别数等。

n_input = 28
n_steps = 28
n_hidden = 256
n_classes = 10

n_input = 28

n_steps = 28

n_hidden = 256

n_classes = 10

定义输入数据及权重等

x = tf.placeholder(tf.float32, [None, n_steps, n_input])
y = tf.placeholder(tf.float32, [None, n_classes])

1 2	x = tf.placeholder(tf.float32, [None, n_steps, n_input]) y = tf.placeholder(tf.float32, [None, n_classes])

定义权重及初始化偏移量

# Classifier weights and biases
w = tf.Variable(tf.truncated_normal([n_hidden, n_classes]))
b = tf.Variable(tf.zeros([n_classes]))

# Classifier weights and biases

w = tf.Variable(tf.truncated_normal([n_hidden, n_classes]))

b = tf.Variable(tf.zeros([n_classes]))

定义并初始化Input gate、Forget gate、Output gate、Memory cell等的输入数据、权重、偏移量，这里采用tensorflow中truncated_normal函数初始化相关参数值。

# Input gate: input, previous output, and bias
ix = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))
im = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))
ib = tf.Variable(tf.zeros([1, n_hidden]))
# Forget gate: input, previous output, and bias
fx = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))
fm = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))
fb = tf.Variable(tf.zeros([1, n_hidden]))
# Memory cell: input, state, and bias
cx = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))
cm = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))
cb = tf.Variable(tf.zeros([1, n_hidden]))
# Output gate: input, previous output, and bias
ox = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))
om = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))
ob = tf.Variable(tf.zeros([1, n_hidden]))

# Input gate: input, previous output, and bias

ix = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))

im = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))

ib = tf.Variable(tf.zeros([1, n_hidden]))

# Forget gate: input, previous output, and bias

fx = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))

fm = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))

fb = tf.Variable(tf.zeros([1, n_hidden]))

# Memory cell: input, state, and bias

cx = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))

cm = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))

cb = tf.Variable(tf.zeros([1, n_hidden]))

# Output gate: input, previous output, and bias

ox = tf.Variable(tf.truncated_normal([n_input, n_hidden], -0.1, 0.1))

om = tf.Variable(tf.truncated_normal([n_hidden, n_hidden], -0.1, 0.1))

ob = tf.Variable(tf.zeros([1, n_hidden]))

创建循环神经网络结构

def LSTMRNN(x, n_steps, n_input, n_hidden, n_classes): 
    # 定义LSTM单元
    def lstm_cell(i, o, state):
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        update = tf.tanh(tf.matmul(i, cx) + tf.matmul(o, cm) + cb)
        state = forget_gate * state + input_gate * update
        output_gate = tf.sigmoid(tf.matmul(i, ox) +  tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state

    # 把状态线上的多个值串联起来
    outputs = list()
    state = tf.Variable(tf.zeros([batch_size, n_hidden]))
    output = tf.Variable(tf.zeros([batch_size, n_hidden]))

    # 输入数据x用函数transpose把第一个维度与第二个维度互换，使用reshape把x
   #变为(n_steps*batch_size,n_input)的形状，然后利用split把x拆成长度为n_steps
   #的列表，这样适合LMTM的输入格式。
    x = tf.transpose(x, [1, 0, 2])
    x = tf.reshape(x, [-1, n_input])
    x = tf.split(x, n_steps, 0)
    for i in x:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)
    logits =tf.matmul(outputs[-1], w) + b
    return logits

def LSTMRNN(x, n_steps, n_input, n_hidden, n_classes):

# 定义LSTM单元

def lstm_cell(i, o, state):

input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)

forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)

update = tf.tanh(tf.matmul(i, cx) + tf.matmul(o, cm) + cb)

state = forget_gate * state + input_gate * update

output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)

return output_gate * tf.tanh(state), state

# 把状态线上的多个值串联起来

outputs = list()

state = tf.Variable(tf.zeros([batch_size, n_hidden]))

output = tf.Variable(tf.zeros([batch_size, n_hidden]))

# 输入数据x用函数transpose把第一个维度与第二个维度互换，使用reshape把x

#变为(n_steps*batch_size,n_input)的形状，然后利用split把x拆成长度为n_steps

#的列表，这样适合LMTM的输入格式。

x = tf.transpose(x, [1, 0, 2])

x = tf.reshape(x, [-1, n_input])

x = tf.split(x, n_steps, 0)

for i in x:

output, state = lstm_cell(i, output, state)

outputs.append(output)

logits =tf.matmul(outputs[-1], w) + b

return logits

2. 定义损失函数及优化器

pred = LSTMRNN(x, n_steps, n_input, n_hidden, n_classes)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initializing the variables
init = tf.global_variables_initializer()

# Launch the graph
sess.run(init)

pred = LSTMRNN(x, n_steps, n_input, n_hidden, n_classes)

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))

accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initializing the variables

init = tf.global_variables_initializer()

# Launch the graph

sess.run(init)

3. 训练数据及评估模型

for step in range(10000):
    batch_x, batch_y = mnist.train.next_batch(batch_size)
    batch_x = batch_x.reshape((batch_size, n_steps, n_input))
    sess.run(optimizer, feed_dict={x: batch_x, y: batch_y})

    if step % 100 == 0:
        acc = sess.run(accuracy, feed_dict={x: batch_x, y: batch_y})
        loss = sess.run(cost, feed_dict={x: batch_x, y: batch_y})
        print "Iter " + str(step) + ", Minibatch Loss= " + "{:.6f}".format(loss) + ", Training Accuracy= " + "{:.5f}".format(acc)
print "Optimization Finished!"

for step in range(10000):

batch_x, batch_y = mnist.train.next_batch(batch_size)

batch_x = batch_x.reshape((batch_size, n_steps, n_input))

sess.run(optimizer, feed_dict={x: batch_x, y: batch_y})

if step % 100 == 0:

acc = sess.run(accuracy, feed_dict={x: batch_x, y: batch_y})

loss = sess.run(cost, feed_dict={x: batch_x, y: batch_y})

print "Iter " + str(step) + ", Minibatch Loss= " + "{:.6f}".format(loss) + ", Training Accuracy= " + "{:.5f}".format(acc)

print "Optimization Finished!"

运行结果，以下是最后批次的运行结果。
Iter 8000, Minibatch Loss= 0.085752, Training Accuracy= 0.97656
Iter 8100, Minibatch Loss= 0.065435, Training Accuracy= 0.96875
Iter 8200, Minibatch Loss= 0.088926, Training Accuracy= 0.97656
Iter 8300, Minibatch Loss= 0.039572, Training Accuracy= 1.00000
Iter 8400, Minibatch Loss= 0.050593, Training Accuracy= 0.98438
Iter 8500, Minibatch Loss= 0.030424, Training Accuracy= 0.99219
Iter 8600, Minibatch Loss= 0.026174, Training Accuracy= 0.99219
Iter 8700, Minibatch Loss= 0.045043, Training Accuracy= 0.98438
Iter 8800, Minibatch Loss= 0.031143, Training Accuracy= 0.98438
Iter 8900, Minibatch Loss= 0.055115, Training Accuracy= 0.99219
Iter 9000, Minibatch Loss= 0.061676, Training Accuracy= 0.98438
Iter 9100, Minibatch Loss= 0.123581, Training Accuracy= 0.97656
Iter 9200, Minibatch Loss= 0.057620, Training Accuracy= 0.98438
Iter 9300, Minibatch Loss= 0.043013, Training Accuracy= 0.99219
Iter 9400, Minibatch Loss= 0.067405, Training Accuracy= 0.98438
Iter 9500, Minibatch Loss= 0.020679, Training Accuracy= 1.00000
Iter 9600, Minibatch Loss= 0.079038, Training Accuracy= 0.98438
Iter 9700, Minibatch Loss= 0.080076, Training Accuracy= 0.97656
Iter 9800, Minibatch Loss= 0.010582, Training Accuracy= 1.00000
Iter 9900, Minibatch Loss= 0.019426, Training Accuracy= 1.00000
Optimization Finished!

在测试集上验证模型

# Calculate accuracy for 128 mnist test images
test_len = batch_size
test_data = mnist.test.images[:test_len].reshape((-1, n_steps, n_input))
test_label = mnist.test.labels[:test_len]
print "Testing Accuracy:", sess.run(accuracy, feed_dict={x: test_data, y: test_label})

# Calculate accuracy for 128 mnist test images

test_len = batch_size

test_data = mnist.test.images[:test_len].reshape((-1, n_steps, n_input))

test_label = mnist.test.labels[:test_len]

print "Testing Accuracy:", sess.run(accuracy, feed_dict={x: test_data, y: test_label})

运行结果如下，这个结果虽然比CNN结果低些，但也是不错的一个结果。
Testing Accuracy: 0.976562

14.4分布式TensorFlow

2016年4月14日，Google发布了分布式TensorFlow，能够支持在几百台机器上并行训练。分布式的TensorFlow由高性能的gRPC库作为底层技术支持。

14.4.1客户端、主节点和工作节点间的关系

14.4.2分布式模式

常用的深度学习训练模型为数据并行化，即TensorFlow任务采用相同的训练模型在不同的小批量数据集上进行训练，然后在参数服务器上更新模型的共享参数。TensorFlow支持同步训练和异步训练两种模型训练方式。

14.4.3在Pyspark集群环境运行TensorFlow

这节将通过神经网络来模拟一个一元二次方程：y=x^2-0.5，
TensorFlowOnSpark的详细配置，请参考14.4节。已集群方式启动pyspark：

pyspark --master spark://master:7077 --driver-memory 1G --total-executor-cores 2

1	pyspark --master spark://master:7077 --driver-memory 1G --total-executor-cores 2

进入pyspark的交换界面

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

#构造满足一元二次方程的函数

x_data = np.linspace(-1, 1, 300)[:, np.newaxis]
#加入一些噪声
noise = np.random.normal(0, 0.05, x_data.shape)
y_data = np.square(x_data) - 0.5 + noise
#画出散点图
fig=plt.figure()
ax=fig.add_subplot(1,1,1)
ax.scatter(x_data,y_data)

import tensorflow as tf

import numpy as np

import matplotlib.pyplot as plt

#构造满足一元二次方程的函数

x_data = np.linspace(-1, 1, 300)[:, np.newaxis]

#加入一些噪声

noise = np.random.normal(0, 0.05, x_data.shape)

y_data = np.square(x_data) - 0.5 + noise

#画出散点图

fig=plt.figure()

ax=fig.add_subplot(1,1,1)

ax.scatter(x_data,y_data)

构造一个神经网络

xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])
#定义添加层的函数

def add_layer(inputs, in_size, out_size, activation_function=None):
    weights = tf.Variable(tf.random_normal([in_size, out_size]))
    biases = tf.Variable(tf.zeros([1, out_size]) + 0.1)
    Wx_plus_b = tf.matmul(inputs, weights) + biases
    if activation_function is None:
        outputs = Wx_plus_b
    else:
        outputs = activation_function(Wx_plus_b)
    return outputs
#构造输入层为1，隐藏层20个，输出层为1的神经网络

h1 = add_layer(xs, 1, 20, activation_function=tf.nn.relu)

#构造输出层，隐含层的输出为输出层的输入

prediction = add_layer(h1, 20, 1, activation_function=None)
#计算损失值
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction),reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
#初始化所以变量
init = tf.global_variables_initializer()

sess = tf.Session()
sess.run(init)
#训练1000次

for i in range(1000):
    sess.run(train_step, feed_dict={xs: x_data, ys: y_data})
    if i % 50 == 0:
        print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))
        prediction_value=sess.run(prediction,feed_dict={xs:x_data})
        lines=ax.plot(x_data,prediction_value,'r',lw=5)

xs = tf.placeholder(tf.float32, [None, 1])

ys = tf.placeholder(tf.float32, [None, 1])

#定义添加层的函数

def add_layer(inputs, in_size, out_size, activation_function=None):

weights = tf.Variable(tf.random_normal([in_size, out_size]))

biases = tf.Variable(tf.zeros([1, out_size]) + 0.1)

Wx_plus_b = tf.matmul(inputs, weights) + biases

if activation_function is None:

outputs = Wx_plus_b

else:

outputs = activation_function(Wx_plus_b)

return outputs

#构造输入层为1，隐藏层20个，输出层为1的神经网络

h1 = add_layer(xs, 1, 20, activation_function=tf.nn.relu)

#构造输出层，隐含层的输出为输出层的输入

prediction = add_layer(h1, 20, 1, activation_function=None)

#计算损失值

loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction),reduction_indices=[1]))

train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)

#初始化所以变量

init = tf.global_variables_initializer()

sess = tf.Session()

sess.run(init)

#训练1000次

for i in range(1000):

sess.run(train_step, feed_dict={xs: x_data, ys: y_data})

if i % 50 == 0:

print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))

prediction_value=sess.run(prediction,feed_dict={xs:x_data})

lines=ax.plot(x_data,prediction_value,'r',lw=5)

输出结果：
1.62758
0.00996406
0.00634915
0.00483868
0.0043179
0.00399014
0.00368176
0.00337165
0.00309145
0.00284696
0.00267657
0.00255845
0.0024702
0.00240239
0.00235583
0.00232014
0.00229183
0.00226797
0.00224843

14.5TensorFlowOnSpark架构

TensorFlowOnSpark(TFoS)，支持 TensorFlow 在 Spark 和 Hadoop 上的分布式运行。

14.6TensorFlowOnSpark安装

安装TensorFlowOnSpark，采用pip管理工具进行安装，缺省安装是1.0版本。

pip  install  tensorflowonspark

1	pip install tensorflowonspark

执行以上命令后，在用户当前目录下，将新增一个TensorFlowOnSpark目录。
然后，在.bashrc定义该路径。

export TFoS_HOME=/home/hadoop/TensorFlowOnSpark

1	export TFoS_HOME=/home/hadoop/TensorFlowOnSpark

可以通过pyspark环境来验证，以上2个安装是否成功。

pyspark
>>> import tensorflow as tf
>>> from tensorflowonspark import TFCluster

pyspark

>>> import tensorflow as tf

>>> from tensorflowonspark import TFCluster

导入这些库，如果没有异常，说明安装成功。接下来开始为训练数据做一些准备工作。
对scripts目录进行打包，便于把该包发布到各worker上

cd TensorFlowOnSpark/scripts
zip  -r ../tfspark.zip *

1 2	cd TensorFlowOnSpark/scripts zip -r ../tfspark.zip *

14.7TensorFlowOnSpark实例

使用TensorFlowOnSpark对MNIST数据进行预测，MNIST是一个手写数字数据库，它有60000个训练样本集和10000个测试样本集，train-images-idx3-ubyte.gz、train-labels-idx1-ubyte.gz等四个文件。这些图像数据都保存在二进制文件中。每个样本图像的宽高为28*28。
下载MNIST数据

mkdir ${TFoS_HOME}/mnist
pushd ${TFoS_HOME}/mnist
curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"

mkdir ${TFoS_HOME}/mnist

pushd ${TFoS_HOME}/mnist

curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"

curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"

curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"

curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"

14.7.1TensorFlowOnSpark单机模式实例

设置本机相关参数，在单机上启动一个master节点，两个worker节点。

export MASTER=spark://$(hostname):7077
export SPARK_WORKER_INSTANCES=2
export CORES_PER_WORKER=1 
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES})) 
${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 2G ${MASTER}

export MASTER=spark://$(hostname):7077

export SPARK_WORKER_INSTANCES=2

export CORES_PER_WORKER=1

export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))

${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 2G ${MASTER}

启动以后，通过jps可以看到如下一些进程。一个master，两个worker，namenode是之前启动hadoop的进程。

[hadoop@master ~]$ jps
25157 Master
25258 Worker
27229 RunJar
26893 NameNode
27087 SecondaryNameNode
13071 Jps
25215 Worker

[hadoop@master ~]$ jps

25157 Master

25258 Worker

27229 RunJar

26893 NameNode

27087 SecondaryNameNode

13071 Jps

25215 Worker

相关服务起来后，接下来把MNIST数据上传到HDFS上，并把数据转换cvs格式。

${SPARK_HOME}/bin/spark-submit --master spark://master:7077 ${TFoS_HOME}/examples/mnist/mnist_data_setup.py --output /examples/mnist/csv --format csv

1	${SPARK_HOME}/bin/spark-submit --master spark://master:7077 ${TFoS_HOME}/examples/mnist/mnist_data_setup.py --output /examples/mnist/csv --format csv

运行完成后，通过hadoop fs命令可以在HDFS上看到如下信息：

hadoop fs -ls /user/hadoop/examples/mnist/csv/train/images
Found 11 items
-rw-r--r--   1 hadoop supergroup          0 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/_SUCCESS
-rw-r--r--   1 hadoop supergroup    9338236 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00000
-rw-r--r--   1 hadoop supergroup   11231804 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00001
-rw-r--r--   1 hadoop supergroup   11214784 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00002
-rw-r--r--   1 hadoop supergroup   11226100 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00003
-rw-r--r--   1 hadoop supergroup   11212767 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00004
-rw-r--r--   1 hadoop supergroup   11173834 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00005
-rw-r--r--   1 hadoop supergroup   11214285 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00006
-rw-r--r--   1 hadoop supergroup   11201024 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00007
-rw-r--r--   1 hadoop supergroup   11194141 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00008
-rw-r--r--   1 hadoop supergroup   10449019 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00009

hadoop fs -ls /user/hadoop/examples/mnist/csv/train/images

Found 11 items

-rw-r--r-- 1 hadoop supergroup 0 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/_SUCCESS

-rw-r--r-- 1 hadoop supergroup 9338236 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00000

-rw-r--r-- 1 hadoop supergroup 11231804 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00001

-rw-r--r-- 1 hadoop supergroup 11214784 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00002

-rw-r--r-- 1 hadoop supergroup 11226100 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00003

-rw-r--r-- 1 hadoop supergroup 11212767 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00004

-rw-r--r-- 1 hadoop supergroup 11173834 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00005

-rw-r--r-- 1 hadoop supergroup 11214285 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00006

-rw-r--r-- 1 hadoop supergroup 11201024 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00007

-rw-r--r-- 1 hadoop supergroup 11194141 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00008

-rw-r--r-- 1 hadoop supergroup 10449019 2017-06-15 23:21 /user/hadoop/examples/mnist/csv/train/images/part-00009

数据加载转换成功后，开始训练数据。

${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images examples/mnist/csv/train/images \
--labels examples/mnist/csv/train/labels \
--format csv \
--mode train \
--model mnist_model

${SPARK_HOME}/bin/spark-submit \

--master ${MASTER} \

--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \

--conf spark.cores.max=${TOTAL_CORES} \

--conf spark.task.cpus=${CORES_PER_WORKER} \

--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \

${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \

--cluster_size ${SPARK_WORKER_INSTANCES} \

--images examples/mnist/csv/train/images \

--labels examples/mnist/csv/train/labels \

--format csv \

--mode train \

--model mnist_model

运行完成后，可以看到如下内容：

2017-06-18 05:30:50,072 INFO (MainThread-25741) Feeding training data
2017-06-18 05:32:07,655 INFO (MainThread-25741) Stopping TensorFlow nodes       
2017-06-18 05:32:07,883 INFO (MainThread-25741) Shutting down cluster
2017-06-18T05:32:13.346161 ===== Stop

2017-06-18 05:30:50,072 INFO (MainThread-25741) Feeding training data

2017-06-18 05:32:07,655 INFO (MainThread-25741) Stopping TensorFlow nodes

2017-06-18 05:32:07,883 INFO (MainThread-25741) Shutting down cluster

2017-06-18T05:32:13.346161 ===== Stop

如果运行过程中，过程被卡，可以调整mnist_dist.py文件中两处（在115,125行）logdir=logdir改为logdir=None。
训练完成后，接下来就是用测试集验证模型，并对结果进行预测。

${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size ${SPARK_WORKER_INSTANCES} \
--images examples/mnist/csv/test/images \
--labels examples/mnist/csv/test/labels \
--mode inference \
--format csv \
--model mnist_model \
--output predictions

${SPARK_HOME}/bin/spark-submit \

--master ${MASTER} \

--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \

--conf spark.cores.max=${TOTAL_CORES} \

--conf spark.task.cpus=${CORES_PER_WORKER} \

--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \

${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \

--cluster_size ${SPARK_WORKER_INSTANCES} \

--images examples/mnist/csv/test/images \

--labels examples/mnist/csv/test/labels \

--mode inference \

--format csv \

--model mnist_model \

--output predictions

运行完成以后，在HDFS上，就可看到predictions目录及相关内容。

[hadoop@master spark]$ hadoop fs -ls /user/hadoop/predictions
17/06/20 02:45:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 11 items
-rw-r--r--   1 hadoop supergroup          0 2017-06-18 14:04 /user/hadoop/predictions/_SUCCESS
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00000
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00001
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00002
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00003
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00004
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00005
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00006
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00007
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00008
-rw-r--r--   1 hadoop supergroup      51000 2017-06-18 14:04 /user/hadoop/predictions/part-00009

[hadoop@master spark]$ hadoop fs -ls /user/hadoop/predictions

17/06/20 02:45:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Found 11 items

-rw-r--r-- 1 hadoop supergroup 0 2017-06-18 14:04 /user/hadoop/predictions/_SUCCESS

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00000

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00001

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00002

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00003

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00004

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00005

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00006

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00007

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00008

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-18 14:04 /user/hadoop/predictions/part-00009

打开其中一个文件，可以看到预测结果信息。
2017-06-18T05:51:42.397905 Label: 5, Prediction: 5
2017-06-18T05:51:42.397923 Label: 9, Prediction: 8
2017-06-18T05:51:42.397941 Label: 7, Prediction: 5
2017-06-18T05:51:42.397958 Label: 3, Prediction: 5
2017-06-18T05:51:42.397976 Label: 4, Prediction: 8
2017-06-18T05:51:42.397993 Label: 9, Prediction: 8
2017-06-18T05:51:42.398012 Label: 6, Prediction: 5

14.7.2TensorFlowOnSpark集群模式实例

设置本机相关参数，在以集群方式启动spark，一个master节点，slave1、slave2作为
两个worker节点，各节点资源配置信息。
训练模型

${SPARK_HOME}/bin/spark-submit \
--master spark://master:7077 \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=4 \
--conf spark.task.cpus=2 \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
--conf spark.executorEnv.LD_LIBRARY_PATH="${JAVA_HOME}/jre/lib/amd64/server" \
--conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size 2 \
--images examples/mnist/csv/train/images \
--labels examples/mnist/csv/train/labels \
--format csv \
--mode train \
--model mnist_model

${SPARK_HOME}/bin/spark-submit \

--master spark://master:7077 \

--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \

--conf spark.cores.max=4 \

--conf spark.task.cpus=2 \

--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \

--conf spark.executorEnv.LD_LIBRARY_PATH="${JAVA_HOME}/jre/lib/amd64/server" \

--conf spark.executorEnv.CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath --glob):${CLASSPATH}" \

${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \

--cluster_size 2 \

--images examples/mnist/csv/train/images \

--labels examples/mnist/csv/train/labels \

--format csv \

--mode train \

--model mnist_model

测试模型

${SPARK_HOME}/bin/spark-submit \
--master ${MASTER} \
--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
--conf spark.cores.max=4 \
--conf spark.task.cpus=2 \
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
--cluster_size 2 \
--images examples/mnist/csv/test/images \
--labels examples/mnist/csv/test/labels \
--mode inference \
--format csv \
--model mnist_model \
--output predictions

${SPARK_HOME}/bin/spark-submit \

--master ${MASTER} \

--py-files ${TFoS_HOME}/tfspark.zip,${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \

--conf spark.cores.max=4 \

--conf spark.task.cpus=2 \

--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \

${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \

--cluster_size 2 \

--images examples/mnist/csv/test/images \

--labels examples/mnist/csv/test/labels \

--mode inference \

--format csv \

--model mnist_model \

--output predictions

查看运行结果

$ hadoop fs -ls /user/hadoop/predictions
Found 11 items
-rw-r--r--   1 hadoop supergroup          0 2017-06-20 08:55 /user/hadoop/predictions/_SUCCESS
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00000
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00001
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00002
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00003
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00004
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00005
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00006
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00007
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00008
-rw-r--r--   1 hadoop supergroup      51000 2017-06-20 08:55 /user/hadoop/predictions/part-00009

$ hadoop fs -ls /user/hadoop/predictions

Found 11 items

-rw-r--r-- 1 hadoop supergroup 0 2017-06-20 08:55 /user/hadoop/predictions/_SUCCESS

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00000

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00001

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00002

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00003

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00004

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00005

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00006

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00007

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00008

-rw-r--r-- 1 hadoop supergroup 51000 2017-06-20 08:55 /user/hadoop/predictions/part-00009

运行时各节点报错信息，可以查看spark/work/app-20170620085449-0003/1下的

14.8小结

为了弥补Spark机器学习中，缺乏神经网络、深度学习等的不足，这章我们介绍脱胎于AlphaGo的深度学习框架TensorFlow，以基础知识为主，在这个基础上介绍了使用TensorFlow的几个实例，最后介绍TensorFlow的分布式架构及与Spark整合的架构TensorFlowOnSpark。

第13章使用Spark Streaming构建在线学习模型

前面我们介绍的这些算法，一般基于一个或几个相对固定的文件，以这样的数据为模型处理的源数据是固定的，这样的数据或许很大，很多。训练或测试都是建立在这些固定数据之上，当然，测试时，可能取这个数据源之外的数据，如新数据或其他数据等。训练模型的数据一般是相对固定的，这样的机器学习的场景是很普遍的。
但实际环境中，还有其他一些场景，如源数据是经常变换，就像流水一样，时刻在变换着，如很多在线数据、很多日志数据等等。面对这些数据的学习我们该如何处理呢？
这个问题实际上属于流水计算问题，目前解决这类问题有Spark Streaming、Storm、Samza等。这章我们主要介绍Spark Streaming。
本章主要包括以下内容：
 介绍Spark Streaming主要内容
 Spark Streaming入门实例
 在线学习实例

13.1 Spark Streaming简介

Spark Streaming 是Spark核心API的一个扩展，可以实现高吞吐量的、具备容错机制的实时流数据的处理。支持从多种数据源获取数据，包括Kafk、Flume、Twitter、ZeroMQ、Kinesis 以及TCP sockets，从数据源获取数据之后，可以使用诸如map、reduce、join和window等高级函数进行复杂算法的处理。最后还可以将处理结果存储到文件系统，数据库和现场仪表盘。在“One Stack rule them all”的基础上，还可以使用Spark的其他子框架，如集群学习、图计算等，对流数据进行处理。

13.1.1Spark Streaming常用术语

在简介Spark Streaming前，我们先简单介绍Streaming的一些常用术语。

13.1.2Spark Streaming处理流程

Spark Streaming处理的数据流图如图13-1所示。

图13-1 Spark Streaming计算过程

13.2 Dstream操作

RDD有很多操作和转换，与RDD类似，DStream也提供了自己的一系列操作方法，本节主要介绍如何操作DStream，包括输入、转换、修改状态及输出等。

13.2.1 Dstream输入

在Spark Streaming中所有的操作都是基于流的，而输入源是一切操作的起点。
Spark Streaming 提供两种类型的流式输入数据源：
 基础输入源：能直接应用于StreamingContext API输入源。例如：文件系统、Socket（套接字）连接和 Akka actors；
 高级输入源：能应用于特定工具类的输入源，如 Kafka、Flume、Kinesis、Twitter 等，使用这些输入源需要导入一些额外依赖包。

13.2.2 Dstream转换

DStream转换操作是在一个或多个DStream上创建新的DStream。

13.2.3 Dstream修改

Spark Streaming除提供一些基本操作，还提供一些状态操作。

13.2 .4Dstream输出

Spark Streaming允许DStream的数据输出到外部系统，如数据库、文件系统等。

13.3 Spark Streaming应用实例

先启动nc，端口为9999

nc -lk 9999

1	nc -lk 9999

然后,以本地方式启动spark shell

//导入类或包
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext, Time}
import spark.implicits._


// 创建一个间隔时间为3秒的context
val ssc = new StreamingContext(sc, Seconds(3))
// 创建一个socket stream ，基于master:9999
val lines = ssc.socketTextStream("master",9999)
val words = lines.flatMap(_.split(" "))
//为便于使用SQL进行统计，把DStream的RDD转换为DataFrame。
// 把RDD[String] 转换为RDD[case class] ，最后转换为DataFrame
case class Record(word: String)

words.foreachRDD { (rdd:RDD[String], time:Time) =>
val wordsDataFrame = rdd.map(w => Record(w)).toDF()

// 创建一个临时视图
wordsDataFrame.createOrReplaceTempView("words")
//使用SQL进行统计
val wordCountsDataFrame = spark.sql("select word, count(*) as total from words group by word")
println(s"========= $time =========")
wordCountsDataFrame.show()
    }
ssc.start()
ssc.awaitTermination()

//导入类或包

import org.apache.spark.SparkConf

import org.apache.spark.rdd.RDD

import org.apache.spark.sql.SparkSession

import org.apache.spark.storage.StorageLevel

import org.apache.spark.streaming.{Seconds, StreamingContext, Time}

import spark.implicits._

// 创建一个间隔时间为3秒的context

val ssc = new StreamingContext(sc, Seconds(3))

// 创建一个socket stream ，基于master:9999

val lines = ssc.socketTextStream("master",9999)

val words = lines.flatMap(_.split(" "))

//为便于使用SQL进行统计，把DStream的RDD转换为DataFrame。

// 把RDD[String] 转换为RDD[case class] ，最后转换为DataFrame

case class Record(word: String)

words.foreachRDD { (rdd:RDD[String], time:Time) =>

val wordsDataFrame = rdd.map(w => Record(w)).toDF()

// 创建一个临时视图

wordsDataFrame.createOrReplaceTempView("words")

//使用SQL进行统计

val wordCountsDataFrame = spark.sql("select word, count(*) as total from words group by word")

println(s"========= $time =========")

wordCountsDataFrame.show()

}

ssc.start()

ssc.awaitTermination()

在启动了nc的界面输入：

ok ok m p py py py

1	ok ok m p py py py

在spark shell界面，可以看到如下输出：

========= 1494714360000 ms =========
+----+-----+
|word|total|
+----+-----+
|   m|    1|
|  ok|    2|
|   p|    1|
|  py|    3|
+----+-----+

========= 1494714360000 ms =========

+----+-----+

|word|total|

+----+-----+

| m| 1|

| ok| 2|

| p| 1|

| py| 3|

+----+-----+

13.4 Spark Streaming在线学习实例

前面我们简单介绍一个利用nc产生文本数据，Spark Streaming实时统计词频的一个实例，通过这个例子，我们对Streaming有个大致了解，它的源数据可以是实时产生、实时变化的，基于这个数据流，Spark Streaming能实时进行统计词频信息，并输出到界面。
除了统计词频，实际上Spark Streaming 还可以做在线机器学习工作，目前Spark Streaming支持Streaming Linear Regression, Streaming KMeans等，这节我们模拟一个在线学习线性回归的算法，源数据为多个文件，首先在一个文件中训练模型，然后在新数据上进行调整模型，对新数据进行预测等。

//导入需要的类
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.feature.StandardScaler
import breeze.linalg.DenseVector
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._


//交互式编程
val ssc = new StreamingContext(sc, Seconds(10))
val stream = ssc.textFileStream("file:///home/hadoop/data/streaming/traindir")

val NumFeatures = 11
val zeroVector = DenseVector.zeros[Double](NumFeatures)
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.dense(zeroVector.data))
.setNumIterations(20)
.setRegParam(0.8)
.setStepSize(0.01) 

		
//创建一个含标签的数据流
val labeledStream = stream.map { line =>
val split = line.split(";")
val y = split(11).toDouble
val features=split.slice(0,11).map(_.toDouble)
    LabeledPoint(label = y, features = Vectors.dense(features))
    }	
//在数据流上训练测试模型。    
model.trainOn(labeledStream)
model.predictOnValues(labeledStream.map(lp => (lp.label, lp.features))) .print()
//启动Spark Streaming
ssc.start()
ssc.awaitTermination()

//导入需要的类

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.feature.StandardScaler

import breeze.linalg.DenseVector

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD

import org.apache.spark.streaming._

import org.apache.spark.streaming.StreamingContext._

//交互式编程

val ssc = new StreamingContext(sc, Seconds(10))

val stream = ssc.textFileStream("file:///home/hadoop/data/streaming/traindir")

val NumFeatures = 11

val zeroVector = DenseVector.zeros[Double](NumFeatures)

val model = new StreamingLinearRegressionWithSGD()

.setInitialWeights(Vectors.dense(zeroVector.data))

.setNumIterations(20)

.setRegParam(0.8)

.setStepSize(0.01)

//创建一个含标签的数据流

val labeledStream = stream.map { line =>

val split = line.split(";")

val y = split(11).toDouble

val features=split.slice(0,11).map(_.toDouble)

LabeledPoint(label = y, features = Vectors.dense(features))

}

//在数据流上训练测试模型。

model.trainOn(labeledStream)

model.predictOnValues(labeledStream.map(lp => (lp.label, lp.features))) .print()

//启动Spark Streaming

ssc.start()

ssc.awaitTermination()

13.5小结

前几章主要介绍了Spark ML对批量数据或离线数据的分析和处理，本章主要介绍Spark Streamin对在线数据或流式数据的处理及分析，首先对Spark Streaming的一些概念、输入源、Dstream的一些转换、修改、输出作了简单介绍，然后，通过两个实例把这些内容结合在一起，进一步说明Spark Streaming在线统计、在线学习的具体使用。

本章数据集下载
第12章 Spark R 朴素贝叶斯模型

前一章我们介绍了PySpark，就是用Python语言操作Spark大数据计算框架上的任务，这样把自然把Python的优点与Spark的优势进行叠加。Spark提供了Python的API，也提供了R语言的API，其组件名称为Spark R。Spark R的运行原理或架构，具体请看图12-1。

图12-1 Spark R 架构图

Spark R的架构类似于PySpark，Driver端除了一个JVM进程（包含一个SparkContext,在Spark2.X中SparkContext已经被SparkSession所代替）外，还有起一个R的进程，这两个进程通过Socket进行通信，用户可以提交R语言代码，R的进程会执行这些R代码，
当R代码调用Spark相关函数时，R进程会通过Socket触发JVM中的对应任务。
当R进程向JVM进程提交任务的时候，R会把子任务需要的环境进行打包，并发送到JVM的driver端。通过R生成的RDD都会是RRDD类型，当触发RRDD的action时，Spark的执行器会开启一个R进程，执行器和R进程通过Socket进行通信。执行器会把任务和所需的环境发送给R进程，R进程会加载对应的package，执行任务，并返回结果。
本章通过一个实例来说明如何使用Spark R，具体内容如下：
 Spark R简介
 把数据上传到HDFS,然后导入Hive，最后从Hive读取数据
 使用朴素贝叶斯分类器
 探索数据
 预处理数据
 训练模型
 评估模型

12.1. Spark R简介

目前SparkR的最新版本为2.0.1，API参考文档（http://spark.apache.org/docs/latest/api/R/index.html）。

12.2获取数据

12.2.1 SparkDataFrame数据结构说明

SparkDataFrame是Spark提供的分布式数据格式（DataFrame）。类似于关系数据库中的表或R语言中的DataFrame。SparkDataFrames可以从各种各样的源构造，例如：结构化数据文件，Hive中的表，外部数据库或现有的本地数据。

12.2.2创建SparkDataFrame

1.从本地文件加载数据，生成SparkDataFrame
SparkR支持通过SparkDataFrame接口对各种数据源进行操作。示例：

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR
## 
## Attaching package: 'SparkR'
## The following objects are masked from 'package:stats':
## 
##     cov, filter, lag, na.omit, predict, sd, var, window
## The following objects are masked from 'package:base':
## 
##     as.data.frame, colnames, colnames<-, drop, endsWith,
##     intersect, rank, rbind, sample, startsWith, subset, summary,
##     transform, union
sparkR.session(sparkHome ='/u01/bigdata/spark' )#启动Spark环境
## Spark package found in SPARK_HOME: /u01/bigdata/spark
## Launching java with spark-submit command /u01/bigdata/spark/bin/spark-submit   sparkr-shell /tmp/RtmpA30Gvz/backend_port2ad11c0705d8
## Java ref type org.apache.spark.sql.SparkSession id 1
# 读取本地csv文件
Sparkdf <-read.df("/u01/bigdata/data/df2.csv",source='csv',header='TRUE',inferSchema ="true")
# 查看SparkDataFrame
head(Sparkdf)
##   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR

## Attaching package: 'SparkR'

## The following objects are masked from 'package:stats':

## cov, filter, lag, na.omit, predict, sd, var, window

## The following objects are masked from 'package:base':

## as.data.frame, colnames, colnames<-, drop, endsWith,

## intersect, rank, rbind, sample, startsWith, subset, summary,

## transform, union

sparkR.session(sparkHome ='/u01/bigdata/spark' )#启动Spark环境

## Spark package found in SPARK_HOME: /u01/bigdata/spark

## Launching java with spark-submit command /u01/bigdata/spark/bin/spark-submit sparkr-shell /tmp/RtmpA30Gvz/backend_port2ad11c0705d8

## Java ref type org.apache.spark.sql.SparkSession id 1

# 读取本地csv文件

Sparkdf <-read.df("/u01/bigdata/data/df2.csv",source='csv',header='TRUE',inferSchema ="true")

# 查看SparkDataFrame

head(Sparkdf)

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3.0 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5.0 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

2.利用R环境的data frames创建SparkDataFrame
创建SparkDataFrame的最简单的方法是将本地R环境变量中的data frames转换为SparkDataFrame。我们可以使用as.DataFrame或createDataFrame函数来创建SparkDataFrame。作为示例，我们使用R自带的iris数据集来创建SparkDataFrame。

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR
## 
## Attaching package: 'SparkR'
## The following objects are masked from 'package:stats':
## 
##     cov, filter, lag, na.omit, predict, sd, var, window
## The following objects are masked from 'package:base':
## 
##     as.data.frame, colnames, colnames<-, drop, endsWith,
##     intersect, rank, rbind, sample, startsWith, subset, summary,
##     transform, union
sparkR.session(sparkHome ='/u01/bigdata/spark' )#启动Spark环境
## Java ref type org.apache.spark.sql.SparkSession id 1
# 创建SparkDataFrame Sparkdf，数据来自iris数据集
Sparkdf <-as.DataFrame(iris)
# 查看刚创建好的SparkDataFrame
head(Sparkdf)
##   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR

## Attaching package: 'SparkR'

## The following objects are masked from 'package:stats':

## cov, filter, lag, na.omit, predict, sd, var, window

## The following objects are masked from 'package:base':

## as.data.frame, colnames, colnames<-, drop, endsWith,

## intersect, rank, rbind, sample, startsWith, subset, summary,

## transform, union

sparkR.session(sparkHome ='/u01/bigdata/spark' )#启动Spark环境

## Java ref type org.apache.spark.sql.SparkSession id 1

# 创建SparkDataFrame Sparkdf，数据来自iris数据集

Sparkdf <-as.DataFrame(iris)

# 查看刚创建好的SparkDataFrame

head(Sparkdf)

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3.0 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5.0 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

3.从HDFS文件系统加载数据，生成SparkDataFrame
示例：

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR
## 
## Attaching package: 'SparkR'
## The following objects are masked from 'package:stats':
## 
##     cov, filter, lag, na.omit, predict, sd, var, window
## The following objects are masked from 'package:base':
## 
##     as.data.frame, colnames, colnames<-, drop, endsWith,
##     intersect, rank, rbind, sample, startsWith, subset, summary,
##     transform, union
## Java ref type org.apache.spark.sql.SparkSession id 1
# 读取HDFS文件
Sparkdf <-read.df("hdfs://192.168.1.112:9000/u01/bigdata/data/df2.csv",source='csv',header='TRUE',inferSchema ="true")
# 查看SparkDataFrame
head(Sparkdf)
##   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR

## Attaching package: 'SparkR'

## The following objects are masked from 'package:stats':

## cov, filter, lag, na.omit, predict, sd, var, window

## The following objects are masked from 'package:base':

## as.data.frame, colnames, colnames<-, drop, endsWith,

## intersect, rank, rbind, sample, startsWith, subset, summary,

## transform, union

## Java ref type org.apache.spark.sql.SparkSession id 1

# 读取HDFS文件

Sparkdf <-read.df("hdfs://192.168.1.112:9000/u01/bigdata/data/df2.csv",source='csv',header='TRUE',inferSchema ="true")

# 查看SparkDataFrame

head(Sparkdf)

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3.0 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5.0 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

4.读取Hive数据仓库中的表，生成SparkDataFrame
我们还可以从Hive表创建SparkDataFrame。

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR
# 查看数据库
sql('show databases')
## SparkDataFrame[databaseName:string]
# 选择hive库
sql('use hive')
## SparkDataFrame[]
# 查看hive数据库的表
sql('show tables')
## SparkDataFrame[database:string, tableName:string, isTemporary:boolean]
# 查看表df2的信息
sql('desc df2')
## SparkDataFrame[col_name:string, data_type:string, comment:string]
# 读取hive表df2，生成SparkDataFrame
Sparkdf<-sql('select * from df2')
# 查看SparkDataFrame
head(Sparkdf)
##       height     weight
## 1  0.3307575 -1.4197984
## 2  0.4970992 -1.4364733
## 3  1.4477968 -0.7579736
## 4  0.6815300 -1.7573564
## 5  0.8915567  1.1815332
## 6 -2.2494993 -1.6438995

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR

# 查看数据库

sql('show databases')

## SparkDataFrame[databaseName:string]

# 选择hive库

sql('use hive')

## SparkDataFrame[]

# 查看hive数据库的表

sql('show tables')

## SparkDataFrame[database:string, tableName:string, isTemporary:boolean]

# 查看表df2的信息

sql('desc df2')

## SparkDataFrame[col_name:string, data_type:string, comment:string]

# 读取hive表df2，生成SparkDataFrame

Sparkdf<-sql('select * from df2')

# 查看SparkDataFrame

head(Sparkdf)

## height weight

## 1 0.3307575 -1.4197984

## 2 0.4970992 -1.4364733

## 3 1.4477968 -0.7579736

## 4 0.6815300 -1.7573564

## 5 0.8915567 1.1815332

## 6 -2.2494993 -1.6438995

12.2.3 SparkDataFrame的常用操作

1.选择行，或者列

df <-as.DataFrame(iris)
str(df)
## 'SparkDataFrame': 5 variables:
##  $ Sepal_Length: num 5.1 4.9 4.7 4.6 5 5.4
##  $ Sepal_Width : num 3.5 3 3.2 3.1 3.6 3.9
##  $ Petal_Length: num 1.4 1.4 1.3 1.5 1.4 1.7
##  $ Petal_Width : num 0.2 0.2 0.2 0.2 0.2 0.4
##  $ Species     : chr "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
head(df)
##   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# 选择Sepal_Length列
head(select(df, df$Sepal_Length))
##   Sepal_Length
## 1          5.1
## 2          4.9
## 3          4.7
## 4          4.6
## 5          5.0
## 6          5.4
# 或者
head(select(df, "Sepal_Length"))
##   Sepal_Length
## 1          5.1
## 2          4.9
## 3          4.7
## 4          4.6
## 5          5.0
## 6          5.4
# 过滤出Sepal_Length小于5的行
head(filter(df, df$Sepal_Length <5))
##   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1          4.9         3.0          1.4         0.2  setosa
## 2          4.7         3.2          1.3         0.2  setosa
## 3          4.6         3.1          1.5         0.2  setosa
## 4          4.6         3.4          1.4         0.3  setosa
## 5          4.4         2.9          1.4         0.2  setosa
## 6          4.9         3.1          1.5         0.1  setosa

df <-as.DataFrame(iris)

str(df)

## 'SparkDataFrame': 5 variables:

## $ Sepal_Length: num 5.1 4.9 4.7 4.6 5 5.4

## $ Sepal_Width : num 3.5 3 3.2 3.1 3.6 3.9

## $ Petal_Length: num 1.4 1.4 1.3 1.5 1.4 1.7

## $ Petal_Width : num 0.2 0.2 0.2 0.2 0.2 0.4

## $ Species : chr "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"

head(df)

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3.0 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5.0 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

# 选择Sepal_Length列

head(select(df, df$Sepal_Length))

## Sepal_Length

## 1 5.1

## 2 4.9

## 3 4.7

## 4 4.6

## 5 5.0

## 6 5.4

# 或者

head(select(df, "Sepal_Length"))

## Sepal_Length

## 1 5.1

## 2 4.9

## 3 4.7

## 4 4.6

## 5 5.0

## 6 5.4

# 过滤出Sepal_Length小于5的行

head(filter(df, df$Sepal_Length <5))

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 1 4.9 3.0 1.4 0.2 setosa

## 2 4.7 3.2 1.3 0.2 setosa

## 3 4.6 3.1 1.5 0.2 setosa

## 4 4.6 3.4 1.4 0.3 setosa

## 5 4.4 2.9 1.4 0.2 setosa

## 6 4.9 3.1 1.5 0.1 setosa

2.数据分组，聚合

df <-as.DataFrame(faithful)
#数据分组,并统计每组出现的个数
head(summarize(groupBy(df, df$waiting), count =n(df$waiting)))
##   waiting count
## 1      70     4
## 2      67     1
## 3      69     2
## 4      88     6
## 5      49     5
## 6      64     4
# 对结果进行排序
waiting_counts <-summarize(groupBy(df, df$waiting), count =n(df$waiting))
head(arrange(waiting_counts, desc(waiting_counts$count)))
##   waiting count
## 1      78    15
## 2      83    14
## 3      81    13
## 4      77    12
## 5      82    12
## 6      79    10

df <-as.DataFrame(faithful)

#数据分组,并统计每组出现的个数

head(summarize(groupBy(df, df$waiting), count =n(df$waiting)))

## waiting count

## 1 70 4

## 2 67 1

## 3 69 2

## 4 88 6

## 5 49 5

## 6 64 4

# 对结果进行排序

waiting_counts <-summarize(groupBy(df, df$waiting), count =n(df$waiting))

head(arrange(waiting_counts, desc(waiting_counts$count)))

## waiting count

## 1 78 15

## 2 83 14

## 3 81 13

## 4 77 12

## 5 82 12

## 6 79 10

3.对SparkDataFrame的列进行运算操作

df$waiting_secs <-df$waiting *60
head(df)
##   eruptions waiting waiting_secs
## 1     3.600      79         4740
## 2     1.800      54         3240
## 3     3.333      74         4440
## 4     2.283      62         3720
## 5     4.533      85         5100
## 6     2.883      55         3300

df$waiting_secs <-df$waiting *60

head(df)

## eruptions waiting waiting_secs

## 1 3.600 79 4740

## 2 1.800 54 3240

## 3 3.333 74 4440

## 4 2.283 62 3720

## 5 4.533 85 5100

## 6 2.883 55 3300

4.apply系列函数的应用
• dapply函数类似于R语言的apply函数，看一个示例。

df <-as.DataFrame(iris)
df1 <-dapply(df, function(x) { x[x[,1]>6,]},schema =schema(df))
head(collect(df1))
##   Sepal_Length Sepal_Width Petal_Length Petal_Width    Species
## 1          7.0         3.2          4.7         1.4 versicolor
## 2          6.4         3.2          4.5         1.5 versicolor
## 3          6.9         3.1          4.9         1.5 versicolor
## 4          6.5         2.8          4.6         1.5 versicolor
## 5          6.3         3.3          4.7         1.6 versicolor
## 6          6.6         2.9          4.6         1.3 versicolor
str(df1)
## 'SparkDataFrame': 5 variables:
##  $ Sepal_Length: num 7 6.4 6.9 6.5 6.3 6.6
##  $ Sepal_Width : num 3.2 3.2 3.1 2.8 3.3 2.9
##  $ Petal_Length: num 4.7 4.5 4.9 4.6 4.7 4.6
##  $ Petal_Width : num 1.4 1.5 1.5 1.5 1.6 1.3
##  $ Species     : chr "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"
dim(df1)
## [1] 61  5

df <-as.DataFrame(iris)

df1 <-dapply(df, function(x) { x[x[,1]>6,]},schema =schema(df))

head(collect(df1))

## Sepal_Length Sepal_Width Petal_Length Petal_Width Species

## 1 7.0 3.2 4.7 1.4 versicolor

## 2 6.4 3.2 4.5 1.5 versicolor

## 3 6.9 3.1 4.9 1.5 versicolor

## 4 6.5 2.8 4.6 1.5 versicolor

## 5 6.3 3.3 4.7 1.6 versicolor

## 6 6.6 2.9 4.6 1.3 versicolor

str(df1)

## 'SparkDataFrame': 5 variables:

## $ Sepal_Length: num 7 6.4 6.9 6.5 6.3 6.6

## $ Sepal_Width : num 3.2 3.2 3.1 2.8 3.3 2.9

## $ Petal_Length: num 4.7 4.5 4.9 4.6 4.7 4.6

## $ Petal_Width : num 1.4 1.5 1.5 1.5 1.6 1.3

## $ Species : chr "versicolor" "versicolor" "versicolor" "versicolor" "versicolor" "versicolor"

dim(df1)

## [1] 61 5

12.3朴素贝叶斯分类器

该案例数据来自泰坦尼克号人员存活情况，响应变量为Survived，包含2个分类（yes，no），特征变量有Sex 、 Age 、Class（船舱等级），说明如下：
Class :0 = crew, 1 = first, 2 = second, 3 = third Age :1 = adult, 0 = child Sex :1 = male, 0 = female Survived :1 = yes, 0 = no

12.3.1数据探查

让我们来观察一下数据，

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR
## 
## Attaching package: 'SparkR'
## The following objects are masked from 'package:stats':
## 
##     cov, filter, lag, na.omit, predict, sd, var, window
## The following objects are masked from 'package:base':
## 
##     as.data.frame, colnames, colnames<-, drop, endsWith,
##     intersect, rank, rbind, sample, startsWith, subset, summary,
##     transform, union
## Java ref type org.apache.spark.sql.SparkSession id 1
## SparkDataFrame[]
#从hive仓库加载数据
titanic <-sql('select * from titanic')
# 查看SparkDataFrame
head(titanic)
##   class age sex survived
## 1     1   1   1        1
## 2     1   1   1        1
## 3     1   1   1        1
## 4     1   1   1        1
## 5     1   1   1        1
## 6     1   1   1        1
dim(titanic)
## [1] 2201    4
# 查看SparkDataFrame
dim(titanic)#查看数据的记录数以及维度数量
## [1] 2201    4

library("SparkR", lib.loc="/u01/bigdata/spark/R/lib")#加载sparkR

## Attaching package: 'SparkR'

## The following objects are masked from 'package:stats':

## cov, filter, lag, na.omit, predict, sd, var, window

## The following objects are masked from 'package:base':

## as.data.frame, colnames, colnames<-, drop, endsWith,

## intersect, rank, rbind, sample, startsWith, subset, summary,

## transform, union

## Java ref type org.apache.spark.sql.SparkSession id 1

## SparkDataFrame[]

#从hive仓库加载数据

titanic <-sql('select * from titanic')

# 查看SparkDataFrame

head(titanic)

## class age sex survived

## 1 1 1 1 1

## 2 1 1 1 1

## 3 1 1 1 1

## 4 1 1 1 1

## 5 1 1 1 1

## 6 1 1 1 1

dim(titanic)

## [1] 2201 4

# 查看SparkDataFrame

dim(titanic)#查看数据的记录数以及维度数量

## [1] 2201 4

12.3.2对原始数据集进行转换

titanic_df=as.data.frame(titanic)
titanic <-as.data.frame(table(titanic_df))
colnames(titanic)<-paste0(toupper(substring(colnames(titanic),1,1)),substring(colnames(titanic),2))
titanic_temp<-titanic[titanic$Freq >0, -5]

head(titanic_temp)
##    Class Age Sex Survived
## 4      3   0   0        0
## 5      0   1   0        0
## 6      1   1   0        0
## 7      2   1   0        0
## 8      3   1   0        0
## 12     3   0   1        0

titanic_df=as.data.frame(titanic)

titanic <-as.data.frame(table(titanic_df))

colnames(titanic)<-paste0(toupper(substring(colnames(titanic),1,1)),substring(colnames(titanic),2))

titanic_temp<-titanic[titanic$Freq >0, -5]

head(titanic_temp)

## Class Age Sex Survived

## 4 3 0 0 0

## 5 0 1 0 0

## 6 1 1 0 0

## 7 2 1 0 0

## 8 3 1 0 0

## 12 3 0 1 0

12.3.3查看不同船舱的生还率差异

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:SparkR':
## 
##     arrange, between, collect, contains, count, cume_dist,
##     dense_rank, desc, distinct, explain, filter, first, group_by,
##     intersect, lag, last, lead, mutate, n, n_distinct, ntile,
##     percent_rank, rename, row_number, sample_frac, select, sql,
##     summarize, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
tempdata<-aggregate(Freq~Class+Survived,data = titanic,FUN = sum)
ggplot(data = tempdata,mapping =aes(x = Class,y=Freq,fill=Survived))+geom_bar(position ='dodge',stat ='identity')+ylab("number")+xlim(c("1","2","3","0"))+theme(text=element_text(family ="Italic",size=18))

library(ggplot2)

library(dplyr)

## Attaching package: 'dplyr'

## The following objects are masked from 'package:SparkR':

## arrange, between, collect, contains, count, cume_dist,

## dense_rank, desc, distinct, explain, filter, first, group_by,

## intersect, lag, last, lead, mutate, n, n_distinct, ntile,

## percent_rank, rename, row_number, sample_frac, select, sql,

## summarize, union

## The following objects are masked from 'package:stats':

## filter, lag

## The following objects are masked from 'package:base':

## intersect, setdiff, setequal, union

tempdata<-aggregate(Freq~Class+Survived,data = titanic,FUN = sum)

ggplot(data = tempdata,mapping =aes(x = Class,y=Freq,fill=Survived))+geom_bar(position ='dodge',stat ='identity')+ylab("number")+xlim(c("1","2","3","0"))+theme(text=element_text(family ="Italic",size=18))

然后，对比一下不同性别之间的生还率：

tempdata<-aggregate(Freq~Sex+Survived,data = titanic,FUN = sum)
ggplot(data = tempdata,mapping =aes(x = Sex,y=Freq,fill=Survived))+geom_bar(position ='dodge',stat ='identity')

1 2	tempdata<-aggregate(Freq~Sex+Survived,data = titanic,FUN = sum) ggplot(data = tempdata,mapping =aes(x = Sex,y=Freq,fill=Survived))+geom_bar(position ='dodge',stat ='identity')

最后再看看不同年龄段的生还情况：

tempdata<-aggregate(Freq~Age+Survived,data = titanic,FUN = sum)
ggplot(data = tempdata,mapping =aes(x = Age,y=Freq,fill=Survived))+geom_bar(position ='dodge',stat ='identity')

1 2	tempdata<-aggregate(Freq~Age+Survived,data = titanic,FUN = sum) ggplot(data = tempdata,mapping =aes(x = Age,y=Freq,fill=Survived))+geom_bar(position ='dodge',stat ='identity')

12.3.4转换成SparkDataFrame格式的数据

titanicDF <-createDataFrame(titanic[titanic$Freq >0, -5])
nbDF

1 2	titanicDF <-createDataFrame(titanic[titanic$Freq >0, -5]) nbDF

12.3.5模型概要

summary(nbModel)
## $apriori
##              1         0
## [1,] 0.5769231 0.4230769
## 
## $tables
##   Class_3   Class_2 Class_1 Sex_0 Age_1 
## 1 0.3125    0.3125  0.3125  0.5   0.5625
## 0 0.4166667 0.25    0.25    0.5   0.75

summary(nbModel)

## $apriori

## 1 0

## [1,] 0.5769231 0.4230769

## $tables

## Class_3 Class_2 Class_1 Sex_0 Age_1

## 1 0.3125 0.3125 0.3125 0.5 0.5625

## 0 0.4166667 0.25 0.25 0.5 0.75

12.3.6预测

nbPredictions <-predict(nbModel, nbTestDF)
showDF(nbPredictions)
## +-----+---+---+--------+--------------------+--------------------+----------+
## |Class|Age|Sex|Survived|       rawPrediction|         probability|prediction|
## +-----+---+---+--------+--------------------+--------------------+----------+
## |    3|  0|  0|       0|[-3.9824097993521...|[0.60062402496099...|         1|
## |    0|  1|  0|       0|[-2.9426380107070...|[0.50318824507901...|         1|
## |    1|  1|  0|       0|[-3.7310953710712...|[0.58003280993672...|         1|
## |    2|  1|  0|       0|[-3.7310953710712...|[0.58003280993672...|         1|
## |    3|  1|  0|       0|[-3.7310953710712...|[0.39192399049881...|         0|
## |    3|  0|  1|       0|[-3.9824097993521...|[0.60062402496099...|         1|
## |    0|  1|  1|       0|[-2.9426380107070...|[0.50318824507901...|         1|
## |    1|  1|  1|       0|[-3.7310953710712...|[0.58003280993672...|         1|
## |    2|  1|  1|       0|[-3.7310953710712...|[0.58003280993672...|         1|
## |    3|  1|  1|       0|[-3.7310953710712...|[0.39192399049881...|         0|
## |    1|  0|  0|       1|[-3.9824097993521...|[0.76318223866790...|         1|
## |    2|  0|  0|       1|[-3.9824097993521...|[0.76318223866790...|         1|
## |    3|  0|  0|       1|[-3.9824097993521...|[0.60062402496099...|         1|
## |    0|  1|  0|       1|[-2.9426380107070...|[0.50318824507901...|         1|
## |    1|  1|  0|       1|[-3.7310953710712...|[0.58003280993672...|         1|
## |    2|  1|  0|       1|[-3.7310953710712...|[0.58003280993672...|         1|
## |    3|  1|  0|       1|[-3.7310953710712...|[0.39192399049881...|         0|
## |    1|  0|  1|       1|[-3.9824097993521...|[0.76318223866790...|         1|
## |    2|  0|  1|       1|[-3.9824097993521...|[0.76318223866790...|         1|
## |    3|  0|  1|       1|[-3.9824097993521...|[0.60062402496099...|         1|
## +-----+---+---+--------+--------------------+--------------------+----------+
## only showing top 20 rows

nbPredictions <-predict(nbModel, nbTestDF)

showDF(nbPredictions)

## +-----+---+---+--------+--------------------+--------------------+----------+

## +-----+---+---+--------+--------------------+--------------------+----------+

## | 3| 0| 0| 0|[-3.9824097993521...|[0.60062402496099...| 1|

## | 0| 1| 0| 0|[-2.9426380107070...|[0.50318824507901...| 1|

## | 1| 1| 0| 0|[-3.7310953710712...|[0.58003280993672...| 1|

## | 2| 1| 0| 0|[-3.7310953710712...|[0.58003280993672...| 1|

## | 3| 1| 0| 0|[-3.7310953710712...|[0.39192399049881...| 0|

## | 3| 0| 1| 0|[-3.9824097993521...|[0.60062402496099...| 1|

## | 0| 1| 1| 0|[-2.9426380107070...|[0.50318824507901...| 1|

## | 1| 1| 1| 0|[-3.7310953710712...|[0.58003280993672...| 1|

## | 2| 1| 1| 0|[-3.7310953710712...|[0.58003280993672...| 1|

## | 3| 1| 1| 0|[-3.7310953710712...|[0.39192399049881...| 0|

## | 1| 0| 0| 1|[-3.9824097993521...|[0.76318223866790...| 1|

## | 2| 0| 0| 1|[-3.9824097993521...|[0.76318223866790...| 1|

## | 3| 0| 0| 1|[-3.9824097993521...|[0.60062402496099...| 1|

## | 0| 1| 0| 1|[-2.9426380107070...|[0.50318824507901...| 1|

## | 1| 1| 0| 1|[-3.7310953710712...|[0.58003280993672...| 1|

## | 2| 1| 0| 1|[-3.7310953710712...|[0.58003280993672...| 1|

## | 3| 1| 0| 1|[-3.7310953710712...|[0.39192399049881...| 0|

## | 1| 0| 1| 1|[-3.9824097993521...|[0.76318223866790...| 1|

## | 2| 0| 1| 1|[-3.9824097993521...|[0.76318223866790...| 1|

## | 3| 0| 1| 1|[-3.9824097993521...|[0.60062402496099...| 1|

## +-----+---+---+--------+--------------------+--------------------+----------+

## only showing top 20 rows

12.3.7评估模型

nbPredictions<-as.data.frame(nbPredictions)

# 计算混淆矩阵
ct<-table(titanic_temp$Survived,nbPredictions$prediction)
ct
##    
##      0  1
##   0  2  8
##   1  2 12

nbPredictions<-as.data.frame(nbPredictions)

# 计算混淆矩阵

ct<-table(titanic_temp$Survived,nbPredictions$prediction)

## 0 1

## 0 2 8

## 1 2 12

计算准确率

(ct[1,1]+ct[2,2])/sum(ct)
## [1] 0.5833333

1 2	(ct[1,1]+ct[2,2])/sum(ct) ## [1] 0.5833333

计算召回率

ct[2,2]/(ct[2,2]+ct[2,1])
## [1] 0.8571429

1 2	ct[2,2]/(ct[2,2]+ct[2,1]) ## [1] 0.8571429

计算精准率

ct[2,2]/(ct[2,2]+ct[1,2])
## [1] 0.6

1 2	ct[2,2]/(ct[2,2]+ct[1,2]) ## [1] 0.6

12.4 小结

本章主要介绍了如何使用Spark R组件的问题，Spark R 给R开发人员提供很多API,利用这些API，开发人员就可以通过R语言操作Spark，把用R编写的代码放在Spark这个大数据技术平台运行，这样可以使R不但可以操作HDFS或Hive中数据，也自然使用Spark分布式基于内存的架构。

本章数据集下载

第11章 PySpark 决策树模型

Spark不但好用、而且还易用、通用，它提供多种的开发语言的API，除了Scala外，还有Java、Python、R等，可以说集成目前市场最有代表性的开发语言，使得Spark受众上升几个数据量级，同时也无形中降低了学习和使用它的门槛，使得很多熟悉Java、Python、R的编程人员、数据分析师，也可方便地利用Spark大数据计算框架来实现他们的大数据处理、机器学习等任务。
Python作为机器学习中的利器，一直被很多开发者和学习者所推崇的一种语言。除了开源、易学以及简洁的代码风格的特性之外，Python当中还有很多优秀的第三方的库，为我们对数据进行处理、探索和模型的构建提供很大的便利，如Pandas、Numpy、Scipy、Matplotlib、StatsModels、Scikit-Learn、Keras等。Python的强大还体现在它的与时俱进，它与大数据计算平台Spark的结合，可为是强强联合、优势互补、相得益彰，这就有了现如今Spark当中一个重要分支--PySpark。其内部架构可参考图11-1（该图取自https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals?spm=5176.100239.0.0.eI85ij）。

图11-1 PySpark架构图

PySpark的Python解释器在启动时会同时启动一个JVM,Python解释器与JVM进程通过socket进行通信，在python driver端，SparkContext利用Py4J启动一个JVM并产生一个JavaSparkContext。Py4J只使用在driver端，用于本地python与Java SparkContext objects的通信。大量数据的传输使用的是另一个机制。RDD在python下的转换会被映射成java环境下PythonRDD。在远端worker机器上，PythonRDD对象启动一些子进程并通过pipes与这些子进程通信，以此send用户代码和数据。
本章节就机器学习中的决策树模型，使用PySpark中的ML库以及IPython交互式环境进行示例。具体内容如下：
 决策树简介
 数据加载
 数据探索
 创建决策树模型
 训练模型并进行预测
 利用交叉验证、网格参数等进行模型调优
 最后生成一个可执行python脚本

11.1 PySpark 简介

在Spark的官网上这么介绍PySpark：“PySpark is the Python API for Spark”，也就是说PySpark其实是Spark为Python提供的编程接口。此外，Spark还提供了关于Scala、Java和R的编程接口，关于Spark为R提供的编程接口（Spark R）将在第12章进行介绍。

11.2 决策树简介

决策树在机器学习中是很常见且经常使用的模型，它是一个强大的非概率模型，可以用来表达复杂的非线性模式和特征相互关系。

图11-2决策树结构

关于决策树的原理，这里不再赘述。本章着重讨的是，决策树的分类模型在PySpark中的应用。

11.3数据加载

11.3.1 原数据集初探

这里的数据选择为某比赛的数据集，用来预测推荐的一些页面是短暂（昙花一现）还是长久（长时流行）。原数据集为train.tsv，存放路径在 /home/hadoop/data/train.tsv。
先使用shell命令对数据进行试探性的查看，并做一些简单的数据处理。
1) 查看前2行数据

$ head -2 train.tsv

"url" "urlid" "boilerplate"	"alchemy_category"	"alchemy_category_score"	"avglinksize"	"commonlinkratio_1"	"commonlinkratio_2"	"commonlinkratio_3"	"commonlinkratio_4"	"compression_ratio"	"embed_ratio"	"framebased"	"frameTagRatio"	"hasDomainLink"	"html_ratio"	"image_ratio"	"is_news"	"lengthyLinkDomain"	"linkwordscore"	"news_front_page"   "non_markup_alphanum_characters"	"numberOfLinks"	"numwords_in_url"	"parametrizedLinkRatio"	"spelling_errors_ratio"	"label"
"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"	"4042" "{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose Cali	"8"	.............. "0.152941176" "0.079129575"	"0"

$ head -2 train.tsv

"url" "urlid" "boilerplate" "alchemy_category" "alchemy_category_score" "avglinksize" "commonlinkratio_1" "commonlinkratio_2" "commonlinkratio_3" "commonlinkratio_4" "compression_ratio" "embed_ratio" "framebased" "frameTagRatio" "hasDomainLink" "html_ratio" "image_ratio" "is_news" "lengthyLinkDomain" "linkwordscore" "news_front_page" "non_markup_alphanum_characters" "numberOfLinks" "numwords_in_url" "parametrizedLinkRatio" "spelling_errors_ratio" "label"

"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html" "4042" "{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose Cali "8" .............. "0.152941176" "0.079129575" "0"

数据集中的第1行为标题（字段名）行，下面是一些的字段说明。
查看文件记录总数

$ cat train.tsv |wc -l
7396

1 2	$ cat train.tsv \|wc -l 7396

结果显示共有：数据集一共有7396条数据
2) 由于textFile目前不好过滤标题行数据，为便于spark操作数据，需要先删除标题。

$ sed  1d train.tsv >train_noheader.tsv

1	$ sed 1d train.tsv >train_noheader.tsv

3) 将数据文件上传到 hdfs

$ hdfs dfs -put train_noheader.tsv /data

1	$ hdfs dfs -put train_noheader.tsv /data

4) 查看是否成功

hadoop@master:~/data$ hdfs dfs -ls /data
17/05/24 00:46:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 hadoop supergroup   21972457 2017-05-24 00:46 /data/train_noheader.tsv

hadoop@master:~/data$ hdfs dfs -ls /data

17/05/24 00:46:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Found 1 items

-rw-r--r-- 1 hadoop supergroup 21972457 2017-05-24 00:46 /data/train_noheader.tsv

11.3.2 PySpark 的启动

以spark Standalone模式启动spark集群，保证内存分配充足。

$ pyspark --master spark://master:7077 --driver-memory 1G --total-executor-cores 4

1	$ pyspark --master spark://master:7077 --driver-memory 1G --total-executor-cores 4

[注]：使用pyspark --help 可以查看指令的详细帮助信息。

# Default to standard python interpreter unless told otherwise
if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then
  PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"ipython"}"
fi

# Default to standard python interpreter unless told otherwise

if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then

PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"ipython"}"

11.3.3 基本函数

这里将本章节中需要用到函数和方法做一个简单的说明，如表11-4所示。
表11-4 本章使用的一些函数或方法简介

11.4数据探索

1) 通过sc对象的textFile方法，载入本地数据文件，创建RDD

In [1]: raw_data = sc.textFile("hdfs://master:9000/data/train_noheader.tsv")

1	In [1]: raw_data = sc.textFile("hdfs://master:9000/data/train_noheader.tsv")

2) 查看第1行数据

In [2]: raw_data.take(2)
Out[2]:[u'"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"\t"4042"\t"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose Cali..."\t...\t"8"\t"0.152941176"\t"0.079129575"\t"0"',
u'"http://www.popsci.com/technology/article/2012-07/electronic-futuristic-starting-gun-eliminates-advantages-races"\t"8471"\t"{""title"":""The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races the fully electronic, futuristic starting gun that eliminates advantages in races the fully electronic, futuristic starting gun that eliminates advantages in races"",""body"":""And that can be carried on a plane without the hassle too The Omega..."\t...\t"9"\t"0.181818182"\t"0.125448029"\t"1"']

In [2]: raw_data.take(2)

Out[2]:[u'"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"\t"4042"\t"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose Cali..."\t...\t"8"\t"0.152941176"\t"0.079129575"\t"0"',

u'"http://www.popsci.com/technology/article/2012-07/electronic-futuristic-starting-gun-eliminates-advantages-races"\t"8471"\t"{""title"":""The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races the fully electronic, futuristic starting gun that eliminates advantages in races the fully electronic, futuristic starting gun that eliminates advantages in races"",""body"":""And that can be carried on a plane without the hassle too The Omega..."\t...\t"9"\t"0.181818182"\t"0.125448029"\t"1"']

3) 查看数据文件的总行数

In [3]: numRaws = raw_data.count()
In [4]: numRaws
Out[4]: 7395

In [3]: numRaws = raw_data.count()

In [4]: numRaws

Out[4]: 7395

4) 按键进行统计

In [5]: raw_data.countByKey()
Out[5]: defaultdict(int, {u'"': 7395})

1 2	In [5]: raw_data.countByKey() Out[5]: defaultdict(int, {u'"': 7395})

原数据文件总的行数为7396，由于我们在数据加载中将数据集的第一行数据已经去除掉，所以这里结果为7395。

11.5数据预处理

1) 由于后续的算法我们不需要时间戳以及网页的内容，所以这里先将其过滤掉。

In [6]: records = raw_data.map(lambda line: line.split('\t'))

1	In [6]: records = raw_data.map(lambda line: line.split('\t'))

2) 查看records 数据结构

In [7]: records.first()
Out[7]:
[u'"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"',
  u'"4042"',
  u'"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar...""}"',
  u'"business"',
  u'"0.789131"',
  u'"2.055555556"',
  u'"0.676470588"',
  u'"0.205882353"',
  u'"0.047058824"',
  u'"0.023529412"',
  u'"0.443783175"',
  u'"0"',
  u'"0"',
  u'"0.09077381"',
  u'"0"',
  u'"0.245831182"',
  u'"0.003883495"',
  u'"1"',
  u'"1"',
  u'"24"',
  u'"0"',
  u'"5424"',
  u'"170"',
  u'"8"',
  u'"0.152941176"',
  u'"0.079129575"',
  u'"0"']

In [7]: records.first()

Out[7]:

[u'"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"',

u'"4042"',

u'"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar...""}"',

u'"business"',

u'"0.789131"',

u'"2.055555556"',

u'"0.676470588"',

u'"0.205882353"',

u'"0.047058824"',

u'"0.023529412"',

u'"0.443783175"',

u'"0"',

u'"0.09077381"',

u'"0"',

u'"0.245831182"',

u'"0.003883495"',

u'"1"',

u'"24"',

u'"0"',

u'"5424"',

u'"170"',

u'"8"',

u'"0.152941176"',

u'"0.079129575"',

u'"0"']

3) 查看每一行的列数

In [8]: len(records.first())
Out[8]: 27

1 2	In [8]: len(records.first()) Out[8]: 27

导入Vectors 矢量方法

In [9]: from pyspark.ml.linalg import Vectors

1	In [9]: from pyspark.ml.linalg import Vectors

导入决策树分类器

In [10]: from pyspark.ml.classification import DecisionTreeClassifier

1	In [10]: from pyspark.ml.classification import DecisionTreeClassifier

4) 将RDD中的所有元素以列表的形式返回

In [11]: data = records.collect()

1	In [11]: data = records.collect()

5) 查看data数据一行有多少列

In [12]: numColumns = len(data[0])
In [13]: numColumns
Out[13]: 27

In [12]: numColumns = len(data[0])

In [13]: numColumns

Out[13]: 27

6) 定义一个列表data1，存放清理过的数据，格式为[(label_1, features_1), (label_2, features_2),…]

In [14]: data1 = []

1	In [14]: data1 = []

对数据进行清理工作中的1,2,3步

In [15]:
for i in range(numRaws):
    trimmed = [ each.replace('"', "") for each in data[i] ]
    label = int(trimmed[-1])
    features = map(lambda x: 0.0 if x == "?" else x, trimmed[4:numColumns-1])
c = (label, Vectors.dense(map(float, features)))
data1.append(c)

In [15]:

for i in range(numRaws):

trimmed = [ each.replace('"', "") for each in data[i] ]

label = int(trimmed[-1])

features = map(lambda x: 0.0 if x == "?" else x, trimmed[4:numColumns-1])

c = (label, Vectors.dense(map(float, features)))

data1.append(c)

11.6创建决策树模型

1) 将data1 转换为DataFrame对象，label表示标签列，features 表示特征值列

In [16]: df= spark.createDataFrame(data1, ["label","features"])
In [17]: df.show(10)
+-----+--------------------+
|label|            features|
+-----+--------------------+
|    0|[0.789131,2.05555...|
|    1|[0.574147,3.67796...|
|    1|[0.996526,2.38288...|
|    1|[0.801248,1.54310...|
|    0|[0.719157,2.67647...|
|    0|[0.0,119.0,0.7454...|
|    1|[0.22111,0.773809...|
|    0|[0.0,1.883333333,...|
|    1|[0.0,0.471502591,...|
|    1|[0.0,2.41011236,0...|
+-----+--------------------+
only showing top 10 rows
# 显示 df 的Schema
In [18]: df.printSchema()
root
 |-- label: long (nullable = true)
 |-- features: vector (nullable = true)

In [16]: df= spark.createDataFrame(data1, ["label","features"])

In [17]: df.show(10)

+-----+--------------------+

|label| features|

+-----+--------------------+

| 0|[0.789131,2.05555...|

| 1|[0.574147,3.67796...|

| 1|[0.996526,2.38288...|

| 1|[0.801248,1.54310...|

| 0|[0.719157,2.67647...|

| 0|[0.0,119.0,0.7454...|

| 1|[0.22111,0.773809...|

| 0|[0.0,1.883333333,...|

| 1|[0.0,0.471502591,...|

| 1|[0.0,2.41011236,0...|

+-----+--------------------+

only showing top 10 rows

# 显示 df 的Schema

In [18]: df.printSchema()

root

|-- label: long (nullable = true)

|-- features: vector (nullable = true)

2) 由于后面会经常使用，所以将df载入内存

In [19]: df.cache()
Out[19]: DataFrame[label: double, features: vector]

1 2	In [19]: df.cache() Out[19]: DataFrame[label: double, features: vector]

3) 建立特征索引

In [20]: from pyspark.ml.feature import VectorIndexer
In [20]: featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)

1 2	In [20]: from pyspark.ml.feature import VectorIndexer In [20]: featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)

4) 将数据切分成80%训练集和20%测试集

#seed=1234L，表示每次随机生成的训练集和测试集的总行数不变
In [21]: (trainingData, testData) = df.randomSplit([0.8, 0.2],seed=1234L)

In [22]: trainingData.count()
Out[22]: 5912

In [23]: testData.count()
Out[23]: 1483

#seed=1234L，表示每次随机生成的训练集和测试集的总行数不变

In [21]: (trainingData, testData) = df.randomSplit([0.8, 0.2],seed=1234L)

In [22]: trainingData.count()

Out[22]: 5912

In [23]: testData.count()

Out[23]: 1483

5) 指定决策树模型的深度、标签列，特征值列，使用信息熵(entropy)作为评估方法，并训练数据。

In [24]: dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")

1	In [24]: dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")

6) 构建流水线工作流

In [25]: from pyspark.ml import Pipeline

In [26]: pipeline = Pipeline(stages=[featureIndexer, dt])

In [27]: model = pipeline.fit(trainingData)      ## 训练模型

In [25]: from pyspark.ml import Pipeline

In [26]: pipeline = Pipeline(stages=[featureIndexer, dt])

In [27]: model = pipeline.fit(trainingData) ## 训练模型

下面我们用一组已知数据和一组新数据重新预测下结果：

11.7训练模型进行预测

1) 使用第一行数据进行预测结果，看看是否相符合，这里先来看一下原数据集第一行数据

In [28]: data1[0]
Out[28]: 
(0.0,
 DenseVector([0.7891, 2.0556, 0.6765, 0.2059, 0.0471, 0.0235, 0.4438, 0.0, 0.0, 0.0908, 0.0, 0.2458, 0.0039, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.1529, 0.0791]))

In [28]: data1[0]

Out[28]:

(0.0,

DenseVector([0.7891, 2.0556, 0.6765, 0.2059, 0.0471, 0.0235, 0.4438, 0.0, 0.0, 0.0908, 0.0, 0.2458, 0.0039, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.1529, 0.0791]))

2) 使用数据集中第一行的特征值数据进行预测

In [29]: test0 = spark.createDataFrame([(data1[0][1],)], ["features"])
In [30]: result = model.transform(test0)
# 查看预测结果
In [31]: result.show()
+--------------------+--------------------+-------------+--------------------+----------+
|            features|     indexedFeatures|rawPrediction|         probability|prediction|
+--------------------+--------------------+-------------+--------------------+----------+
|[0.789131,2.05555...|[0.789131,2.05555...|[274.0,310.0]|[0.46917808219178...|       1.0|
+--------------------+--------------------+-------------+--------------------+----------+
In [32]: predictedResult.select(['prediction']).show() 	#只获取预测值
+----------+
|prediction|
+----------+
|       1.0|
+----------+

In [29]: test0 = spark.createDataFrame([(data1[0][1],)], ["features"])

In [30]: result = model.transform(test0)

# 查看预测结果

In [31]: result.show()

+--------------------+--------------------+-------------+--------------------+----------+

+--------------------+--------------------+-------------+--------------------+----------+

|[0.789131,2.05555...|[0.789131,2.05555...|[274.0,310.0]|[0.46917808219178...| 1.0|

+--------------------+--------------------+-------------+--------------------+----------+

In [32]: predictedResult.select(['prediction']).show() #只获取预测值

+----------+

|prediction|

+----------+

| 1.0|

+----------+

3) 将第一行的特征值数据修改掉2个（这里换掉第一个和第二个值），进行该特征值下的预测.

# 将第一行的数据进行修改
In [33]: firstRaw = list(data1[0][1])
In [34]: firstRaw[0] = 2.7891
In [35]: firstRaw[1] = 0.0556

In [36]: predictedData = Vectors.dense(firstRaw)
In [37]: predictedData
Out[37]: DenseVector([2.7891, 0.0556, 0.6765, 0.2059, 0.0471, 0.0235, 0.4438, 0.0, 0.0, 0.0908, 0.0, 0.2458, 0.0039, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.1529, 0.0791])

# 将第一行的数据进行修改

In [33]: firstRaw = list(data1[0][1])

In [34]: firstRaw[0] = 2.7891

In [35]: firstRaw[1] = 0.0556

In [36]: predictedData = Vectors.dense(firstRaw)

In [37]: predictedData

Out[37]: DenseVector([2.7891, 0.0556, 0.6765, 0.2059, 0.0471, 0.0235, 0.4438, 0.0, 0.0, 0.0908, 0.0, 0.2458, 0.0039, 1.0, 1.0, 24.0, 0.0, 5424.0, 170.0, 8.0, 0.1529, 0.0791])

4) 进行新数据的预测

In [38]: predictedRaw = spark.createDataFrame([(predictedData,)], ["features"])
In [39]: predictedResult = model.transform(predictedRaw)
In [40]: predictedResult.show()
+--------------------+--------------------+-------------+--------------------+----------+
|            features|     indexedFeatures|rawPrediction|         probability|prediction|
+--------------------+--------------------+-------------+--------------------+----------+
|[2.7891,0.0556,0....|[2.7891,0.0556,0....|[274.0,310.0]|[0.46917808219178...|       1.0|
+--------------------+--------------------+-------------+--------------------+----------+
In [41]: predictedResult.select(['prediction']).show()
+----------+
|prediction|
+----------+
|       1.0|
+----------+

In [38]: predictedRaw = spark.createDataFrame([(predictedData,)], ["features"])

In [39]: predictedResult = model.transform(predictedRaw)

In [40]: predictedResult.show()

+--------------------+--------------------+-------------+--------------------+----------+

+--------------------+--------------------+-------------+--------------------+----------+

|[2.7891,0.0556,0....|[2.7891,0.0556,0....|[274.0,310.0]|[0.46917808219178...| 1.0|

+--------------------+--------------------+-------------+--------------------+----------+

In [41]: predictedResult.select(['prediction']).show()

+----------+

|prediction|

+----------+

| 1.0|

+----------+

5) 下面我们用测试数据做决策树准确度测试

# 通过模型，预测测试集
In [42]: predictedResultAll = model.transform(testData)

#查看预测值
In [43]: predictedResultAll.select("prediction").show()
+----------+
|prediction|
+----------+
|       0.0|
|       0.0|
|       1.0|
|       1.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       0.0|
|       1.0|
+----------+
only showing top 10 rows

#由于预测值是DataFrame对象，每一行是Raw型，不可做修改
#需将预测值转换为pandas，然后转换为列表
In [44]:df_prediction = predictedResultAll.select("prediction").toPandas()
In [45]: dtPredictions = list(df_prediction.prediction)

#查看前10个预测值
In [46]: dtPredictions[:10]
Out[46]: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0]

#对预测值做准确性统计
In [47]: dtTotalCorrect = 0
#获取测试集的总行数
In [48]: testRaw = testData.count()
In [49]: testLabel = testData.select("label").collect()
In [50]: 
for i in range(testRaw):
    if dtPredictions[i] == testLabel[i]:
        dtTotalCorrect += 1

In [51]: dtTotalCorrect
Out[51]: 940

In [52]: 1.0 * dtTotalCorrect / testRaw
Out[52]: 0.6338503034389751

# 通过模型，预测测试集

In [42]: predictedResultAll = model.transform(testData)

#查看预测值

In [43]: predictedResultAll.select("prediction").show()

+----------+

|prediction|

+----------+

| 0.0|

| 1.0|

| 0.0|

| 1.0|

+----------+

only showing top 10 rows

#由于预测值是DataFrame对象，每一行是Raw型，不可做修改

#需将预测值转换为pandas，然后转换为列表

In [44]:df_prediction = predictedResultAll.select("prediction").toPandas()

In [45]: dtPredictions = list(df_prediction.prediction)

#查看前10个预测值

In [46]: dtPredictions[:10]

Out[46]: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0]

#对预测值做准确性统计

In [47]: dtTotalCorrect = 0

#获取测试集的总行数

In [48]: testRaw = testData.count()

In [49]: testLabel = testData.select("label").collect()

In [50]:

for i in range(testRaw):

if dtPredictions[i] == testLabel[i]:

dtTotalCorrect += 1

In [51]: dtTotalCorrect

Out[51]: 940

In [52]: 1.0 * dtTotalCorrect / testRaw

Out[52]: 0.6338503034389751

11.8模型优化
在上一个小节中，我们发现使用决策树的正确率不算高，只有63.3850%。在这一小节，我们探究一下改进预测准确率的方法。

11.8.1特征值的优化

1) 先将之前用到的一些代码加载进来。

In [1]: from pyspark.ml.linalg import Vectors
In [2]: from pyspark.ml.classification import DecisionTreeClassifier
In [3]: from pyspark.ml.feature import VectorIndexer
In [4]: from pyspark.ml import Pipeline
In [5]: raw_data = sc.textFile("hdfs://master:9000/data/train_noheader.tsv")
In [6]: numRaws = raw_data.count()
In [7]: records = raw_data.map(lambda line: line.split('\t'))
In [8]: data = records.collect()
In [9]: numColumns = len(data[0])
In [10]: data1 = []

In [1]: from pyspark.ml.linalg import Vectors

In [2]: from pyspark.ml.classification import DecisionTreeClassifier

In [3]: from pyspark.ml.feature import VectorIndexer

In [4]: from pyspark.ml import Pipeline

In [5]: raw_data = sc.textFile("hdfs://master:9000/data/train_noheader.tsv")

In [6]: numRaws = raw_data.count()

In [7]: records = raw_data.map(lambda line: line.split('\t'))

In [8]: data = records.collect()

In [9]: numColumns = len(data[0])

In [10]: data1 = []

2) 由于这里对网页类型的标识有很多，需要单独挑选出来进行处理。

#将第三列网页类型的引号去除掉
In [11]: category = records.map(lambda x: x[3].replace("\"",""))

1 2	#将第三列网页类型的引号去除掉 In [11]: category = records.map(lambda x: x[3].replace("\"",""))

将网页的唯一类型删选出来，并进行排序。

In [12]: categories = sorted(category.distinct().collect())
In [13]: categories
Out[13]: 
[u'?',
 u'arts_entertainment',
 u'business',
 u'computer_internet',
 u'culture_politics',
 u'gaming',
 u'health',
 u'law_crime',
 u'recreation',
 u'religion',
 u'science_technology',
 u'sports',
 u'unknown',
 u'weather']

In [12]: categories = sorted(category.distinct().collect())

In [13]: categories

Out[13]:

[u'?',

u'arts_entertainment',

u'business',

u'computer_internet',

u'culture_politics',

u'gaming',

u'health',

u'law_crime',

u'recreation',

u'religion',

u'science_technology',

u'sports',

u'unknown',

u'weather']

3) 查看网页类型的个数。

In [14]: numCategories = len(categories)
In [15]: numCategories
Out[15]: 14

In [14]: numCategories = len(categories)

In [15]: numCategories

Out[15]: 14

4) 紧接着，我们定义一个函数，用于返回当前网页类型的列表。

In [16]: 
def transform_category(x):
    markCategory = [0] * numCategories
    index = categories.index(x)
markCategory[index] = 1
return markCategory

In [16]:

def transform_category(x):

markCategory = [0] * numCategories

index = categories.index(x)

markCategory[index] = 1

return markCategory

5) 通过这样的处理，我们将网页类型这一个特征值转化14个特征值，整体的特征值其实就增加了14个。接下来，我们在处理的时候将这个些特征值加入进去。

In [17]: 
for i in range(numRaws):
    trimmed = [ each.replace('"', "") for each in data[i] ]
    label = float(trimmed[-1])
    cate = transform_category(trimmed[3]) #调用函数，返回一个类型列表
features = cate + map(lambda x: 0.0 if x == "?" else (x), trimmed[4:numColumns - 1])
c = (label, Vectors.dense(map(float, features)))
data1.append(c)

In [17]:

for i in range(numRaws):

trimmed = [ each.replace('"', "") for each in data[i] ]

label = float(trimmed[-1])

cate = transform_category(trimmed[3]) #调用函数，返回一个类型列表

features = cate + map(lambda x: 0.0 if x == "?" else (x), trimmed[4:numColumns - 1])

c = (label, Vectors.dense(map(float, features)))

data1.append(c)

6) 创建DataFrame对象。

In [18]: df= spark.createDataFrame(data1, ["label","features"])

#由于后面经常使用df，所以载入内存
In [19]: df.cache()
Out[20]: DataFrame[label: double, features: vector]

In [18]: df= spark.createDataFrame(data1, ["label","features"])

#由于后面经常使用df，所以载入内存

In [19]: df.cache()

Out[20]: DataFrame[label: double, features: vector]

7) 建立特征索引。

In [21]: from pyspark.ml.feature import VectorIndexer
In [22]: featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)

1 2	In [21]: from pyspark.ml.feature import VectorIndexer In [22]: featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)

8) 将数据切分成80%训练集和20%测试集。

In [23]: (trainingData, testData) = df.randomSplit([0.8, 0.2],seed=1234L)

In [24]: trainingData.count()
Out[24]: 5912

In [25]: testData.count()
Out[25]: 1483

In [23]: (trainingData, testData) = df.randomSplit([0.8, 0.2],seed=1234L)

In [24]: trainingData.count()

Out[24]: 5912

In [25]: testData.count()

Out[25]: 1483

9) 创建决策树模型。

In [26]: dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")

1	In [26]: dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")

10) 构建流水线工作流。

In [27]: pipeline = Pipeline(stages=[featureIndexer, dt])
In [28]: model = pipeline.fit(trainingData)      ## 训练模型
11)	用测试数据再一次做下决策树准确度测试。
In [29]: predictedResultAll = model.transform(testData)
In [30]:df_prediction = predictedResultAll.select("prediction").toPandas()
In [31]: dtPredictions = list(df_prediction.prediction)

#对预测值做准确性统计
In [32]: dtTotalCorrect = 0
#测试集的总行数
In [33]: testRaw = testData.count()
In [49]: testLabel = testData.select("label").collect()
In [34]: 
for i in range(testRaw):
    if dtPredictions[i] == testLabel[i]:
        dtTotalCorrect += 1

In [35]: dtTotalCorrect
Out[35]: 967

In [36]: 1.0 * dtTotalCorrect / testRaw
Out[36]: 0.6520566419420094

In [27]: pipeline = Pipeline(stages=[featureIndexer, dt])

In [28]: model = pipeline.fit(trainingData) ## 训练模型

11) 用测试数据再一次做下决策树准确度测试。

In [29]: predictedResultAll = model.transform(testData)

In [30]:df_prediction = predictedResultAll.select("prediction").toPandas()

In [31]: dtPredictions = list(df_prediction.prediction)

#对预测值做准确性统计

In [32]: dtTotalCorrect = 0

#测试集的总行数

In [33]: testRaw = testData.count()

In [49]: testLabel = testData.select("label").collect()

In [34]:

for i in range(testRaw):

if dtPredictions[i] == testLabel[i]:

dtTotalCorrect += 1

In [35]: dtTotalCorrect

Out[35]: 967

In [36]: 1.0 * dtTotalCorrect / testRaw

Out[36]: 0.6520566419420094

可以看到，准确率增大到了63.3850%，而未做优化前的准确率是65.2057%。增长了1.88%。效果还是比较显著的。

11.8.2交叉验证和网格参数

In [1]: from pyspark.ml.linalg import Vectors
In [2]: from pyspark.ml.classification import DecisionTreeClassifier
In [3]: from pyspark.ml.feature import VectorIndexer
In [4]: from pyspark.ml import Pipeline
In [5]: raw_data = sc.textFile("hdfs://master:9000/data/train_noheader.tsv")
In [6]: numRaws = raw_data.count()
In [7]: records = raw_data.map(lambda line: line.split('\t'))
In [8]: data = records.collect()
In [9]: numColumns = len(data[0])
In [10]: data1 = []
In [11]: category = records.map(lambda x: x[3].replace("\"",""))
In [12]: categories = sorted(category.distinct().collect())
In [13]: numCategories = len(categories)
In [14]: 
def transform_category(x):
    markCategory = [0] * numCategories
    index = categories.index(x)
markCategory[index] = 1
return markCategory
In [15]: 
for i in range(numRaws):
    trimmed = [ each.replace('"', "") for each in data[i] ]
    label = float(trimmed[-1])
    cate = transform_category(trimmed[3]) #调用函数，返回一个类型列表
features = cate + map(lambda x: 0.0 if x == "?" else (x), trimmed[4:numColumns - 1])
c = (label, Vectors.dense(map(float, features)))
data1.append(c)
In [16]: df= spark.createDataFrame(data1, ["label","features"])
In [17]: df.cache()
In [18]: featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)
In [19]: dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")
In [20]: pipeline = Pipeline(stages=[featureIndexer, dt])
In [21]: (trainingData, testData) = df.randomSplit([0.8, 0.2],seed=1234L)

In [1]: from pyspark.ml.linalg import Vectors

In [2]: from pyspark.ml.classification import DecisionTreeClassifier

In [3]: from pyspark.ml.feature import VectorIndexer

In [4]: from pyspark.ml import Pipeline

In [5]: raw_data = sc.textFile("hdfs://master:9000/data/train_noheader.tsv")

In [6]: numRaws = raw_data.count()

In [7]: records = raw_data.map(lambda line: line.split('\t'))

In [8]: data = records.collect()

In [9]: numColumns = len(data[0])

In [10]: data1 = []

In [11]: category = records.map(lambda x: x[3].replace("\"",""))

In [12]: categories = sorted(category.distinct().collect())

In [13]: numCategories = len(categories)

In [14]:

def transform_category(x):

markCategory = [0] * numCategories

index = categories.index(x)

markCategory[index] = 1

return markCategory

In [15]:

for i in range(numRaws):

trimmed = [ each.replace('"', "") for each in data[i] ]

label = float(trimmed[-1])

cate = transform_category(trimmed[3]) #调用函数，返回一个类型列表

features = cate + map(lambda x: 0.0 if x == "?" else (x), trimmed[4:numColumns - 1])

c = (label, Vectors.dense(map(float, features)))

data1.append(c)

In [16]: df= spark.createDataFrame(data1, ["label","features"])

In [17]: df.cache()

In [18]: featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(df)

In [19]: dt = DecisionTreeClassifier(maxDepth=5, labelCol="label", featuresCol="indexedFeatures", impurity="entropy")

In [20]: pipeline = Pipeline(stages=[featureIndexer, dt])

In [21]: (trainingData, testData) = df.randomSplit([0.8, 0.2],seed=1234L)

创建交叉验证和网格参数

# 导入交叉验证和参数网格
In [22]: from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
#导入二分类评估器
In [23]: from pyspark.ml.evaluation import BinaryClassificationEvaluator
In [24]: evaluator = BinaryClassificationEvaluator()  # 初始化一个评估器
#设置参数网格
In [25]: paramGrid = ParamGridBuilder().addGrid(dt.maxDepth, [4,5,6]).build()
#设置交叉验证的参数
In [26]: cv = CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)

# 导入交叉验证和参数网格

In [22]: from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

#导入二分类评估器

In [23]: from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [24]: evaluator = BinaryClassificationEvaluator() # 初始化一个评估器

#设置参数网格

In [25]: paramGrid = ParamGridBuilder().addGrid(dt.maxDepth, [4,5,6]).build()

#设置交叉验证的参数

In [26]: cv = CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)

通过交叉验证来训练模型

In [27]: cvModel = cv.fit(trainingData)

1	In [27]: cvModel = cv.fit(trainingData)

测试模型

In [28]: Predictions=cvModel.transform(testData)

1	In [28]: Predictions=cvModel.transform(testData)

准确率统计

In [29]: df_prediction = Predictions.select("prediction").toPandas()
In [30]: dtPredictions = list(df_prediction.prediction)

#对预测值做准确性统计
In [31]: dtTotalCorrect = 0
#测试集的总行数
In [32]: testRaw = testData.count()
In [34]: testLabel = testData.select("label").collect()
In [33]: 
for i in range(testRaw):
    if dtPredictions[i] == testLabel[i]:
        dtTotalCorrect += 1

In [34]: dtTotalCorrect
Out[34]: 960

In [35]: 1.0 * dtTotalCorrect / testRaw
Out[35]: 0.6473364801078895
我们还可以查看最匹配模型的具体参数
In [36]: bestmodel = cvModel.bestModel.stages[1]

In [37]: bestmodel.numFeatures 	#决策树有36个特征值
Out[37]: 36

In [38]: bestmodel.depth  #最大深度为10
Out[38]: 6

In [39]: bestmodel.numNodes  #决策树中点有457个

In [29]: df_prediction = Predictions.select("prediction").toPandas()

In [30]: dtPredictions = list(df_prediction.prediction)

#对预测值做准确性统计

In [31]: dtTotalCorrect = 0

#测试集的总行数

In [32]: testRaw = testData.count()

In [34]: testLabel = testData.select("label").collect()

In [33]:

for i in range(testRaw):

if dtPredictions[i] == testLabel[i]:

dtTotalCorrect += 1

In [34]: dtTotalCorrect

Out[34]: 960

In [35]: 1.0 * dtTotalCorrect / testRaw

Out[35]: 0.6473364801078895

我们还可以查看最匹配模型的具体参数

In [36]: bestmodel = cvModel.bestModel.stages[1]

In [37]: bestmodel.numFeatures #决策树有36个特征值

Out[37]: 36

In [38]: bestmodel.depth #最大深度为10

Out[38]: 6

In [39]: bestmodel.numNodes #决策树中点有457个

11.9脚本方式运行

11.9.1 在脚本中添加配置信息

创建一个decisionTree.py文件，添加如下代码来配置启动pyspark。将上述在pyspark的IPython中的代码添加到该文件中来。
本文的示例程序存为 /home/hadoop/projects/spark/pyspark/decisionTree.py。

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

#指定本地运行spark
conf = SparkConf().setMaster("local[*]")
sc = SparkContext(conf=conf)
spark = SparkSession.builder.master('local') \
                .appName('DecisionTree') \
                .config("spark.some.config.option", "some-value") \
                .getOrCreate()

from pyspark import SparkConf, SparkContext

from pyspark.sql import SparkSession

#指定本地运行spark

conf = SparkConf().setMaster("local[*]")

sc = SparkContext(conf=conf)

spark = SparkSession.builder.master('local') \

.appName('DecisionTree') \

.config("spark.some.config.option", "some-value") \

.getOrCreate()

11.9.2运行脚本程序

spark 2.0 之前使用pyspark decisionTree.py 来执行文件，spark 2.0之后统一用spark-submit decisionTree.py 执行文件。读者可以使用spark-submit --help来查看相关命令的帮助信息。

$ spark-submit decisionTree.py

1	$ spark-submit decisionTree.py

11.10小结

本章我们了Spark中的 PySpark使用方法，对于PySpark做了简单的介绍。讨论了分类模型中最常见的决策树模型在PySpark 的应用，用实例讲解了如何对数据进行清理、转换，分析了分类模型的准确性有待提高的的原因，通过可视化对决策树的不同深度下的准确度进行讨论。

第1章：Keras基础

1.1Keras简介

Tensorflow、theano是神经网络、机器学习的基础框架，但使用它们大家神经网络，尤其深度学习网络，像tensorflow或theano属于符号编程，需要涉及如何定义变量、图形、各层、session、初始化、各种算法等等，有时显得比较繁琐，尤其对新手而言，更是如此，是否有更简单的方法呢？keras就是一个很好工具！Keras是一个高层神经网络API，Keras由纯Python编写而成并基Tensorflow、Theano以及CNTK后端。Keras 为支持快速实验而生，能够把你的想法迅速转换为结果。
其主要特点：

简便的原型设计
支持CNN和RNN，或二者的结合
在CPU和GPU间无缝切换

keras资料：
keras官网：
https://keras.io/
keras中文网
https://keras-cn.readthedocs.io/en/latest/

1.2keras安装

1）安装Python3.6
建议用anaconda安装，先下载最新版anaconda 支持linux或windows
2）安装numpy、scipy

conda install numpy  scipy

1	conda install numpy scipy

3）安装theano

conda install theano

1	conda install theano

4）安装tensorflow

pip  install tensorflow     （cpu）
pip  install tensorfow-gpu    (gpu)

1 2	pip install tensorflow （cpu） pip install tensorfow-gpu (gpu)

(gpu 需要有GPU卡，并安装GPU驱动及cuda等)
5）安装keras

pip  install keras

1	pip install keras

6）测试
【说明】变更keras后台支持的几种方法

（1）修改~/.keras/keras.json

hadoop@master:~/.keras$ cat keras.json 
{
    "epsilon": 1e-07, 
    "floatx": "float32", 
    "image_data_format": "channels_last", 
    "backend": "tensorflow"
}

hadoop@master:~/.keras$ cat keras.json

{

"epsilon": 1e-07,

"floatx": "float32",

"image_data_format": "channels_last",

"backend": "tensorflow"

}

(2)在客户端直接修改
改为theano为支持后台

import os
os.environ['KERAS_BACKEND']='theano'

1 2	import os os.environ['KERAS_BACKEND']='theano'

改为tensorflow为支持后台（缺省）

import os
os.environ['KERAS_BACKEND']='tensorflow'

1 2	import os os.environ['KERAS_BACKEND']='tensorflow'

在客户端修改，影响范围是当前脚本或session。

1.3 keras常用概念

François Chollet作为人工智能时代的先行者，为无数的开发者提供了开源深度学习框架Keras，目前就职于Google公司，主推tf.keras。
在开始学习Keras之前，我们希望传递一些关于Keras，关于深度学习的基本概念和技术，我们建议新手在使用Keras之前浏览一下本页面提到的内容，这将减少你学习中的困惑。

符号计算

Keras的底层库使用Theano或TensorFlow，这两个库也称为Keras的后端。无论是Theano还是TensorFlow，都是一个“符号式”的库。
因此，这也使得Keras的编程与传统的Python代码有所差别。笼统的说，符号主义的计算首先定义各种变量，然后建立一个“计算图”，计算图规定了各个变量之间的计算关系。建立好的计算图需要编译以确定其内部细节，然而，此时的计算图还是一个“空壳子”，里面没有任何实际的数据，只有当你把需要运算的输入放进去后，才能在整个模型中形成数据流，从而形成输出值。
就像用管道搭建供水系统，当你在拼水管的时候，里面是没有水的。只有所有的管子都接完了，才能送水。
符号计算也叫数据流图，如下图是一个经典的数据流计算可视化图形。
saddle_point_evaluation_optimizers

张量

张量，或tensor，可以看作是向量、矩阵的自然推广，用来表示广泛的数据类型。张量的阶数也叫维度。
0阶张量,即标量,是一个数。
1阶张量,即向量,一组有序排列的数
2阶张量,即矩阵,一组向量有序的排列起来
3阶张量，即立方体，一组矩阵上下排列起来
4阶张量......
依次类推
重点：关于维度的理解
假如有一个10长度的列表，那么我们横向看有10个数字，也可以叫做10维度，纵向看只能看到1个数字，那么就叫1维度。注意这个区别有助于理解Keras或者神经网络中计算时出现的维度问题
张量的阶数有时候也称为维度，或者轴，轴这个词翻译自英文axis。譬如一个矩阵[[1,2],[3,4]]，是一个2阶张量，有两个维度或轴，沿着第0个轴（为了与python的计数方式一致，本文档维度和轴从0算起）你看到的是[1,2]，[3,4]两个向量，沿着第1个轴你看到的是[1,3]，[2,4]两个向量。
 数据格式(data_format)
目前主要有两种方式来表示张量：
a) th模式或channels_first模式，Theano和caffe使用此模式。
b）tf模式或channels_last模式，TensorFlow使用此模式。
模式的修改，可以通修改配置文件~/.keras/keras.json中的image_data_format。

下面举例说明两种模式的区别：
对于100张RGB3通道的16×32（高为16宽为32）彩色图，
th表示方式：（100,3,16,32）
tf表示方式：（100,16,32,3）
唯一的区别就是表示通道个数3的位置不一样。

模型

Keras有两种类型的模型，序贯（或序列）模型（Sequential）和函数式模型（Model），函数式模型应用更为广泛，序贯模型是函数式模型的一种特殊情况。
a）序贯模型（Sequential):单输入单输出，一条路通到底，层与层之间只有相邻关系，没有跨层连接。这种模型编译速度快，操作也比较简单
b）函数式模型（Model）：多输入多输出，层与层之间任意连接。这种模型编译速度慢。

batch

这个概念与Keras无关，老实讲不应该出现在这里的，但是因为它频繁出现，而且不了解这个技术的话看函数说明会很头痛，这里还是简单说一下。
深度学习的优化算法，说白了就是梯度下降。每次的参数更新有两种方式。
第一种，遍历全部数据集算一次损失函数，然后算函数对各个参数的梯度，更新梯度。这种方法每更新一次参数都要把数据集里的所有样本都看一遍，计算量开销大，计算速度慢，不支持在线学习，这称为Batch gradient descent，批梯度下降。
另一种，每看一个数据就算一下损失函数，然后求梯度更新参数，这个称为随机梯度下降，stochastic gradient descent。这个方法速度比较快，但是收敛性能不太好，可能在最优点附近晃来晃去，hit不到最优点。两次参数的更新也有可能互相抵消掉，造成目标函数震荡的比较剧烈。
为了克服两种方法的缺点，现在一般采用的是一种折中手段，mini-batch gradient decent，小批的梯度下降，这种方法把数据分为若干个批，按批来更新参数，这样，一个批中的一组数据共同决定了本次梯度的方向，下降起来就不容易跑偏，减少了随机性。另一方面因为批的样本数与整个数据集相比小了很多，计算量也不是很大。
基本上现在的梯度下降都是基于mini-batch的，所以Keras的模块中经常会出现batch_size，就是指这个。

epochs

epochs指的就是训练过程中数据将被“轮”多少次。

1.4 keras与Tensorflow

1.5 keras的主要模块

【说明】
这里选择了一些常用模块，更多或更详细的说明请参考keras中文网站：
https://keras-cn.readthedocs.io/en/latest/
我们先从总体上了解一下Keras的主要模块及常用层，可参考下图，然后我们对各模块和常用层展开详细说明。

该图取自：http://blog.csdn.net/zdy0_2004/article/details/74736656

1.5.1优化器（optimizers）

优化器是调整每个节点权重的方法，看一个代码示例：

model = Sequential() 
model.add(Dense(64, init='uniform', input_dim=10)) model.add(Activation('tanh')) 
model.add(Activation('softmax')) 
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) model.compile(loss='mean_squared_error', optimizer=sgd)

model = Sequential()

model.add(Dense(64, init='uniform', input_dim=10)) model.add(Activation('tanh'))

model.add(Activation('softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) model.compile(loss='mean_squared_error', optimizer=sgd)

可以看到优化器在模型编译前定义，作为编译时的两个参数之一。
代码中的sgd是随机梯度下降算法
lr表示学习速率
momentum表示动量项
decay是学习速率的衰减系数(每个epoch衰减一次)
Nesterov的值是False或者True，表示使不使用Nesterov momentum
除了sgd，还可以选择的优化器有RMSprop（适合递归神经网络）、Adagrad、Adadelta、Adam、Adamax、Nadam等。

1.5.2目标函数（objectives）

目标函数又称损失函数（loss），目的是计算神经网络的输出与样本标记的差的一种方法，代码示例：

model = Sequential() 
model.add(Dense(64, init='uniform', input_dim=10)) model.add(Activation('tanh')) 
model.add(Activation('softmax')) 
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) model.compile(loss='mean_squared_error', optimizer=sgd)

model = Sequential()

model.add(Dense(64, init='uniform', input_dim=10)) model.add(Activation('tanh'))

model.add(Activation('softmax'))

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) model.compile(loss='mean_squared_error', optimizer=sgd)

mean_squared_error就是损失函数的名称。
可以选择的损失函数有：
mean_squared_error，mean_absolute_error，squared_hinge，hinge，binary_crossentropy，categorical_crossentropy
其中binary_crossentropy 和 categorical_crossentropy也就是交叉熵为logloss，一般用于分类模型。

1.5.3激活函数（activations）

每一个神经网络层都需要一个激活函数，代码示例：

from keras.layers.core import Activation, Dense

model.add(Dense(64))
model.add(Activation('tanh'))

#或把上面两行合并为：
model.add(Dense(64, activation='tanh'))

from keras.layers.core import Activation, Dense

model.add(Dense(64))

model.add(Activation('tanh'))

#或把上面两行合并为：

model.add(Dense(64, activation='tanh'))

可以选择的激活函数有：
linear、sigmoid、hard_sigmoid、tanh、softplus、relu、 softplus，softmax、softsign
还有一些高级激活函数，比如如PReLU，LeakyReLU等。

1.5.4 参数初始化（Initializations）

这个模块的作用是在添加layer时调用init进行这一层的权重初始化，有两种初始化方法

1.5.4.1 通过制定初始化方法的名称

示例代码：

model.add(Dense(64, init='uniform'))

1	model.add(Dense(64, init='uniform'))

可以选择的初始化方法有：
uniform、lecun_uniform、normal、orthogonal、zero、glorot_normal、he_normal等。

1.5.4.2 通过调用对象

该对象必须包含两个参数:shape(待初始化的变量的shape)和name(该变量的名字),该可调用对象必须返回一个(Keras)变量,例如K.variable()返回的就是这种变量，示例代码：

from keras import backend as K
import numpy as np

def my_init(shape, name=None):
    value = np.random.random(shape)
    return K.variable(value, name=name)
model.add(Dense(64, init=my_init))

from keras import backend as K

import numpy as np

def my_init(shape, name=None):

value = np.random.random(shape)

return K.variable(value, name=name)

model.add(Dense(64, init=my_init))

或者

from keras import initializations
def my_init(shape, name=None):
    return initializations.normal(shape, scale=0.01, name=name)
model.add(Dense(64, init=my_init))

from keras import initializations

def my_init(shape, name=None):

return initializations.normal(shape, scale=0.01, name=name)

model.add(Dense(64, init=my_init))

所以说可以通过库中的方法设定每一层的初始化权重，
也可以自己初始化权重，自己设定的话可以精确到每个节点的权重。

1.5.5 常用层（layer）

keras的层主要包括：
常用层（Core）、卷积层（Convolutional）、池化层（Pooling）、局部连接层、递归层（Recurrent）、嵌入层（ Embedding）、高级激活层、规范层、噪声层、包装层，当然也可以编写自己的层

1.5.5.1 Dense层(全连接层）

keras.layers.core.Dense(units, activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
Dense就是常用的全连接层，所实现的运算是output = activation(dot(input, kernel)+bias)。其中activation是逐元素计算的激活函数，kernel是本层的权值矩阵，bias为偏置向量，只有当use_bias=True才会添加。如果本层的输入数据的维度大于2，则会先被压为与kernel相匹配的大小。
参数：
units：大于0的整数，代表该层的输出维度。
activation：激活函数，为预定义的激活函数名（参考激活函数），或逐元素（element-wise）的Theano函数。如果不指定该参数，将不会使用任何激活函数（即使用线性激活函数：a(x)=x）
use_bias: 布尔值，是否使用偏置项
kernel_initializer：权值初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的初始化器。参考initializers
bias_initializer：偏置向量初始化方法，为预定义初始化方法名的字符串，或用于初始化偏置向量的初始化器。参考initializers
kernel_regularizer：施加在权重上的正则项，为Regularizer对象
bias_regularizer：施加在偏置向量上的正则项，为Regularizer对象
activity_regularizer：施加在输出上的正则项，为Regularizer对象
kernel_constraints：施加在权重上的约束项，为Constraints对象
bias_constraints：施加在偏置上的约束项，为Constraints对象
输入
形如(batch_size, ..., input_dim)的nD张量，最常见的情况为(batch_size, input_dim)的2D张量。
输出
形如(batch_size, ..., units)的nD张量，最常见的情况为(batch_size, units)的2D张量。
示例

# as first layer in a sequential model:
model = Sequential()
model.add(Dense(32, input_shape=(16,)))
# model.add(Dense(32, input_dim=16))
# now the model will take as input arrays of shape (*, 16)
# and output arrays of shape (*, 32)

# after the first layer, you don't need to specify the size of the input anymore
model.add(Dense(32))

# as first layer in a sequential model:

model = Sequential()

model.add(Dense(32, input_shape=(16,)))

# model.add(Dense(32, input_dim=16))

# now the model will take as input arrays of shape (*, 16)

# and output arrays of shape (*, 32)

# after the first layer, you don't need to specify the size of the input anymore

model.add(Dense(32))

1.5.5.2 Flatten层

Flatten层用来将输入“压平”，即把多维的输入一维化，常用在从卷积层到全连接层的过渡。Flatten不影响batch的大小。
keras.layers.core.Flatten()
示例

model = Sequential()
model.add(Convolution2D(64, 3, 3,
            border_mode='same',
            input_shape=(3, 32, 32)))
# now: model.output_shape == (None, 64, 32, 32)

model.add(Flatten())
# now: model.output_shape == (None, 65536)

model = Sequential()

model.add(Convolution2D(64, 3, 3,

border_mode='same',

input_shape=(3, 32, 32)))

# now: model.output_shape == (None, 64, 32, 32)

model.add(Flatten())

# now: model.output_shape == (None, 65536)

1.5.5.3 dropout层

为输入数据施加Dropout。Dropout将在训练过程中每次更新参数时随机断开一定百分比（p）的输入神经元连接，Dropout层用于防止过拟合。
keras.layers.core.Dropout(p)

1.5.5.4 卷积层（Convolutional）

1.5.5.4.1 Conv1D层

keras.layers.convolutional.Conv1D(filters, kernel_size, strides=1, padding='valid', dilation_rate=1, activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
一维卷积层（即时域卷积），用以在一维输入信号上进行邻域滤波。当使用该层作为首层时，需要提供关键字参数input_shape。例如(10,128)代表一个长为10的序列，序列中每个信号为128向量。而(None, 128)代表变长的128维向量序列。
该层生成将输入信号与卷积核按照单一的空域（或时域）方向进行卷积。如果use_bias=True，则还会加上一个偏置项，若activation不为None，则输出为经过激活函数的输出。
参数
filters：卷积核的数目（即输出的维度）
kernel_size：整数或由单个整数构成的list/tuple，卷积核的空域或时域窗长度
strides：整数或由单个整数构成的list/tuple，为卷积的步长。任何不为1的strides均与任何不为1的dilation_rate均不兼容
padding：补0策略，为“valid”, “same” 或“causal”，“causal”将产生因果（膨胀的）卷积，即output[t]不依赖于input[t+1：]。当对不能违反时间顺序的时序信号建模时有用。参考WaveNet: A Generative Model for Raw Audio, section 2.1.。“valid”代表只进行有效的卷积，即对边界数据不处理。“same”代表保留边界处的卷积结果，通常会导致输出shape与输入shape相同。
activation：激活函数，为预定义的激活函数名（参考激活函数），或逐元素（element-wise）的Theano函数。如果不指定该参数，将不会使用任何激活函数（即使用线性激活函数：a(x)=x）
dilation_rate：整数或由单个整数构成的list/tuple，指定dilated convolution中的膨胀比例。任何不为1的dilation_rate均与任何不为1的strides均不兼容。
use_bias:布尔值，是否使用偏置项
kernel_initializer：权值初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的初始化器。参考initializers
bias_initializer：权值初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的初始化器。参考initializers
kernel_regularizer：施加在权重上的正则项，为Regularizer对象
bias_regularizer：施加在偏置向量上的正则项，为Regularizer对象
activity_regularizer：施加在输出上的正则项，为Regularizer对象

kernel_constraints：施加在权重上的约束项，为Constraints对象
bias_constraints：施加在偏置上的约束项，为Constraints对象
输入shape
形如（samples，steps，input_dim）的3D张量。
输出shape
形如（samples，new_steps，nb_filter）的3D张量，因为有向量填充的原因，steps的值会改变。

1.5.5.4.2 Conv2D层

keras.layers.convolutional.Conv2D(filters, kernel_size, strides=(1, 1), padding='valid', data_format=None, dilation_rate=(1, 1), activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
二维卷积层，即对图像的空域卷积。该层对二维输入进行滑动窗卷积，当使用该层作为第一层时，应提供input_shape参数。例如input_shape = (128,128,3)代表128*128的彩色RGB图像（data_format='channels_last'）

1.5.5.4.3 Conv3D层

keras.layers.convolutional.Conv3D(filters, kernel_size, strides=(1, 1, 1), padding='valid', data_format=None, dilation_rate=(1, 1, 1), activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
三维卷积对三维的输入进行滑动窗卷积，当使用该层作为第一层时，应提供input_shape参数。例如input_shape = (3,10,128,128)代表对10帧128*128的彩色RGB图像进行卷积。数据的通道位置仍然有data_format参数指定。

1.5.5.5池化层

1.5.5.5.1 MaxPooling1D层

keras.layers.pooling.MaxPooling1D(pool_size=2, strides=None, padding='valid')
对时域1D信号进行最大值池化
参数
pool_size：整数，池化窗口大小
strides：整数或None，下采样因子，例如设2将会使得输出shape为输入的一半，若为None则默认值为pool_size。
padding：‘valid’或者‘same’
输入shape
形如（samples，steps，features）的3D张量
输出shape
形如（samples，downsampled_steps，features）的3D张量

1.5.5.5.2 MaxPooling2D层

keras.layers.pooling.MaxPooling2D(pool_size=(2, 2), strides=None, padding='valid', data_format=None)
为空域信号施加最大值池化

1.5.5.5.3 AveragePooling1D层

keras.layers.pooling.AveragePooling1D(pool_size=2, strides=None, padding='valid')
对时域1D信号进行平均值池化

1.5.5.5.4 AveragePooling2D层

keras.layers.pooling.AveragePooling2D(pool_size=(2, 2), strides=None, padding='valid', data_format=None)
为空域信号施加平均值池化

1.5.5.6 循环层（Recurrent）

循环层包含三种模型：LSTM、GRU和SimpleRNN。
所有的循环层（LSTM,GRU,SimpleRNN）都继承本层，因此下面的参数可以在任何循环层中使用。

1.5.5.6.1抽象层，不能直接使用

keras.layers.recurrent.Recurrent(return_sequences=False, go_backwards=False, stateful=False, unroll=False, implementation=0)
weights：numpy array的list，用以初始化权重。该list形如[(input_dim, output_dim),(output_dim, output_dim),(output_dim,)]
 return_sequences：布尔值，默认False，控制返回类型。若为True则返回整个序列，否则仅返回输出序列的最后一个输出
 go_backwards：布尔值，默认为False，若为True，则逆向处理输入序列并返回逆序后的序列
 stateful：布尔值，默认为False，若为True，则一个batch中下标为i的样本的最终状态将会用作下一个batch同样下标的样本的初始状态。
 unroll：布尔值，默认为False，若为True，则循环层将被展开，否则就使用符号化的循环。当使用TensorFlow为后端时，循环网络本来就是展开的，因此该层不做任何事情。层展开会占用更多的内存，但会加速RNN的运算。层展开只适用于短序列。
 implementation：0，1或2，若为0，则RNN将以更少但是更大的矩阵乘法实现，因此在CPU上运行更快，但消耗更多的内存。如果设为1，则RNN将以更多但更小的矩阵乘法实现，因此在CPU上运行更慢，在GPU上运行更快，并且消耗更少的内存。如果设为2（仅LSTM和GRU可以设为2），则RNN将把输入门、遗忘门和输出门合并为单个矩阵，以获得更加在GPU上更加高效的实现。注意，RNN dropout必须在所有门上共享，并导致正则效果性能微弱降低。
 input_dim：输入维度，当使用该层为模型首层时，应指定该值（或等价的指定input_shape)
 input_length：当输入序列的长度固定时，该参数为输入序列的长度。当需要在该层后连接Flatten层，然后又要连接Dense层时，需要指定该参数，否则全连接的输出无法计算出来。注意，如果循环层不是网络的第一层，你需要在网络的第一层中指定序列的长度（通过input_shape指定）。

1.5.5.6.2全连接RNN网络

keras.layers.SimpleRNN(units, activation='tanh', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False, go_backwards=False, stateful=False, unroll=False)
全连接RNN网络，RNN的输出会被回馈到输入
参数说明
• units：输出维度
• activation：激活函数，为预定义的激活函数名（参考激活函数）
• use_bias: 布尔值，是否使用偏置项
• kernel_initializer：权值初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的初始化器。参考initializers
• recurrent_initializer：循环核的初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的初始化器。参考initializers
• bias_initializer：权值初始化方法，为预定义初始化方法名的字符串，或用于初始化权重的初始化器。参考initializers
• kernel_regularizer：施加在权重上的正则项，为Regularizer对象
• bias_regularizer：施加在偏置向量上的正则项，为Regularizer对象
• recurrent_regularizer：施加在循环核上的正则项，为Regularizer对象
• activity_regularizer：施加在输出上的正则项，为Regularizer对象
• kernel_constraints：施加在权重上的约束项，为Constraints对象
• recurrent_constraints：施加在循环核上的约束项，为Constraints对象
• bias_constraints：施加在偏置上的约束项，为Constraints对象
• dropout：0~1之间的浮点数，控制输入线性变换的神经元断开比例
• recurrent_dropout：0~1之间的浮点数，控制循环状态的线性变换的神经元断开比例
• 其他参数参考Recurrent的说明

输入shape
形如（samples，timesteps，input_dim）的3D张量
输出shape
如果return_sequences=True：返回形如（samples，timesteps，output_dim）的3D张量
否则，返回形如（samples，output_dim）的2D张量
示例：

# as the first layer in a Sequential model
model = Sequential()
model.add(LSTM(32, input_shape=(10, 64)))
# now model.output_shape == (None, 32)
# note: `None` is the batch dimension.

# 以下与上面相同:
model = Sequential()
model.add(LSTM(32, input_dim=64, input_length=10))

# for subsequent layers, no need to specify the input size:
         model.add(LSTM(16))

# to stack recurrent layers, you must use return_sequences=True
# on any recurrent layer that feeds into another recurrent layer.
# note that you only need to specify the input size on the first layer.
model = Sequential()
model.add(LSTM(64, input_dim=64, input_length=10, return_sequences=True))
model.add(LSTM(32, return_sequences=True))
model.add(LSTM(10))

# as the first layer in a Sequential model

model = Sequential()

model.add(LSTM(32, input_shape=(10, 64)))

# now model.output_shape == (None, 32)

# note: `None` is the batch dimension.

# 以下与上面相同:

model = Sequential()

model.add(LSTM(32, input_dim=64, input_length=10))

# for subsequent layers, no need to specify the input size:

model.add(LSTM(16))

# to stack recurrent layers, you must use return_sequences=True

# on any recurrent layer that feeds into another recurrent layer.

# note that you only need to specify the input size on the first layer.

model = Sequential()

model.add(LSTM(64, input_dim=64, input_length=10, return_sequences=True))

model.add(LSTM(32, return_sequences=True))

model.add(LSTM(10))

1.5.5.6.3 LSTM层

keras.layers.recurrent.LSTM(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0)
Keras长短期记忆模型
forget_bias_init：遗忘门偏置的初始化函数，建议初始化为全1元素。
inner_activation：内部单元激活函数

1.5.5.6.4 GRU

keras.layers.recurrent.GRU(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0)
门限循环单元

1.5.5.6.5Embedding层

keras.layers.embeddings.Embedding(input_dim, output_dim, init='uniform', input_length=None, W_regularizer=None, activity_regularizer=None, W_constraint=None, mask_zero=False, weights=None, dropout=0.0)
只能作为模型第一层
mask_zero：布尔值，确定是否将输入中的‘0’看作是应该被忽略的‘填充’（padding）值，该参数在使用递归层处理变长输入时有用。设置为True的话，模型中后续的层必须都支持masking，否则会抛出异常

1.5.6 model层

model层是最主要的模块，model层可以将上面定义了各种基本组件组合起来。
model的方法：
model.summary() ：打印出模型概况
model.get_config() ：返回包含模型配置信息的Python字典
model.get_weights()：返回模型权重张量的列表，类型为numpy array
model.set_weights()：从numpy array里将权重载入给模型
model.to_json：返回代表模型的JSON字符串，仅包含网络结构，不包含权值。可以从JSON字符串中重构原模型：

from models import model_from_json

json_string = model.to_json()
model = model_from_json(json_string)
model.to_yaml：与model.to_json类似，同样可以从产生的YAML字符串中重构模型
from models import model_from_yaml

yaml_string = model.to_yaml()
model = model_from_yaml(yaml_string)

from models import model_from_json

json_string = model.to_json()

model = model_from_json(json_string)

model.to_yaml：与model.to_json类似，同样可以从产生的YAML字符串中重构模型

from models import model_from_yaml

yaml_string = model.to_yaml()

model = model_from_yaml(yaml_string)

model.save_weights(filepath)：将模型权重保存到指定路径，文件类型是HDF5（后缀是.h5）。
model.load_weights(filepath, by_name=False)：从HDF5文件中加载权重到当前模型中, 默认情况下模型的结构将保持不变。如果想将权重载入不同的模型（有些层相同）中，则设置by_name=True，只有名字匹配的层才会载入权重。
keras有两种model，分别是Sequential模型和泛型模型。

1.5.6.1 Sequential模型

Sequential是多个网络层的线性堆叠
可以通过向Sequential模型传递一个layer的list来构造该模型：

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential([Dense(32, input_dim=784),
Activation('relu'),
Dense(10),
Activation('softmax'),])

from keras.models import Sequential

from keras.layers import Dense, Activation

model = Sequential([Dense(32, input_dim=784),

Activation('relu'),

Dense(10),

Activation('softmax'),])

也可以通过.add()方法一个个的将layer加入模型中：

model = Sequential()
model.add(Dense(32, input_dim=784))
model.add(Activation('relu'))

model = Sequential()

model.add(Dense(32, input_dim=784))

model.add(Activation('relu'))

还可以通过merge将两个Sequential模型通过某种方式合并
Merge层提供了一系列用于融合两个层或两个张量的层对象和方法。以大写首字母开头的是Layer类，以小写字母开头的是张量的函数。小写字母开头的张量函数在内部实际上是调用了大写字母开头的层。
keras.engine.topology.Merge(layers=None, mode='sum', concat_axis=-1, dot_axes=-1, output_shape=None, node_indices=None, tensor_indices=None, name=None)
layers：该参数为Keras张量的列表，或Keras层对象的列表。该列表的元素数目必须大于1。
mode：合并模式，如果为字符串，则为下列值之一{“sum”，“mul”，“concat”，“ave”，“cos”，“dot”}
其中sum和mul是对待合并层输出做一个简单的求和、乘积运算，因此要求待合并层输出shape要一致。concat是将待合并层输出沿着最后一个维度进行拼接，因此要求待合并层输出只有最后一个维度不同。
Merge是一个层对象，在多个sequential组成的网络模型中，如果
x：输入数据。如果模型只有一个输入，那么x的类型是numpy array，如果模型有多个输入，那么x的类型应当为list，list的元素是对应于各个输入的numpy array
y：标签，numpy array
否则运行时很可能会提示意思就是你输入的维度与实际不符
 Add
keras.layers.Add()
添加输入列表的图层。
该层接收一个相同shape列表张量，并返回它们的和，shape不变。

import keras

input1 = keras.layers.Input(shape=(16,))
x1 = keras.layers.Dense(8, activation='relu')(input1)
input2 = keras.layers.Input(shape=(32,))
x2 = keras.layers.Dense(8, activation='relu')(input2)
added = keras.layers.Add()([x1, x2])  # equivalent to added = keras.layers.add([x1, x2])

out = keras.layers.Dense(4)(added)
model = keras.models.Model(inputs=[input1, input2], outputs=out)

import keras

input1 = keras.layers.Input(shape=(16,))

x1 = keras.layers.Dense(8, activation='relu')(input1)

input2 = keras.layers.Input(shape=(32,))

x2 = keras.layers.Dense(8, activation='relu')(input2)

added = keras.layers.Add()([x1, x2]) # equivalent to added = keras.layers.add([x1, x2])

out = keras.layers.Dense(4)(added)

model = keras.models.Model(inputs=[input1, input2], outputs=out)

 Concatenate
keras.layers.Concatenate(axis=-1)
该层接收一个列表的同shape张量，并返回它们的按照给定轴相接构成的向量。

1.5.6.2 函数式（Functional）模型

在Keras 2里我们将这个词改译为“函数式”，对函数式编程有所了解的同学应能够快速get到该类模型想要表达的含义。函数式模型称作Functional，但它的类名是Model，因此我们有时候也用Model来代表函数式模型。
Keras函数式模型接口是用户定义多输出模型、非循环有向模型或具有共享层的模型等复杂模型的途径。一句话，只要你的模型不是类似VGG一样一条路走到黑的模型，或者你的模型需要多于一个的输出，那么你总应该选择函数式模型。函数式模型是最广泛的一类模型，序贯模型（Sequential）只是它的一种特殊情况。
这部分的文档假设你已经对Sequential模型已经比较熟悉
让我们从简单一点的模型开始
第一个模型：全连接网络
Sequential当然是实现全连接网络的最好方式，但我们从简单的全连接网络开始，有助于我们学习这部分的内容。在开始前，有几个概念需要澄清：
层对象接受张量为参数，返回一个张量。
输入是张量，输出也是张量的一个框架就是一个模型，通过Model定义。
这样的模型可以被像Keras的Sequential一样被训练

from keras.layers import Input, Dense
from keras.models import Model

# This returns a tensor
inputs = Input(shape=(784,))

# a layer instance is callable on a tensor, and returns a tensor
x = Dense(64, activation='relu')(inputs)
x = Dense(64, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)

# This creates a model that includes
# the Input layer and three Dense layers
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(data, labels)  # starts training

from keras.layers import Input, Dense

from keras.models import Model

# This returns a tensor

inputs = Input(shape=(784,))

# a layer instance is callable on a tensor, and returns a tensor

x = Dense(64, activation='relu')(inputs)

x = Dense(64, activation='relu')(x)

predictions = Dense(10, activation='softmax')(x)

# This creates a model that includes

# the Input layer and three Dense layers

model = Model(inputs=inputs, outputs=predictions)

model.compile(optimizer='rmsprop',

loss='categorical_crossentropy',

metrics=['accuracy'])

model.fit(data, labels) # starts training

所有的模型都是可调用的，就像层一样
利用函数式模型的接口，我们可以很容易的重用已经训练好的模型：你可以把模型当作一个层一样，通过提供一个tensor来调用它。注意当你调用一个模型时，你不仅仅重用了它的结构，也重用了它的权重。

x = Input(shape=(784,))
# This works, and returns the 10-way softmax we defined above.
y = model(x)

x = Input(shape=(784,))

# This works, and returns the 10-way softmax we defined above.

y = model(x)

使用函数式模型的一个典型场景是搭建多输入、多输出的模型，如下图：

auxiliary_input = Input(shape=(5,), name='aux_input')
x = keras.layers.concatenate([lstm_out, auxiliary_input])

# We stack a deep densely-connected network on top
x = Dense(64, activation='relu')(x)
x = Dense(64, activation='relu')(x)
x = Dense(64, activation='relu')(x)

# And finally we add the main logistic regression layer
main_output = Dense(1, activation='sigmoid', name='main_output')(x)

auxiliary_input = Input(shape=(5,), name='aux_input')

x = keras.layers.concatenate([lstm_out, auxiliary_input])

# We stack a deep densely-connected network on top

x = Dense(64, activation='relu')(x)

# And finally we add the main logistic regression layer

main_output = Dense(1, activation='sigmoid', name='main_output')(x)

第2章 keras的使用流程

2.1 流程说明

第1步：构造数据：定义输入数据
第2步：构造模型：确定各个变量之间的计算关系
第3步：编译模型：编译已确定其内部细节
第4步：训练模型：导入数据，训练模型
第5步：测试模型
第6步：保存模型
把这些步骤进一步图形化为：

2.2 实例-详细说明使用流程

第1步：构造数据
我们需要根据模型fit（训练）时需要的数据格式来构造数据的shape，这里我们用numpy构造两个矩阵：一个是数据矩阵，一个是标签矩阵。

import numpy as np
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam

x_train = np.random.random((1000, 784))
y_train = np.random.randint(2, size=(1000, 1))
x_test = np.random.random((200, 784))
y_test = np.random.randint(2, size=(200, 1))

import numpy as np

from keras.utils import np_utils

from keras.models import Sequential

from keras.layers import Dense, Activation

from keras.optimizers import Adam

x_train = np.random.random((1000, 784))

y_train = np.random.randint(2, size=(1000, 1))

x_test = np.random.random((200, 784))

y_test = np.random.randint(2, size=(200, 1))

通过numpy的random生成随机矩阵，数据矩阵是1000行784列的矩阵，标签矩阵是1000行1列的句子，所以数据矩阵的一行就是一个样本，这个样本是784维的。
第2步构造模型
我们来构造一个神经网络模型，keras构造深度学习模型可以采用序列模型（基于Sequential类）或函数模型（又称为通用模型）（基于Model类）。两种间差异是拓扑结构不一样。这里我们采用序列模型。

model = Sequential()
model.add(Dense(32, activation='relu', input_dim=784))
model.add(Dense(1, activation='sigmoid'))

model = Sequential()

model.add(Dense(32, activation='relu', input_dim=784))

model.add(Dense(1, activation='sigmoid'))

在这一步中可以add多个层，也可以merge合并两个模型。
第3步：编译模型
我们编译上一步构造好的模型，并指定一些模型的参数，optimizer（优化器），loss（目标函数或损失函数），metrics（评估模型的指标）等。编译模型时损失函数和优化器这两项是必须的。

model.compile(optimizer='Adam',loss='binary_crossentropy',metrics=['accuracy'])

1	model.compile(optimizer='Adam',loss='binary_crossentropy',metrics=['accuracy'])

第4步：训练模型
传入要训练的数据和标签，并指定训练的一些参数，然后进行模型训练。

model.fit(x_train, y_train, epochs=10,verbose=2, batch_size=32,)

1	model.fit(x_train, y_train, epochs=10,verbose=2, batch_size=32,)

Epoch 1/10
- 1s - loss: 0.7063 - acc: 0.5010
Epoch 2/10
- 0s - loss: 0.6955 - acc: 0.5110
.................................
Epoch 10/10
- 0s - loss: 0.5973 - acc: 0.6980

epochs：整数，训练的轮数。
verbose：训练时显示实时信息，0表示不显示数据，1表示显示进度条，2表示用只显示一个数据。
batch_size：整数，指定进行梯度下降时每个batch包含的样本数。训练时一个batch的样本会被计算一次梯度下降，使目标函数优化一步。
第5步：测试模型
用测试数据测试已经训练好的模型，并可以获得测试结果，从而对模型进行评估

score = model.evaluate(x_test, y_test, batch_size=32)

1	score = model.evaluate(x_test, y_test, batch_size=32)

200/200 [==============================] - 0s 146us/step
本函数返回一个测试误差的标量值（如果模型没有其他评价指标），或一个标量的list（如果模型还有其他的评价指标）
第6步：保存模型

#将模型保存为json
json_string = model.to_json()  
#从保存的json中加载模型  
from keras.models import model_from_json  
model_re = model_from_json(json_string)

#将模型保存为json

json_string = model.to_json()

#从保存的json中加载模型

from keras.models import model_from_json

model_re = model_from_json(json_string)

【项目延伸】
上面是采用全连接的神经网络，包括输入层、一个隐含层及一个输出层。如果我们卷积神经网络是否可以？例如：一个卷积层+池化层+展平+全连接+输出层。

第3章 keras实现单层神经网络

3.1利用keras实现单层神经

本章利用Keras架构实现一个传统机器学习算法---线性回归
根据输入数据及目标数据，模拟一个线性函数y=kx+b
这里使用一个神经元，神经元中使用Relu作为激活函数。如下图：

第1步：构造数据

import numpy as np
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
%matplotlib inline

#构造数据
X = np.linspace(-2, 2, 200)
np.random.shuffle(X)    # randomize the data
#添加一些噪音数据
Y = 0.5 * X + 2 + np.random.normal(0, 0.05, (200, ))

# 显示输入数据
plt.scatter(X, Y)
plt.show()

import numpy as np

from keras.models import Sequential

from keras.layers import Dense

import matplotlib.pyplot as plt

%matplotlib inline

#构造数据

X = np.linspace(-2, 2, 200)

np.random.shuffle(X) # randomize the data

#添加一些噪音数据

Y = 0.5 * X + 2 + np.random.normal(0, 0.05, (200, ))

# 显示输入数据

plt.scatter(X, Y)

plt.show()

把200份数据划分为训练数据、测试数据。

X_train, Y_train = X[:160], Y[:160]     # first 160 data points
X_test, Y_test = X[160:], Y[160:]       # last 40 data points

1 2	X_train, Y_train = X[:160], Y[:160] # first 160 data points X_test, Y_test = X[160:], Y[160:] # last 40 data points

第2步构造模型

# build a neural network from the 1st layer to the last layer
model = Sequential()
model.add(Dense(units=1,activation='relu', input_dim=1))

# build a neural network from the 1st layer to the last layer

model = Sequential()

model.add(Dense(units=1,activation='relu', input_dim=1))

第3步编译模型

# choose loss function and optimizing method
model.compile(loss='mse', optimizer='sgd')

1 2	# choose loss function and optimizing method model.compile(loss='mse', optimizer='sgd')

第4步训练模型

model.fit(X_train, Y_train, epochs=100,verbose=0, batch_size=64,)

1	model.fit(X_train, Y_train, epochs=100,verbose=0, batch_size=64,)

第5步测试模型

# test
print('\nTesting ------------')
cost = model.evaluate(X_test, Y_test, batch_size=40)
print('test cost:', cost)
W, b = model.layers[0].get_weights()
print('Weights=', W, '\nbiases=', b)

# test

print('\nTesting ------------')

cost = model.evaluate(X_test, Y_test, batch_size=40)

print('test cost:', cost)

W, b = model.layers[0].get_weights()

print('Weights=', W, '\nbiases=', b)

Testing ------------
40/40 [==============================] - 0s 996us/step
test cost: 0.00395184289664
Weights= [[ 0.48489931]]
biases= [ 1.95838749]

可视化结果：

# plotting the prediction
Y_pred = model.predict(X_test)
plt.scatter(X_test, Y_test)
plt.plot(X_test, Y_pred)
plt.show()

# plotting the prediction

Y_pred = model.predict(X_test)

plt.scatter(X_test, Y_test)

plt.plot(X_test, Y_pred)

plt.show()

第4章 keras实现多层神经网络

利用keras构造一个多层神经网络，用该神经网络识别手写数字，上次我们采用python来实现，这里我们采用keras来构造多层神经网络。
网络构造图形：

在整个网络设计中，输入数据的维度，优化方法、损失函数需要重点考虑。当然激活函数也很重要，特别是层数较多时。
这里为便于说明keras构建多层神经网络的方法，采用MNIST数据集，MNIST是一个手写数字0-9的数据集，它有60000个训练样本集和10000个测试样本集它是NIST数据库的一个子集。该数据集keras有现成的数据处理API(mnist.load_data())。
数据预处理：
（1）展平矩阵：
原数据为28*28图片，在利用全连接前，需要把矩阵拉平为一维数组，大小为784；
（2）规范训练数据
转换为都是0-255的像素，为提高模型的泛化能力，需要对数据规范化，即除以255，使数据范围都在[0,1]之间；
（3）规范标签数据
把标签数据转换为one-hot格式，向量维度为10，每行除一个1元素外，其它都是0，如把2标签转换为[0,0,1,0,0,0,0,0,0,0]
以下为详细计算步骤：
第1步：构建数据

import numpy as np
from keras.datasets import mnist
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam

# download the mnist to the path '~/.keras/datasets/' 
(X_train, y_train), (X_test, y_test) = mnist.load_data()
#查看数据集维度
print("展平前")
print(X_train.shape,y_train.shape)

# data pre-processing
X_train = X_train.reshape(X_train.shape[0], -1) / 255.   # normalize
X_test = X_test.reshape(X_test.shape[0], -1) / 255.      # normalize
y_train = np_utils.to_categorical(y_train, num_classes=10)
y_test = np_utils.to_categorical(y_test, num_classes=10)
print("展平后")
print(X_train.shape,y_train.shape)

import numpy as np

from keras.datasets import mnist

from keras.utils import np_utils

from keras.models import Sequential

from keras.layers import Dense, Activation

from keras.optimizers import Adam

# download the mnist to the path '~/.keras/datasets/'

(X_train, y_train), (X_test, y_test) = mnist.load_data()

#查看数据集维度

print("展平前")

print(X_train.shape,y_train.shape)

# data pre-processing

X_train = X_train.reshape(X_train.shape[0], -1) / 255. # normalize

X_test = X_test.reshape(X_test.shape[0], -1) / 255. # normalize

y_train = np_utils.to_categorical(y_train, num_classes=10)

y_test = np_utils.to_categorical(y_test, num_classes=10)

print("展平后")

print(X_train.shape,y_train.shape)

运行结果：
Using TensorFlow backend.
展平前
(60000, 28, 28) (60000,)
展平后
(60000, 784) (60000, 10)

第2步构建网络

model = Sequential([
    Dense(32, input_dim=784),
    Activation('relu'),
    Dense(10),
    Activation('softmax'),
])

model = Sequential([

Dense(32, input_dim=784),

Activation('relu'),

Dense(10),

Activation('softmax'),

])

第3步编译模型

model.compile(optimizer='Adam',loss='categorical_crossentropy',metrics=['accuracy'])

1	model.compile(optimizer='Adam',loss='categorical_crossentropy',metrics=['accuracy'])

运行结果：
【备注】
如果我们需要对优化方法进行某些定制化，也很方便：

Adam1 =Adam(lr=0.001, beta_1=0.5, epsilon=1e-08, decay=0.0)

1	Adam1 =Adam(lr=0.001, beta_1=0.5, epsilon=1e-08, decay=0.0)

第4步训练模型

print('Training ------------')
model.fit(X_train, y_train, epochs=2,verbose=2, batch_size=32)

1 2	print('Training ------------') model.fit(X_train, y_train, epochs=2,verbose=2, batch_size=32)

运行结果：
Training ------------
Epoch 1/2
- 7s - loss: 0.1584 - acc: 0.9539
Epoch 2/2
- 6s - loss: 0.1340 - acc: 0.9607
第5步测试模型

print('\nTesting ------------')
# Evaluate the model with the metrics we defined earlier
loss, accuracy = model.evaluate(X_test, y_test)

print('test loss: ', loss)
print('test accuracy: ', accuracy)

print('\nTesting ------------')

# Evaluate the model with the metrics we defined earlier

loss, accuracy = model.evaluate(X_test, y_test)

print('test loss: ', loss)

print('test accuracy: ', accuracy)

运行结果：
test loss: 0.132445694927
test accuracy: 0.9614

本章数据集下载

第10章构建Spark ML聚类模型

前面我们介绍了推荐、分类、回归等模型，这些模型属于监督学习，在训练模型时，都提供目标值或标签数据，根据目标值训练模型，然后根据模型对测试数据或新数据进行推荐、分类或预测。
但实际数据有很多是没有标签数据，或者预先标签很难，但我们又希望或需要从这些数据中提炼一些规则或特征等，如识别异常数据、对客户进行分类等，解决这类问题就属于无监督学习。
聚类是一种无监督学习，它与分类的不同，聚类所要求划分的类是未知的。
聚类算法的思想就是物以类聚的思想，相同性质的点在空间中表现的较为紧密和接近，主要用于数据探索与异常检测。
聚类分析是一种探索性的分析，在分类的过程中，人们不必事先给出一个分类的标准，它能够从样本数据出发，自动进行分类。聚类分析也有很多方法，使用不同方法往往会得到不同的结论。从实际应用的角度看，聚类分析是数据挖掘的主要任务之一。而且聚类能够作为一个独立的工具获得数据的分布状况，观察每一簇数据的特征，集中对特定的聚簇集合作进一步地分析。聚类分析还可以作为其他算法（如分类和推荐等算法）的预处理步骤。

10.1 K-means模型简介

作为经典的聚类算法，一般的机器学习框架里都有K-means，Spark自然也不例外。
不过spark中的K-means，除有一般K-means的特点外，还进行了如下的优化：

10.2 数据加载

这里我们以某批发经销商的客户对不同产品的年度消费支出（数据来源http://archive.ics.uci.edu/ml/datasets/Wholesale+customers）
读取HDFS中的数据。

//导入需要的类
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.OneHotEncoder
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.ml.{Pipeline, PipelineModel}

//通过spark.read,读取HDFS中的数据
val rawdata = spark.read.format("csv").option("header", true).load("hdfs://master:9000/home/hadoop/data/customers_sale.csv")

//导入需要的类

import org.apache.spark.ml.clustering.KMeans

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.ml.feature.OneHotEncoder

import org.apache.spark.ml.feature.StandardScaler

import org.apache.spark.ml.{Pipeline, PipelineModel}

//通过spark.read,读取HDFS中的数据

val rawdata = spark.read.format("csv").option("header", true).load("hdfs://master:9000/home/hadoop/data/customers_sale.csv")

10.3 探索特征的相关性

//数据的样本信息
rawdata.show(3)
+-------+------+-----+----+-------+------+----------------+----------+
|Channel|Region|Fresh|Milk|Grocery|Frozen|Detergents_Paper|Delicassen|
+-------+------+-----+----+-------+------+----------------+----------+
|      2|     3|12669|9656|   7561|   214|            2674|      1338|
|      2|     3| 7057|9810|   9568|  1762|            3293|      1776|
|      2|     3| 6353|8808|   7684|  2405|            3516|      7844|
+------+-----+----+----+------+-----+--------------+---------+
//查看数据结构
rawdata.printSchema()
root
 |-- Channel: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Fresh: string (nullable = true)
 |-- Milk: string (nullable = true)
 |-- Grocery: string (nullable = true)
 |-- Frozen: string (nullable = true)
 |-- Detergents_Paper: string (nullable = true)
 |-- Delicassen: string (nullable = true)

//数据的样本信息

rawdata.show(3)

+-------+------+-----+----+-------+------+----------------+----------+

+-------+------+-----+----+-------+------+----------------+----------+

| 2| 3|12669|9656| 7561| 214| 2674| 1338|

| 2| 3| 7057|9810| 9568| 1762| 3293| 1776|

| 2| 3| 6353|8808| 7684| 2405| 3516| 7844|

+------+-----+----+----+------+-----+--------------+---------+

//查看数据结构

rawdata.printSchema()

root

|-- Channel: string (nullable = true)

|-- Region: string (nullable = true)

|-- Fresh: string (nullable = true)

|-- Milk: string (nullable = true)

|-- Grocery: string (nullable = true)

|-- Frozen: string (nullable = true)

|-- Detergents_Paper: string (nullable = true)

|-- Delicassen: string (nullable = true)

从以上分析，我们可以看出，rawdata数据集总记录数为440条，最大与最小值相差不大，已统计的特征来看，没有缺失值，数据类型为字符型，这点需要在预处理中转换为Double型。
利用pyspark我们可以画出这些特征间的相关性，这里使用pearson's r,相关系统在[-1,1]之间，如果r=1，表示特征完全正相关；r=0，表示不存在关系；r=-1,表示特征完全负相关。
实现代码：

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df=pd.read_csv('/home/hadoop/data/customer_sale/customers_sale.csv',header=0)
cols=['Channel','Region','Fresh','Milk','Grocery','Frozen','Detergents_Paper','Delicassen']
cm=np.corrcoef(df[cols].values.T)

sns.set(font_scale=1.2)
hm=sns.heatmap(cm,cbar=True,annot=True,square=True,fmt='.2f',annot_kws={'size':15},yticklabels=cols,xticklabels=cols)
plt.show()
plt.savefig('sale_corr.png')

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

df=pd.read_csv('/home/hadoop/data/customer_sale/customers_sale.csv',header=0)

cols=['Channel','Region','Fresh','Milk','Grocery','Frozen','Detergents_Paper','Delicassen']

cm=np.corrcoef(df[cols].values.T)

sns.set(font_scale=1.2)

hm=sns.heatmap(cm,cbar=True,annot=True,square=True,fmt='.2f',annot_kws={'size':15},yticklabels=cols,xticklabels=cols)

plt.show()

plt.savefig('sale_corr.png')

10.4 数据预处理

通过数据探索，发现数据需要又字符转换为数值型,并缓存。

val data1= rawdata.select(
rawdata("Channel").cast("Double"),
rawdata("Region").cast("Double"),
rawdata("Fresh").cast("Double"),
rawdata("Milk").cast("Double"),
rawdata("Grocery").cast("Double"),
rawdata("Frozen").cast("Double"),
rawdata("Detergents_Paper").cast("Double"),
rawdata("Delicassen").cast("Double")).cache()

val data1= rawdata.select(

rawdata("Channel").cast("Double"),

rawdata("Region").cast("Double"),

rawdata("Fresh").cast("Double"),

rawdata("Milk").cast("Double"),

rawdata("Grocery").cast("Double"),

rawdata("Frozen").cast("Double"),

rawdata("Detergents_Paper").cast("Double"),

rawdata("Delicassen").cast("Double")).cache()

查看数据的统计信息：

data1.select("Fresh","Milk","Grocery","Frozen","Detergents_Paper","Delicassen").describe().show()
+-------+------------------+------------------+-----------------+-----------------+------------------+------------------+
|summary|             Fresh|              Milk|          Grocery|           Frozen|  Detergents_Paper|        Delicassen|
+-------+------------------+------------------+-----------------+-----------------+------------------+------------------+
|  count|               440|               440|              440|              440|               440|               440|
|   mean|12000.297727272728| 5796.265909090909|7951.277272727273|3071.931818181818|2881.4931818181817|1524.8704545454545|
| stddev|12647.328865076885|7380.3771745708445|9503.162828994346|4854.673332592367| 4767.854447904201|2820.1059373693965|
|    min|               3.0|              55.0|              3.0|             25.0|               3.0|               3.0|
|    max|          112151.0|           73498.0|          92780.0|          60869.0|           40827.0|           47943.0|
+-------+------------------+------------------+-----------------+-----------------+------------------+------------------+

data1.select("Fresh","Milk","Grocery","Frozen","Detergents_Paper","Delicassen").describe().show()

+-------+------------------+------------------+-----------------+-----------------+------------------+------------------+

+-------+------------------+------------------+-----------------+-----------------+------------------+------------------+

| count| 440| 440| 440| 440| 440| 440|

| mean|12000.297727272728| 5796.265909090909|7951.277272727273|3071.931818181818|2881.4931818181817|1524.8704545454545|

| stddev|12647.328865076885|7380.3771745708445|9503.162828994346|4854.673332592367| 4767.854447904201|2820.1059373693965|

| min| 3.0| 55.0| 3.0| 25.0| 3.0| 3.0|

| max| 112151.0| 73498.0| 92780.0| 60869.0| 40827.0| 47943.0|

+-------+------------------+------------------+-----------------+-----------------+------------------+------------------+

Channel、Region为类别型，其余6个字段为连续型，为此，在训练模型前，需要对类别特征先转换为二元向量，然后，对各特征进行规范化。最后得到一个新的特征向量。
对类别特征转换为二元编码：

//把channel特征转换为二元编码
val datahot1=new OneHotEncoder()
.setInputCol("Channel")
.setOutputCol("Channelvector")
.setDropLast(false)

//把Region特征转换为二元编码
val datahot2=new OneHotEncoder()
.setInputCol("Region")
.setOutputCol("Regionvector")
.setDropLast(false)

//把channel特征转换为二元编码

val datahot1=new OneHotEncoder()

.setInputCol("Channel")

.setOutputCol("Channelvector")

.setDropLast(false)

//把Region特征转换为二元编码

val datahot2=new OneHotEncoder()

.setInputCol("Region")

.setOutputCol("Regionvector")

.setDropLast(false)

把新生成的两个特征及原来的6个特征组成一个特征向量

val featuresArray =Array("Channelvector","Regionvector","Fresh","Milk","Grocery","Frozen","Detergents_Paper","Delicassen")

1	val featuresArray =Array("Channelvector","Regionvector","Fresh","Milk","Grocery","Frozen","Detergents_Paper","Delicassen")

把源数据组合成特征向量features

val vecDF = new VectorAssembler()
.setInputCols(featuresArray)
.setOutputCol("features")
对特征进行规范化
val scaledDF = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("scaledFeatures")
  .setWithStd(true)
  .setWithMean(false)

val vecDF = new VectorAssembler()

.setInputCols(featuresArray)

.setOutputCol("features")

对特征进行规范化

val scaledDF = new StandardScaler()

.setInputCol("features")

.setOutputCol("scaledFeatures")

.setWithStd(true)

.setWithMean(false)

10.5 组装

这里我们只使用了setK、setSeed两个参数，其余的使用缺省值。

val kmeans = new KMeans().setFeaturesCol("scaledFeatures").setK(4).setSeed(123)

//把转换二元向量、特征规范化转换等组装到流水线上，因pipeline中无聚类的评估函数，故，这里流水线中不纳入kmeans。具体实现如下：
val pipeline1 = new Pipeline().setStages(Array(datahot1,datahot2,vecDF,scaledDF))
val data2=pipeline1.fit(data1).transform(data1)
//训练模型
val model=kmeans.fit(data2)
val results = model.transform(data2)

//评估模型
val WSSSE = model.computeCost(data2)
println(s"Within Set Sum of Squared Errors = $WSSSE")
//显示聚类结果。
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
results.collect().foreach(row => {println( row(10) + " is predicted as cluster " + row(11))})
//部分结果
[0.0,0.0,1.0,0.0,0.0,0.0,1.0,11867.0,3327.0,4814.0,1178.0,3837.0,120.0] is predicted as cluster [0.0,0.0,2.1365167114232353,0.0,0.0,0.0,2.220261941072331,0.9383008955170281,0.4507899693071525,0.5065681906777801,0.24265278408979063,0.8047644998237361,0.0425515929773675]
[0.0,0.0,1.0,0.0,0.0,0.0,1.0,16117.0,46197.0,92780.0,1026.0,40827.0,2944.0] is predicted as cluster [0.0,0.0,2.1365167114232353,0.0,0.0,0.0,2.220261941072331,1.2743402319919055,6.259436192390298,9.763065378289248,0.21134274743304346,8.562971132213624,1.0439324143780826]
//当k=4的记录数
results.select("scaledFeatures","prediction").groupBy("prediction").count.show()
+----------+-----+                                                              
|prediction|count|
+----------+-----+
|         1|   10|
|         3|  136|
|         2|   64|
|         0|  230|
+----------+-----+
//由此可知，第0,3族较大，1,2比较小
results.select("scaledFeatures","prediction").filter(i=>i(1)==0).show(20)
val result0=results.select("scaledFeatures","prediction").filter(i=>i(1)==0).select("scaledFeatures")

val kmeans = new KMeans().setFeaturesCol("scaledFeatures").setK(4).setSeed(123)

//把转换二元向量、特征规范化转换等组装到流水线上，因pipeline中无聚类的评估函数，故，这里流水线中不纳入kmeans。具体实现如下：

val pipeline1 = new Pipeline().setStages(Array(datahot1,datahot2,vecDF,scaledDF))

val data2=pipeline1.fit(data1).transform(data1)

//训练模型

val model=kmeans.fit(data2)

val results = model.transform(data2)

//评估模型

val WSSSE = model.computeCost(data2)

println(s"Within Set Sum of Squared Errors = $WSSSE")

//显示聚类结果。

println("Cluster Centers: ")

model.clusterCenters.foreach(println)

results.collect().foreach(row => {println( row(10) + " is predicted as cluster " + row(11))})

//部分结果

[0.0,0.0,1.0,0.0,0.0,0.0,1.0,11867.0,3327.0,4814.0,1178.0,3837.0,120.0] is predicted as cluster [0.0,0.0,2.1365167114232353,0.0,0.0,0.0,2.220261941072331,0.9383008955170281,0.4507899693071525,0.5065681906777801,0.24265278408979063,0.8047644998237361,0.0425515929773675]

[0.0,0.0,1.0,0.0,0.0,0.0,1.0,16117.0,46197.0,92780.0,1026.0,40827.0,2944.0] is predicted as cluster [0.0,0.0,2.1365167114232353,0.0,0.0,0.0,2.220261941072331,1.2743402319919055,6.259436192390298,9.763065378289248,0.21134274743304346,8.562971132213624,1.0439324143780826]

//当k=4的记录数

results.select("scaledFeatures","prediction").groupBy("prediction").count.show()

+----------+-----+

|prediction|count|

+----------+-----+

| 1| 10|

| 3| 136|

| 2| 64|

| 0| 230|

+----------+-----+

//由此可知，第0,3族较大，1,2比较小

results.select("scaledFeatures","prediction").filter(i=>i(1)==0).show(20)

val result0=results.select("scaledFeatures","prediction").filter(i=>i(1)==0).select("scaledFeatures")

10.6 模型优化

聚类模型中最重要的是参数k的选择，下面我们通过循环来获取哪个k值的性能最好。

val KSSE = (2 to 20 by 1).toList.map { k =>
val kmeans = new KMeans().setFeaturesCol("scaledFeatures").setK(k).setSeed(123)
val model = kmeans.fit(data2)
// 评估性能.
val WSSSE = model.computeCost(data2)

// K，实际迭代次数，SSE，聚类类别编号，每类的记录数，类中心点
(k, model.getMaxIter, WSSSE, model.summary.cluster, model.summary.clusterSizes, model.clusterCenters)
    }

//显示k、WSSSE评估指标，并按指标排序
KSSE.map(x=>(x._1,x._3)).sortBy(x=>x._2).foreach(println)
//显示结果
(20,635.6231456631109)
(19,674.1240263779249)
(18,696.2925462727684)
(17,747.697734807987)
(15,848.393503421027)
(16,878.8045714559038)
(14,932.4137349866897)
(13,988.2458378719449)
(12,1026.9426528633646)
(11,1165.7468060138433)
(10,1201.1295734061587)
(9,1242.388169008257)
(8,1399.0770764839624)
(7,1523.4613624094593)
(6,1965.6551642041663)
(5,2405.5349119889274)
(4,2595.7328287620885)
(3,3123.9948271417393)
(2,3480.224930619828)
//把该结果保存到HDFS上
KSSE.map(x=>(x._1,x._3)).sortBy(x=>x._2).toDF.write.save("/home/hadoop/data/ksse")

val KSSE = (2 to 20 by 1).toList.map { k =>

val kmeans = new KMeans().setFeaturesCol("scaledFeatures").setK(k).setSeed(123)

val model = kmeans.fit(data2)

// 评估性能.

val WSSSE = model.computeCost(data2)

// K，实际迭代次数，SSE，聚类类别编号，每类的记录数，类中心点

(k, model.getMaxIter, WSSSE, model.summary.cluster, model.summary.clusterSizes, model.clusterCenters)

}

//显示k、WSSSE评估指标，并按指标排序

KSSE.map(x=>(x._1,x._3)).sortBy(x=>x._2).foreach(println)

//显示结果

(20,635.6231456631109)

(19,674.1240263779249)

(18,696.2925462727684)

(17,747.697734807987)

(15,848.393503421027)

(16,878.8045714559038)

(14,932.4137349866897)

(13,988.2458378719449)

(12,1026.9426528633646)

(11,1165.7468060138433)

(10,1201.1295734061587)

(9,1242.388169008257)

(8,1399.0770764839624)

(7,1523.4613624094593)

(6,1965.6551642041663)

(5,2405.5349119889274)

(4,2595.7328287620885)

(3,3123.9948271417393)

(2,3480.224930619828)

//把该结果保存到HDFS上

KSSE.map(x=>(x._1,x._3)).sortBy(x=>x._2).toDF.write.save("/home/hadoop/data/ksse")

以上数据可视化的图形如图10-2所示。

图10-2聚类模型中族K与评估指标的关系

从图10-2中不难看出，k<12时，性能（computeCost）提升比较明显，>12后，逐渐变缓。所以K越大不一定越好，恰当才是重要的。
当k=10时，聚类结果如下：
+----------+-----+
|prediction|count|
+----------+-----+
| 1| 65|
| 6| 18|
| 3| 86|
| 5| 27|
| 9| 3|
| 4| 2|
| 8| 2|
| 7| 46|
| 2| 27|
| 0| 164|
+----------+-----+
图10-3为k取10时（族0对应的channel和Region分别为1和3；族3对应的channel和Region分别为2和3），前两大族的销售均值比较图，从图中可以看出，团购冷藏食品均值大于或接近零售冷藏食品均值。说明团购对冷藏食品量比较大。

图10-3 聚类模型中K=10时0和3族平均销售额对比

10.7 小结

本章主要介绍了用Spark ML中的聚类算法，对某地多种销售数据进行聚类分析，在分析前对数据集主要特征进行了相关性分析，并对类别数据进行二元向量化，对连续性数据进行规范和标准化，然后把这些stages组装在流水线上，在模型训练中，我们尝试不同K的取值，以便获取最佳族群数。

本章数据集下载

第9章构建Spark ML回归模型

回归模型属于监督式学习，每个个体都有一个与之相关联的实数标签，并且我们希望在给出用于表示这些实体的数值特征后，所预测出的标签值可以尽可能接近实际值。
回归算法是试图采用对误差的衡量来探索变量之间的关系的一类算法。回归算法是统计机器学习的利器。在机器学习领域，人们说起回归，有时候是指一类问题，有时候是指一类算法，这一点常常会使初学者有所困惑。常见的回归算法包括：普通最小二乘法（OLS）（Ordinary Least Square），它使用损失函数是平方损失函数（1/2 (w^T x-y)^2），简单的预测就是y=w^T x，标准的最小二乘回归不使用正则化，这就意味着数据中异常数据点非常敏感，因此，在实际应用中经常使用一定程度的正则化（目的避免过拟合、提供泛化能力）。
本章主要介绍Spark ML中的回归模型，以回归分析中常用决策树回归、线性回归为例，对共享单车租赁的情况进行预测，其中介绍了一些特征转换、特征选择、交叉验证等方法的具体使用，主要内容包括：
回归模型简介
把数据加载到HDFS,Spark读取HDFS中的数据
探索特征及其分布信息
预处理数据
把pipeline的多个Stages组装到流水线上
模型优化

9.1 回归模型简介

ML目前支持回归模型有：
Linear regression (线性回归)
Generalized linear regression(广义线性回归)
Decision tree regression (决策树的回归)
Random forest regression(随机森林回归 )
Gradient-boosted tree regression (梯度提高树回归)
Survival regression(生存回归)
Isotonic regression(保序回归)

9.2 数据加载

查看数据大致情况：

####查看文件前3行数据
$ head -3 hour.csv 
instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0,3,13,16
2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0,8,32,40
###查看文件记录总数
$ wc -l hour.csv 
17380 hour.csv
###查看文件列数
cat hour.csv | head -1 | awk -F ',' '{print NF}'
17

####查看文件前3行数据

$ head -3 hour.csv

instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt

1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0,3,13,16

2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0,8,32,40

###查看文件记录总数

$ wc -l hour.csv

17380 hour.csv

###查看文件列数

cat hour.csv | head -1 | awk -F ',' '{print NF}'

从数据集前3行的数据可以看出，第一行为标题，其他为租赁数据，共有17个字段和17380条记录。
把数据文件hour.csv复制到HDFS上。

$ hadoop fs -put hour.csv  /home/hadoop/data

1	$ hadoop fs -put hour.csv /home/hadoop/data

以独立模式启动spark，然后读取数据。

$ spark-shell --master spark://master:7077 --driver-memory 1G --total-executor-cores 4

1	$ spark-shell --master spark://master:7077 --driver-memory 1G --total-executor-cores 4

导入需要使用的类。

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Row
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Dataset
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer,VectorAssembler}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.regression.DecisionTreeRegressor
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.Row

import org.apache.spark.sql.DataFrame

import org.apache.spark.sql.Dataset

import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer,VectorAssembler}

import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}

import org.apache.spark.ml.regression.DecisionTreeRegressionModel

import org.apache.spark.ml.regression.DecisionTreeRegressor

import org.apache.spark.ml.regression.LinearRegression

import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.sql.types._

import org.apache.spark.sql.functions._

读取数据，把第一行为列名：

val rawdata = spark.read.format("csv").option("header", true).load("hdfs://master:9000/home/hadoop/data/hour.csv")

1	val rawdata = spark.read.format("csv").option("header", true).load("hdfs://master:9000/home/hadoop/data/hour.csv")

查看前4行样本数据

rawdata.show(4)
+-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+
|instant|    dteday|season| yr|mnth| hr|holiday|weekday|workingday|weathersit|temp| atemp| hum|windspeed|casual|registered|cnt|
+-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+
|      1|2011-01-01|     1|  0|   1|  0|      0|      6|         0|         1|0.24|0.2879|0.81|        0|     3|        13| 16|
|      2|2011-01-01|     1|  0|   1|  1|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     8|        32| 40|
|      3|2011-01-01|     1|  0|   1|  2|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     5|        27| 32|
|      4|2011-01-01|     1|  0|   1|  3|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     3|        10| 13|
+-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+

rawdata.show(4)

+-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+

+-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+

| 1|2011-01-01| 1| 0| 1| 0| 0| 6| 0| 1|0.24|0.2879|0.81| 0| 3| 13| 16|

| 2|2011-01-01| 1| 0| 1| 1| 0| 6| 0| 1|0.22|0.2727| 0.8| 0| 8| 32| 40|

| 3|2011-01-01| 1| 0| 1| 2| 0| 6| 0| 1|0.22|0.2727| 0.8| 0| 5| 27| 32|

| 4|2011-01-01| 1| 0| 1| 3| 0| 6| 0| 1|0.24|0.2879|0.75| 0| 3| 10| 13|

+-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+

9.3 探索特征分布

Spark读取数据后，我们就可以对数据进行探索和分析，首先查看前4行样本数据

rawdata.show(4)
+-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+
|instant|    dteday|season| yr|mnth| hr|holiday|weekday|workingday|weathersit|temp| atemp| hum|windspeed|casual|registered|cnt|
+-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+
|      1|2011-01-01|     1|  0|   1|  0|      0|      6|         0|         1|0.24|0.2879|0.81|        0|     3|        13| 16|
|      2|2011-01-01|     1|  0|   1|  1|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     8|        32| 40|
|      3|2011-01-01|     1|  0|   1|  2|      0|      6|         0|         1|0.22|0.2727| 0.8|        0|     5|        27| 32|
|      4|2011-01-01|     1|  0|   1|  3|      0|      6|         0|         1|0.24|0.2879|0.75|        0|     3|        10| 13|
+-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+

rawdata.show(4)

+-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+

+-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+

| 1|2011-01-01| 1| 0| 1| 0| 0| 6| 0| 1|0.24|0.2879|0.81| 0| 3| 13| 16|

| 2|2011-01-01| 1| 0| 1| 1| 0| 6| 0| 1|0.22|0.2727| 0.8| 0| 8| 32| 40|

| 3|2011-01-01| 1| 0| 1| 2| 0| 6| 0| 1|0.22|0.2727| 0.8| 0| 5| 27| 32|

| 4|2011-01-01| 1| 0| 1| 3| 0| 6| 0| 1|0.24|0.2879|0.75| 0| 3| 10| 13|

+-------+----------+------+---+----+---+-------+-------+----------+----------+----+------+----+---------+------+----------+---+

查看rawdata的数据结构

rawdata.printSchema
root
 |-- instant: string (nullable = true)
 |-- dteday: string (nullable = true)
 |-- season: string (nullable = true)
 |-- yr: string (nullable = true)
 |-- mnth: string (nullable = true)
 |-- hr: string (nullable = true)
 |-- holiday: string (nullable = true)
 |-- weekday: string (nullable = true)
 |-- workingday: string (nullable = true)
 |-- weathersit: string (nullable = true)
 |-- temp: string (nullable = true)
 |-- atemp: string (nullable = true)
 |-- hum: string (nullable = true)
 |-- windspeed: string (nullable = true)
 |-- casual: string (nullable = true)
 |-- registered: string (nullable = true)
 |-- cnt: string (nullable = true)

rawdata.printSchema

root

|-- instant: string (nullable = true)

|-- dteday: string (nullable = true)

|-- season: string (nullable = true)

|-- yr: string (nullable = true)

|-- mnth: string (nullable = true)

|-- hr: string (nullable = true)

|-- holiday: string (nullable = true)

|-- weekday: string (nullable = true)

|-- workingday: string (nullable = true)

|-- weathersit: string (nullable = true)

|-- temp: string (nullable = true)

|-- atemp: string (nullable = true)

|-- hum: string (nullable = true)

|-- windspeed: string (nullable = true)

|-- casual: string (nullable = true)

|-- registered: string (nullable = true)

|-- cnt: string (nullable = true)

目前这些数据的字段都是字符型，后续需要转换为数值型。
查看主要字段的统计信息

rawdata.describe("dteday","holiday","weekday","temp").show()
+-------+----------+--------------------+-----------------+-------------------+
|summary|    dteday|             holiday|          weekday|               temp|
+-------+----------+--------------------+-----------------+-------------------+
|  count|     17379|               17379|            17379|              17379|
|   mean|      null|0.028770355026181024|3.003682605443351| 0.4969871684216586|
| stddev|      null|  0.1671652763843717|2.005771456110986|0.19255612124972202|
|    min|2011-01-01|                   0|                0|               0.02|
|    max|2012-12-31|                   1|                6|                  1|
+-------+----------+--------------------+-----------------+-------------------+

rawdata.describe("dteday","holiday","weekday","temp").show()

+-------+----------+--------------------+-----------------+-------------------+

+-------+----------+--------------------+-----------------+-------------------+

| count| 17379| 17379| 17379| 17379|

| mean| null|0.028770355026181024|3.003682605443351| 0.4969871684216586|

| stddev| null| 0.1671652763843717|2.005771456110986|0.19255612124972202|

| min|2011-01-01| 0| 0| 0.02|

| max|2012-12-31| 1| 6| 1|

+-------+----------+--------------------+-----------------+-------------------+

其中有很多字段是类型型，如果使用回归算法时，需要通过OneHotEncoder把数据转换为二元向量，对一些字段或特征进行规范化。
通过pyspark可以画出主要特征的重要程度：

图9-2 各特征的重要性

通过pyspark可以画出其中一些特征的分布情况：

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df=pd.read_csv('/home/hadoop/data/bike/hour.csv',header=0)
sns.set(style='whitegrid',context='notebook')
cols=['season','yr','temp','atemp','hum','windspeed','cnt']
sns.pairplot(df[cols],size=2.5)
plt.show()

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

df=pd.read_csv('/home/hadoop/data/bike/hour.csv',header=0)

sns.set(style='whitegrid',context='notebook')

cols=['season','yr','temp','atemp','hum','windspeed','cnt']

sns.pairplot(df[cols],size=2.5)

plt.show()

图9-3特征间的关系图

9.4 数据预处理

9.4.1 特征选择

首先把字符型的特征转换为数值类型，并过滤instant、dteday、casual、registered等4个无关或冗余特征。cnt特征作为标志。

val data1= rawdata.select(
rawdata("season").cast("Double"), 
rawdata("yr").cast("Double"), 
rawdata("mnth").cast("Double"), 
rawdata("hr").cast("Double"), 
rawdata("holiday").cast("Double"), 
rawdata("weekday").cast("Double"),
rawdata("workingday").cast("Double"),
rawdata("weathersit").cast("Double"),
rawdata("temp").cast("Double"),
rawdata("atemp").cast("Double"),
rawdata("hum").cast("Double"),
rawdata("windspeed").cast("Double"),	
rawdata("cnt").cast("Double").alias("label"))

val data1= rawdata.select(

rawdata("season").cast("Double"),

rawdata("yr").cast("Double"),

rawdata("mnth").cast("Double"),

rawdata("hr").cast("Double"),

rawdata("holiday").cast("Double"),

rawdata("weekday").cast("Double"),

rawdata("workingday").cast("Double"),

rawdata("weathersit").cast("Double"),

rawdata("temp").cast("Double"),

rawdata("atemp").cast("Double"),

rawdata("hum").cast("Double"),

rawdata("windspeed").cast("Double"),

rawdata("cnt").cast("Double").alias("label"))

生成一个存放以上预测特征的特征向量

val featuresArray =Array("season","yr","mnth","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed")

1	val featuresArray =Array("season","yr","mnth","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed")

把源数据组合成特征向量features

val assembler = new VectorAssembler().setInputCols(featuresArray).setOutputCol("features")

1	val assembler = new VectorAssembler().setInputCols(featuresArray).setOutputCol("features")

9.4.2 特征转换

使用决策树回归算法前，我们对类别特征进行索引化或数值化。

val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(24)
	对前8个类别字段或特征转换为二元向量。
val data2= new OneHotEncoder().setInputCol("season").setOutputCol("seasonVec")
val data3= new OneHotEncoder().setInputCol("yr").setOutputCol("yrVec")
val data4= new OneHotEncoder().setInputCol("mnth").setOutputCol("mnthVec")
val data5= new OneHotEncoder().setInputCol("hr").setOutputCol("hrVec")
val data6= new OneHotEncoder().setInputCol("holiday").setOutputCol("holidayVec")
val data7= new OneHotEncoder().setInputCol("weekday").setOutputCol("weekdayVec")
val data8= new OneHotEncoder().setInputCol("workingday").setOutputCol("workingdayVec")
val data9= new OneHotEncoder().setInputCol("weathersit").setOutputCol("weathersitVec")

val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(24)

对前8个类别字段或特征转换为二元向量。

val data2= new OneHotEncoder().setInputCol("season").setOutputCol("seasonVec")

val data3= new OneHotEncoder().setInputCol("yr").setOutputCol("yrVec")

val data4= new OneHotEncoder().setInputCol("mnth").setOutputCol("mnthVec")

val data5= new OneHotEncoder().setInputCol("hr").setOutputCol("hrVec")

val data6= new OneHotEncoder().setInputCol("holiday").setOutputCol("holidayVec")

val data7= new OneHotEncoder().setInputCol("weekday").setOutputCol("weekdayVec")

val data8= new OneHotEncoder().setInputCol("workingday").setOutputCol("workingdayVec")

val data9= new OneHotEncoder().setInputCol("weathersit").setOutputCol("weathersitVec")

因OneHotEncoder不是Estimator，这里我们对采用回归算法的数据另外进行处理，先建立一个流水线，把以上转换组装到这个流水线上。

val pipeline_en = new Pipeline().setStages(Array(data2,data3,data4,data5,data6,data7,data8,data9))
val data_lr = pipeline_en.fit(data1).transform(data1)

1 2	val pipeline_en = new Pipeline().setStages(Array(data2,data3,data4,data5,data6,data7,data8,data9)) val data_lr = pipeline_en.fit(data1).transform(data1)

把原来的4个及转换后的8个二元特征向量，拼接成一个feature向量。

val assembler_lr = new VectorAssembler().setInputCols(Array("seasonVec","yrVec","mnthVec","hrVec", "holidayVec","weekdayVec","workingdayVec","weathersitVec","temp","atemp","hum","windspeed")).setOutputCol("features_lr")

1	val assembler_lr = new VectorAssembler().setInputCols(Array("seasonVec","yrVec","mnthVec","hrVec", "holidayVec","weekdayVec","workingdayVec","weathersitVec","temp","atemp","hum","windspeed")).setOutputCol("features_lr")

9.5 组装

1）将data1数据分为训练和测试集（30%进行测试，种子设为12）：

//对data1数据集进行随机划分，这份数据用于决策模型
val Array(trainingData, testData) = data1.randomSplit(Array(0.7, 0.3),12)

//对data2数据集进行随机划分，这份数据用于回归模型
val Array(trainingData_lr, testData_lr) = data_lr.randomSplit(Array(0.7, 0.3),12)

//对data1数据集进行随机划分，这份数据用于决策模型

val Array(trainingData, testData) = data1.randomSplit(Array(0.7, 0.3),12)

//对data2数据集进行随机划分，这份数据用于回归模型

val Array(trainingData_lr, testData_lr) = data_lr.randomSplit(Array(0.7, 0.3),12)

2）设置决策树回归模型参数

val dt = new DecisionTreeRegressor()
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")
.setMaxBins(64)
.setMaxDepth(15)

val dt = new DecisionTreeRegressor()

.setLabelCol("label")

.setFeaturesCol("indexedFeatures")

.setMaxBins(64)

.setMaxDepth(15)

3）设置线性回归模型的参数

val lr =new LinearRegression()
.setFeaturesCol("features_lr")
.setLabelCol("label")
.setFitIntercept(true)
.setMaxIter(20)
.setRegParam(0.3)
.setElasticNetParam(0.8)

val lr =new LinearRegression()

.setFeaturesCol("features_lr")

.setLabelCol("label")

.setFitIntercept(true)

.setMaxIter(20)

.setRegParam(0.3)

.setElasticNetParam(0.8)

4）把决策树回归模型涉及的特征转换及模型训练组装在一个流水线上。

val pipeline = new Pipeline().setStages(Array(assembler,featureIndexer, dt))

1	val pipeline = new Pipeline().setStages(Array(assembler,featureIndexer, dt))

5）把线性回归模型涉及的特征转换、模型训练组装载一个流水上线。

val pipeline_lr= new Pipeline().setStages(Array(assembler_lr,lr))

1	val pipeline_lr= new Pipeline().setStages(Array(assembler_lr,lr))

6）训练模型

//训练决策树回归模型
val model = pipeline.fit(trainingData)
//训练线性回归模型
val lrModel = pipeline_lr.fit(trainingData_lr)

//训练决策树回归模型

val model = pipeline.fit(trainingData)

//训练线性回归模型

val lrModel = pipeline_lr.fit(trainingData_lr)

7）作出预测

//预测决策树回归的值
val predictions = model.transform(testData)
//预测线性回归模型的值
val predictions_lr = lrModel.transform(testData_lr)

//预测决策树回归的值

val predictions = model.transform(testData)

//预测线性回归模型的值

val predictions_lr = lrModel.transform(testData_lr)

8）评估模型

RegressionEvaluator.setMetricName可以定义四种评估器：rmse(缺省)、 mse、r^2、mae。
val evaluator =new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse")
//决策树模型评估指标
val rmse = evaluator.evaluate(predictions)
//rmse: Double = 61.62409114645229
val rmse_lr = evaluator.evaluate(predictions_lr)
// rmse_lr: Double = 102.05406408259029

RegressionEvaluator.setMetricName可以定义四种评估器：rmse(缺省)、 mse、r^2、mae。

val evaluator =new RegressionEvaluator()

.setLabelCol("label")

.setPredictionCol("prediction")

.setMetricName("rmse")

//决策树模型评估指标

val rmse = evaluator.evaluate(predictions)

//rmse: Double = 61.62409114645229

val rmse_lr = evaluator.evaluate(predictions_lr)

// rmse_lr: Double = 102.05406408259029

从以上使用不同模型情况看来，决策树性能稍好与线性回归，但这仅是粗糙的比较，下面使用模型选择中介绍的一些方法，对线性模型进行优化。

9.6 模型优化

从图9-3可知，temp特征与atemp特征线性相关，而且从图9-2可知，atemp的贡献度较小，所以我们将过滤该特征。

val assembler_lr1 = new VectorAssembler().setInputCols(Array("seasonVec","yrVec","mnthVec","hrVec", "holidayVec","weekdayVec","workingdayVec","weathersitVec","temp","hum","windspeed")).setOutputCol("features_lr1")

1	val assembler_lr1 = new VectorAssembler().setInputCols(Array("seasonVec","yrVec","mnthVec","hrVec", "holidayVec","weekdayVec","workingdayVec","weathersitVec","temp","hum","windspeed")).setOutputCol("features_lr1")

对label标签特征进行转换，使其更接近正态分布，这里我们SQLTransformer转换器，其具体使用可参考第4章。

//导入需要的包
import org.apache.spark.ml.feature.SQLTransformer
//对特征label进行SQRT运行
val sqlTrans = new SQLTransformer().setStatement(
  "SELECT *, SQRT(label) as label1 FROM __THIS__")

//导入需要的包

import org.apache.spark.ml.feature.SQLTransformer

//对特征label进行SQRT运行

val sqlTrans = new SQLTransformer().setStatement(

"SELECT *, SQRT(label) as label1 FROM __THIS__")

这里我们利用训练验证划分法对线性回归模型进行优化，对参数进行网格化，将数据集划分为训练集、验证集和测试集。
1）导入需要用到的包。

import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}

1	import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}

2）建立模型，预测label1的值，设置线性回归参数。

val lr1 = new LinearRegression()
.setFeaturesCol("features_lr1")
.setLabelCol("label1")
.setFitIntercept(true)

val lr1 = new LinearRegression()

.setFeaturesCol("features_lr1")

.setLabelCol("label1")

.setFitIntercept(true)

3）设置流水线，为便于把特征组合、特征值优化、模型训练等任务组装到这条流水线上。

val pipeline_lr1 = new Pipeline().setStages(Array(assembler_lr1,sqlTrans,lr1))

1	val pipeline_lr1 = new Pipeline().setStages(Array(assembler_lr1,sqlTrans,lr1))

4）建立参数网格。

val paramGrid = new ParamGridBuilder()
.addGrid(lr1.elasticNetParam, Array(0.0, 0.8, 1.0))
.addGrid(lr1.regParam,Array(0.1,0.3,0.5))
.addGrid(lr1.maxIter, Array(20, 30))
.build()

val paramGrid = new ParamGridBuilder()

.addGrid(lr1.elasticNetParam, Array(0.0, 0.8, 1.0))

.addGrid(lr1.regParam,Array(0.1,0.3,0.5))

.addGrid(lr1.maxIter, Array(20, 30))

.build()

5）选择(prediction, label1)，计算测试误差。

val evaluator_lr1 =new RegressionEvaluator()
.setLabelCol("label1")
.setPredictionCol("prediction")
.setMetricName("rmse")
//利用交叉验证方法
val trainValidationSplit = new TrainValidationSplit()
  .setEstimator(pipeline_lr1)
  .setEvaluator(evaluator_lr1)
  .setEstimatorParamMaps(paramGrid)
  .setTrainRatio(0.8)

val evaluator_lr1 =new RegressionEvaluator()

.setLabelCol("label1")

.setPredictionCol("prediction")

.setMetricName("rmse")

//利用交叉验证方法

val trainValidationSplit = new TrainValidationSplit()

.setEstimator(pipeline_lr1)

.setEvaluator(evaluator_lr1)

.setEstimatorParamMaps(paramGrid)

.setTrainRatio(0.8)

6）训练模型并自动选择最优参数。

val lrModel1 = trainValidationSplit.fit(trainingData_lr)

1	val lrModel1 = trainValidationSplit.fit(trainingData_lr)

7）查看模型全部参数

lrModel1.getEstimatorParamMaps.foreach { println }   //参数组合
lrModel1.getEvaluator.extractParamMap()  //查看评估参数
lrModel1.getEvaluator.isLargerBetter

lrModel1.getEstimatorParamMaps.foreach { println } //参数组合

lrModel1.getEvaluator.extractParamMap() //查看评估参数

lrModel1.getEvaluator.isLargerBetter

8）用最好的参数组合，做出预测。

val predictions_lr1 = lrModel1.transform(testData_lr)
val rmse_lr1 = evaluator_lr1.evaluate(predictions_lr1)
//rmse_lr1: Double = 3.1354674045018514
//显示转换后特征值的前5行信息
predictions_lr1.select("features_lr1","label","label1","prediction").show(5)
//结果显示如下：
+--------------------+-----+------------------+------------------+
|        features_lr1|label|            label1|        prediction|
+--------------------+-----+------------------+------------------+
|(55,[1,4,6,17,40,...| 39.0| 6.244997998398398| 2.544732781830004|
|(55,[1,4,6,17,40,...|  7.0|2.6457513110645907|1.1823933720401953|
|(55,[1,4,6,17,40,...|  5.0|  2.23606797749979|1.3641560005748419|
|(55,[1,4,6,17,40,...|  7.0|2.6457513110645907|1.7674507231166494|
|(55,[1,4,6,17,40,...| 12.0|3.4641016151377544|1.5020350291356124|

val predictions_lr1 = lrModel1.transform(testData_lr)

val rmse_lr1 = evaluator_lr1.evaluate(predictions_lr1)

//rmse_lr1: Double = 3.1354674045018514

//显示转换后特征值的前5行信息

predictions_lr1.select("features_lr1","label","label1","prediction").show(5)

//结果显示如下：

+--------------------+-----+------------------+------------------+

+--------------------+-----+------------------+------------------+

|(55,[1,4,6,17,40,...| 39.0| 6.244997998398398| 2.544732781830004|

|(55,[1,4,6,17,40,...| 7.0|2.6457513110645907|1.1823933720401953|

|(55,[1,4,6,17,40,...| 5.0| 2.23606797749979|1.3641560005748419|

|(55,[1,4,6,17,40,...| 7.0|2.6457513110645907|1.7674507231166494|

|(55,[1,4,6,17,40,...| 12.0|3.4641016151377544|1.5020350291356124|

看了对标签特征进行转换、利用网格参数及训练验证划分等优化方法，从102下降到3左右，效果比较明显。

9.7 小结

本章主要介绍Spark ML的线性回归模型、决策树回归模型，对共享单车的租赁信息进行预测，由于很多数据不规范，因此，对原数据进行了二元向量转换、对类别数据索引化，然后把这些转换组装到流水线上，在训练集上训练模型，在测试集上进行预测，最后，更加评估指标对模型进行优化。

第8章构建Spark ML 分类模型

在上一章中，我们通过实例介绍了Spark中基于协同过滤的推荐模型，了解了推荐模型的原理以及场景、使用流水线组装任务，使用自定义函数优化模型等。这一章我们将就Spark中分类模型为例，进一步说明如何使用Spark ML中特征选取、特征转换、流水线、模型选择或优化等方法，简化、规范化、流程化整个机器学习过程。
分类、回归和聚类是机器学习中重要的几个分支，也是日常数据处理与分析中最常用的手段。这几类的算法有着较高的成熟度，原理也较容易理解，且有着不错的效果，深受数据分析师们的喜爱。本章以Spark ML分类模型为例，主要包括以下内容：
 简介用于分类的几种常用算法
 加载数据
 探索加载后的数据
 预处理数据
 把各种任务组装到流水线上
 模型调优

8.1分类模型简介

8.1.1线性模型

8.1.2 决策树模型

决策树模型是一个强大的非概率模型，可以用来表示复杂的非线性模式和特征的相互关系。

8.1.3 朴素贝叶斯模型

关于朴素贝叶斯详细的原理，在维基百科中有更为详细的数学公式解释：http://en.wikipedia.org/wiki/Naive_Bayes_classifier。

8.2数据加载

存放路径在 /home/hadoop/data/train.tsv。
数据集下载
先使用shell命令对数据进行试探性的查看，并做一些简单的数据处理。
1) 查看前2行数据

$ head -2 train.tsv

"url" "urlid" "boilerplate"	"alchemy_category"	"alchemy_category_score"	"avglinksize"	"commonlinkratio_1"	"commonlinkratio_2"	"commonlinkratio_3"	"commonlinkratio_4"	"compression_ratio"	"embed_ratio"	"framebased"	"frameTagRatio"	"hasDomainLink"	"html_ratio"	"image_ratio"	"is_news"	"lengthyLinkDomain"	"linkwordscore"	"news_front_page"   "non_markup_alphanum_characters"	"numberOfLinks"	"numwords_in_url"	"parametrizedLinkRatio"	"spelling_errors_ratio"	"label"
"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"	"4042" "{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose Cali	"8"	.............. "0.152941176" "0.079129575"	"0"

$ head -2 train.tsv

数据集中的第1行为标题（字段名）行，下面是一些的字段说明。
2) 查看文件记录总数

$ cat train.tsv |wc -l
7396

1 2	$ cat train.tsv \|wc -l 7396

结果显示共有：数据集一共有7396条数据
3) 由于textFile目前不好过滤标题行数据，为便于spark操作数据，需要先删除标题。

$ sed  1d train.tsv >train_noheader.tsv

1	$ sed 1d train.tsv >train_noheader.tsv

4) 将数据文件上传到 hdfs

$ hdfs dfs -put train_noheader.tsv /data

1	$ hdfs dfs -put train_noheader.tsv /data

5) 查看是否上成功

hadoop@master:~/data$ hdfs dfs -ls /data
17/05/24 00:46:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 hadoop supergroup   21972457 2017-05-24 00:46 /data/train_noheader.tsv

hadoop@master:~/data$ hdfs dfs -ls /data

17/05/24 00:46:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Found 1 items

-rw-r--r-- 1 hadoop supergroup 21972457 2017-05-24 00:46 /data/train_noheader.tsv

6) 启动Spark Shell

spark-shell --master spark://master:7077 --driver-memory 2G --total-executor-cores 2

1	spark-shell --master spark://master:7077 --driver-memory 2G --total-executor-cores 2

7) 通过sc对象的textFile方法，由本地文件数据创建RDD

scala> val rawData=sc.textFile("hdfs://master:9000/data/train_noheader.tsv")
rawData: org.apache.spark.rdd.RDD[String] = hdfs://master:9000/data/train_noheader.tsv MapPartitionsRDD[1] at textFile at <console>:24

1 2	scala> val rawData=sc.textFile("hdfs://master:9000/data/train_noheader.tsv") rawData: org.apache.spark.rdd.RDD[String] = hdfs://master:9000/data/train_noheader.tsv MapPartitionsRDD[1] at textFile at <console>:24

8.3数据探索

1) 查看数据前2行

scala> rawData.take(2)
res0: Array[String] = Array("http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"	"4042"	"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees in...

scala> rawData.take(2)

res0: Array[String] = Array("http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html" "4042" "{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees in...

由上面可以看到，得到的是只有一行字符串数组。通过常看源文件，我们可以发现字段间由制表符(\t)分割。由于后续的算法我们不需要时间戳以及网页的内容，所以这里先将其过滤掉。下面我们获取每个属性。
2) 根据以上分析，对数据进行处理，并生成新的RDD

scala> val records = rawData.map(line => line.split("\t"))
records: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26

1 2	scala> val records = rawData.map(line => line.split("\t")) records: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26

3) 查看数据结构

scala> records.first
res4: Array[String] = Array("http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html", "4042", "{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees ...

scala> records.first

res4: Array[String] = Array("http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html", "4042", "{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees ...

4) 查看总的数据行数

scala> records.count
res5: Long = 7395

1 2	scala> records.count res5: Long = 7395

5) 查看每一行数据的列数

scala> records.first.size
res6: Int = 27

1 2	scala> records.first.size res6: Int = 27

6) 获取第一行的某个值

scala> records.first.take(2)
res22: Array[String] = Array("http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html", "4042")

1 2	scala> records.first.take(2) res22: Array[String] = Array("http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html", "4042")

8.4数据预处理

1) 导入LabeledPoint

scala> import org.apache.spark.ml.feature.LabeledPoint

1	scala> import org.apache.spark.ml.feature.LabeledPoint

2) 导入Vectors矢量方法

scala> import org.apache.spark.ml.linalg.Vectors

1	scala> import org.apache.spark.ml.linalg.Vectors

3) 对数据进行1-4步的数据清洗工作

val data = records.map { r =>
         val trimmed = r.map(_.replaceAll("\"", ""))
         val label = trimmed(r.size - 1).toInt
         val features = trimmed.slice(4, r.size - 1).map(d => if (d == "?") 0.0 else d.toDouble)
         LabeledPoint(label, Vectors.dense(features))
}

val data = records.map { r =>

val trimmed = r.map(_.replaceAll("\"", ""))

val label = trimmed(r.size - 1).toInt

val features = trimmed.slice(4, r.size - 1).map(d => if (d == "?") 0.0 else d.toDouble)

LabeledPoint(label, Vectors.dense(features))

}

上述代码可通过复制粘贴到代码行中，使用 :paste ，粘贴过后按下 Ctrl+D 即可。
4) 考虑到使用朴素贝叶斯算法时，数据需不小于0，故需要做些处理。

val nbData = records.map { r =>
         val trimmed = r.map(_.replaceAll("\"", ""))
         val label = trimmed(r.size - 1).toInt
         val features = trimmed.slice(4, r.size - 1).map(d => if (d == "?") 0.0 else d.toDouble).map(d => if (d < 0) 0.0 else d)
         LabeledPoint(label, Vectors.dense(features))
}

val nbData = records.map { r =>

val trimmed = r.map(_.replaceAll("\"", ""))

val label = trimmed(r.size - 1).toInt

val features = trimmed.slice(4, r.size - 1).map(d => if (d == "?") 0.0 else d.toDouble).map(d => if (d < 0) 0.0 else d)

LabeledPoint(label, Vectors.dense(features))

}

5) 查看清理后数据集的前2行数据

scala> data.take(2)
res0: Array[org.apache.spark.ml.feature.LabeledPoint] = Array((0.0,[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]), (1.0,[0.574147,3.677966102,0.50802139,0.288770053,0.213903743,0.144385027,0.468648998,0.0,0.0,0.098707403,0.0,0.203489628,0.088652482,1.0,1.0,40.0,0.0,4973.0,187.0,9.0,0.181818182,0.125448029]))

scala> data.take(2)

res0: Array[org.apache.spark.ml.feature.LabeledPoint] = Array((0.0,[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]), (1.0,[0.574147,3.677966102,0.50802139,0.288770053,0.213903743,0.144385027,0.468648998,0.0,0.0,0.098707403,0.0,0.203489628,0.088652482,1.0,1.0,40.0,0.0,4973.0,187.0,9.0,0.181818182,0.125448029]))

6) 通过RDD创建DataFrame

scala> val df = spark.createDataFrame(data)
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val nbDF = spark.createDataFrame(nbData)
nbDF: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val df = spark.createDataFrame(data)

df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val nbDF = spark.createDataFrame(nbData)

nbDF: org.apache.spark.sql.DataFrame = [label: double, features: vector]

7) 查看df和nbDF的数据

scala> df.show(10) 	// 查看df的前10行数据
+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|[0.789131,2.05555...|
|  1.0|[0.574147,3.67796...|
|  1.0|[0.996526,2.38288...|
|  1.0|[0.801248,1.54310...|
|  0.0|[0.719157,2.67647...|
|  0.0|[0.0,119.0,0.7454...|
|  1.0|[0.22111,0.773809...|
|  0.0|[0.0,1.883333333,...|
|  1.0|[0.0,0.471502591,...|
|  1.0|[0.0,2.41011236,0...|
+-----+--------------------+
only showing top 10 rows

// 查看nbDF的第一行数据,或者使用nbDF.first也是一样的
scala> nbDF.head  
res21: org.apache.spark.sql.Row = [0.0,[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]]

scala> df.show(10) // 查看df的前10行数据

+-----+--------------------+

|label| features|

+-----+--------------------+

| 0.0|[0.789131,2.05555...|

| 1.0|[0.574147,3.67796...|

| 1.0|[0.996526,2.38288...|

| 1.0|[0.801248,1.54310...|

| 0.0|[0.719157,2.67647...|

| 0.0|[0.0,119.0,0.7454...|

| 1.0|[0.22111,0.773809...|

| 0.0|[0.0,1.883333333,...|

| 1.0|[0.0,0.471502591,...|

| 1.0|[0.0,2.41011236,0...|

+-----+--------------------+

only showing top 10 rows

// 查看nbDF的第一行数据,或者使用nbDF.first也是一样的

scala> nbDF.head

res21: org.apache.spark.sql.Row = [0.0,[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]]

8) 查看df和nbDF的Schema的信息和数据总行数

scala> df.printSchema
root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

scala> df.count
res4: Long = 7395

scala> nbDF.printSchema
root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

scala> nbDF.count
res24: Long = 7395

scala> df.printSchema

root

|-- label: double (nullable = true)

|-- features: vector (nullable = true)

scala> df.count

res4: Long = 7395

scala> nbDF.printSchema

root

|-- label: double (nullable = true)

|-- features: vector (nullable = true)

scala> nbDF.count

res24: Long = 7395

9) 随机地将数据进行划分，80%用于训练集，20%用于测试集

scala>  val Array(trainingData, testData) = df.randomSplit(Array(0.8, 0.2), seed = 1234L)
trainingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]
testData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]

scala> val Array(nbTrainingData, nbTestData) = nbDF.randomSplit(Array(0.8, 0.2), seed = 1234L)
nbTrainingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]
nbTestData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]

scala> val Array(trainingData, testData) = df.randomSplit(Array(0.8, 0.2), seed = 1234L)

trainingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]

testData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]

scala> val Array(nbTrainingData, nbTestData) = nbDF.randomSplit(Array(0.8, 0.2), seed = 1234L)

nbTrainingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]

nbTestData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]

10) 查看训练数据和测试数据的总行数

scala> trainingData.count
res8: Long = 5912

scala> testData.count
res9: Long = 1483

scala> trainingData.count

res8: Long = 5912

scala> testData.count

res9: Long = 1483

11) 由于后续使用网格参数和交叉验证的时候，需要多次使用到训练集和测试集，所以将这两者载入内存，可大大提高性能。

scala> trainingData.cache
res10: trainingData.type = MapPartitionsRDD[24] at randomSplit at <console>:35

scala> testData.cache
res11: testData.type = MapPartitionsRDD[25] at randomSplit at <console>:35

scala> nbTrainingData.cache
res27: nbTrainingData.type = [label: double, features: vector]

scala> nbTestData.cache
res28: nbTestData.type = [label: double, features: vector]

scala> trainingData.cache

res10: trainingData.type = MapPartitionsRDD[24] at randomSplit at <console>:35

scala> testData.cache

res11: testData.type = MapPartitionsRDD[25] at randomSplit at <console>:35

scala> nbTrainingData.cache

res27: nbTrainingData.type = [label: double, features: vector]

scala> nbTestData.cache

res28: nbTestData.type = [label: double, features: vector]

12) 导入逻辑回归分类器、决策树模型以及朴素贝叶斯模型

scala> import org.apache.spark.ml.classification.{LogisticRegression,LogisticRegressionModel}

scala> import org.apache.spark.ml.classification.{NaiveBayes,NaiveBayesModel}

scala> import org.apache.spark.ml.classification.{DecisionTreeClassifier,DecisionTreeClassificationModel}

scala> import org.apache.spark.ml.classification.{LogisticRegression,LogisticRegressionModel}

scala> import org.apache.spark.ml.classification.{NaiveBayes,NaiveBayesModel}

scala> import org.apache.spark.ml.classification.{DecisionTreeClassifier,DecisionTreeClassificationModel}

13) 创建贝叶斯模型，设置初始参数

scala> val nb = new NaiveBayes().setLabelCol("label").setFeaturesCol("features")
nbModel: org.apache.spark.ml.classification.NaiveBayes = nb_050f7aa0718e

1 2	scala> val nb = new NaiveBayes().setLabelCol("label").setFeaturesCol("features") nbModel: org.apache.spark.ml.classification.NaiveBayes = nb_050f7aa0718e

14) 通过朴素贝叶斯训练模型，对测试数据进行预测

//训练数据
scala> val nbModel = nb.fit(nbTrainingData)
nbModel: org.apache.spark.ml.classification.NaiveBayesModel = NaiveBayesModel (uid=nb_63013179fe1f) with 2 classes

// 预测数据
scala> val nbPrediction = nbModel.transform(nbTestData)
nbPrediction: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 3 more fields]

scala> nbPrediction.show(10)
+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|[0.0,0.0,0.0,0.0,...|[-46.748642409367...|[0.83980023102260...|       0.0|
|  0.0|[0.0,0.0,0.0,0.0,...|[-28.455469606463...|[0.80560314910869...|       0.0|
|  0.0|[0.0,0.0,0.0,0.0,...|[-21.419660085849...|[0.74277974559336...|       0.0|
|  0.0|[0.0,0.253731343,...|[-566.66641956697...|[1.0,1.7019208090...|       0.0|
|  0.0|[0.0,0.5,0.0,0.0,...|[-85.270246662200...|[0.83409619579026...|       0.0|
|  0.0|[0.0,0.5,0.0,0.0,...|[-109.88609079237...|[0.94215720717655...|       0.0|
|  0.0|[0.0,0.563636364,...|[-645.84504343631...|[3.97514834896843...|       1.0|
|  0.0|[0.0,0.590163934,...|[-2040.0838024687...|[0.99999724227148...|       0.0|
|  0.0|[0.0,0.677966102,...|[-432.36145227604...|[0.99888109472905...|       0.0|
|  0.0|[0.0,0.7,0.111111...|[-222.48044531991...|[0.99999999305131...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 10 rows

//训练数据

scala> val nbModel = nb.fit(nbTrainingData)

nbModel: org.apache.spark.ml.classification.NaiveBayesModel = NaiveBayesModel (uid=nb_63013179fe1f) with 2 classes

// 预测数据

scala> val nbPrediction = nbModel.transform(nbTestData)

nbPrediction: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 3 more fields]

scala> nbPrediction.show(10)

+-----+--------------------+--------------------+--------------------+----------+

+-----+--------------------+--------------------+--------------------+----------+

| 0.0|[0.0,0.0,0.0,0.0,...|[-46.748642409367...|[0.83980023102260...| 0.0|

| 0.0|[0.0,0.0,0.0,0.0,...|[-28.455469606463...|[0.80560314910869...| 0.0|

| 0.0|[0.0,0.0,0.0,0.0,...|[-21.419660085849...|[0.74277974559336...| 0.0|

| 0.0|[0.0,0.253731343,...|[-566.66641956697...|[1.0,1.7019208090...| 0.0|

| 0.0|[0.0,0.5,0.0,0.0,...|[-85.270246662200...|[0.83409619579026...| 0.0|

| 0.0|[0.0,0.5,0.0,0.0,...|[-109.88609079237...|[0.94215720717655...| 0.0|

| 0.0|[0.0,0.563636364,...|[-645.84504343631...|[3.97514834896843...| 1.0|

| 0.0|[0.0,0.590163934,...|[-2040.0838024687...|[0.99999724227148...| 0.0|

| 0.0|[0.0,0.677966102,...|[-432.36145227604...|[0.99888109472905...| 0.0|

| 0.0|[0.0,0.7,0.111111...|[-222.48044531991...|[0.99999999305131...| 0.0|

+-----+--------------------+--------------------+--------------------+----------+

only showing top 10 rows

15) 朴素贝叶斯准确性统计

//t1 存放预测值的数组,t2存放测试数据标签值
// t3存放测试数据总行数
scala>val (t1, t2, t3) = (nbPrediction.select("prediction").collect, nbTestData.select("label").collect,nbTestData.count.toInt)

// t4 为累加器
scala> var t4 = 0
t4: Int = 0

// 遍历循环，统计正确预测的次数
scala> for(i <- 0 to t3-1){if(t1(i)==t2(i)) t4+=1}

// 查看预测正确的个数
scala> t4
res63: Int = 840

// 计算准确率
scala> val nbAccuracy = 1.0*t4/t3
nbAccuracy: Double = 0.5664194200944033

//t1 存放预测值的数组,t2存放测试数据标签值

// t3存放测试数据总行数

scala>val (t1, t2, t3) = (nbPrediction.select("prediction").collect, nbTestData.select("label").collect,nbTestData.count.toInt)

// t4 为累加器

scala> var t4 = 0

t4: Int = 0

// 遍历循环，统计正确预测的次数

scala> for(i <- 0 to t3-1){if(t1(i)==t2(i)) t4+=1}

// 查看预测正确的个数

scala> t4

res63: Int = 840

// 计算准确率

scala> val nbAccuracy = 1.0*t4/t3

nbAccuracy: Double = 0.5664194200944033

可以看到，朴素贝叶斯的准确率为56.6419%。

8.5组装

1) 导入特征索引类

scala>import org.apache.spark.ml.feature.{ VectorIndexer, VectorIndexerModel}

1	scala>import org.apache.spark.ml.feature.{ VectorIndexer, VectorIndexerModel}

2) 建立特征索引

scala> val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").fit(df)
featureIndexer: org.apache.spark.ml.feature.VectorIndexerModel = vecIdx_b73ca1435eea

1 2	scala> val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").fit(df) featureIndexer: org.apache.spark.ml.feature.VectorIndexerModel = vecIdx_b73ca1435eea

3) 创建逻辑回归模型

scala> val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("indexedFeatures").setMaxIter(10).setRegParam(0.001)
lrModel: org.apache.spark.ml.classification.LogisticRegression = logreg_9bec21f2262f

1 2	scala> val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("indexedFeatures").setMaxIter(10).setRegParam(0.001) lrModel: org.apache.spark.ml.classification.LogisticRegression = logreg_9bec21f2262f

4) 创建决策树模型

scala> val dt = new DecisionTreeClassifier().setLabelCol("label").setFeaturesCol("indexedFeatures").setImpurity("entropy").setMaxBins(100).setMaxDepth(5).setMinInfoGain(0.01)
dtModel: org.apache.spark.ml.classification.DecisionTreeClassifier = dtc_8a3a01185f6b

1 2	scala> val dt = new DecisionTreeClassifier().setLabelCol("label").setFeaturesCol("indexedFeatures").setImpurity("entropy").setMaxBins(100).setMaxDepth(5).setMinInfoGain(0.01) dtModel: org.apache.spark.ml.classification.DecisionTreeClassifier = dtc_8a3a01185f6b

5) 导入网格参数和交叉验证

scala> import org.apache.spark.ml.tuning.{ ParamGridBuilder, CrossValidator }

1	scala> import org.apache.spark.ml.tuning.{ ParamGridBuilder, CrossValidator }

6) 导入流水线

import org.apache.spark.ml.{Pipeline,PipelineModel}

1	import org.apache.spark.ml.{Pipeline,PipelineModel}

7) 导入评估器

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

1	import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

8) 配置2个流水线：一个是逻辑回归的流水线，包含2个stages（ featureIndexer和lr）；
一个是决策树回归的流水线，包含2个stages（ featureIndexer 和 dt）。

scala> val lrPipeline = new Pipeline().setStages(Array(featureIndexer,lr))
lrPipeline: org.apache.spark.ml.Pipeline = pipeline_64c542dff42e
scala> val dtPipeline = new Pipeline().setStages(Array(featureIndexer,dt))
dtPipeline: org.apache.spark.ml.Pipeline = pipeline_b9ed2ccc2108

scala> val lrPipeline = new Pipeline().setStages(Array(featureIndexer,lr))

lrPipeline: org.apache.spark.ml.Pipeline = pipeline_64c542dff42e

scala> val dtPipeline = new Pipeline().setStages(Array(featureIndexer,dt))

dtPipeline: org.apache.spark.ml.Pipeline = pipeline_b9ed2ccc2108

8.6模型优化

1) 分别配置网格参数，使用ParamGridBuilder构造一个parameter grid

scala> :paste
val lrParamGrid = new ParamGridBuilder()
.addGrid(lr.regParam,Array(0.1,0.3,0.5))
.addGrid(lr.maxIter, Array(10,20,30))
  .build()

scala> :paste
val dtParamGrid = new ParamGridBuilder()
.addGrid(dt.maxDepth, Array(3,5,7))
  .build()

scala> :paste

val lrParamGrid = new ParamGridBuilder()

.addGrid(lr.regParam,Array(0.1,0.3,0.5))

.addGrid(lr.maxIter, Array(10,20,30))

.build()

scala> :paste

val dtParamGrid = new ParamGridBuilder()

.addGrid(dt.maxDepth, Array(3,5,7))

.build()

2) 分别实例化交叉验证模型

val evaluator = new BinaryClassificationEvaluator()

scala> :paste
val lrCV = new CrossValidator()
  .setEstimator(lrPipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(lrParamGrid)
  .setNumFolds(2)
lrCV: org.apache.spark.ml.tuning.CrossValidator = cv_b25c7e0f1be7

scala> :paste
val dtCV = new CrossValidator()
  .setEstimator(dtPipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(dtParamGrid)
  .setNumFolds(2)

dtCV: org.apache.spark.ml.tuning.CrossValidator = cv_5176e642601d

val evaluator = new BinaryClassificationEvaluator()

scala> :paste

val lrCV = new CrossValidator()

.setEstimator(lrPipeline)

.setEvaluator(evaluator)

.setEstimatorParamMaps(lrParamGrid)

.setNumFolds(2)

lrCV: org.apache.spark.ml.tuning.CrossValidator = cv_b25c7e0f1be7

scala> :paste

val dtCV = new CrossValidator()

.setEstimator(dtPipeline)

.setEvaluator(evaluator)

.setEstimatorParamMaps(dtParamGrid)

.setNumFolds(2)

dtCV: org.apache.spark.ml.tuning.CrossValidator = cv_5176e642601d

3) 通过交叉验证模型，获取最优参数集，并测试模型

scala>val lrCvModel = lrCV.fit(trainingData)
lrCvModel: org.apache.spark.ml.tuning.CrossValidatorModel = cv_b25c7e0f1be7
scala> val dtCvModel = dtCV.fit(trainingData)
dtCvModel: org.apache.spark.ml.tuning.CrossValidatorModel = cv_5176e642601d

scala> val lrPrediction = lrCvModel.transform(testData)
lrPrediction: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 4 more fields]

scala> val dtPrediction = dtCvModel.transform(testData)
dtPrediction: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 4 more fields]

scala>val lrCvModel = lrCV.fit(trainingData)

lrCvModel: org.apache.spark.ml.tuning.CrossValidatorModel = cv_b25c7e0f1be7

scala> val dtCvModel = dtCV.fit(trainingData)

dtCvModel: org.apache.spark.ml.tuning.CrossValidatorModel = cv_5176e642601d

scala> val lrPrediction = lrCvModel.transform(testData)

lrPrediction: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 4 more fields]

scala> val dtPrediction = dtCvModel.transform(testData)

dtPrediction: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 4 more fields]

4) 查看数据

scala> lrPrediction.select("label","prediction").show(10)
+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       1.0|
|  0.0|       1.0|
|  0.0|       1.0|
|  0.0|       1.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 10 rows

scala> dtPrediction.select("label","prediction").show(10)
+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       1.0|
|  0.0|       1.0|
|  0.0|       1.0|
|  0.0|       0.0|
+-----+----------+
only showing top 10 rows

scala> lrPrediction.select("label","prediction").show(10)

+-----+----------+

|label|prediction|

+-----+----------+

| 0.0| 0.0|

| 0.0| 1.0|

| 0.0| 0.0|

+-----+----------+

only showing top 10 rows

scala> dtPrediction.select("label","prediction").show(10)

+-----+----------+

|label|prediction|

+-----+----------+

| 0.0| 0.0|

| 0.0| 1.0|

| 0.0| 0.0|

+-----+----------+

only showing top 10 rows

5) 查看逻辑回归匹配模型的参数

scala> val lrBestModel = lrCvModel.bestModel.asInstanceOf[PipelineModel]
lrBestModel: org.apache.spark.ml.PipelineModel = pipeline_64c542dff42e
scala> val lrModel = lrBestModel.stages(1).asInstanceOf[LogisticRegressionModel]
lrModel: org.apache.spark.ml.classification.LogisticRegressionModel = logreg_100994a23a48

scala> lrModel.getRegParam
res20: Double = 0.1

scala> lrModel.getMaxIter
res21: Int = 20

scala> val lrBestModel = lrCvModel.bestModel.asInstanceOf[PipelineModel]

lrBestModel: org.apache.spark.ml.PipelineModel = pipeline_64c542dff42e

scala> val lrModel = lrBestModel.stages(1).asInstanceOf[LogisticRegressionModel]

lrModel: org.apache.spark.ml.classification.LogisticRegressionModel = logreg_100994a23a48

scala> lrModel.getRegParam

res20: Double = 0.1

scala> lrModel.getMaxIter

res21: Int = 20

6) 查看决策树匹配模型的参数

scala> val dtBestModel = dtCvModel.bestModel.asInstanceOf[PipelineModel]
dtBestModel: org.apache.spark.ml.PipelineModel = pipeline_b9ed2ccc2108

scala> val dtModel = dtBestModel.stages(1).asInstanceOf[DecisionTreeClassificationModel]
dtModel: org.apache.spark.ml.classification.DecisionTreeClassificationModel = DecisionTreeClassificationModel (uid=dtc_b10fd9474309) of depth 4 with 17 nodes

scala> dtModel.getMaxDepth
res24: Int = 3

scala> dtModel.numFeatures
res25: Int = 22

scala> val dtBestModel = dtCvModel.bestModel.asInstanceOf[PipelineModel]

dtBestModel: org.apache.spark.ml.PipelineModel = pipeline_b9ed2ccc2108

scala> val dtModel = dtBestModel.stages(1).asInstanceOf[DecisionTreeClassificationModel]

dtModel: org.apache.spark.ml.classification.DecisionTreeClassificationModel = DecisionTreeClassificationModel (uid=dtc_b10fd9474309) of depth 4 with 17 nodes

scala> dtModel.getMaxDepth

res24: Int = 3

scala> dtModel.numFeatures

res25: Int = 22

7) 统计逻辑回归的预测正确率

// t_lr 为逻辑回归预测值的数组,t_dt 为决策树预测值的数组
// t_label 为测试集的标签值的数组
scala> val (t_lr, t_dt, t_label, t_count) = (lrPrediction.select("prediction").collect, dtPrediction.select("prediction").collect,testData.select("label").collect,testData.count.toInt)

// c_lr 为统计逻辑回归预测正确个数的累加器
//c_dt 为统计决策树预测正确个数的累加器
scala> var Array(c_lr,c_dt) = Array(0,0)
t4: Int = 0

// 遍历循环，统计逻辑回归正确预测的次数
scala> for(i <- 0 to t_count-1){if(t_lr(i)==t_label(i)) c_lr+=1}

scala> c_lr
res5: Int = 899

# 统计逻辑回归正确率
scala> 1.0*c_lr/t_count
res6: Double = 0.6062036412677007

//遍历循环，统计逻辑回归正确预测的次数
scala> for(i <- 0 to t_count-1){if(t_dt(i)==t_label(i)) c_dt+=1}

scala> c_dt
res8: Int = 927

// 统计决策树正确率
scala>  1.0*c_dt / t_count
res9: Double = 0.6250842886041807

// t_lr 为逻辑回归预测值的数组,t_dt 为决策树预测值的数组

// t_label 为测试集的标签值的数组

scala> val (t_lr, t_dt, t_label, t_count) = (lrPrediction.select("prediction").collect, dtPrediction.select("prediction").collect,testData.select("label").collect,testData.count.toInt)

// c_lr 为统计逻辑回归预测正确个数的累加器

//c_dt 为统计决策树预测正确个数的累加器

scala> var Array(c_lr,c_dt) = Array(0,0)

t4: Int = 0

// 遍历循环，统计逻辑回归正确预测的次数

scala> for(i <- 0 to t_count-1){if(t_lr(i)==t_label(i)) c_lr+=1}

scala> c_lr

res5: Int = 899

# 统计逻辑回归正确率

scala> 1.0*c_lr/t_count

res6: Double = 0.6062036412677007

//遍历循环，统计逻辑回归正确预测的次数

scala> for(i <- 0 to t_count-1){if(t_dt(i)==t_label(i)) c_dt+=1}

scala> c_dt

res8: Int = 927

// 统计决策树正确率

scala> 1.0*c_dt / t_count

res9: Double = 0.6250842886041807

可以看到，我们通过交叉验证得出最优参数，从而获得最佳模型，将这个过程使用流水线连接起来，方便了我们的工作。关于模型的优化，其实我们还有很多工作要做，第11章也也出了一定的优化思路和方法。

8.7小结

本章就Spark ML中分类模型进行的详细介绍，包括逻辑回归、决策树、朴素贝叶斯模型的原理，同时介绍了分类模型的一些使用场景。通过流水线、网格参数以及交叉验证的方式，将整个机器学习过程规范化、标准化、流程化。

本章数据集下载

第7章构建Spark ML推荐模型

前面我们介绍了机器学习的一般步骤、如何探索数据、如何预处理数据、如何利用Spark Ml中的一些算法或API，以及有效处理机器学习过程中的特征转换、特征选择、训练模型，并把这些过程流程化等。从本章开始，我们将通过实例，进一步阐述这些问题，并通过实例把相关内容有机结合起来。
本章主要介绍Spark机器学习中的协同过滤（Collaborative Filtering，CF)模型，协调过滤简单来，说是利用某个兴趣相投、拥有共同经验之群体的喜好来推荐感兴趣的资讯给使用者，个人透过合作的机制给予资讯相当程度的回应（如评分）并记录下来以达到过滤的目的，进而帮助别人筛选资讯，回应不一定局限于特别感兴趣的，特别不感兴趣资讯的纪录也相当重要。在日常生活中，人们实际上经常使用这种方法，如你哪天突然想看个电影，但你不知道具体看哪部，你会怎么做？大部分的人会问问周围的朋友，最近有什么好看的电影，而我们一般更倾向于从兴趣或观点相近的朋友那里得到推荐。这就是协同过滤的思想。换句话说，就是借鉴和你相关人群的观点来进行推荐。
本章介绍Spark的推荐模型，将按以下步骤进行：
 首先简介推荐模型
 加载数据到HDFS
 Spark读取数据
 对数据进行探索
 训练模型
 组装任务
 评估、优化模型

7.1推荐模型简介

协同过滤常被用于推荐系统。这类技术目标在于填充“用户－商品”联系矩阵中的缺失项。Spark.ml目前支持基于模型的协同过滤，其中用户和商品以少量的潜在因子来描述，用以预测缺失项。Spark.ml使用交替最小二乘（ALS）算法来学习这些潜在因子。

7.2数据加载

这里使用MovieLens 100k数据集，主要包括用户属性数据（u.user）、电影数据(u.item)、用户对电影的评级数据（u.data）及题材数据（u.genre）等。在把数据复制到HDFS之前，我们先大致了解一下相关数据：
用户数据(u.user)结构：

$ head -3 u.user 
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
$ wc -l u.user 
943 u.user

$ head -3 u.user

1|24|M|technician|85711

2|53|F|other|94043

3|23|M|writer|32067

$ wc -l u.user

943 u.user

可以看出用户数据由user id、age、gender、occupation和zip code等5个字段，字段间隔符为竖线（"|"），共有943行。
电影数据(u.item)结构：

head -3 u.item 
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
$ wc -l u.item 
1682 u.item

head -3 u.item

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0

2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0

$ wc -l u.item

1682 u.item

可以看出用户数据由movie id、title、release date及其他属性，字段间隔符为竖线（"|"），共有1682行。
用户对电影评级数据(u.data)结构：

$ head -3 u.data
196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
$ wc -l u.data
100000 u.data

$ head -3 u.data

196 242 3 881250949

186 302 3 891717742

22 377 1 878887116

$ wc -l u.data

100000 u.data

可以看出用户数据由user id、movie id、rating(1-5)和timestamp等4个字段，字段间隔符为制表符（"\t"），共有100000行。
电影题材数据(u.genre):

$ head -3 u.genre 
unknown|0
Action|1
Adventure|2
$wc -l u.genre
20 u.genre

$ head -3 u.genre

unknown|0

Action|1

Adventure|2

$wc -l u.genre

20 u.genre

这个数据只有两个字段：题材及代码，以竖线分隔。共有20种电影题材。
把用户数据（u.user）复制到HDFS上,其他数据方法一样。

$ hadoop fs -put u.user /home/hadoop/data/

1	$ hadoop fs -put u.user /home/hadoop/data/

查看数据复制是否成功

$ hadoop fs -ls /home/hadoop/data/

1	$ hadoop fs -ls /home/hadoop/data/

把相关数据复制到HDFS后，我们就可以利用Pyspark对数据进行探索或简单分析，这里使用Pyspark主要考虑其可视化功能，如果不需要数据的可视化，使用Spark即可。
以spark Standalone模式启动spark集群

spark-shell --master spark://master:7077 --driver-memory 1G --total-executor-cores 2

1	spark-shell --master spark://master:7077 --driver-memory 1G --total-executor-cores 2

导入需要的包或库

import org.apache.spark.ml.evaluation.RegressionEvaluator  
import org.apache.spark.ml.recommendation.ALS  
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.Pipeline
	因读入数据缺省都是字符格式，故需要对数据进行格式转换。
//定义个类，来保存一次评分
case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long) 
//把一行转换成一个评分类
def parseRating(str: String): Rating = {  
  val fields = str.split("\t")  
  assert(fields.size == 4)  
  Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)  
}  
	读取数据，并缓存数据，因后面需要多次使用这份数据。
val ratings = spark.read.textFile("hdfs://master:9000/home/hadoop/data/u.data")  
  .map(parseRating)  
  .cache()

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.recommendation.ALS

import org.apache.spark.sql.SparkSession

import org.apache.spark.ml.Pipeline

因读入数据缺省都是字符格式，故需要对数据进行格式转换。

//定义个类，来保存一次评分

case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)

//把一行转换成一个评分类

def parseRating(str: String): Rating = {

val fields = str.split("\t")

assert(fields.size == 4)

Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)

}

读取数据，并缓存数据，因后面需要多次使用这份数据。

val ratings = spark.read.textFile("hdfs://master:9000/home/hadoop/data/u.data")

.map(parseRating)

.cache()

7.3数据探索

数据加载到HDFS后，我们便可对数据进行探索和分析，对用户数据的探索，大家可参考2.4.3节的相关内容。用户对电影评级数据比较简单，这里我们简单查看一下导入数据抽样及统计信息。抽样数据：

ratings.show(4)
+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|   196|    242|   3.0|881250949|
|   186|    302|   3.0|891717742|
|    22|    377|   1.0|878887116|
|   244|     51|   2.0|880606923|
+------+-------+------+---------+

ratings.show(4)

+------+-------+------+---------+

+------+-------+------+---------+

| 196| 242| 3.0|881250949|

| 186| 302| 3.0|891717742|

| 22| 377| 1.0|878887116|

| 244| 51| 2.0|880606923|

+------+-------+------+---------+

用户ID、电影ID、评级数据统计信息：

ratings.describe("userId","movieId","rating").show()
+-------+------------------+------------------+------------------+
|summary|            userId|           movieId|            rating|
+-------+------------------+------------------+------------------+
|  count|            100000|            100000|            100000|
|   mean|         462.48475|         425.53013|           3.52986|
| stddev|266.61442012750905|330.79835632558473|1.1256735991443214|
|    min|                 1|                 1|               1.0|
|    max|               943|              1682|               5.0|
+-------+------------------+------------------+------------------+

ratings.describe("userId","movieId","rating").show()

+-------+------------------+------------------+------------------+

+-------+------------------+------------------+------------------+

| count| 100000| 100000| 100000|

| mean| 462.48475| 425.53013| 3.52986|

| stddev|266.61442012750905|330.79835632558473|1.1256735991443214|

| min| 1| 1| 1.0|

| max| 943| 1682| 5.0|

+-------+------------------+------------------+------------------+

由此可知，该数据集共有100000条，评级最低为1.0，最高为5.0,平均3.5左右。

7.4训练模型

这里数据比较简单，无须做数据转换和清理等数据预处理工作。在训练模型前，我们需要把数据划分为几个部分，这里先随机划分成两部分，划分比例为80%作为训练集，20%作为测试集。后续我们在性能优化时将采用另一种划分方式，然后，比较使用不同划分方法对模型性能或泛化能力的影响。

val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2),seed=1234)

val als = new ALS()  
  .setMaxIter(10) 
  .setRank(10)
  .setRegParam(0.01)
  .setNonnegative(true)
  .setUserCol("userId")  
  .setItemCol("movieId")  
  .setRatingCol("rating")

val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2),seed=1234)

val als = new ALS()

.setMaxIter(10)

.setRank(10)

.setRegParam(0.01)

.setNonnegative(true)

.setUserCol("userId")

.setItemCol("movieId")

.setRatingCol("rating")

7.5组装

1）创建流水线，把数据转换、模型训练等任务组装在一条流水线上。

val pipeline = new Pipeline().setStages(Array(als))

1	val pipeline = new Pipeline().setStages(Array(als))

2）训练模型

val model = pipeline.fit(training)

1	val model = pipeline.fit(training)

3）作出预测

val predictions = model.transform(test)

1	val predictions = model.transform(test)

4）查看预测值与原来的值

predictions.show(5)
+------+-------+------+---------+----------+                                    
|userId|movieId|rating|timestamp|prediction|
+------+-------+------+---------+----------+
|   222|    148|   2.0|881061164| 3.1357265|
|   330|    148|   4.0|876544781| 3.9583592|
|   224|    148|   3.0|888104154| 3.9787998|
|   618|    148|   3.0|891309670| 2.7060091|
|   896|    148|   2.0|887160606| 2.8391676|
+------+-------+------+---------+----------+

predictions.show(5)

+------+-------+------+---------+----------+

+------+-------+------+---------+----------+

| 222| 148| 2.0|881061164| 3.1357265|

| 330| 148| 4.0|876544781| 3.9583592|

| 224| 148| 3.0|888104154| 3.9787998|

| 618| 148| 3.0|891309670| 2.7060091|

| 896| 148| 2.0|887160606| 2.8391676|

+------+-------+------+---------+----------+

7.6评估模型

1）预测时会产生NaN，即NaN表示不推荐(预测时产生NaN是spark2.1 ALS中的一个bug，该bug在2.2中将修复)

predictions.filter(predictions("prediction").isNaN).select("userId","movieId","rating","prediction").count()

1	predictions.filter(predictions("prediction").isNaN).select("userId","movieId","rating","prediction").count()

2）删除含NaN的值的行,NaN有一定合理性，不推荐，但为评估指标，可以先过滤这些数。

val predictions1= predictions.na.drop()

val evaluator = new RegressionEvaluator()
  .setMetricName("rmse")
  .setLabelCol("rating")
  .setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictions1)

val predictions1= predictions.na.drop()

val evaluator = new RegressionEvaluator()

.setMetricName("rmse")

.setLabelCol("rating")

.setPredictionCol("prediction")

val rmse = evaluator.evaluate(predictions1)

3）运行结果为：rmse: Double = 1.016902715345917

7.7模型优化

//导入一些包
import org.apache.spark.sql.Row
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.recommendation.{ALS,ALSModel}

//将样本评分表分成3个部分，分别用于训练 (60%), 校验 (20%), and 测试 (20%)
val splits = ratings.randomSplit(Array(0.6, 0.2,0.2),12)

//把训练样本缓存起来，加快运算速度
val training = splits(0).cache()
val validation = splits(1).toDF.cache()
val test = splits(1).toDF.cache()
//计算各集合总数
val numTraining = training.count()
val numValidation = validation.count()
val numTest = test.count()

//训练不同参数下的模型，并在校验集中验证，获取最佳参数下的模型
val ranks = List(10, 20)
val lambdas = List(0.01, 0.1)
val numIters = List(5, 10)
var bestModel: Option[ALSModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = 1.0
var bestNumIter = 1

def computeRmse(model:ALSModel,data:DataFrame,n:Long):Double = {
    val predictions = model.transform(data)
    val p1=predictions.na.drop().rdd.map{ x =>((x(0),x(1)),x(2))}.join(predictions.rdd.map{ x =>((x(0),x(1)),x(4))}).values
    math.sqrt(p1.map( x => (x._1.toString.toDouble - x._2.toString.toDouble) * (x._1.toString.toDouble - x._2.toString.toDouble)).reduce(_+_)/n)
  }

for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
val als = new ALS()
  .setMaxIter(numIter)
  .setRegParam(lambda)
  .setRank(rank)
.setNonnegative(true)
  .setUserCol("userId")
  .setItemCol("movieId")
  .setRatingCol("rating")
val model = als.fit(training)
val validationRmse = computeRmse(model, validation, numValidation)
      println("RMSE(validation) = " + validationRmse + " for the model trained with rank = "
        + rank + ",lambda = " + lambda + ",and numIter = " + numIter + ".")
      if (validationRmse < bestValidationRmse) {
        bestModel = Some(model)
        bestValidationRmse = validationRmse
        bestRank = rank
        bestLambda = lambda
        bestNumIter = numIter
      }
    }
//运行结果
RMSE(validation) = 1.0664516122491705 for the model trained with rank = 10,lambda = 0.01,and numIter = 5.
RMSE(validation) = 1.0773258157269512 for the model trained with rank = 10,lambda = 0.01,and numIter = 10.
RMSE(validation) = 0.9509095542582248 for the model trained with rank = 10,lambda = 0.1,and numIter = 5.
RMSE(validation) = 0.9390664107785451 for the model trained with rank = 10,lambda = 0.1,and numIter = 10.
RMSE(validation) = 1.1024492428290906 for the model trained with rank = 20,lambda = 0.01,and numIter = 5.
RMSE(validation) = 1.1242105743040174 for the model trained with rank = 20,lambda = 0.01,and numIter = 10.
RMSE(validation) = 0.9393089637028184 for the model trained with rank = 20,lambda = 0.1,and numIter = 5.
RMSE(validation) = 0.9383240505365207 for the model trained with rank = 20,lambda = 0.1,and numIter = 10.

//用最佳模型预测测试集的评分，并计算和实际评分之间的均方根误差（RMSE）
val testRmse = computeRmse(bestModel.get, test, numTest)
testRmse: Double = 0.9383240505365207
//比优化前的rmse: Double = 1.016902715345917提高了7.6%左右。

//打印最优模型中的各参数值
println("The best model was trained with rank = " + bestRank + " and lambda = " + bestLambda
      + ", and numIter = " + bestNumIter + ", and its RMSE on the test set is " + testRmse + ".")

//导入一些包

import org.apache.spark.sql.Row

import org.apache.spark.sql.Dataset

import org.apache.spark.sql.DataFrame

import org.apache.spark.ml.recommendation.{ALS,ALSModel}

//将样本评分表分成3个部分，分别用于训练 (60%), 校验 (20%), and 测试 (20%)

val splits = ratings.randomSplit(Array(0.6, 0.2,0.2),12)

//把训练样本缓存起来，加快运算速度

val training = splits(0).cache()

val validation = splits(1).toDF.cache()

val test = splits(1).toDF.cache()

//计算各集合总数

val numTraining = training.count()

val numValidation = validation.count()

val numTest = test.count()

//训练不同参数下的模型，并在校验集中验证，获取最佳参数下的模型

val ranks = List(10, 20)

val lambdas = List(0.01, 0.1)

val numIters = List(5, 10)

var bestModel: Option[ALSModel] = None

var bestValidationRmse = Double.MaxValue

var bestRank = 0

var bestLambda = 1.0

var bestNumIter = 1

def computeRmse(model:ALSModel,data:DataFrame,n:Long):Double = {

val predictions = model.transform(data)

val p1=predictions.na.drop().rdd.map{ x =>((x(0),x(1)),x(2))}.join(predictions.rdd.map{ x =>((x(0),x(1)),x(4))}).values

math.sqrt(p1.map( x => (x._1.toString.toDouble - x._2.toString.toDouble) * (x._1.toString.toDouble - x._2.toString.toDouble)).reduce(_+_)/n)

}

for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {

val als = new ALS()

.setMaxIter(numIter)

.setRegParam(lambda)

.setRank(rank)

.setNonnegative(true)

.setUserCol("userId")

.setItemCol("movieId")

.setRatingCol("rating")

val model = als.fit(training)

val validationRmse = computeRmse(model, validation, numValidation)

println("RMSE(validation) = " + validationRmse + " for the model trained with rank = "

+ rank + ",lambda = " + lambda + ",and numIter = " + numIter + ".")

if (validationRmse < bestValidationRmse) {

bestModel = Some(model)

bestValidationRmse = validationRmse

bestRank = rank

bestLambda = lambda

bestNumIter = numIter

}

//运行结果

RMSE(validation) = 1.0664516122491705 for the model trained with rank = 10,lambda = 0.01,and numIter = 5.

RMSE(validation) = 1.0773258157269512 for the model trained with rank = 10,lambda = 0.01,and numIter = 10.

RMSE(validation) = 0.9509095542582248 for the model trained with rank = 10,lambda = 0.1,and numIter = 5.

RMSE(validation) = 0.9390664107785451 for the model trained with rank = 10,lambda = 0.1,and numIter = 10.

RMSE(validation) = 1.1024492428290906 for the model trained with rank = 20,lambda = 0.01,and numIter = 5.

RMSE(validation) = 1.1242105743040174 for the model trained with rank = 20,lambda = 0.01,and numIter = 10.

RMSE(validation) = 0.9393089637028184 for the model trained with rank = 20,lambda = 0.1,and numIter = 5.

RMSE(validation) = 0.9383240505365207 for the model trained with rank = 20,lambda = 0.1,and numIter = 10.

//用最佳模型预测测试集的评分，并计算和实际评分之间的均方根误差（RMSE）

val testRmse = computeRmse(bestModel.get, test, numTest)

testRmse: Double = 0.9383240505365207

//比优化前的rmse: Double = 1.016902715345917提高了7.6%左右。

//打印最优模型中的各参数值

println("The best model was trained with rank = " + bestRank + " and lambda = " + bestLambda

+ ", and numIter = " + bestNumIter + ", and its RMSE on the test set is " + testRmse + ".")

//最佳模型相关参数
The best model was trained with rank = 20 and lambda = 0.1, and numIter = 10, and its RMSE on the test set is 0.9383240505365207.

7.8小结

本章介绍了推荐模型的一般方法，Spark推荐模型的原理和算法等，然后通过一个实例具体说明实施Spark推荐模型的一般步骤、使用自定义函数优化模型等内容。下一章将以Spark ML的分类模型为例，进一步说明如何使用Spark ML提供的特征选取、特征转换、流水线、交叉验证等函数或方法。

第24章 语音识别基础

24.1 语言识别系统的架构

24.2 语音识别的原理

24.3 语音识别发展历程

第14章TensorFlowOnSpark详解

14.1TensorFlow简介

14.1.1TensorFlow的安装

14.1.2TensorFlow的发展

14.1.3TensorFlow的特点

14.1.4TensorFlow编程模型

14.1.5TensorFlow常用函数

14.1.6TensorFlow的运行原理

14.2TensorFlow实现卷积神经网络

14.2.1卷积神经网络简介

14.2.3卷积神经网络的网络结构

14.2.4.1 导入数据

14.2.4.2 权重初始化

14.2.4.3 构建卷积神经网络结构

14.2.4.4 训练评估模型

14.3TensorFlow实现循环神经网络

14.3.1循环神经网络简介

14.3.2LSTM循环神经网络简介

14.3.4TensorFlow实现循环神经网络

14.4分布式TensorFlow

14.4.1客户端、主节点和工作节点间的关系

14.4.2分布式模式

14.4.3在Pyspark集群环境运行TensorFlow

14.5TensorFlowOnSpark架构

14.6TensorFlowOnSpark安装

14.7TensorFlowOnSpark实例

14.7.1TensorFlowOnSpark单机模式实例

14.7.2TensorFlowOnSpark集群模式实例

14.8小结

第13章 使用Spark Streaming构建在线学习模型

13.1 Spark Streaming简介

13.1.1Spark Streaming常用术语

13.2 Dstream操作

13.2.1 Dstream输入

13.2.2 Dstream转换

13.2.3 Dstream修改

13.2 .4Dstream输出

13.3 Spark Streaming应用实例

13.4 Spark Streaming在线学习实例

13.5小结

12.1. Spark R简介

12.2获取数据

12.2.1 SparkDataFrame数据结构说明

12.2.2创建SparkDataFrame

12.2.3 SparkDataFrame的常用操作

12.3朴素贝叶斯分类器

12.3.1数据探查

12.3.2对原始数据集进行转换

12.3.3查看不同船舱的生还率差异

12.3.4转换成SparkDataFrame格式的数据

12.3.5模型概要

12.3.6预测

12.3.7评估模型

12.4 小结

第11章 PySpark 决策树模型

11.1 PySpark 简介

11.2 决策树简介

11.3数据加载

11.3.1 原数据集初探

11.3.2 PySpark 的启动

11.3.3 基本函数

11.4数据探索

11.5数据预处理

11.6创建决策树模型

11.7训练模型进行预测

11.8.1特征值的优化

11.8.2交叉验证和网格参数

11.9脚本方式运行

11.9.1 在脚本中添加配置信息

11.9.2运行脚本程序

11.10小结

第1章：Keras基础

1.1Keras简介

1.2keras安装

1.3 keras常用概念

1.4 keras与Tensorflow

第24章语音识别基础

第13章使用Spark Streaming构建在线学习模型

第10章构建Spark ML聚类模型

第9章构建Spark ML回归模型