配置hadoop+pyspark环境

news/2024/10/3 17:54:26

配置hadoop+pyspark环境

1、部署hadoop环境

配置hadoop伪分布式环境,所有服务都运行在同一个节点上。

1.1、安装JDK

安装jdk使用的是二进制免编译包,下载页面

  • 下载jdk
$ cd /opt/local/src/
$ curl -o jdk-8u171-linux-x64.tar.gz  http://download.oracle.com/otn-pub/java/jdk/8u171-b11/512cd62ec5174c3487ac17c61aaa89e8/jdk-8u171-linux-x64.tar.gz?AuthParam=1529719173_f230ce3269ab2fccf20e190d77622fe1 
  • 解压文件,配置环境变量
### 解压到指定位置
$ tar -zxf jdk-8u171-linux-x64.tar.gz -C /opt/local### 创建软连接
$ cd /opt/local/
$ ln -s jdk1.8.0_171 jdk### 配置环境变量,在当前用的配置文件 ~/.bashrc 增加如下配置
$ tail ~/.bashrc # Java 
export JAVA_HOME=/opt/local/jdk
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
  • 刷新环境变量
$ source ~/.bashrc### 演那种是否生效,返回java信息说明正确
$ java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)

1.2、配置/etc/hosts

### 配置/etc/hosts 把主机名和IP地址一一对应
$ head -n 3 /etc/hosts
# ip --> hostname or domain
192.168.20.10    node### 验证
$ ping node -c 2
PING node (192.168.20.10) 56(84) bytes of data.
64 bytes from node (192.168.20.10): icmp_seq=1 ttl=64 time=0.063 ms
64 bytes from node (192.168.20.10): icmp_seq=2 ttl=64 time=0.040 ms

1.3、设置ssh无密码登录

  • 生成SSH key
### 生成ssh key
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  • 配置公钥到许可文件authorizd_keys
### 需要输入密码
ssh-copy-id node### 验证登录,不需要密码即为成功
$ ssh node

1.4、安装配置hadoop

  • 下载hadoop
### 下载Hadoop2.7.6
$ cd /opt/local/src/
$ wget -c http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.7.6/hadoop-2.7.6.tar.gz
  • 创建hadoop相关目录
$ mkdir -p /opt/local/hdfs/{namenode,datanode,tmp}
$ tree /opt/local/hdfs/
/opt/local/hdfs/
├── datanode
├── namenode
└── tmp
  • 解压hadoop安装文件
### 解压到指定位置
$ cd /opt/local/src/
$ tar -zxf hadoop-2.7.6.tar.gz -C /opt/local/### 创建软连接
$ cd /opt/local/
$ ln -s hadoop-2.7.6 hadoop

1.5、配置hadoop

1.5.1、 配置core-site.xml

$ vim /opt/local/hadoop/etc/hadoop/core-site.xml
<configuration><property><name>hadoop.tmp.dir</name><value>file:/opt/local/hdfs/tmp/</value></property><property><name>fs.defaultFS</name><value>hdfs://node:9000</value></property><property><name>io.file.buffer.size</name><value>131072</value></property>
</configuration>

1.5.2、 配置hdfs-site.xml

$ vim /opt/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration><property><name>dfs.replication</name><value>1</value></property><property><name>dfs.namenode.name.dir</name><value>/opt/local/hdfs/namenode</value></property><property><name>dfs.datanode.data.dir</name><value>/opt/local/hdfs/datanode</value></property><property><name>dfs.webhdfs.enabled</name><value>true</value></property>
</configuration>

1.5.3、 配置mapred-site.xml

### mapred-site.xml需要从一个模板拷贝在修改
$ cp /opt/local/hadoop/etc/hadoop/mapred-site.xml.template  /opt/local/hadoop/etc/hadoop/mapred-site.xml
$ vim /opt/local/hadoop/etc/hadoop/mapred-site.xml
<configuration><property><name>mapreduce.framework.name</name><value>yarn</value></property><property><name>mapreduce.jobhistory.address</name><value>node:10020</value></property><property><name>mapreduce.jobhistory.webapp.address</name><value>node:19888</value></property><property><name>mapreduce.jobhistory.done-dir</name><value>/history/done</value></property><property><name>mapreduce.jobhistory.intermediate-done-dir</name><value>/history/done_intermediate</value></property>
</configuration>

1.5.4、 配置yarn-site.xml

$ vim /opt/local/hadoop/etc/hadoop/yarn-site.xml
<configuration><!-- Site specific YARN configuration properties --><property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property><property><name>yarn.resourcemanager.hostname</name><value>node</value></property><property><name>yarn.resourcemanager.address</name><value>node:8032</value></property><property><name>yarn.resourcemanager.scheduler.address</name><value>node:8030</value></property><property><name>yarn.resourcemanager.resource-tracker.address</name><value>node:8031</value></property><property><name>yarn.resourcemanager.admin.address</name><value>node:8033</value></property><property><name>yarn.resourcemanager.webapp.address</name><value>node:8088</value></property><property>  <name>yarn.log-aggregation-enable</name>  <value>true</value>  </property> <property><name>yarn.log-aggregation.retain-seconds</name><value>604800</value></property><property><name>yarn.nodemanager.pmem-check-enabled</name><value>false</value></property><property><name>yarn.nodemanager.vmem-check-enabled</name><value>false</value></property>
</configuration>

1.5.5、 配置slaves

$ cat  /opt/local/hadoop/etc/hadoop/slaves 
node

1.5.6、 配置master

$ cat  /opt/local/hadoop/etc/hadoop/master
node

1.5.7、 配置hadoop-env

$ vim  /opt/local/hadoop/etc/hadoop/hadoop-env.sh
### 修改JAVA_HOME
export JAVA_HOME=/opt/local/jdk

1.5.8、 配置yarn-env

$ vim  /opt/local/hadoop/etc/hadoop/yarn-env.sh
### 修改JAVA_HOME
export JAVA_HOME=/opt/local/jdk

1.5.9、 配置mapred-env

$ vim  /opt/local/hadoop/etc/hadoop/mapred-env.sh
### 修改JAVA_HOME
export JAVA_HOME=/opt/local/jdk

1.5.10、配置hadoop环境变量

  • 增加hadoop相关配置

在 ~/.bashrc 增加hadoop环境变量,配置如下

# hadoop
export HADOOP_HOME=/opt/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME 
export HADOOP_COMMON_HOME=$HADOOP_HOME 
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-DJava.library.path=$HADOOP_HOME/lib"
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
  • 启用配置
$ source ~/.bashrc### 验证
$ hadoop version
Hadoop 2.7.6
Subversion https://shv@git-wip-us.apache.org/repos/asf/hadoop.git -r 085099c66cf28be31604560c376fa282e69282b8
Compiled by kshvachk on 2018-04-18T01:33Z
Compiled with protoc 2.5.0
From source with checksum 71e2695531cb3360ab74598755d036
This command was run using /opt/local/hadoop-2.7.6/share/hadoop/common/hadoop-common-2.7.6.jar

1.6、格式化hdfs文件系统

### 格式化hdfs,如果已有数据慎重使用,这会删除原有的数据
$ hadoop namenode -format### namenode存储目录会产生数据
$ ls /opt/local/hdfs/namenode/
current

1.7、启动hadoop

启动hadoop主要有HDFS(Namenode、Datanode)和YARN(ResourceManager、NodeManager),可以使用start-all.sh命令启动;关闭命令stop-all.sh,也可以指定应用启动;

1.7.1、启动dfs

启动dfs包括namenode和datanode两个服务,可以使用start-dfs.sh启动,以下采用分布启动;

1.7.1.1、启动Namenode
### 启动namenode
$ hadoop-daemon.sh start namenode
starting namenode, logging to /opt/local/hadoop-2.7.6/logs/hadoop-hadoop-namenode-node.out### 查看进程
$ jps
7547 Jps
7500 NameNode### 启动SecondaryNameNode
$ hadoop-daemon.sh start secondarynamenode
starting secondarynamenode, logging to /opt/local/hadoop-2.7.6/logs/hadoop-hadoop-secondarynamenode-node.out### 查看进程
$ jps
10001 SecondaryNameNode
10041 Jps
9194 NameNode
1.7.1.2、启动Datanode
### 启动datanode
$ hadoop-daemon.sh start datanode
starting datanode, logging to /opt/local/hadoop-2.7.6/logs/hadoop-hadoop-datanode-node.out### 查看进程
$ jps
7607 DataNode
7660 Jps
7500 NameNode
10001 SecondaryNameNode

1.7.2、启动yarn

启动yarn包括ResourceManager和NodeManager,可以使用start-yarn.sh启动,以下采用分布启动;

1.7.2.1、启动ResourceManager
### 启动resourcemanager
$ yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /opt/local/hadoop-2.7.6/logs/yarn-hadoop-resourcemanager-node.out### 查看进程
$ jps
7607 DataNode
7993 Jps
7500 NameNode
7774 ResourceManager
10001 SecondaryNameNode
1.7.2.2、启动NodeManager
### 启动nodemanager
$ yarn-daemon.sh start nodemanager
starting nodemanager, logging to /opt/local/hadoop-2.7.6/logs/yarn-hadoop-nodemanager-node.out### 查看进程
$ jps
7607 DataNode
8041 NodeManager
8106 Jps
7500 NameNode
7774 ResourceManager
10001 SecondaryNameNode

1.7.3、启动Historyserver

### 启动 historyserver
$ mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /opt/local/hadoop/logs/mapred-hadoop-historyserver-node.out### 查看进程
$ jps
8278 JobHistoryServer
7607 DataNode
8041 NodeManager
7500 NameNode
8317 Jps
7774 ResourceManager
10001 SecondaryNameNode

1.7.4、hadoop相关功能说明

hadoop启动后,主要有以下功能

  • HDFS功能:NameNode、SecondaryNameNode、DataNode
  • YARN功能:ResourceManager、NodeManager
  • HistoryServer:JobHistoryServer

1.8、hadoop基本操作

1.8.1、 hadoop 常用命令

命令说明
hadoop fs -mkdir创建HDFS 目录
hadoop fs -ls列出HDFS 目录
hadoop fs -copyFromLocal复制本地文件到HDFS
hadoop fs -put复制本地文件到HDFS,put可以接收stdin(标准输入)
hadoop fs -cat列出HDFS文件的内容
hadoop fs -copyToLocal将HDFS上的文件复制到本地
hadoop fs -get将HDFS上的文件复制到本地
hadoop fs -cp负载HDFS文件
hadoop fs -rm删除HDFS文件或目录(-R参数)

1.8.2、haodoom 命令操作

1.8.2.1 基本命令操作
  • 创建目录
$ hadoop fs -mkdir /user/hadoop
  • 创建多个目录
$ hadoop fs -mkdir -p /user/hadoop/{input,output} 
  • 查看HDFS目录
$ hadoop fs -ls /
Found 2 items
drwxrwx---   - hadoop supergroup          0 2018-06-23 12:20 /history
drwxr-xr-x   - hadoop supergroup          0 2018-06-23 13:20 /user
$ hadoop fs -ls /user
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2018-06-23 13:20 /user/hadoop
  • 查看所有的目录
$ hadoop fs -ls -R /
drwxrwx---   - hadoop supergroup          0 2018-06-23 12:20 /history
drwxrwx---   - hadoop supergroup          0 2018-06-23 12:20 /history/done
drwxrwxrwt   - hadoop supergroup          0 2018-06-23 12:20 /history/done_intermediate
drwxr-xr-x   - hadoop supergroup          0 2018-06-23 13:20 /user
drwxr-xr-x   - hadoop supergroup          0 2018-06-23 13:24 /user/hadoop
drwxr-xr-x   - hadoop supergroup          0 2018-06-23 13:24 /user/hadoop/input
drwxr-xr-x   - hadoop supergroup          0 2018-06-23 13:24 /user/hadoop/output
  • 上传本地文件到HDFS
$ hadoop fs -copyFromLocal /opt/local/hadoop/README.txt /user/hadoop/input
  • 查看HDFS上的文件内容
$ hadoop fs -cat /user/hadoop/input/README.txt
  • 将HDFS上的文件下载到本地
$ hadoop fs -get /user/hadoop/input/README.txt ./
  • 删除文件或目录
### 删除文件会提示
$ hadoop fs -rm /user/hadoop/input/examples.desktop
18/06/23 13:47:06 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hadoop/input/examples.desktop### 删除目录
$ hadoop fs -rm -R /user/hadoop
18/06/23 13:48:17 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hadoop
1.8.2.2、执行mapreduce任务

使用hadoop内置的wordcount程序统计字数

  • 执行计算任务
$ hadoop fs -put /opt/local/hadoop/README.txt /user/input
$ cd /opt/local/hadoop/share/hadoop/mapreduce
#### hadoop jar jar包名称 类 输入文件 输出目录
$ hadoop jar hadoop-mapreduce-examples-2.7.6.jar wordcount /user/input/ /user/output/wordcount
  • 查看当前的任务
#### 查看当然的任务情况,也可以在http://node:8088 上查看
$ yarn application -list
18/06/23 13:55:34 INFO client.RMProxy: Connecting to ResourceManager at node/192.168.20.10:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1Application-Id      Application-Name        Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL
application_1529732240998_0001            word count               MAPREDUCE        hadoop     default             RUNNING           UNDEFINED               5%                   http://node:41713
  • 查看计算结果
#### _SUCCESS 表示成功,part开头的文件表是结果
$ hadoop fs -ls /user/output/wordcount
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2018-06-23 13:55 /user/output/wordcount/_SUCCESS
-rw-r--r--   1 hadoop supergroup       1306 2018-06-23 13:55 /user/output/wordcount/part-r-00000#### 查看内容
$ hadoop fs -cat /user/output/wordcount/part-r-00000|tail
uses    1
using   2
visit   1
website 1
which   2
wiki,   1
with    1
written 1
you 1
your    1

1.9、hadoop Web界面

  • hadoop NameNode HDFS Web界面可以查看当前HDFS和DataNode的运行情况
http://node:50070
  • hadoop ResourceManager Web界面可以查看当前hadoop的Node节点状态,应用进程、任务执行状态
http://node:8088

2、部署spark

2.1、Scala简介和安装

2.1.1、Scala简介

spark是用是scala编写,Scala官网为:https://www.scala-lang.org/,因此需要首先安装Scala,Scala有以下特点:

  • Scala可编译为Java bytecode 字节码,也就是说可以在JVM(Java Virtual Machine)上运行,具备跨平台能力;
  • 现有的Java的链接库都可以使用,可以继续使用丰富的Java开放源代码生态系统;
  • Scala是一种函数式语言,在函数式语言中,函数也是值,与整数字符串处于同一地位;函数可以作为参数传递给其他函数;
  • Scala是一种纯面向对象的语言,所有东西都是对象,而所有操作都是方法;

2.1.2、Scala安装

Scala下载地址为https://www.scala-lang.org/files/archive/;
从Spark2.0版开始,Spark默认使用Scala 2.11构建,因此下载scala-2.11版本;

  • 下载Scala
$ cd /opt/local/src/
$ wget -c https://www.scala-lang.org/files/archive/scala-2.11.11.tgz
  • 解压Scala文件
#### 解压到指定位置,并做软连接
$ tar -zxf scala-2.11.11.tgz -C /opt/local/
$ cd /opt/local/
  • 配置Scala环境变量
#### 配置 ~/.bashrc 增加如下
$ tail -n 5 ~/.bashrc
# scala
export SCALA_HOME=/opt/local/scala
export PATH=$PATH:$SCALA_HOME/bin#### 启用配置
$ source ~/.bashrc#### 验证
$ scala -version
Scala code runner version 2.11.11 -- Copyright 2002-2017, LAMP/EPFL

2.2、Spark安装

2.2.1、Spark下载

spark下载页面地址是http://spark.apache.org/downloads.html,需要选择用于hadoop2.7及以上版本;

$ cd /opt/local/src/
$ wget -c http://mirror.bit.edu.cn/apache/spark/spark-2.2.1/spark-2.3.1-bin-hadoop2.7.tgz

2.2.2、Spark解压配置

  • 解压spark到指定目录,并做软连接
$ tar zxf spark-2.3.1-bin-hadoop2.7.tgz -C /opt/local/
$ cd /opt/local/
$ ln -s spark-2.3.1-bin-hadoop2.7 spark
  • 配置spark环境变量
$ tail -n 5 ~/.bashrc 
# spark
export SPARK_HOME=/opt/local/spark
export PATH=$PATH:$SPARK_HOME/bin
  • 启用环境变量
$ source ~/.bashrc

2.3、运行pyspark

2.3.1、本地运行pyspyark

在终端输入pyspark启动spark的python接口,启动会显示使用的python版本和spark版本信息;

pyspark --master local[4],local[N]代表本地运行,使用N个线程(thread);local[*] 会尽量使用所有的CPU核心;

  • 本地启动pyspark
$ pyspark 
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
2018-06-23 19:25:00 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to____              __/ __/__  ___ _____/ /___\ \/ _ \/ _ `/ __/  '_//__ / .__/\_,_/_/ /_/\_\   version 2.3.1/_/Using Python version 2.7.12 (default, Dec  4 2017 14:50:18)
SparkSession available as 'spark'.
  • 查看当前运行模式
>>> sc.master
u'local[*]'
  • 读取本地文件
>>> textFile=sc.textFile("file:/opt/local/spark/README.md")
>>> textFile.count()
103
  • 读取HDFS文件
>>> textFile=sc.textFile("hdfs://node:9000/user/input/README.md")
>>> textFile.count()
103

2.3.2、Hadoop YARN 运行spark

spark可以在Hadoop YARN上运行,让YARN帮助它进行资源管理;

HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop 表示设置hadoop配置文件目录;
pyspark 运行的程序;
--master yarn --deploy-mode client 设置运行模式为YARN-Client

$ HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
2018-06-23 20:27:48 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-06-23 20:27:52 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Welcome to____              __/ __/__  ___ _____/ /___\ \/ _ \/ _ `/ __/  '_//__ / .__/\_,_/_/ /_/\_\   version 2.3.1/_/Using Python version 2.7.12 (default, Dec  4 2017 14:50:18)
SparkSession available as 'spark'.
  • 查看当前运行模式
>>> sc.master
u'yarn'
  • 读取HDFS文件
>>> textFile=sc.textFile("hdfs://node:9000/user/input/README.md")
>>> textFile.count()
103  
  • yarn 查看任务运行情况
#### 也可以通过web查看:http://node:8088
$ yarn application -list
18/06/23 20:34:40 INFO client.RMProxy: Connecting to ResourceManager at node/192.168.20.10:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1Application-Id      Application-Name        Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL
application_1529756801315_0001          PySparkShell                   SPARK        hadoop     default             RUNNING           UNDEFINED              10%                    http://node:4040

2.3.3、Spark Standalone Cluster运行spark

配置Spark Standalone Cluste伪分布式环境,所有服务都运行在同一个节点上。

2.3.3.1、配置spark-env.sh
  • 复制模板文件创建spark-env.sh
$ cp /opt/local/spark/conf/spark-env.sh.template /opt/local/spark/conf/spark-env.sh
  • 配置spark-env.sh
$ tail -n 6 /opt/local/spark/conf/spark-env.sh
#### Spark Standalone Cluster
export JAVA_HOME=/opt/local/jdk
export SPARK_MASTER_HOST=node
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=512m
export SPARK_WORKER_INSTANCES=1
2.3.3.2、配置slave
#### 增加编辑,也可以拷贝模板文件
$ tail /opt/local/spark/conf/slaves
node
2.3.3.3、在Spark Standalone Cluster运行pyspark
2.3.3.3.1、启动Spark Standalone Cluster

启动Spark Standalone Cluster可以使用${SPARN_HOME}/sbin/start-all.sh一个脚本启动所有服务;也可以分布启动master和slaves;

  • 启动master
$ /opt/local/spark/sbin/start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/local/spark/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-node.out
~$ jps
4185 Master
  • 启动slaves
$ /opt/local/spark/sbin/start-slaves.sh 
node: starting org.apache.spark.deploy.worker.Worker, logging to /opt/local/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node.out
$ jps
4185 Master
4313 Worker
  • 通过http://node:8080查看集群状态
$ w3m http://node:8080/
[spark-logo] 2.3.1 Spark Master at spark://node:7077• URL: spark://node:7077• REST URL: spark://node:6066 (cluster mode)• Alive Workers: 1• Cores in use: 1 Total, 0 Used• Memory in use: 256.0 MB Total, 0.0 B Used• Applications: 0 Running, 0 Completed• Drivers: 0 Running, 0 Completed• Status: ALIVEWorkers (1)Worker Id                       Address       State Cores Memory512.0
worker-20180624102100-192.168.20.10-42469 192.168.20.10:42469 ALIVE 1 (0  MBUsed) (0.0 BUsed)Running Applications (0)Application ID Name Cores     Memory per     Submitted Time User State DurationExecutorCompleted Applications (0)Application ID Name Cores     Memory per     Submitted Time User State DurationExecutor
2.3.3.3.2、在Spark Standalone Cluster运行pyspark
  • 运行pyspark
$ pyspark --master spark://node:7077 --num-executors 1 --total-executor-cores 1 --executor-memory 512m
Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
2018-06-24 10:39:09 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to____              __/ __/__  ___ _____/ /___\ \/ _ \/ _ `/ __/  '_//__ / .__/\_,_/_/ /_/\_\   version 2.3.1/_/Using Python version 2.7.12 (default, Dec  4 2017 14:50:18)
SparkSession available as 'spark'.
  • 查看当前运行模式
>>> sc.master
u'spark://node:7077'
  • 读取本地文件
>>> textFile=sc.textFile("file:/opt/local/spark/README.md")
>>> textFile.count()
103
  • 读取hdfs文件
>>> textFile=sc.textFile("hdfs://node:9000/user/input/README.md")
>>> textFile.count()
103

2.3、总结

spark的运行方式有多种,主要有独立集群、YARN集群、Mesos集群,和本地模式

master可选值描述说明
spark://host:portspark standalone集群,默认端口为7077
yarnYARN集群,当在YARN上运行时,需设置环境变量HADOOP_CONF_DIR指向hadoop配置目录,以获取集群信息
mesos://host:portMesos集群,默认端口为5050
local本地模式,使用1个核心
local[n]本地模式,使用n个核心
local[*]本地模式,使用尽可能多的核心

转载于:https://blog.51cto.com/balich/2132160


https://dhexx.cn/news/show-156458.html

相关文章

【面试题】面试官:请你说说对Vue响应式数据的理解

前言 我们平时的面试过程当中&#xff0c;问到Vue&#xff0c;几乎都会问到响应式的问题&#xff0c;因为在Vue的实现当中&#xff0c;响应式系统的实现就占据很大一个篇幅。这是Vue声明式编程的基石。那么如何理解响应式数据呢&#xff1f;相信结合源码以及手写实现会有一个更…

android电量优化之Battery Historian工具使用

前几天写了关于androidAPP性能优化总结的文章&#xff0c;还没有看的话可以看一下&#xff0c;这文章提到了电量优化&#xff0c;android耗电分析所用到的工具battery-historian,这里做一个总结&#xff0e; 在 Android5.0 以前&#xff0c;在应用中测试电量消耗比较麻烦&…

plt

设定X,Y轴的长度以及刻度的方法. import numpy as np import matplotlib.pyplot as pltdata np.arange(0,1.1,0.01) plt.title(line) plt.xlabel(x) plt.ylabel(y) plt.xlim((0,1)) # 确定X轴的范围. plt.ylim((0,1)) plt.xticks([0,0.2,0.4,0.6,0.8,1]) plt.yticks([0,0.2,0.…

2019年Java和JVM生态系统预测:OpenJDK将成为Java运行时市场领导者

本文对2019年Java和JVM生态系统做了一些预测。正如InfoQ 2018年度总结中说的那样&#xff0c;Java在2018年的发展势头非常有意思。 在我们步入2019之际&#xff0c;让我们来看看在新的一年中Java和相关技术值得注意的点&#xff0c;并试着猜测未来会发生些什么。 免责声明&…

【动态图】教你捋清Java常用数据结构及其设计原理

最近在整理数据结构方面的知识, 系统化看了下Java中常用数据结构, 突发奇想用动画来绘制数据流转过程。主要基于jdk8, 可能会有些特性与jdk7之前不相同, 例如LinkedList LinkedHashMap中的双向列表不再是回环的。HashMap中的单链表是尾插, 而不是头插入等等, 后文不再赘叙这些差…

scrapy

https://doc.scrapy.org/en/1.2/intro/install.html#installing-scrapy 转载于:https://blog.51cto.com/11970781/2132283

IBM MQ 中 amqsput : command not found的解决办法

MQ操作队列的命令有如下三条&#xff1a;命令功能1、amqsput 将消息放入队列中&#xff0c; 程序把之后的每一行标准输入作为一条独立的消息&#xff0c;读到 EOF 或空行时退出。注意&#xff0c;UNIX 上的 EOF 为 CtrlD&#xff0c;Windows 上的 EOF为 CtrlZ 。可以将标准输…

APP性能测试--功耗测试

一、功耗测试基础 移动设备的电池电量是非常有限的&#xff0c;保持持久的续航能力尤为重要。另外&#xff0c;android的很多特性都比较耗电(如屏幕&#xff0c;GPS&#xff0c;sensor传感器&#xff0c;唤醒机制&#xff0c;CPU&#xff0c;连网等的使用)&#xff0c;我们必须…

获取实体的所有面,测量两个面角度,获取面的相邻面

//获取实体的所有面,测量两个面角度,获取面的相邻面int face_1 1;int face_2 5;TaggedObject body;Point3d point;theUI.SelectionManager.SelectTaggedObject("选择实体", "选择", Selection.SelectionScope.AnyInAssembly, false, false, out body, ou…

微信浏览器到底是什么内核?

https://www.zhihu.com/question/22082084最近在做手机浏览器的开发&#xff0c;发现微信内嵌的浏览器很奇怪&#xff0c;以为是webkit内核的&#xff0c;但是webkit兼容的JS它不支持&#xff0c;html5也不支持&#xff0c;如果不是回是什么内核那&#xff0c;而且同样的iphone…