My colleague Param (param.bng@in.ibm.com) and I (pallavipr@in.ibm.com)
are exploring various aspects of Spark integration with DB2 and DB2
Connect drivers. We have decided to write a series of blogs capturing
our experimentation for benefit of others as we did not find any
article that focuses on different aspects of DB2 access via Spark.
Currently
Spark shell is available in Scala and Python. This article covers
accessing and filtering DB2 data via Scala shell using DB2 supplied JDBC
driver (IBM Data Server Driver for JDBC and SQLJ). Below are the step
by step instructions -
1) Confirm that you have Java installed by running java -version from Windows command line. JDK version 1.7 or 1.8 is recommended.
2) Install Spark on local machine by downloading spark from https://spark.apache.org/downloads.html.
3) We chose pre-built binaries as shown in Screenshot 1 (instead of source code download) to avoid building spark in early experimentation phase.
4) Unzip the installation file to a local directory (say C:/spark).
5) Start Windows command prompt.
6) Navigate to the directory that has bin folder of spark installation (c:/spark/bin).
7) Download the DB2 JDBC driver jar (db2jcc.jar or db2jcc4.jar) from http://www-01.ibm.com/support/docview.wss?uid=swg21385217 into C:\ or any other location you desire.
8) Set spark_classpath to the location of the DB2 driver by running SET SPARK_CLASSPATH=c:\db2jcc.jar
9) Run spark-shell.cmd script found in bin folder to start Spark shell using Scala.
10) If installation was successful, you should see output like Screenshot 2, followed by a Scala prompt as in Screenshot 3.
11) In Screenshot 3, you see 2 important objects are already created for you –
11.1) SparkContext
– Any Spark application needs a SparkContext which tells Spark how to
access a cluster. In the shell mode, a SparkContext is already created
for you in a variable called sc.
11.2) SqlContext
– This is needed to construct DataFrames (equivalent to relational
tables) from database data and serves as the entry point for working
with structured data.
12) Once you have Spark up and running, you can issue queries to DB2 on z/OS as well as DB2 LUW through the DB2 JDBC driver. Tables from DB2 database can be loaded as a DataFrame using the following options on load -
12.1) url
The JDBC URL to connect to12.2) dbtable
The JDBC table that should be read. Note that anything that is valid in a `FROM` clause of a SQL query can be used.12.3) driver
The class name of the JDBC driver needed to connect to this URL.
13) From Scala command line, issue
val
employeeDF = sqlContext.load("jdbc", Map("url" ->
"jdbc:db2://localhost:50000/sample:currentSchema=pallavipr;user=pallavipr;password=XXXXXX;","driver"
-> "com.ibm.db2.jcc.DB2Driver","dbtable" ->
"pallavipr.employee"))
14) You should see output containing the table metadata as shown in Screenshot 4 -
Screenshot 4
15) To see the contents of the EMPLOYEE table, issue employeeDF.show() from Scala command line, which shows the contents of the DataFrame as captured in Screenshot 5. Show() returns first 20 records from the table by default (out of ~40 rows that exist).
16)
You can further narrow the search results above by using filter
criteria. For eg. If you want to see only columns employee id,
firstname, lastname and job title out of all existing columns, you will
issue – employeeDF.select("empno","firstnme","lastname",”job”).show(). This gives results shown in Screenshot 6.
Screenshot 6
17) Now if you want to filter out only those rows that have job title DESIGNER, issue the following from scala shell - employeeDF.filter(employeeDF("job").equalTo("DESIGNER")).show(). You will see results shown in Screenshot 7.
Screenshot 7
SPARK_CLASSPATH was deprecated in Spark 1.0+
ReplyDeletePlease instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath