Tuesday 16 June 2015

Spark blog 2 - Accessing DB2 data in Spark via stadalone Scala and Java programs in Eclipse.


My colleague Param (param.bng@in.ibm.com) and I (pallavipr@in.ibm.com) are exploring various aspects of Spark integration with DB2 and DB2 Connect drivers. We have decided to write a series of articles capturing our experimentation for the benefit of others as we did not find any article that focuses on different aspects of DB2 access via Spark.

Our first article in the series covered DB2 access via Spark Scala shell. This second article focuses on accessing DB2 data from via standalone Scala and Java program in Eclipse using DB2 JDBC driver and DataFrames API. Below are the detailed step by step instructions. Note that same instructions will apply to DB2 on all platforms (z/OS, LUW, I) as well as Informix.

  1. Confirm that you have Java installed by running java -version from Windows command line. JDK version 1.7 or 1.8 is recommended.
  2. Install Spark on local machine by downloading spark from https://spark.apache.org/downloads.html.
  3. We chose pre-built binaries as shown in Screenshot 1 (instead of source code download) to avoid building spark in early experimentation phase.
    Screenshot 1



  4. Unzip the installation file to a local directory (say C:/spark).
  5. Download Scala Eclipse IDE from http://scala-ide.org/download/sdk.html
  6. Unzip scala-SDK-4.1.0-vfinal-2.11-win32.win32.x86_64.zip into a folder (say c:\Eclipse_Scala)
  7. Find eclipse.exe from eclipse folder and run. Make sure you have 64-bit Java installed by running java -version from cmd prompt. Incompatibility between 64 bit Eclipse package and 32-bit Java will give an error and Eclipse would not start.
  8. Choose a workspace for your Scala project as shown in Screenshot 2.

    Screenshot 2


  9. Create a new Scala project using File->New Scala Project.
  10. Add Spark libraries downloaded in Step 6 to the newly created Scala project as shown in Screenshot 3.
    Screenshot 3
  11. You may see an error about more than 1 scala libraries as shown in Screenshot 4 since Spark has its own copy of Scala library.
Screenshot 4

  1. Remove Scala reference from the Java build path as shown in Screenshot 5 to remove the error. 
    Screenshot 5
  2. You may see another error “The version of scala library found in the build path of DB2SparkAccess (2.10.4) is prior to the one provided by scala IDE (2.11.6). Setting a Scala Installation Choice to match”. Right click Project->Properties->Scala Compiler and change project setting to 2.10 as shown in Screenshot 6.
    Screenshot 6

  3. After clicking OK, project gets rebuilt and you will only see a warning about different Scala versions that you can ignore.
  4. Now you can right click DB2SparkAccess project and choose New Scala App as shown in Screenshot 7. Enter application name and click Finish.

    Screenshot 7

  1. Copy the following source code into the new Scala application you have created (.scala file) and modify the database credentials to yours.

    import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object DB2SparkScala extends App {
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("GetEmployee")
.set("spark.executor.memory", "1g")

val sc = new SparkContext(conf)

val sqlContext = new SQLContext(sc)

val employeeDF = sqlContext.load("jdbc", Map(
"url" -> "jdbc:db2://localhost:50000/sample:currentSchema=pallavipr;user=pallavipr;password=XXXX;",
"driver" -> "com.ibm.db2.jcc.DB2Driver",
"dbtable" -> "pallavipr.employee"))

employeeDF.show();
}


  1. Right click the application and select Run As-> Scala application as shown in Screenshot 8-


    Screenshot 8
  2. You may see the following exception - Exception in thread "main" java.lang.ClassNotFoundException: com.ibm.db2.jcc.DB2Driver. To get rid of the above exception, select Project->Properties and configure Java Build Path to include the IBM DB2 JDBC driver (db2jcc.jar or db2jcc4.jar) as shown in Screenshot 9. JDBC driver can be downloaded from http://www-01.ibm.com/support/docview.wss?uid=swg21385217
    Screenshot 9


  3. Now click on your Scala application and select Run As->Scala Application again and you should see the employee data retrieved from DB2 table as shown in Screenshot 10.
    Screenshot 10
  4. To perform similar access via a standalone Java program, Click on Project->New->Other as shown in Screenshot 11.  
    Screenshot 11
  5. Select Java->Class and click Next that takes you to Screenshot 12.
    Screenshot 12
  6. Enter a name for your Java class and click Finish as shown in Screenshot 13 -
    Screenshot 13
  7. Paste the following code into your newly created class (.java file) with database credentials changed to yours.
import java.util.HashMap;
import java.util.Map;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;

public class DB2SparkJava {
public static void main(String[] args) {

SparkConf conf = new SparkConf().setAppName("Simple Application");

conf.setMaster("local[1]");
conf.set("spark.executor.memory", "1g");

JavaSparkContext sc = new JavaSparkContext(conf);

SQLContext sqlContext = new SQLContext(sc);

Map<String, String> options = new HashMap<String, String>();
options.put(
"url",
"jdbc:db2://localhost:50000/sample:currentSchema=pallavipr;user=pallavipr;password=XXXX;");
options.put("driver", "com.ibm.db2.jcc.DB2Driver");
options.put("dbtable", "pallavipr.employee");

DataFrame jdbcDF = sqlContext.load("jdbc", options);

jdbcDF.show();

}
}


  1. Right click your newly created Java application. Select Run As → Java application. You should see similar results as Step 19 .

Wednesday 10 June 2015

Spark blog 1 - Using Spark's interactive Scala shell for accessing DB2 data using DB2 JDBC driver and Spark's new DataFrames API


My colleague Param (param.bng@in.ibm.com) and I (pallavipr@in.ibm.com) are exploring various aspects of Spark integration with DB2 and DB2 Connect drivers. We have decided to write a series of blogs capturing our experimentation for benefit of others as we did not find any article that focuses on different aspects of DB2 access via Spark.

Currently Spark shell is available in Scala and Python. This article covers accessing and filtering DB2 data via Scala shell using DB2 supplied JDBC driver (IBM Data Server Driver for JDBC and SQLJ). Below are the step by step instructions -

1) Confirm that you have Java installed by running java -version from Windows command line. JDK version 1.7 or 1.8 is recommended.

2) Install Spark on local machine by downloading spark from https://spark.apache.org/downloads.html.

3) We chose pre-built binaries as shown in Screenshot 1 (instead of source code download) to avoid building spark in early experimentation phase.

Screenshot 1

4) Unzip the installation file to a local directory (say C:/spark).

5) Start Windows command prompt.

6) Navigate to the directory that has bin folder of spark installation (c:/spark/bin).

7) Download the DB2 JDBC driver jar (db2jcc.jar or db2jcc4.jar) from http://www-01.ibm.com/support/docview.wss?uid=swg21385217 into C:\ or any other location you desire.

8) Set spark_classpath to the location of the DB2 driver by running SET SPARK_CLASSPATH=c:\db2jcc.jar

9) Run spark-shell.cmd script found in bin folder to start Spark shell using Scala.

10) If installation was successful, you should see output like Screenshot 2, followed by a Scala prompt as in Screenshot 3.

Screenshot 2

Screenshot 3

11) In Screenshot 3, you see 2 important objects are already created for you –
11.1) SparkContext – Any Spark application needs a SparkContext which tells Spark how to access a cluster. In the shell mode, a SparkContext is already created for you in a variable called sc.
11.2) SqlContext – This is needed to construct DataFrames (equivalent to relational tables) from database data and serves as the entry point for working with structured data.

12) Once you have Spark up and running, you can issue queries to DB2 on z/OS as well as DB2 LUW through the DB2 JDBC driver. Tables from DB2 database can be loaded as a DataFrame using the following options on load -
12.1) url The JDBC URL to connect to
12.2) dbtable The JDBC table that should be read. Note that anything that is valid in a `FROM` clause of a SQL query can be used.
12.3) driver The class name of the JDBC driver needed to connect to this URL.

13) From Scala command line, issue
val employeeDF = sqlContext.load("jdbc", Map("url" -> "jdbc:db2://localhost:50000/sample:currentSchema=pallavipr;user=pallavipr;password=XXXXXX;","driver" -> "com.ibm.db2.jcc.DB2Driver","dbtable" -> "pallavipr.employee"))

14) You should see output containing the table metadata as shown in Screenshot 4 -

Screenshot 4

15) To see the contents of the EMPLOYEE table, issue employeeDF.show() from Scala command line, which shows the contents of the DataFrame as captured in Screenshot 5. Show() returns first 20 records from the table by default (out of ~40 rows that exist).

Screenshot 5

16) You can further narrow the search results above by using filter criteria. For eg. If you want to see only columns employee id, firstname, lastname and job title out of all existing columns, you will issue – employeeDF.select("empno","firstnme","lastname",”job”).show(). This gives results shown in Screenshot 6.


Screenshot 6

17) Now if you want to filter out only those rows that have job title DESIGNER, issue the following from scala shell - employeeDF.filter(employeeDF("job").equalTo("DESIGNER")).show(). You will see results shown in Screenshot 7.

Screenshot 7