Tuesday 4 August 2015

Spark Blog 3 – Simplify joining DB2 data and JSON data with Spark

Spark SQL gives powerful API to work with data across different data sources using Python, Scala and Java. In this post, we will demonstrate how easily DB2 data (both z/OS and LUW) can be loaded into Spark and joined with JSON data using DataFrames.

We will use a standalone Java program in this example. For setup of Eclipse to work with Spark, please refer to second blog in Spark series.

Step by step instructions for our sample program is given below.

  1. Create a table in DB2 (“EMPLOYEESUB”) using Command Line Processor (CLP) that contains a subset of EMPLOYEE information. DDL for EMPSUB is - CREATE TABLE "PALLAVIPR"."EMPLOYEESUB" ( "EMPNO" CHAR(6) NOT NULL , "FIRSTNME" VARCHAR(12) NOT NULL , "MIDINIT" CHAR(1) , "LASTNAME" VARCHAR(15) NOT NULL , "WORKDEPT" CHAR(3) , "COMM" DECIMAL(9,2) ). Change the schema to your schema and make sure that you are connected to the correct database you want to create your table in.



  1. Load EMPLOYEESUB table with 5 rows of data below, which is stored in a CSV file (C:\1.csv).

"000010","CHRISTINE","I","HAAS","A00",+0004220.00
"000020","MICHAEL","L","THOMPSON","B01",+0003300.00
"000030","SALLY","A","KWAN","C01",+0003060.00
"000050","JOHN","B","GEYER","E01",+0003214.00
"000060","IRVING","F","STERN","D11",+0002580.00



  1. Copy the following contents into a JSON file (employeesub.json)
    { "EMPNO":"000010", "EDLEVEL":"18", "SALARY":"152750.00", "BONUS":"1000.00" }
{ "EMPNO":"000020", "EDLEVEL":"18", "SALARY":"94250.00", "BONUS":"800.00" }
{ "EMPNO":"000030", "EDLEVEL":"20", "SALARY":"98250.00", "BONUS":"800.00" }
{ "EMPNO":"000050", "EDLEVEL":"16", "SALARY":"80175.00", "BONUS":"800.00" }
{ "EMPNO":"000060", "EDLEVEL":"16", "SALARY":"72250.00", "BONUS":"500.00" }

  1. As you can see, DB2 table EMPLOYEESUB contains 6 columns, EMPNO, FIRSTNME, MIDINIT, LASTNAME, WORKDEPT, COMM, while JSON file has following 4 keys – EMPNO, EDLEVEL, SALARY, BONUS. Our goal is to join both sets of data using EMPNO as the join key so that a combined data set can be created with all employee information in one place.

  2. Copy the following program in an Eclipse java class.

       1:   
       2:  import java.util.HashMap;
       3:  import java.util.Map;
       4:   
       5:  import org.apache.log4j.Level;
       6:  import org.apache.log4j.Logger;
       7:  import org.apache.spark.SparkConf;
       8:  import org.apache.spark.api.java.JavaSparkContext;
       9:  import org.apache.spark.sql.DataFrame;
      10:  import org.apache.spark.sql.Row;
      11:  import org.apache.spark.sql.SQLContext;
      12:   
      13:  public class DB2JsonJoinSpark {
      14:    public static void main(String[] args) {
      15:      
      16:        SparkConf conf = new SparkConf();
      17:      
      18:        conf.setMaster("local[1]");
      19:        conf.setAppName("GetUnionEmployeeTable");
      20:        conf.set("spark.executor.memory", "1g");
      21:      
      22:        String path = "C:\\Eclipse_Scala\\employeesub.json";
      23:      
      24:        JavaSparkContext sc = new JavaSparkContext(conf);
      25:        SQLContext sqlContext = new SQLContext(sc);
      26:   
      27:        Map<String, String> options = new HashMap<String, String>();
      28:        options.put("url", "jdbc:db2://localhost:50000/sample:currentSchema=pallavipr;user=pallavipr;password=XXXXXX;");
      29:        options.put("driver", "com.ibm.db2.jcc.DB2Driver");
      30:        options.put("dbtable", "pallavipr.employeesub");
      31:   
      32:        DataFrame df1 = sqlContext.load("jdbc", options);
      33:        df1.show();
      34:   
      35:        DataFrame df2 = sqlContext.jsonFile(path);
      36:        df2.show();
      37:   
      38:        DataFrame finaldf = df1.join(df2, df2.col("EMPNO").equalTo(df1.col("EMPNO")) );
      39:        System.out.println("Printing Joined Data");
      40:        finaldf.show();
      41:    }
      42:  }
     
  3. As shown above, we create SQLContext in Line 25 to work with RDMS and load DB2 data into Spark using DB2 JDBC driver as a DataFrame in Line 32. We load JSON data as Spark DataFrame in Line 35.

  4. Make sure that you have included Spark libraries and DB2 JDBC driver libraries in the Build path.   

  5. Right click the Java class and Run As Java application. You should see below output on the console.  
As you can see above, the output contains data from both DB2 table and JSON file, joined by EMPNO. This example clearly highlights the ease with which Spark enables join across disparate data sources with its powerful DataFrames API.

Enjoy your Spark testdrive with DB2!
- Pallavi (pallavipr@in.ibm.com) and Param (param.bng@in.bm.com)

3 comments:

  1. Hi - Were you able to write from dataframe into DB2 database, I am able to with int data type, but for string/varchar it's failing... Please let me know if you have a solution.

    ReplyDelete
    Replies
    1. Hi AK,
      There are couple of issues with write back of DB2 String data type, which we are working through. Watch this space for a solution.

      BTW, what is the nature of project on DB2-Spark you are involved in? Feel free to email me at pallavipr@in.ibm.com.

      -Pallavi

      Delete
  2. Hi - Were you able to write from dataframe into DB2 database, I am able to with int data type, but for string/varchar it's failing... Please let me know if you have a solution.

    ReplyDelete