Wednesday 12 August 2015

Spark blog 4 - Persisting Spark DataFrames into DB2


There are several use cases where data in Spark needs to be persisted in a backend database. Enterprise wide analytics may require load of data into Spark from different data sources, apply transformations, perform in-memory analytics and write the transformed data back to a enterprise RDMS such as DB2.

In this blog, simple techniques are shown using latest Spark release to load data from a JSON file into Spark and write that back into DB2 using DB2 supplied JDBC driver.

Step 1)
Download latest pre-built Spark library (1.4.1) from http://spark.apache.org/downloads.html. With the rapid evolution in Spark, many methods in 1.3 have been deprecated, and it is best to experiment with the latest.

Step 2)
In your Eclipse Scala IDE build path, add Spark library and DB2 JDBC driver as shown below -

Step 3)
Create a json file with following contents -

        { "EMPNO":10, "EDLEVEL":18, "SALARY":152750, "BONUS":1000 }
    { "EMPNO":20, "EDLEVEL":18, "SALARY":94250, "BONUS":800 }

Step 4)
Create a Scala application with following logic -
   1:  val DB2_CONNECTION_URL = "jdbc:db2://localhost:50000/sample:
currentSchema=pallavipr;user=pallavipr;password=XXXXXX;traceFile=
C:/Eclipse_Scala/trace_scala.txt;";
   2:    
   3:    val conf = new SparkConf()
   4:      .setMaster("local[1]")
   5:      .setAppName("GetEmployee")
   6:      .set("spark.executor.memory", "1g")
   7:   
   8:    val sc = new SparkContext(conf)
   9:    val sqlcontext = new SQLContext(sc)
  10:    val path = "C:/Eclipse_Scala/empint.json"
  11:   
  12:    val empdf = sqlcontext.read.json(path)
  13:    empdf.printSchema()
  14:    empdf.show()
  15:    
  16:    Class.forName("com.ibm.db2.jcc.DB2Driver");
  17:    
  18:    val prop = new Properties()
  19:    prop.put("spark.sql.dialect" , "sql"); 
  20:      
  21:    empdf.write.jdbc(DB2_CONNECTION_URL, "PALLAVIPR.EMPLOYEESALARY", prop)


Step 5) 
JSON file is loaded into Spark in Line 12 using new DataFrameReader introduced 
in Spark 1.4.0.
 
Step 6)
DB2 JDBC driver is loaded in Line 16 to carry out the write operation to DB2.
Step 7) 
On running this Scala program, you will see following schema output from 
printSchema method on DataFrame created from JSON file -

 
 
Step 8)
Print of DataFrame using Dataframes show method produces following output-




Step 9)
Final write to DB2 is done using DataFrameWriter jdbc API introduced in 1.4.0 
(as shown in Line 21) which under the covers generate the CREATE TABLE and 
INSERT sql's for table EMPLOYEESALARY.
 
Step 10)
You can verify that the table is created and the JSON data is inserted into 
DB2 using any tool of your choice -


Note that we are working through couple of issues on write back of String data types in DB2 from Spark, however, that should not stop you from experimenting with numeric types. Watch this space for more blogs and updates on DB2-Spark.

No comments:

Post a Comment