There
are several use cases where data in Spark needs to be persisted in a
backend database. Enterprise wide analytics may require load of data
into Spark from different data sources, apply transformations, perform
in-memory analytics and write the transformed data back to a enterprise
RDMS such as DB2.
In
this blog, simple techniques are shown using latest Spark release to
load data from a JSON file into Spark and write that back into DB2 using
DB2 supplied JDBC driver.
Step 1)
Download latest pre-built Spark library (1.4.1) from http://spark.apache.org/downloads.html. With the rapid evolution in Spark, many methods in 1.3 have been deprecated, and it is best to experiment with the latest.
Step 2)
Step 3)
Create a json file with following contents -
{ "EMPNO":10, "EDLEVEL":18, "SALARY":152750, "BONUS":1000 }
{ "EMPNO":20, "EDLEVEL":18, "SALARY":94250, "BONUS":800 }
Step 4)
Create a Scala application with following logic -
1: val DB2_CONNECTION_URL = "jdbc:db2://localhost:50000/sample:
currentSchema=pallavipr;user=pallavipr;password=XXXXXX;traceFile=
C:/Eclipse_Scala/trace_scala.txt;";
2:
3: val conf = new SparkConf()
4: .setMaster("local[1]")
5: .setAppName("GetEmployee")
6: .set("spark.executor.memory", "1g")
7:
8: val sc = new SparkContext(conf)
9: val sqlcontext = new SQLContext(sc)
10: val path = "C:/Eclipse_Scala/empint.json"
11:
12: val empdf = sqlcontext.read.json(path)
13: empdf.printSchema()
14: empdf.show()
15:
16: Class.forName("com.ibm.db2.jcc.DB2Driver");
17:
18: val prop = new Properties()
19: prop.put("spark.sql.dialect" , "sql");
20:
21: empdf.write.jdbc(DB2_CONNECTION_URL, "PALLAVIPR.EMPLOYEESALARY", prop)
Step 5)
JSON file is loaded into Spark in Line 12 using new DataFrameReader introduced
in Spark 1.4.0.
Step 6)
DB2 JDBC driver is loaded in Line 16 to carry out the write operation to DB2.
Step 7)
On running this Scala program, you will see following schema output from
printSchema method on DataFrame created from JSON file -
Step 8)
Print of DataFrame using Dataframes show method produces following output- Step 9)
Final write to DB2 is done using DataFrameWriter jdbc API introduced in 1.4.0
(as shown in Line 21) which under the covers generate the CREATE TABLE and
INSERT sql's for table EMPLOYEESALARY.
Step 10) You can verify that the table is created and the JSON data is inserted into
DB2 using any tool of your choice -
Note that we are working through couple of issues on write back of String data types in DB2 from Spark, however, that should not stop you from experimenting with numeric types. Watch this space for more blogs and updates on DB2-Spark.