Tuesday, 16 June 2015

Spark blog 2 - Accessing DB2 data in Spark via stadalone Scala and Java programs in Eclipse.


My colleague Param (param.bng@in.ibm.com) and I (pallavipr@in.ibm.com) are exploring various aspects of Spark integration with DB2 and DB2 Connect drivers. We have decided to write a series of articles capturing our experimentation for the benefit of others as we did not find any article that focuses on different aspects of DB2 access via Spark.

Our first article in the series covered DB2 access via Spark Scala shell. This second article focuses on accessing DB2 data from via standalone Scala and Java program in Eclipse using DB2 JDBC driver and DataFrames API. Below are the detailed step by step instructions. Note that same instructions will apply to DB2 on all platforms (z/OS, LUW, I) as well as Informix.

  1. Confirm that you have Java installed by running java -version from Windows command line. JDK version 1.7 or 1.8 is recommended.
  2. Install Spark on local machine by downloading spark from https://spark.apache.org/downloads.html.
  3. We chose pre-built binaries as shown in Screenshot 1 (instead of source code download) to avoid building spark in early experimentation phase.
    Screenshot 1



  4. Unzip the installation file to a local directory (say C:/spark).
  5. Download Scala Eclipse IDE from http://scala-ide.org/download/sdk.html
  6. Unzip scala-SDK-4.1.0-vfinal-2.11-win32.win32.x86_64.zip into a folder (say c:\Eclipse_Scala)
  7. Find eclipse.exe from eclipse folder and run. Make sure you have 64-bit Java installed by running java -version from cmd prompt. Incompatibility between 64 bit Eclipse package and 32-bit Java will give an error and Eclipse would not start.
  8. Choose a workspace for your Scala project as shown in Screenshot 2.

    Screenshot 2


  9. Create a new Scala project using File->New Scala Project.
  10. Add Spark libraries downloaded in Step 6 to the newly created Scala project as shown in Screenshot 3.
    Screenshot 3
  11. You may see an error about more than 1 scala libraries as shown in Screenshot 4 since Spark has its own copy of Scala library.
Screenshot 4

  1. Remove Scala reference from the Java build path as shown in Screenshot 5 to remove the error. 
    Screenshot 5
  2. You may see another error “The version of scala library found in the build path of DB2SparkAccess (2.10.4) is prior to the one provided by scala IDE (2.11.6). Setting a Scala Installation Choice to match”. Right click Project->Properties->Scala Compiler and change project setting to 2.10 as shown in Screenshot 6.
    Screenshot 6

  3. After clicking OK, project gets rebuilt and you will only see a warning about different Scala versions that you can ignore.
  4. Now you can right click DB2SparkAccess project and choose New Scala App as shown in Screenshot 7. Enter application name and click Finish.

    Screenshot 7

  1. Copy the following source code into the new Scala application you have created (.scala file) and modify the database credentials to yours.

    import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object DB2SparkScala extends App {
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("GetEmployee")
.set("spark.executor.memory", "1g")

val sc = new SparkContext(conf)

val sqlContext = new SQLContext(sc)

val employeeDF = sqlContext.load("jdbc", Map(
"url" -> "jdbc:db2://localhost:50000/sample:currentSchema=pallavipr;user=pallavipr;password=XXXX;",
"driver" -> "com.ibm.db2.jcc.DB2Driver",
"dbtable" -> "pallavipr.employee"))

employeeDF.show();
}


  1. Right click the application and select Run As-> Scala application as shown in Screenshot 8-


    Screenshot 8
  2. You may see the following exception - Exception in thread "main" java.lang.ClassNotFoundException: com.ibm.db2.jcc.DB2Driver. To get rid of the above exception, select Project->Properties and configure Java Build Path to include the IBM DB2 JDBC driver (db2jcc.jar or db2jcc4.jar) as shown in Screenshot 9. JDBC driver can be downloaded from http://www-01.ibm.com/support/docview.wss?uid=swg21385217
    Screenshot 9


  3. Now click on your Scala application and select Run As->Scala Application again and you should see the employee data retrieved from DB2 table as shown in Screenshot 10.
    Screenshot 10
  4. To perform similar access via a standalone Java program, Click on Project->New->Other as shown in Screenshot 11.  
    Screenshot 11
  5. Select Java->Class and click Next that takes you to Screenshot 12.
    Screenshot 12
  6. Enter a name for your Java class and click Finish as shown in Screenshot 13 -
    Screenshot 13
  7. Paste the following code into your newly created class (.java file) with database credentials changed to yours.
import java.util.HashMap;
import java.util.Map;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;

public class DB2SparkJava {
public static void main(String[] args) {

SparkConf conf = new SparkConf().setAppName("Simple Application");

conf.setMaster("local[1]");
conf.set("spark.executor.memory", "1g");

JavaSparkContext sc = new JavaSparkContext(conf);

SQLContext sqlContext = new SQLContext(sc);

Map<String, String> options = new HashMap<String, String>();
options.put(
"url",
"jdbc:db2://localhost:50000/sample:currentSchema=pallavipr;user=pallavipr;password=XXXX;");
options.put("driver", "com.ibm.db2.jcc.DB2Driver");
options.put("dbtable", "pallavipr.employee");

DataFrame jdbcDF = sqlContext.load("jdbc", options);

jdbcDF.show();

}
}


  1. Right click your newly created Java application. Select Run As → Java application. You should see similar results as Step 19 .

Wednesday, 10 June 2015

Spark blog 1 - Using Spark's interactive Scala shell for accessing DB2 data using DB2 JDBC driver and Spark's new DataFrames API


My colleague Param (param.bng@in.ibm.com) and I (pallavipr@in.ibm.com) are exploring various aspects of Spark integration with DB2 and DB2 Connect drivers. We have decided to write a series of blogs capturing our experimentation for benefit of others as we did not find any article that focuses on different aspects of DB2 access via Spark.

Currently Spark shell is available in Scala and Python. This article covers accessing and filtering DB2 data via Scala shell using DB2 supplied JDBC driver (IBM Data Server Driver for JDBC and SQLJ). Below are the step by step instructions -

1) Confirm that you have Java installed by running java -version from Windows command line. JDK version 1.7 or 1.8 is recommended.

2) Install Spark on local machine by downloading spark from https://spark.apache.org/downloads.html.

3) We chose pre-built binaries as shown in Screenshot 1 (instead of source code download) to avoid building spark in early experimentation phase.

Screenshot 1

4) Unzip the installation file to a local directory (say C:/spark).

5) Start Windows command prompt.

6) Navigate to the directory that has bin folder of spark installation (c:/spark/bin).

7) Download the DB2 JDBC driver jar (db2jcc.jar or db2jcc4.jar) from http://www-01.ibm.com/support/docview.wss?uid=swg21385217 into C:\ or any other location you desire.

8) Set spark_classpath to the location of the DB2 driver by running SET SPARK_CLASSPATH=c:\db2jcc.jar

9) Run spark-shell.cmd script found in bin folder to start Spark shell using Scala.

10) If installation was successful, you should see output like Screenshot 2, followed by a Scala prompt as in Screenshot 3.

Screenshot 2

Screenshot 3

11) In Screenshot 3, you see 2 important objects are already created for you –
11.1) SparkContext – Any Spark application needs a SparkContext which tells Spark how to access a cluster. In the shell mode, a SparkContext is already created for you in a variable called sc.
11.2) SqlContext – This is needed to construct DataFrames (equivalent to relational tables) from database data and serves as the entry point for working with structured data.

12) Once you have Spark up and running, you can issue queries to DB2 on z/OS as well as DB2 LUW through the DB2 JDBC driver. Tables from DB2 database can be loaded as a DataFrame using the following options on load -
12.1) url The JDBC URL to connect to
12.2) dbtable The JDBC table that should be read. Note that anything that is valid in a `FROM` clause of a SQL query can be used.
12.3) driver The class name of the JDBC driver needed to connect to this URL.

13) From Scala command line, issue
val employeeDF = sqlContext.load("jdbc", Map("url" -> "jdbc:db2://localhost:50000/sample:currentSchema=pallavipr;user=pallavipr;password=XXXXXX;","driver" -> "com.ibm.db2.jcc.DB2Driver","dbtable" -> "pallavipr.employee"))

14) You should see output containing the table metadata as shown in Screenshot 4 -

Screenshot 4

15) To see the contents of the EMPLOYEE table, issue employeeDF.show() from Scala command line, which shows the contents of the DataFrame as captured in Screenshot 5. Show() returns first 20 records from the table by default (out of ~40 rows that exist).

Screenshot 5

16) You can further narrow the search results above by using filter criteria. For eg. If you want to see only columns employee id, firstname, lastname and job title out of all existing columns, you will issue – employeeDF.select("empno","firstnme","lastname",”job”).show(). This gives results shown in Screenshot 6.


Screenshot 6

17) Now if you want to filter out only those rows that have job title DESIGNER, issue the following from scala shell - employeeDF.filter(employeeDF("job").equalTo("DESIGNER")).show(). You will see results shown in Screenshot 7.

Screenshot 7

Thursday, 7 May 2015

Near continuous availability of data centric applications becomes a reality

With the advancements in DB2 server clustering technologies (both Sysplex and pureScale) and the DB2 Connect client drivers, it is now possible to achieve near continuous application availability. Availability is now no longer about only database server availability, but there is more focus on availability of applications that access the data stored on the server. Having the server up and running is not very meaningful if the applications connected to the server suffer connection failures and downtime. And customers rightfully have come to expect that their local and distributed applications originating from around the world, stay up and running 24X7.

DB2 Connect client driver availability technology has grown leaps and bounds over the last few years – this ensures that applications continue to function in case of server unplanned outages as well as planned outages such as rolling upgrades when members of a DB2 cluster are migrated one by one. Client driver availability algorithms complement server availability technologies to ensure highest availability in a distributed environment. A combination of features around workload balancing and failover have been built into DB2 Connect client drivers. While workload balancing achieves higher throughput by distributing connections and transactions to members with maximum capacity at any point in time, failover reconnects applications to next best available member in case of a connection failure.

In recent releases of DB2 Connect, default values of properties that determine high availability behavior have been changed to values that work well out of the box for majority of customers. Given the number of products in the DB2 stack (application server, client driver, DB2) and myriad high availability knobs at each layer, customers found it difficult to achieve an optimal end-to-end configuration leading to outages that could have been avoided in the fist place. Default value changes also control resource consumption and allow faster recovery from failover. For example, the default value of maxTransportObjets was changed from -1 (unlimited connections) to a more reasonable value of 1000, that prevents runaway applications from creating too many unnecessary connections and deplete system resources.

Number of features were added for easier problem determination in complex high availability setups. For eg, new statistics were introduced to better manage socket connections to clustered DB2. More details were added to the errors that are returned to the application going against a cluster – that provide the application more information on what kind of failover happened and whether the application needs to take further action for a successful transaction. For eg, reason codes and detailed message texts were added to the -30108 error that notifies the application that the transaction did not qualify for a seamless reroute to another DB2 member and that the application should resubmit the transaction for the reroute to occur.

High availability technologies built into the DB2 server and DB2 Connect client drivers complement each other and together have come a long way to satisfy customer's needs for continuous availability of business critical data and applications. To get the most availability for distributed applications, we do recommend you to upgrade DB2 Connect client drivers to the latest available levels.

Friday, 23 January 2015

DB2 – perfect partner for Smart Application Developers

Application paradigm has evolved rapidly – new programming and scripting languages geared towards Web and Mobile spring up regularly. DB2 has embraced new developments in the application space and continues to be the database of choice for application developers. It has kept up with new programming trends - whenever a new programming language or framework gets adopted by the developer community, DB2 steps up and adds or enhances support. Selection of programming language by developers depends upon several factors such as skills, performance, usage scenario, whether libraries are available for desired functionality etc.

Primary APIs for DB2 include C and C++, Visual Basic and Visual C# (for .NET applications),and Java (JDBC and SQLJ). DB2's CLI/ODBC and JDBC drivers serve as the base for several open source wrappers provided by DB2 – such as Perl, PhP, Python, Ruby and Node.js Advantage of using the JDBC and ODBC/CLI drivers as the foundation is that they not only implement standard API, but also provide advanced features such as workload balancing, failover, security, connection management, monitoring etc. that the wrappers can take advantage of to build robust enterprise applications. DB2 also contributes actively to the open source communities to keep them up-to-date. We are also seeing adoption of frameworks and Object-Relational Mapping tools such as Hibernate, JPA, iBatis and Spring for enterprise applications that take advantage of accelerators provided by DB2 for ease of use and improved performance.


Expect to see below diagram grow fast over time as application developers experiment with new APIs and DB2 continues down the path of supporting those application developers, further strengthening its position as an ideal database server for Cloud, Analytics and Mobile.

 

Tuesday, 16 December 2014

Integrating on-prem DB2 for z/OS data with Cloud applications via Cloud Integration Service on Bluemix

DB2 for z/OS is the preferred Systems of Record for mission critical enterprise data. With the advent of Cloud paradigm, enterprise customers are looking for ways in which they can easily and securely expose DB2 for z/OS data to emerging Systems of Engagement. This has led to a hybrid cloud model, whereby public cloud offerings such as Bluemix need to access DB2 z/OS data hosted on premises.

With the growing popularity of Bluemix, customers want to quickly expose DB2 z/OS data to Bluemix developers as APIs and allow seamless integration of cloud and on-prem data. IBM Cloud Integration service for Bluemix has made that possible by adding native support for DB2 for z/OS as an enterprise endpoint. With Cloud Integration service on Bluemix, you can not only connect to on prem Db2 z/OS, browse and move data, you can also create REST APIs for DB2 z/OS CRUD operations that can be invoked easily by Bluemix applications.



To allow connectivity to DB2 z/OS running behind a firewall, you need to install and configure an on premise Websphere Cast Iron Secure Connector behind the firewall such that it has direct access to DB2. Secure Connector sets up a tunnel between company's secure network and Bluemix applications. Once you are connected to on-prem DB2 for z/OS using Secure Connector, you can sync data between on premise DB2 and Cloud Database using Data Sync capability. You can also choose to not move any data, but just expose REST APIs for Bluemix users to access DB2 data residing on prem.

You can find step by step instructions to access on-prem DB2 data from Bluemix at https://www.ng.bluemix.net/docs/#services/CloudIntegration/index.html#gettingstartedwithcloudintegation.


Thursday, 4 December 2014

Faster problem determination in DB2 Connect applications with Smarter Serviceability

As DB2 Connect 10.5 FP5 goes GA, there are several exciting new capabilities to look forward to. I want to highlight our emphasis on enhancing serviceability of DB2 Connect drivers for faster problem resolution. Adoption of DB2 Connect has proliferated as customers strive to modernize their applications for a distributed environment. Most of these customers run with a complex multi-tiered stack environment with several Application Servers and Databases. When an application failure happens, it is often extremely difficult to pinpoint the cause of the failure due to lack of sufficient diagnostics.

Although DB2 Connect drivers offer rich tracing capability, collection of traces require recycle of Application Servers, which is difficult in production environments. Problem is aggravated if symptom is intermittent and workload is high, leading to longer turnaround time for resolution and frustration for all stakeholders (customer, IBM). To overcome this, DB2 Connect Java driver now offers an online tracing capability, which means that traces can be turned on/off and levels of tracing can be altered without needing to bring down the Application Server/JVM. An efficient polling mechanism is introduced whereby the tracing related properties such as TraceFile and TraceLevel are monitored at periodic intervals – these properties can be dynamically altered and will be picked up during polling. This will alleviate customer's pain point of bringing down application servers just to collect traces in 24X7 production environments.

Another area we addressed was lack of First Failure Data Capture. Again, we rely heavily on traces and there was no way to collect minimal targeted diagnostic information on a failure without needing to turn on traces (which can be heavy weight). With Cancun, we have introduced the capability to collect relevant diagnostics pertaining to specific error scenarios in both Java (JDBC) and non-Java (CLI/ODBC) drivers without turning on traces. When certain exceptions happen (which is configurable by customers), basic diagnostic information on that exception is auto logged in designated directories (which is also configurable). This prevents loss of critical diagnostic data on the first occurrence of the failure and improves the chances of resolution on the first failure itself rather than waiting for a recurrence.

We have embarked on this journey to enhance overall serviceability of DB2 Connect to drive up customer satisfaction and improve overall product experience of all stakeholders (application developers, App Server admins, DBAs, systems programmers, IBM support and dev). There are several serviceability related requirements on our radar for future releases. If you would like to see a particular serviceability feature built into the product in the long term, do drop me a note at pallavipr@in.ibm.com.

Monday, 10 November 2014

Consider direct distributed connections to DB2 and trading up to DB2 Connect Unlimited Edition

Top banks, financial services, retail and telecom companies store their mission critical data on  DB2 for z/OS, DB2 LUW and DB2 for i. Distributed workloads accessing DB2 data have proliferated due to the ubiquity of Web and Mobile applications. DB2 Connect provides a portfolio of database drivers offering distributed access to DB2 data through standard interfaces such as JDBC, SQLJ, OLE DB, ADO, ADO.NET, CLI, embedded SQL and scripting Languages such as PhP, Python, Ruby and Perl.

For distributed access to data stored on DB2 for z/OS and DB2 for i, license entitlement to DB2 Connect must be purchased. DB2 Connect is offered in several editions (Enterprise Edition, Application Server Edition, Unlimited Edition). Introduction of Unlimited Edition for both DB2 for z/OS and DB2 for i offers flexibility for direct connections to DB2 servers (without the DB2 Connect Server or Gateway) and transparently handles growth in number of users and applications. More customers anticipate the need for more users accessing DB2 data from Web, Mobile and Cloud anytime, anywhere. DB2 Connect Unlimited Edition license charges are not based on number of users or application servers, rather it is based off the processing capacity of zSeries and iSeries servers. This allows any number of distributed users, applications and workstations to connect. It also allows server side install of DB2 Connect license, instead of replicating the license on potentially hundreds of client workstations or application servers.

We are seeing a pattern whereby the decision to trade up to DB2 Connect Unlimited Edition is taken when customers adopt direct connectivity to DB2 servers (without the DB2 Connect Server or Gateway component). With the advancements in DB2 Connect client drivers around connection management, failover and workload management, more customers are adopting direct distributed connectivity to DB2 instead of going through the DB2 Connect Server or Gateway component. Direct connections to DB2 offers several advantages -

1) Simplification of distributed connectivity by reduction of one layer.
2)  Reduced resource consumption since no resources for DB2 Connect Server machines  (hardware, software) needed. No installation, maintenance, migration of this extra component.
3) Improved availability as point of failure is removed.

4)  Improved response time of distributed applications (especially OLTP applications) due to elimination of extra hop.
 

We encourage you to evaluate simplification of DB2 Connect environment by going direct to DB2 (without Gateway) and upgrading to Unlimited Edition. You will not have to worry about growing number of users and applications who become instantly available without any prior notice from different channels such as Web, Mobile and Cloud.