Free Online Courses for Software Developers - MrBool
× Please, log in to give us a feedback. Click here to login
×

You must be logged to download. Click here to login

×

MrBool is totally free and you can help us to help the Developers Community around the world

Yes, I'd like to help the MrBool and the Developers Community before download

No, I'd like to download without make the donation

×

MrBool is totally free and you can help us to help the Developers Community around the world

Yes, I'd like to help the MrBool and the Developers Community before download

No, I'd like to download without make the donation

Introduction to Big Data in Java with Apache Spark

See in this article a whole introduction to this big and powerful framework, where you can choose among Java, Scala or Python to develop your Big Data code.

Spark extends MapReduce avoiding moving data during processing, through features such as in memory data storage and processing near real-time, performance can be several times faster than other Big Data technologies.

There is also support for validation-demand queries for Big Data, which helps to optimize the data processing flow and provides a higher level API to improve developer productivity and a consistent model for the Big Data solution architect.

Spark holds intermediate results in memory instead of writing them to disk, which is very useful when you need to process the same data sets often. His project aimed to make it an enforcement mechanism that works both in memory and on disk, so Spark performs disk operations when the data no longer fit in memory. You can then use it to process larger data sets that aggregate memory in a cluster.

Spark will store the maximum amount of data in memory and then will persist them to disk. It is for the system architect to look at your data and use cases to assess memory requirements. With this data storage mechanism in memory, using Spark provides performance advantages.

Other Spark features:

  • It supports more than just the functions of Map and Reduce;
  • Optimizes the use of arbitrary graphs operators;
  • Rating demand queries of Big Data contributes to optimize the global flow of data processing;
  • It provides concise and consistent APIs in Scala, Java and Python;
  • It offers interactive shell for Scala and Python. The shell is not available in Java.

Spark is written in the Scala language and run in a Java virtual machine. Currently supports the following languages for application development:

  • Scala
  • Java
  • Python
  • Clojure
  • R

The Spark ecosystem

In addition to the Spark API, there are additional libraries that are part of its ecosystem and provide additional capabilities in the areas of analysis of big data and machine learning.

These libraries include:

  • Spark Streaming: The Spark Streaming can be used to process real-time streaming data based onmicrobatchcomputing. For this, is used DStream which is basically a series of RDD to process the data in real time;
  • Spark SQL: Spark SQL provides the ability to expose Spark sets of data through a JDBC API. This allows you to run SQL style queries on this data using traditional BI tools and visualization. It also allows users to use ETL to extract your data in different formats (such as JSON, TV, or a database), turn them and expose them to ad-hoc queries;
  • Spark MLlib: MLlib is the Spark machine learning library, which consists of learning algorithms, including classification, regression, clustering, collaborative filtering and dimensionality reduction;
  • Spark GraphX: GraphX Spark is a new API for graphs and parallel computing. At a high level, GraphX extends Spark RDD for graphs. To support the computation of graphs, GraphX exposes a set of fundamental operators (eg subgraphs and adjacent vertices) as well as an optimized variant of the Pregel. Furthermore, GraphX includes an increasing collection of algorithms to simplify the graphs analysis tasks.

In these libraries, other components complete theSparkecosystem, as BlinkDB and Tachyon.

BlinkDB is a SQL engine for queries by sample and can be used for running interactive queries against large volumes of data. It allows users to balance the query precisely with the response time. In addition, BlinkDB works on the large data sets through data sampling and presentation of results annotated with the error values.

Tachyon is a distributed file system in memory that allows file sharing reliably and quickly through cluster frameworks such as Spark and MapReduce. Also caches the files that are being worked on, allowing the existence of different processing/consultations and frameworks to access files cached in memory speed.

Finally, there are also adapters integration with other products such as Cassandra (Kassandra Spark Connector) and R (SparkR). With Cassandra Connector, you can use the Spark to access data stored in the database Cassandra and perform with the R statistical analysis.

The following diagram (Figure 1) shows how the different Spark ecosystem libraries are related to each other.

Figure 1. Framework SparkLibraries.

The architecture of Spark

Spark architecture includes the following components:

  • Data storage;
  • API;
  • Management framework.

Consider each of these components in detail.

Data storage:

Spark uses HDFS file system for data storage. Works with any data source compatible with Hadoop, including yourself HDFS, HBase, Cassandra, etc.

API:

The API allows application developers to create applications based on Spark using a standard API interface for Scala, Java and Python.

Resource management:

Spark can be deployed as a standalone server or in a distributed computing framework as Mesos or YARN. Figure 2 presents the components of Spark architecture.

Figure 2. Spark architecture.

Resilient and distributed set of data

The set of resilient and distributed data or RDD (ResilientDistributedDatasets) is the central concept of the Spark framework. Imagine the RDD as a table of the database that can store any type of data. Spark stores the RDD data in different partitions. This helps the computer reorganization and optimization in data processing.

The RDDS are immutable. Although it is apparently possible to modify a RDD with processing, in fact the result of this transformation is a new RDD, while the original remains untouched.

The RDD supports two types of operations:

  • Transformation: Do not return a single value, but a new RDD. Nothing is evaluated when the transformation function is called, it only receives an RDD and returns a new RDD. Some of the processing functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe and coalesce.
  • Action: This operation evaluates and returns a new value. When an action function is called on a RDD object, all data query processing is computed and the value is returned. Some of the action operations are reduce, collect, count, first, take, countByKey and foreach.

Installing Spark

There are some ways to install and use Spark: You can install it on your machine to run stand-alone or use a virtual machine (VM) available from vendors such as Cloudera, Hortonworks and MapR. Or it is also possible to use an installed and configured Spark cloud (as in Databricks Cloud - see Links section).

In this article, we will install Spark as a stand-alone framework and run it locally. We will use version 1.2.0 for the example application code.

How to Run Spark

During the installation of Spark on the local machine or in the cloud, there are different ways in which you can access the Spark engine.

Table 1 shows how to configure the Master URL parameter for different modes of operation of Spark.

Table 1. Master URL parameter for all modes.

Interacting with Spark

Once Spark is up and running, you can connect it using a shell for interactive data analysis. The Spark Shell is available in Scala and Python.

To access them, perform the commands, respectively: spark-shell.cmd and pyspark.cmd.

Spark Web Console

Regardless of the run mode, you can view the results and other statistics via the Web Console available at the URL:

http://localhost:4040

Spark Console is shown in Figure 3 below, with tabs for Stages, Storage, Ecosystem and Executor.

Figure 3. Spark Web Console.

Sshared variables

Spark offers two types of shared variables to make it efficient to run on cluster. These variables are the types Broadcast and accumulators.

Broadcast: or diffusion variables, allow to maintain read-only variables in the cache of each machine instead of sending a copy along with the tasks. These variables can be used to give the cluster nodes copies of large data sets.

The following code snippet shows how to use broadcast variables:

//
// Broadcast Variables // val broadcastVar = sc.broadcast(Array(1, 2, 3)) 
broadcastVar.value

Accumulators: allow the creation of counters or store the results of sums. Jobs running in the cluster can add values to the accumulator variable using the add method. However, the different tasks can not read its value because only the main program can read the value of an accumulator.

The following code snippet shows how to create and use an accumulator:

//
// Acumulattor // val accum = sc.accumulator(0, "My Accumulator") sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) 
accum.value

Sample of a Spark Application

The sample application of this article is a simple counting words application. This is the same example shown in many tutorials on Hadoop. Let's do some data analysis queries in a text file. This file and this sample data set are small, but the same Spark queries can be used for large data sets without any modifications to the code.

For ease of presentation, we will use shell commands to Scala.

First, we will see how to install Spark on your local machine.

Prerequisites:

  • You must install the Java Development Kit (JDK) to work locally.
  • Spark also need to install the software on your local machine. Instructions on how to do this are discussed below.

Note: These instructions are prepared for the Windows environment. If you are using a different operating system, you must modify the system variables and directory paths according to your environment.

Installing Apache software:

Download the latest version of the Spark (see Links section). The latest version at the time of writing this article is Spark 1.6.0. You can choose a specific Spark installation, depending on the version of Hadoop. Spark downloaded for Hadoop 2.4 or later, whose filename is spark-1.6.0-bin hadoop2.4.tgz.

Unzip the installation file to a local directory (eg "C:\dev").

To verify the installation of Spark, navigate to the directory and run the shell of Spark using the following commands. This is for Windows. If you are using Linux or Mac OS, edit commands to work on your operating system.

c:
cd c:\dev\spark-1.6.0-bin-hadoop2.4 
bin\spark-shell

If Spark is correctly installed, it will be presented the following messages in the console output:

….
15/01/17 23:17:46 INFO HttpServer: Starting HTTP Server
15/01/17 23:17:46 INFO Utils: Successfully started service 'HTTP class server' on port 58132.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Type :help for more information.
….
15/01/17 23:17:53 INFO BlockManagerMaster: Registered BlockManager
15/01/17 23:17:53 INFO SparkILoop: Created spark context..
Spark context available as sc.

Type the following commands to check if Spark is working correctly or not:

sc.version
 (or)  
sc.appName

After that, we can get out of Spark shell window by typing the following command:

:quit

To start the Spark Python shell you must have Python installed on your machine. We suggest downloading and installating the Anaconda, which is a free Python distribution and includes several popular Python packages for math, engineering and data analysis.

Then run the following commands:

c:
cd c:\dev\spark-1.6.0-bin-hadoop2.4 
bin\pyspark

Word Count Application

Since we already have Spark up and running, we can perform data analysis queries using Spark API.

The following are a few simple commands to read data from a file and process them. We will work with advanced use cases in the next articles in this series.

First, we will use Spark API to perform the counting of the most popular words in the text. Open a new shell Spark and run the following commands:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._  
val txtFile = "README.md" 
val txtData = sc.textFile(txtFile) 
txtData.cache()

We call the cache function to store the RDD created in the previous step, then Spark does not have to compute it for each use in later queries. Note that the cache() is a lazy operation, so Spark not immediately stores the data in memory. In fact, it will only be allocated if an action is called in the RDD:

txtData.count()

Now, we can call the counting function to see how many lines exist in the text file:

val wcData = txtData.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
 
wcData.collect().foreach(println)

Conclusions

In this article, we discussed how the Apache Spark framework helps in the processing and analysis of big data with its standard API. We also compared the Spark with the traditional implementation of MapReduce. Spark is based on the same HDFS file storage system like Haddop, so you can use Spark and MapReduce together. This is especially interesting if you already have a significant investment and infrastructure configuration with Hadoop.

You can also combine Spark processing with Spark SQL, Machine Learning and Apache streamming.

With various integrations and adapters, you can combine Spark with other technologies. An example is the use of Spark, Kafka, and Apache Cassandra together, in which Kafka can be used for input of streaming data, Spark to processing and finally storing in the Cassandra database the results of this processing.

But keep in mind that Spark has the less mature ecosystem and needs some improvement areas such as security and integration with BI tools.

Links

Databricks Cloud: http://databricks.com/product

Spark official website: https://spark.apache.org/downloads.html



Julio is a System analyst and enthusiast of Information Technology. He is currently a developer at iFactory Solutions company, working in the development of strategic systems, is also a JAVA instructor. He has knowledge and experi...

What did you think of this post?
Services
[Close]
To have full access to this post (or download the associated files) you must have MrBool Credits.

  See the prices for this post in Mr.Bool Credits System below:

Individually – in this case the price for this post is US$ 0,00 (Buy it now)
in this case you will buy only this video by paying the full price with no discount.

Package of 10 credits - in this case the price for this post is US$ 0,00
This subscription is ideal if you want to download few videos. In this plan you will receive a discount of 50% in each video. Subscribe for this package!

Package of 50 credits – in this case the price for this post is US$ 0,00
This subscription is ideal if you want to download several videos. In this plan you will receive a discount of 83% in each video. Subscribe for this package!


> More info about MrBool Credits
[Close]
You must be logged to download.

Click here to login