Spark extends MapReduce avoiding moving data during processing, through features such as in memory data storage and processing near real-time, performance can be several times faster than other Big Data technologies.
There is also support for validation-demand queries for Big Data, which helps to optimize the data processing flow and provides a higher level API to improve developer productivity and a consistent model for the Big Data solution architect.
Spark holds intermediate results in memory instead of writing them to disk, which is very useful when you need to process the same data sets often. His project aimed to make it an enforcement mechanism that works both in memory and on disk, so Spark performs disk operations when the data no longer fit in memory. You can then use it to process larger data sets that aggregate memory in a cluster.
Spark will store the maximum amount of data in memory and then will persist them to disk. It is for the system architect to look at your data and use cases to assess memory requirements. With this data storage mechanism in memory, using Spark provides performance advantages.
Other Spark features:
- It supports more than just the functions of Map and Reduce;
- Optimizes the use of arbitrary graphs operators;
- Rating demand queries of Big Data contributes to optimize the global flow of data processing;
- It provides concise and consistent APIs in Scala, Java and Python;
- It offers interactive shell for Scala and Python. The shell is not available in Java.
Spark is written in the Scala language and run in a Java virtual machine. Currently supports the following languages for application development:
The Spark ecosystem
In addition to the Spark API, there are additional libraries that are part of its ecosystem and provide additional capabilities in the areas of analysis of big data and machine learning.
These libraries include:
- Spark Streaming: The Spark Streaming can be used to process real-time streaming data based onmicrobatchcomputing. For this, is used DStream which is basically a series of RDD to process the data in real time;
- Spark SQL: Spark SQL provides the ability to expose Spark sets of data through a JDBC API. This allows you to run SQL style queries on this data using traditional BI tools and visualization. It also allows users to use ETL to extract your data in different formats (such as JSON, TV, or a database), turn them and expose them to ad-hoc queries;
- Spark MLlib: MLlib is the Spark machine learning library, which consists of learning algorithms, including classification, regression, clustering, collaborative filtering and dimensionality reduction;
- Spark GraphX: GraphX Spark is a new API for graphs and parallel computing. At a high level, GraphX extends Spark RDD for graphs. To support the computation of graphs, GraphX exposes a set of fundamental operators (eg subgraphs and adjacent vertices) as well as an optimized variant of the Pregel. Furthermore, GraphX includes an increasing collection of algorithms to simplify the graphs analysis tasks.
In these libraries, other components complete theSparkecosystem, as BlinkDB and Tachyon.
BlinkDB is a SQL engine for queries by sample and can be used for running interactive queries against large volumes of data. It allows users to balance the query precisely with the response time. In addition, BlinkDB works on the large data sets through data sampling and presentation of results annotated with the error values.
Tachyon is a distributed file system in memory that allows file sharing reliably and quickly through cluster frameworks such as Spark and MapReduce. Also caches the files that are being worked on, allowing the existence of different processing/consultations and frameworks to access files cached in memory speed.
Finally, there are also adapters integration with other products such as Cassandra (Kassandra Spark Connector) and R (SparkR). With Cassandra Connector, you can use the Spark to access data stored in the database Cassandra and perform with the R statistical analysis.
The following diagram (Figure 1) shows how the different Spark ecosystem libraries are related to each other.
Figure 1. Framework SparkLibraries.
The architecture of Spark
Spark architecture includes the following components:
- Data storage;
- Management framework.
Consider each of these components in detail.
Spark uses HDFS file system for data storage. Works with any data source compatible with Hadoop, including yourself HDFS, HBase, Cassandra, etc.
The API allows application developers to create applications based on Spark using a standard API interface for Scala, Java and Python.
Spark can be deployed as a standalone server or in a distributed computing framework as Mesos or YARN. Figure 2 presents the components of Spark architecture.
Figure 2. Spark architecture.
Resilient and distributed set of data
The set of resilient and distributed data or RDD (ResilientDistributedDatasets) is the central concept of the Spark framework. Imagine the RDD as a table of the database that can store any type of data. Spark stores the RDD data in different partitions. This helps the computer reorganization and optimization in data processing.
The RDDS are immutable. Although it is apparently possible to modify a RDD with processing, in fact the result of this transformation is a new RDD, while the original remains untouched.
The RDD supports two types of operations:
- Transformation: Do not return a single value, but a new RDD. Nothing is evaluated when the transformation function is called, it only receives an RDD and returns a new RDD. Some of the processing functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe and coalesce.
- Action: This operation evaluates and returns a new value. When an action function is called on a RDD object, all data query processing is computed and the value is returned. Some of the action operations are reduce, collect, count, first, take, countByKey and foreach.
There are some ways to install and use Spark: You can install it on your machine to run stand-alone or use a virtual machine (VM) available from vendors such as Cloudera, Hortonworks and MapR. Or it is also possible to use an installed and configured Spark cloud (as in Databricks Cloud - see Links section).
In this article, we will install Spark as a stand-alone framework and run it locally. We will use version 1.2.0 for the example application code.
How to Run Spark
During the installation of Spark on the local machine or in the cloud, there are different ways in which you can access the Spark engine.
Table 1 shows how to configure the Master URL parameter for different modes of operation of Spark.
Table 1. Master URL parameter for all modes.
Interacting with Spark
Once Spark is up and running, you can connect it using a shell for interactive data analysis. The Spark Shell is available in Scala and Python.
To access them, perform the commands, respectively: spark-shell.cmd and pyspark.cmd.
Spark Web Console
Regardless of the run mode, you can view the results and other statistics via the Web Console available at the URL:
Spark Console is shown in Figure 3 below, with tabs for Stages, Storage, Ecosystem and Executor.
Figure 3. Spark Web Console.
Spark offers two types of shared variables to make it efficient to run on cluster. These variables are the types Broadcast and accumulators.
Broadcast: or diffusion variables, allow to maintain read-only variables in the cache of each machine instead of sending a copy along with the tasks. These variables can be used to give the cluster nodes copies of large data sets.
The following code snippet shows how to use broadcast variables:
// // Broadcast Variables // val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar.value
Accumulators: allow the creation of counters or store the results of sums. Jobs running in the cluster can add values to the accumulator variable using the add method. However, the different tasks can not read its value because only the main program can read the value of an accumulator.
The following code snippet shows how to create and use an accumulator:
// // Acumulattor // val accum = sc.accumulator(0, "My Accumulator") sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) accum.value
Sample of a Spark Application
The sample application of this article is a simple counting words application. This is the same example shown in many tutorials on Hadoop. Let's do some data analysis queries in a text file. This file and this sample data set are small, but the same Spark queries can be used for large data sets without any modifications to the code.
For ease of presentation, we will use shell commands to Scala.
First, we will see how to install Spark on your local machine.
- You must install the Java Development Kit (JDK) to work locally.
- Spark also need to install the software on your local machine. Instructions on how to do this are discussed below.
Note: These instructions are prepared for the Windows environment. If you are using a different operating system, you must modify the system variables and directory paths according to your environment.
Installing Apache software:
Download the latest version of the Spark (see Links section). The latest version at the time of writing this article is Spark 1.6.0. You can choose a specific Spark installation, depending on the version of Hadoop. Spark downloaded for Hadoop 2.4 or later, whose filename is spark-1.6.0-bin hadoop2.4.tgz.
Unzip the installation file to a local directory (eg "C:\dev").
To verify the installation of Spark, navigate to the directory and run the shell of Spark using the following commands. This is for Windows. If you are using Linux or Mac OS, edit commands to work on your operating system.
c: cd c:\dev\spark-1.6.0-bin-hadoop2.4 bin\spark-shell
If Spark is correctly installed, it will be presented the following messages in the console output:
…. 15/01/17 23:17:46 INFO HttpServer: Starting HTTP Server 15/01/17 23:17:46 INFO Utils: Successfully started service 'HTTP class server' on port 58132. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) Type in expressions to have them evaluated. Type :help for more information. …. 15/01/17 23:17:53 INFO BlockManagerMaster: Registered BlockManager 15/01/17 23:17:53 INFO SparkILoop: Created spark context.. Spark context available as sc.
Type the following commands to check if Spark is working correctly or not:
sc.version (or) sc.appName
After that, we can get out of Spark shell window by typing the following command:
To start the Spark Python shell you must have Python installed on your machine. We suggest downloading and installating the Anaconda, which is a free Python distribution and includes several popular Python packages for math, engineering and data analysis.
Then run the following commands:
c: cd c:\dev\spark-1.6.0-bin-hadoop2.4 bin\pyspark
Word Count Application
Since we already have Spark up and running, we can perform data analysis queries using Spark API.
The following are a few simple commands to read data from a file and process them. We will work with advanced use cases in the next articles in this series.
First, we will use Spark API to perform the counting of the most popular words in the text. Open a new shell Spark and run the following commands:
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val txtFile = "README.md" val txtData = sc.textFile(txtFile) txtData.cache()
We call the cache function to store the RDD created in the previous step, then Spark does not have to compute it for each use in later queries. Note that the cache() is a lazy operation, so Spark not immediately stores the data in memory. In fact, it will only be allocated if an action is called in the RDD:
Now, we can call the counting function to see how many lines exist in the text file:
val wcData = txtData.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) wcData.collect().foreach(println)
In this article, we discussed how the Apache Spark framework helps in the processing and analysis of big data with its standard API. We also compared the Spark with the traditional implementation of MapReduce. Spark is based on the same HDFS file storage system like Haddop, so you can use Spark and MapReduce together. This is especially interesting if you already have a significant investment and infrastructure configuration with Hadoop.
You can also combine Spark processing with Spark SQL, Machine Learning and Apache streamming.
With various integrations and adapters, you can combine Spark with other technologies. An example is the use of Spark, Kafka, and Apache Cassandra together, in which Kafka can be used for input of streaming data, Spark to processing and finally storing in the Cassandra database the results of this processing.
But keep in mind that Spark has the less mature ecosystem and needs some improvement areas such as security and integration with BI tools.
Databricks Cloud: http://databricks.com/product
Spark official website: https://spark.apache.org/downloads.html