Free Online Courses for Software Developers - MrBool
× Please, log in to give us a feedback. Click here to login
×

You must be logged to download. Click here to login

×

MrBool is totally free and you can help us to help the Developers Community around the world

Yes, I'd like to help the MrBool and the Developers Community before download

No, I'd like to download without make the donation

×

MrBool is totally free and you can help us to help the Developers Community around the world

Yes, I'd like to help the MrBool and the Developers Community before download

No, I'd like to download without make the donation

Getting Started with Lambda Architecture in Java

This article discusses how the Lambda architecture fits into the accepted model of Big Data in Java, called 3VS.

Lambda architecture, proposed by Nathan Marz (creator of architecture) is the most advanced technology of this issue in relation to application modeling aspects of Big Data. We will see in this article the possible issues related to the evolution of Big Data for Fast Data, a new concept that promises to speed up the processing of vast amounts of information, and discuss tools whose purpose is to facilitate software development in this scenario.

Although there is no formal definition of the size, shape or applicability encompassing all the peculiar characteristics of a purely computational point of view, Big Data can be defined as a greater amount of data than the most popular technology is capable of processing, and this definition, is a moving target, ie, what is Big Data today will not be tomorrow.

The 3vs model describes the Big Data through the axes volume, variety and speed, as illustrated in Figure 1. By Volume means the exponential increase in the amount of present data, generated and handled by existing computer systems. Variety comes to different sources of data that we have today and the contrast between the era SQL, where a unified framework for description and access to data, and was NoSQL, where there are different models of data - and even those who Nail that there should not be any data model, ie the data must be completely unstructured. Finally, the speed can be translated by the need to ever faster systems that can process this multitude of data - structured and unstructured - in real time.

Figure 1. 3VS of Big Data

The first Vs - Variety and volume - are usually resolved by the database use NoSQL and MapReduce. However, to achieve the three axes involved in Big Data have to be able to deal with not only a huge amount of data coming from various sources, but also to do so at a speed that is close to real time.

Overlooking this problem, Nathan Marz published a generic architecture that he developed while working at Twitter. Figure 2 shows an overview of the Lambda architecture. The proposal is that the same information - the figure recorded as "new data" - will trigger two independent analysis flows. In the first flow there are two components: the first is called "batch layer", and is responsible for persisting data, possibly in a NoSQL database or in a distributed file system (similarly to what we are used); the other component, called "serving" layer is responsible for performing analysis or views on these persisted data and make it available through different views. On the other hand, there is "speed" layer, which creates real-time analysis. Both layers can be found for the final application, for example, an e-commerce site. In addition, both data can still be computed, folded or aggregated.

Figure 2. Overview of Lambda architecture

The idea of the architecture is that these two layers are complementary, meaning that in every application of the layers is also important. As illustrated in Figure 3, the batch layer is always one step behind the actual time, since it is expected that the batch layer to make more complicated analysis and these analyzes are made from a much larger mass of material (shown in blue in Figure). Moreover, after the real-time data to be "achieved" by batch analysis, the views of real-time information can be simply discarded to make room for more updated information.

Figure 3. Relationship between the data analyzed in batch and real-time

Moreover, although it is not explicit in the model, Lambda architecture provides the immutability of data in batch layer. That is, it is expected that any information persisted in the batch layer is deleted or changed, an interesting idea, but controversial. Table 1 shows an example consisting of the value of addresses for two users, where, instead of each user has only one value for address (as we are used), we have persisted the complete history of the addresses of these users, and each has a timestamp representing his moment of insertion.

The idea is that there is no need to modify the data, since the fact that a person change of address does not change that the same person has had another place to address in the past. Thus, the immutability of data creates a direct relationship between information and time. This has very interesting advantages such as the ability to create different views of the same set of data, the ability to delete or disable such views as they become obsolete, greater safety for the consistency of data and the ability to retrieve information that is damaged by a programming error. Obviously, there are also problems, for example, a possible duplication of data and an exponential increase in the amount of information stored.

User

Address

Date

John

New York-NY

March/1983

Mary

San Francisco-CA

September/1986

Mary

San Francisco-CA

February/2015

John

San Diego-CA

February/2015

Table 1. Immutability data

Although it is not revolutionary in itself, the Lambda architecture offers a good way to organize thinking software architect and facilitate the exchange of information on development projects. Still, some questions may arise when we see this model. We can list the most important:

  • In the real world, what kind of application you can use the Lambda architecture?
  • MapReduce and Hadoop are not enough for Big Data?
  • If Hadoop is not enough, that tool will use?
  • If you can, why not do all the processing in real time?

To answer the first question, we must introduce a relatively new term: Fast Data. Fast Data can be defined as the ability to analyze a huge flow of information in real time. The market is moving toward the Fast Data, and many analysts are beginning to discuss what are the requirements of this new stage of Big Data. Thus, some applications can now benefit from this new, among which we can mention:

  • Dependent application context;
  • Dependent applications on the user's location;
  • Emergency applications;
  • Social networks.

In the latter case, many social networks now offer a real-time experience, we can see that when tuitamos or share something on Facebook. On the other hand, there are applications like Waze and Google itself, which does not offer an update in real time. These applications - probably some design decision - have a closer MapReduce behavior, since attending a tsunami scalable information and high arterial runoff, but with a high latency - that is, new information takes to be available.

Therefore, it is important to note that even though revolutionary, is a huge mistake to the MapReduce paradigm as a solution of all computational problems. This is because, by design this paradigm was developed as a solution to a very specific problem: increasing arterial runoff in data analysis. That is, the Map Reduce - and frameworks that implement such as Hadoop - were thought to analyze a huge amount of data. However, this increase in arterial runoff does not necessarily imply an increase in the speed of the analysis.

Overlooking this limitation, many tools are being developed to solve this problem. Among them we can highlight:

  • Apache Storm also developed by Nathan Marz, offers an interesting abstraction for the development of real-time applications. His idea is to create a cluster in which developers can publish topologies responsible for performing tasks. As illustrated in Figure 4, each topology consists of two components: the spouts, which are responsible for receiving the streaming data; and bolts that will process this data. Furthermore, the basic information element that flows in this architecture is called a tuple.

Figure 4. Apache Storm

Apache Kafka developed in Scala at LinkedIn, is a highly scalable messaging system in real time. As shown in Figure 5, the idea is to create a broker (bookie), a software component that is located between producers and consumers of messages to manage and accelerate the data analysis. Confluent The company was created by the same developers of Kafka to offer the application as a service.

Figure 5. Apache Kafka

  • Apache Spark has the trump card memory usage distributed in order to realize the maximum possible computing directly in main memory. The Databricks company was also created by the developers of this tool to support and lead its development.
  • Apache Flume offers an interesting abstraction about Map Reduce common connecting sources of streaming data persistence in the HDFS via a channel in main memory, as depicted in Figure 6.

Figure 6. Apache Flume

The first conclusion that can be seen from the list of tools is that the Apache foundation is leading the efforts. This is great because in addition to verifying the quality of developed software, ensures that the codes are available for consultation and that there is a support community. Besides that,

  • All applications are designed to operate in highly scalable clusters;
  • Most use tools that are also part of the Hadoop like Zookeeper;
  • All use in some of its components MapReduce;
  • The Spark and the Flume try to carry out their activities in real time through the intensive use of primary memory;
  • The Storm hits the real time by creating monitors that manage the time for each activity.

So what to choose tool? The Storm seems to be the most complete application to be older and have a simple and powerful abstraction.

To illustrate this simplicity, the code from Listing 1 to 3 have a complete code to run the Storm in a Java development environment. This application, distributed as an example in itself of Storm code, will count the words in a series of judgments handed down by a Spout. Listing 1 shows the dependence of Maven should be added to pom.xml.

Listing 1. Maven dependency for the Storm

<dependency>
              <groupId>org.apache.storm</groupId>
              <artifactId>storm-core</artifactId>
              <version>0.9.4</version>
  </dependency>

In Listing 2 shows a topology must be created: the most important point is the statement of the Spout and Bolts. The Bolts are also defined as sub-class in Listing 2.

Listing 2. Topology and definition of spouts

package storm.starter;
   
  import backtype.storm.Config;
  import backtype.storm.LocalCluster;
  import backtype.storm.StormSubmitter;
  import backtype.storm.task.ShellBolt;
  import backtype.storm.topology.BasicOutputCollector;
  import backtype.storm.topology.IRichBolt;
  import backtype.storm.topology.OutputFieldsDeclarer;
  import backtype.storm.topology.TopologyBuilder;
  import backtype.storm.topology.base.BaseBasicBolt;
  import backtype.storm.tuple.Fields;
  import backtype.storm.tuple.Tuple;
  import backtype.storm.tuple.Values;
  import storm.starter.spout.RandomSentenceSpout;
   
  import java.util.HashMap;
  import java.util.Map;
   
  /**
   * This topology demonstrates Storm's stream groupings and multilang capabilities.
   */
  public class WordCountTopology {
    public static class SplitSentence extends ShellBolt implements IRichBolt {
   
      public SplitSentence() {
        super("python", "splitsentence.py");
      }
   
      @Override
      public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("word"));
      }
   
      @Override
      public Map<String, Object> getComponentConfiguration() {
        return null;
      }
    }
   
    public static class WordCount extends BaseBasicBolt {
      Map<String, Integer> counts = new HashMap<String, Integer>();
   
      @Override
      public void execute(Tuple tuple, BasicOutputCollector collector) {
        String word = tuple.getString(0);
        Integer count = counts.get(word);
        if (count == null)
          count = 0;
        count++;
        counts.put(word, count);
        collector.emit(new Values(word, count));
      }
   
      @Override
      public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("word", "count"));
      }
    }
   
    public static void main(String[] args) throws Exception {
   
      TopologyBuilder builder = new TopologyBuilder();
   
      builder.setSpout("spout", new RandomSentenceSpout(), 5);
   
      builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
      builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
   
      Config conf = new Config();
      conf.setDebug(true);
   
   
      if (args != null && args.length > 0) {
        conf.setNumWorkers(3);
   
        StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
      }
      else {
        conf.setMaxTaskParallelism(3);
   
        LocalCluster cluster = new LocalCluster();
        cluster.submitTopology("word-count", conf, builder.createTopology());
   
        Thread.sleep(10000);
   
        cluster.shutdown();
      }
    }
  }

In Listing 3 shows the code for the Spout.

Listing 3. Spout code

 package storm.starter.spout;
   
  import backtype.storm.spout.SpoutOutputCollector;
  import backtype.storm.task.TopologyContext;
  import backtype.storm.topology.OutputFieldsDeclarer;
  import backtype.storm.topology.base.BaseRichSpout;
  import backtype.storm.tuple.Fields;
  import backtype.storm.tuple.Values;
  import backtype.storm.utils.Utils;
   
  import java.util.Map;
  import java.util.Random;
   
  public class RandomSentenceSpout extends BaseRichSpout {
    SpoutOutputCollector _collector;
    Random _rand;
   
   
    @Override
    public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
      _collector = collector;
      _rand = new Random();
    }
   
    @Override
    public void nextTuple() {
      Utils.sleep(100);
      String[] sentences = new String[]{ "the cow jumped over the moon", "an apple a day keeps the doctor away",
          "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature" };
      String sentence = sentences[_rand.nextInt(sentences.length)];
      _collector.emit(new Values(sentence));
    }
   
    @Override
    public void ack(Object id) {
    }
   
    @Override
    public void fail(Object id) {
    }
   
    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
      declarer.declare(new Fields("word"));
    }
   
  }

In this sense, it is easy to arrive at a conclusion somewhat dangerous: if it is possible, it is best to do it all in real time. However, it is important to remember that work in real time has a cost, which can manifest in different perspectives:

  • Better (more expensive) computing environment;
  • Staff more qualified;
  • Higher maintenance costs in case of changes;
  • Difficult integration with existing environments;

Therefore, when developing for real-time Big Data, Lambda architecture provides an important starting point as well as being a means interesting time: we have the best of both worlds - batch and real time - in an organized way. Figure 7 illustrates a possible instance of Lambda architecture designed from three technologies:

  1. Hadoop and HDFS with its distributed file system;
  2. The database NoSQL Apache HBase; and
  3. Apache Storm.

Hadoop was used in Batch Layer to store the data in HDFS and compute views using MapReduce. These can be aggregations of data, scores and statistical analysis. For example, an e-commerce could use these views to compute the historical total sales of a particular product. The Storm is employed to process the stream entry and create more simple views that probably consider just a short time - for example, the same e-commerce can compute what were the most visited products in the last 15 minutes. Finally, in Serving Layer these views are combined and stored in HBase, facilitating access by the application. Interestingly, even these four technologies being developed in Java and related tools (Scala and Clojure), we can use various programming languages to develop the iteration between the components.

Figure 7. One possible instance of the Lambda architecture

Usually deal with so many components is not simple.

In the previous example they were listed three tools that completely escape the traditional pattern of computer science. Thus, there is a remarkable effort to simplify the implementation of this type of architecture, among which the most prominent is the Buildoop, a tool similar to Apache Bigtop but focusing on building the ecosystem of the Lambda architecture. The Buildoop is based on Groovy and JSON for definitions of the tools that will be used in architecture.

Listing 4 shows the commands for creating architectures based on the "recipe" cluster.json - such as this is called type definition - for different types of environments. The tool is rapidly developing, but only in the early stages of maturity. However, it can now be used to build complete system (see links section).

Listing 4. Revenue cluster.json

deploop -f conf/cluster.json --deploy batch
  deploop -f conf/cluster.json --deploy batch,speed,bus,serving
  deploop --cluster production --layer batch --stop
  deploop --cluster production --layer batch –start

In summary, the addition of another processing layer has great advantages: the data (historical) can be processed with high precision without loss of short-term information, such as alerts and insights provided by the real time layer. In addition, the computational load a new layer is compensated by the drastic reduction of reading and writing the storage device, which allows much faster access.

From a conceptual point of view, although it is new, the concepts in Big Data evolve very quickly. Therefore, it is important to stay informed about updates to the application of Fast Data. In this sense, the site of Lambda architecture offers many features to understand more and also provides tool lists that fit for each of the three layers: batch, speed and serving.

In summary, the addition of another processing layer has great advantages: the data (historical) can be processed with high precision without loss of short-term information, such as alerts and insights provided by the real time layer. In addition, the computational load a new layer is compensated by the drastic reduction of reading and writing the storage device, which allows much faster access.

From a conceptual point of view, although it is new, the concepts in Big Data evolve very quickly. Therefore, it is important to stay informed about updates to the application of Fast Data. In this sense, the site of Lambda architecture offers many features to understand more and also provides tool lists that fit for each of the three layers: batch, speed and serving.

Links

Architecture Lambda: http://lambda-architecture.net/

3Vs da Big Data: https://apandre.wordpress.com/2013/11/19/datawatch/

Fast Data: http://blogs.wsj.com/cio/2015/01/06/fast-data-applications-emerge-to-manage-real-time-data/

Apache Storm: https://storm.apache.org/ | https://github.com/apache/storm

Apache Kafka: http://kafka.apache.org/ | https://github.com/apache/kafka

Apache Spark: https://spark.apache.org | https://github.com/apache/spark

Apache Flume: https://flume.apache.org/ | https://github.com/apache/flume

Deploop: http://deploop.github.io/



Julio is a System analyst and enthusiast of Information Technology. He is currently a developer at iFactory Solutions company, working in the development of strategic systems, is also a JAVA instructor. He has knowledge and experi...

What did you think of this post?
Services
[Close]
To have full access to this post (or download the associated files) you must have MrBool Credits.

  See the prices for this post in Mr.Bool Credits System below:

Individually – in this case the price for this post is US$ 0,00 (Buy it now)
in this case you will buy only this video by paying the full price with no discount.

Package of 10 credits - in this case the price for this post is US$ 0,00
This subscription is ideal if you want to download few videos. In this plan you will receive a discount of 50% in each video. Subscribe for this package!

Package of 50 credits – in this case the price for this post is US$ 0,00
This subscription is ideal if you want to download several videos. In this plan you will receive a discount of 83% in each video. Subscribe for this package!


> More info about MrBool Credits
[Close]
You must be logged to download.

Click here to login