Apache Hadoop is an open source software platform. It is mainly used for distributed storage and distributed processing of large volume of data (known as big data). All the components of Apache Hadoop are designed to support the distributed processing on a clustered environment. The core components are Hadoop Distributed File System (HDFS) and MapReduce programming. There are also other supporting components associated with Apache Hadoop framework.
In this article, I will talk about all these components in details.
The term big data is becoming confusing day by day. Big data has become an industry buzzword. Today most organizations are talking about moving to big data. Since hadoop is rolling in the industry, solutions that include business intelligence and data warehouse keep on attracting the big data level. Hadoop based solutions, e.g. Hive, are growing to become competitive data warehouse solutions.
To start, let us understand the definition of Big Data – “Any data is termed to be ‘Big Data’ when it exceeds the processing capacity and power of the traditional relational database systems. The data size becomes too large, it is required to move faster and some times it does not fit within the current architecture of the database. Given these issues, we need to look for some alternatives to handle this large volume data”.
Problems pertaining to Big Data vary on the load of data, their velocity, and variability. A database which is very much structured but high on volume is an ideal candidate to move towards Big Data. Another advantage is that getting started with a big data platform e.g. Hadoop doesn’t require huge costs as this is an open source. We can also make it available on Amazon cloud platform almost instantly.
Apache hadoop is a big data solution which is used for distributed computing amongst commodity servers. Big data platforms get divided amongst themselves on the approach they handle Hadoop. Some Big data implementers, which have similar enterprise vendors, incorporate Hadoop distribution while others connect their existing analytical database systems. In the second category, it is the Hadoop’s strength which is used to process the unstructured data in tandem with the existing analytical database. In a real world scenario, big data implementation fall neither in the category of structured nor in the category of unstructured data. Following the open source approach, there isn’t any hadoop solution which implements raw Apache Hadoop. Rather it is packaged in to distributions. These distributions mandatorily pass through a testing mechanism. They also include additional components e.g. management and monitoring tools. Commonly used modern day distributions include –
Apache Hadoop Modules
Hadoop has become a brand name which contains the following components –
- Hadoop Distributed File System or HDFS - This is a virtual file system that looks similar to any other file system. The Only only exception with HDFS is that when we move a file on HDFS, the file gets spitted split into smaller files. Each of these smaller files is then replicated on at least three servers to handle any exceptional situation. This replication count is not fixed but can be customized as per our needs.
- Hadoop MapReduce - This provides a mechanism to break down every request into smaller requests which are then sent out to multiple servers. This allows using the scalable power of the CPU. Map reduce is a vast topic and we will talk about in detail in the later section of this document.
- HBASE - Developed using the Java programming language, HBASE is a layer on top of HDFS and comes with the following features -
- Non relational
- Fault Tolerant.
- Zookeeper - This is a centralized service which is used to maintain the following -
- Configuration Information
- Naming Information
- Synchronization Information
- Solr/Lucene - This is used as the search engine. This library is developed by apache and it took over 10 years to have this robust search engine.
- Programming Languages - The following two languages are identified as original hadoop programming languages -
Integrated Hadoop systems
Most of the enterprise software vendors which deal with have aligned their Hadoop products with the rest of their database and analytical offerings. These vendors don’t require you to source Hadoop from another party, and offer it as a core part of their big data solutions. Their offerings integrate Hadoop into a broader enterprise setting, augmented by analytical and workflow tools.
Greenplum is relatively a new entrant into the enterprise as compared to its counterparts. It has been acquired by EMC, and rapidly growing to capture the maximum of company’s strategy making plans. It has successfully taken the front seat in creating a platform which helps in analytics. It is now positioned to take the analytics “beyond BI” with data science teams which follow the agile methodology.
Greenplum comes with a Unified Analytics Platform (UAP) that is comprised of three components:
- the Greenplum MPP database, used in case of for structured data;
- a Hadoop distribution the Greenplum HD;
- Chorus which is a productivity and groupware layer for data science teams.
The HD Hadoop layer is used to build based on the MapReduce Hadoop compatible distribution. This then replaces the file system with a rapid implementation and provides other features to make the system more robust. The HD and the Greenplum Database should be interoperable so that a single query can access both database and Hadoop data.
Chorus is a unique feature, which is affirms the Greenplum’s commitment to the idea of data science. It also ensures the importance of the agile team element, so that the big data can be exploited in an efficient manner.
InfoSphere BigInsights is a Hadoop distribution from IBM. It is a part of a suite of products which are offered under the umbrella called “InfoSphere” information management brand. In fact any big data at IBM is usually labelled as Big, as it is appropriate enough for a company which is affectionately known as “Big Blue.”
BigInsights implements Hadoop with an array of features. This includes:
- Management tools
- Administration tools.
It also offers textual data analysis tools which help with entity resolution e.g. identifying people, addresses, phone numbers and a lot more.
The JAQL query language from IBM provides an integration point between Hadoop and other IBM products, e.g. relational databases or Netezza data warehouses.
InfoSphere BigInsights can easily operate with IBM’s other database and warehouse products. This includes DB2, Netezza and off course InfoSphere warehouse and analytics lines. To enhance the analytical exploration, BigInsights comes with BigSheets, which is a spreadsheet like interface on top of big data.
IBM addresses the streaming of big data in a separate using its InfoSphere Streams product. At present, BigInsights is not offered in the appliance form, but it can be used over the cloud via Rightscale, Amazon, Rackspace, or IBM Smart Enterprise Cloud.
As a pioneer in the industry, Microsoft has accepted Hadoop as the core of its big data offering. It is also pursuing an integrated approach which has the target to make big data available over its analytical tool suite. This includes Microsoft Excel and PowerPoint.
The Big data solution from Microsoft brings Hadoop into the Windows Server platform and in the elastic form to its cloud platform which is Windows Azure. Microsoft has now packaged their own distribution format of Hadoop, integrated with Windows Systems Center and Active Directory. They have also to contribute some changes to Apache Hadoop to make sure that that an open source version of Hadoop runs smoothly on Windows.
On the server side, Microsoft integrates with its SQL Server database and its corresponding data warehouse product. Using their warehouse solutions is not mandatory however it advised to use different components from a single company so that the interoperability and data exchange is smoother.
Deployment can be done either on the server or on the cloud, or on a combination of both. Jobs that are written in the Apache Hadoop distribution should be able to migrate into Microsoft’s environment with minimal changes.
Oracle announced their entry into the big data world in the end of 2011 by taking the appliance-based approach. The Big Data Appliance from Oracle easily integrates Hadoop, and for analytics, a new database under the name of NoSQL was launched. This had connectors to the Oracle database along with Exadata data warehousing product line.
The approach from Oracle is well known for its ability to cater the need of high volume data enterprises. It particularly provides a rapid-deployment mechanism and high-performance end of the spectrum. Oracle is the only vendor which includes the popular R analytical language integrated with Hadoop, and is very easy to ship a NoSQL based database of its own design which is not possible with Hadoop HBase.
It should be noted that rather than developing its own Hadoop distribution, Oracle has partnered with Cloudera for Hadoop support. This brings a mature and established Hadoop solution into their arena. Database connectors also promote the integration of structured Oracle data with the unstructured data stored in Hadoop HDFS.
The NoSQL Database from ORACLE is a scalable key-value based database. This is built on top of the Berkeley DataBase technology. In doing this Oracle owes a lot of gratitude to Cloudera CEO Mike Olson, as he was previously the CEO of Sleepycat, who is the creator of the Berkeley DB. Oracle has successfully positioned the NoSQL database as a means of acquiring big data even before the analysis.
The Oracle R Enterprise product offers easy integration with the Oracle database, as well as Hadoop. This enables the R scripts to run on data without having to round-trip it out of the data stores.
Analytical databases with Hadoop connectivity
MPP or Massively Parallel Processing based databases are specialized in processing structured big data, in contrast to the unstructured data which is Hadoop’s specialty. In addition to Greenplum, Aster Data and Vertica are the early pioneers of big data products before the mainstream emergence of Hadoop.
These MPP based solutions are databases which are specialized for analytical workloads and data integration, and provide connectors to Hadoop and data warehouses. Very recently a spate of acquisitions haves seen these products become the analytical rivals in terms of data warehouse and storage vendors. These vendors include –
- Teradata acquired Aster Data,
- EMC acquired Greenplum, and
- HP acquired Vertica.
Directly employing Hadoop is an alternate route to create a big data solution, especially when the infrastructure doesn’t fall neatly into the product line of major vendors. Practically every modern database now has the feature to connect to Hadoop. Also we have multiple Hadoop distributions to choose from.
In order to reflect the developer-driven ethos of the big data world, Hadoop distributions are frequently offered in a community edition. These editions do not contain the enterprise management features. At the same time, they consist of all the functionalities which are required for evaluation and sample development.
The initial iterations of Hadoop distributions, offered from Cloudera and IBM, are focused on the usability and administration. We now see the addition of performance-based improvements to Hadoop, e.g. from MapR and Platform Computing. While maintaining the API compatibility, these vendors replace slow or fragile parts of the Apache distribution having better performing or even more robust components.
Cloudera is the oldest-established provider of Hadoop distributions. Cloudera provides an enterprise Hadoop solution, along with services, training and other support options. Apart from Yahoo, Cloudera has made a lot of contributions to Hadoop Community via the open source contributions.
Hortonworks is a recent entrant in the Bigdata community. Hortonworks have a long history with Hadoop. A product of Yahoo, which is also the originator of Hadoop, the aim of Hortonworks is to stick close and promote the core Apache Hadoop technology. Hortonworks is also a partner of Microsoft which is used to assist and accelerate their Hadoop integration.
Selecting the appropriate Hadoop/Big Data platform
Big data has become an important topic of discussion in most organizations these days. We all understand that there is no standard definition for the term “big data”. We know that Hadoop has become the default choice when it comes to selecting a processing tool for big data. Today, almost all the big software giants like IBM, Oracle, SAP, or even Microsoft uses Hadoop while providing Big Data solutions to their customers. However, when we have taken the decision to use Hadoop, the first question that should arise is how to start and which product to select for our big data processes? Lots of alternatives exist while installing Hadoop and implementing Big data processes. Let us talk about these alternatives in detail.
Alternatives for Hadoop Platforms
In the following picture we can see the multiple alternatives for Hadoop platforms. Out of these we can choose to either install just the Apache release or select any one hadoop distribution out of several distributions from different vendors or we can have a big data suite. It must be noted that every distribution comes with Apache Hadoop, and also every big data suite comes with at least one hadoop distribution.
Figure 1. Alternative Hadoop platforms.
There exists very thin difference amongst the different distribution of Hadoop. While making choice for an appropriate tool we must think on the following differentiators –
- Cloudera : This is the most commonly used distribution of Hadoop so far having maximum number of referenced deployments. Cloudera consist of inbuilt powerful deployment, management and monitoring tools which are very much useful. The best example is Impala which is developed and contributed by Cloudera. This offers real time processing of big data.
- Hortonworks: This is the only vendor which uses 100% open source based Apache Hadoop without any modifications or customization. In fact, Hortonworks is the first vendor which uses Apache HCatalog functionality to cater the metadata services. In addition to this the Stinger initiative from Hortonworks helps in optimizing the Hive project massively. Hortonworks comes with a very simple and easy-to-use sandbox which helps to getting started rapidly. Enhancements which are developed by Hortonworks and commited in the core trunk help to make Apache Hadoop run natively on the Microsoft Windows platforms. This includes Windows Server and Windows Azure.
- MapR: This uses concepts that are somewhat different from its competitors. The most significant is the support for a native UNIX file system instead of HDFS having non-open-source components. This helps to achieve better performance and of course it is much easier to use. We can take the advantage of UNIX by executing native UNIX commands instead of the Hadoop commands. In addition to these, MapR is different from its competitors having high availability features e.g. snapshots, mirroring or stateful failover.
- Amazon Elastic Map Reduce (EMR): This differs from others because of the fact that it is a hosted solution running on the web-scale infrastructure of Amazon Elastic Compute Cloud or Amazon EC2 and Amazon Simple Storage Service or Amazon S3. In addition to Amazon’s distribution, we can also use the MapR on EMR. Another important feature is the clustered environment. If you need to execute one-time or infrequent big data processing, EMR might be an option which saves a lot of money. But, there are some disadvantages, also. We have only Pig and Hive that are included in the Hadoop ecosystem, hence lot are missing by default. In addition to these, EMR is highly tuned to work with data in S3, which has a higher latency. Also it does not locate the data on our computational nodes.
The above distributions have one thing in common. They can be used in a flexible manner just within themselves or in a combination of various big data suites. The other distributions, which are coming up these days, are not as flexible as the above ones. They bind us with specific software and even with a hardware stack. Let us consider the example of EMC’s Pivotal HD, which is natively plugged with Greenplum’s analytic database. This offers real SQL queries and a better performance on top of Hadoop or the Intel version of Hadoop.
Hence we can say that, if we already have a specific vendor stack in our enterprise, we need to ensure that we must check which Hadoop distributions are supported. For example, if we are using Greenplum database, then Pivotal HD might be a perfect choice, where as in other cases more flexible solutions could be more adequate. Again for instance, if we are already using Talend ESB, and we want to start our big data project with Talend Big Data, then we are free to choose the desired Hadoop distributions of our choice, as Talend does not rely on a specific vendor of a Hadoop distribution.
In order to make the right choice, we must read out the concepts of every hadoop distribution and try out multiple options. We also need to check out the tooling and analyse the costs for enterprise versions along with commercial support. Based on these factors we can then decide which distribution is the right one for our organization.
When we should use a Hadoop distribution?
Because of its numerous advantages e.g. packaging, tooling and commercial support, a Hadoop distribution should be used in the maximum use cases. It is not advised to use the Apache Hadoop release and build our own distribution on top of it. In this case we then need to test our packaging, build our own tooling, and write patches by ourselves as and when required.
Given the above facts, even a Hadoop distribution requires a lot of efforts. We still need to write a lot of code against the MapReduce jobs, and in order to integrating our different data sources into Hadoop. This is the area where the big data suites come in.
Big Data Suite
We can use the Big Dada suite on top of Apache Hadoop or a Hadoop distribution. A big data suite is nothing but a combination of multiple Hadoop distributions under one umbrella. Even though, there are some vendors which implement their own Hadoop solution. Whatever approach we take, a big data suite adds some additional features to distributions in order to process big data. These features are:
- Tooling: Normally, a big data suite is based on top of an IDE e.g. Eclipse. There are additional plugins available which makes the development of big data applications easier. We can create, build and deploy big data services on any development environment as per our choice.
- Modeling: Apache Hadoop or any other Hadoop distribution offer the infrastructure to incorporate Hadoop clusters. However, we still have to write a lot of complex code in order to build our MapReduce program. We can write these codes in plain Java, or you can use other optimized languages e.g. PigLatin or the Hive Query Language or HQL, which can generate the MapReduce code. A big data suite offers graphical tools which are very much helpful to model our big data services. The required code is auto generated. We just need to configure our jobs by defining the required parameters, if any. Realizing the big data jobs is much easier and more efficient task.
- Code Generation: The entire code is auto generated. We do not need to write, debug, analyse and optimize our MapReduce code.
- Scheduling: Execution of the big data jobs needs to be scheduled and monitored. Rather than writing jobs or other code for scheduling, we can use the big data suite to define and manage the execution plans in a much efficient manner.
- Integration: It is the basic requirement of Hadoop that it needs to integrate data from all kinds of technologies and products. In addition to the files and SQL databases, we also need to integrate data from NoSQL databases, data from social media e.g. Twitter or Facebook, messages from messaging middleware or even data from B2B products e.g. Salesforce or SAP. A big data suite is helpful by offering connectors from all these different interfaces to Hadoop and back. We do not have to write the glue code by hand. We just use the graphical tools to integrate and map all this data. Integration mechanism also includes ensuring data quality features e.g. data cleansing which is equally important in order to improve the quality of imported data.
Let us conclude our discussion in the form of following bullets:
- Any data is termed to be ‘Big Data’ when it exceeds the processing capacity of the conventional database systems.
- Problems related to Big Data may vary on the load of data, their velocity, and variability.
- A database which is very much structured but high on volume is an ideal candidate to move towards Big Data.
- Apache Hadoop comes with following modules:
- Hadoop distributed File System or HDFS
- Hadoop MapReduce
- Zoo Keeper
- Programming Languages
- There are many vendors which provide Hadoop distribution -
- EMC GreenPlum
- IBM - Big Insight
- Oracle - NOSQL
- Microsoft Azure
- In addition to the above facts, there are several alternatives which exist for Hadoop installations. We can use just the Apache Hadoop project and create our own distribution out of the Hadoop ecosystem. Some vendors of Hadoop distributions e.g. as Cloudera, Hortonworks or MapR include several features on top of the de facto Apache Hadoop e.g. management and monitoring tool or commercial support to reduce efforts. On top of Hadoop distributions, we can use a big data suite for additional features e.g. modeling, code generation and scheduling of big data jobs along with integration of all kinds of different data sources. We must be sure to evaluate different alternatives before taking the final call for our big data project. The key differentiators include -
- Amazon EMR or Elastic Map Reduce.
- Big data suite is a combination of multiple hadoop distribution which comes with the following features -
- Code Generation
I hope you will enjoy this article and get a clear picture of Hadoop and Big data.