Hadoop Story
How did Hadoop get here?
As the World Wide Web grew at a dizzying pace in the late 1900s and early 2000s, search engines and indexes were created to help people find relevant information amid all of that text-based content. During the early years, search results were returned by humans. It’s true! But as the number of web pages grew from dozens to millions, automation was required. Web crawlers were created, many as university-led research projects, and search engine startups took off (Yahoo, AltaVista, etc.).
One such project was Nutch – an open-source web search engine – and the brainchild of Doug Cutting and Mike Cafarella. Their goal was to invent a way to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. Also during this time, another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way so that more relevant web search results could be returned faster.
In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The Nutch project was divided. The web crawler portion remained as Nutch. The distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). In 2008, Yahoo released Hadoop as an open-source project, and, today Hadoop’s framework and family of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.
Why is Hadoop important?
Since its inception, Hadoop has become one of the most talked about technologies. Why? One of the top reasons (and why it was invented) is its ability to handle huge amounts of data – any kind of data – quickly. With volumes and varieties of data growing each day, especially from social media and automated sensors, that’s a key consideration for most organizations. Other reasons include:
- Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
- Computing power. Its distributed computing model can quickly process very large volumes of data. The more computing nodes you use, the more processing power you have.
- Scalability. You can easily grow your system simply by adding more nodes. Little administration is required.
- Storage flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. And that includes unstructured data like text, images and videos. You can store as much data as you want and decide how to use it later.
- Inherent data protection and self-healing capabilities. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data.
What’s in Hadoop?
Hadoop components have funny names, which is sort of understandable knowing that “Hadoop” was the name of a yellow toy elephant owned by the son of one of its inventors. Here’s a quick rundown on names you may hear. Currently three core components are included with your basic download from the Apache Software Foundation.- HDFS – the Java-based distributed file system that can store all kinds of data without prior organization.
- MapReduce – a software programming model for processing large sets of data in parallel.
- YARN – a resource management framework for scheduling and handling resource requests from distributed applications.
Other components that have achieved top-level Apache project status and are available include:
Comments
Post a Comment