Increase Big Data Security And To Enhance Large Scale Computing
Due to abundance and the rise of unstructured data accounting for 90% of the data today, the time has come for enterprises to re-evaluate their approach to Data Storage, Management and Analytics. Legacy systems will remain necessary for specific high-value and low-volume workloads, which in turn compliments the use of Hadoop in optimizing the data management structure in your organization by putting the right Big Data workloads in the right systems. The cost-effectiveness, scalability, flexibility and streamlined architectures of Hadoop will make the technology more and more attractive. In fact, the need for Hadoop is no longer a question. The only question now is how to take advantage of it best. Hadoop is one of the up-and-coming technologies for performing analysis on big data.
Apache Hadoop is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.There are two major components of Hadoop, the File System, which is a distributed file system that splits up large files onto multiple computers, and the Map Reduce framework, which is an application framework used to process large data stored on the file system.
Throwing light on past some years, survey done in October 2010 coveys the initial use of the Hadoop framework was to index web pages, but it is now being viewed as an alternative to other business intelligence (BI) products that rely on data residing in databases (structured data), since it can work with unstructured data from disparate data sources that database-oriented tools are unable to handle as effectively.
A November 2011 article in Computer World mentions that JPMorgan Chase is using Hadoop to improve fraud detection, IT risk management, and self-service applications and that EBay is using it to build a new search engine for its auction site. The article goes on to warn that anyone using Hadoop and its associated technologies needs to consider the security implications of using this data in that environment because the currently provided security mechanisms of access control lists and Kerberos authentication are inadequate for most high security applications.
Again, reviews of February 2012 explored, those already using the technology issued the warning that Hadoop requires extensive training along with analytics expertise not seen in many IT shops today.
Hadoop can handle all types of data from disparate systems, structured, unstructured, log files, pictures, audio files, communications records, email– just about anything you can think of, regardless of its native format. Even when different types of data have been stored in unrelated systems, you can dump it all into your Hadoop cluster with no prior need for a schema. In other words, you don’t need to know how you intend to query your data before you store it; Hadoop lets you decide later and over time can reveal questions you never even thought to ask.
Main difference to point out is Schema on Write and Schema on Read that concludes-
When we move data from SQL database A to SQL database B we need to have some information on hand before we write to database B. For example, we have to know what the structure of database B is and how to adapt the data from database A to fit that structure. This is what we call Schema on Write. Hadoop on the other hand has a schema on read approach. So when we write data into Hadoop Distributed File System, we just- bring it in without dictating any gatekeeping rules. Then when we want to read the data, we apply rules to the code that reads the data rather than preconfiguring the structure of the data ahead of time.
Hadoop is an open-source, Java-based implementation of Google’s Map Reduce framework. Hadoop is designed for any application which can take advantage of massively parallel distributed-processing, particularly with clusters composed of unreliable hardware. For example, suppose you have 10 terabytes of data and you want to process it somehow or you need to sort it out. Traditionally, a high-end super-computer with exotic hardware would be required to do this in a reasonable amount of time. Hadoop provides a framework to process data of this size using a computing cluster made from normal, commodity hardware.
Of course, there are many distributed computing frameworks, but what is particularly notable about Hadoop is the built in fault tolerance. It is designed to run on commodity hardware, and therefore it expects computers to be breaking frequently.