This is the list of BigData services offered by the Central University “Marta Abreu” of the Villas
1- APACHE ACCUMULO
A sorted, distributed key-value store with cell-based access control
Accumulo is a low-latency, large table data storage and retrieval system with cell-level security. Accumulo is based on Google’s Bigtable and it runs on YARN, the data operating system of Hadoop. YARN provides visualization and analysis applications predictable access to data in Accumulo.
What Accumulo does
Accumulo was originally developed at the National Security Agency, before it was contributed to the Apache Software Foundation as an open-source incubation project. Due to its origins in the intelligence community, Accumulo provides extremely fast access to data in massive tables, while also controlling access to its billions of rows and millions of columns down to the individual cell. This is known as fine-grained data access control. Cell-level access control is important for organizations with complex policies governing who is allowed to see data. It enables the intermingling of different data sets with access control policies for fine-grained access to data sets that have some sensitive elements. Those with permission to see sensitive data can work alongside co-worker without those privileges. Both users can access data in accordance with their permissions.
Without Accumulo, those policies are difficult to enforce systematically, but Accumulo encodes those rules for each individual data cell and controls fine-grained access.
2- APACHE AMBARI
A completely open source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters. Apache Ambari takes the guesswork out of operating Hadoop.
Apache Ambari, as part of the Hortonworks Data Platform, allows enterprises to plan, install and securely configure HDP making it easier to provide ongoing cluster maintenance and management, no matter the size of the cluster.
What Ambari does
Ambari makes Hadoop management simpler by providing a consistent, secure platform for operational control. Ambari provides an intuitive Web UI as well as a robust REST API, which is particularly useful for automating cluster operations. With Ambari, Hadoop operators get the following core benefits:
3- APACHE ATLAS
Agile enterprise compliance through metadata
Atlas is designed to exchange metadata with other tools and processes within and outside of the Hadoop stack, thereby enabling platform-agnostic governance controls that effectively address compliance requirements.
Apache Atlas provides scalable governance for Enterprise Hadoop that is driven by metadata. Atlas, at its core, is designed to easily model new business processes and data assets with agility. This flexible type system allows exchange of metadata with other tools and processes within and outside of the Hadoop stack, thereby enabling platform-agnostic governance controls that effectively address compliance requirements
Apache Atlas is developed around two guiding principles:
Apache Atlas empowers enterprises to effectively and efficiently address their compliance requirements through a scalable set of core governance services. These services include:
4- APACHE FALCON
A framework for managing data life cycle in Hadoop clusters
Apache™ Falcon addresses enterprise challenges related to Hadoop data replication, business continuity, and lineage tracing by deploying a framework for data management and processing. Falcon centrally manages the data lifecycle, facilitate quick data replication for business continuity and disaster recovery and provides a foundation for audit and compliance by tracking entity lineage and collection of audit logs.
What Falcon does
Falcon allows an enterprise to process a single massive dataset stored in HDFS in multiple ways—for batch, interactive and streaming applications. With more data and more users of that data, Apache Falcon’s data governance capabilities play a critical role. As the value of Hadoop data increases, so does the importance of cleaning that data, preparing it for business intelligence tools, and removing it from the cluster when it outlives its useful life.
Falcon simplifies the development and management of data processing pipelines with a higher layer of abstraction, taking the complex coding out of data processing applications by providing out-of-the-box data management services. This simplifies the configuration and orchestration of data motion, disaster recovery and data retention workflows.
The Falcon framework can also leverage other HDP components, such as Pig, HDFS, and Oozie. Falcon enables this simplified management by providing a framework to define, deploy, and manage data pipelines.
5- APACHE FLUME
A service for streaming logs into Hadoop
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.
YARN coordinates data ingest from Apache Flume and other services that deliver raw data into an Enterprise Hadoop cluster.
What Flume does
Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage. Specifically, Flume allows users to:
Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the Hadoop Distributed File System (HDFS). Typical sources of these streams are application logs, sensor and machine data, geo-location data and social media. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive. Or they can feed business dashboards served ongoing data by Apache HBase.
In one specific example, Flume is used to log manufacturing operations. When one run of product comes off the line, it generates a log file about that run. Even if this occurs hundreds or thousands of times per day, the large volume log file data can stream through Flume into a tool for same-day analysis with Apache Storm or months or years of production runs can be stored in HDFS and analyzed by a quality assurance engineer using Apache Hive.
6- APACHE Hadoop
Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations.
A brief history and benefits
The genesis of Hadoop came from the Google File System paper that was published in October 2003. This paper spawned another research paper from Google – MapReduce: Simplified Data Processing on Large Clusters. Development started in the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. The first committer added to the Hadoop project was Owen O’Malley in March 2006. Hadoop 0.1.0 was released in April 2006 and continues to be evolved by the many contributors to the Apache Hadoop project. Hadoop was named after one of the founder’s toy elephant.
In 2011, Rob Bearden partnered with Yahoo! to establish Hortonworks with 24 engineers from the original Hadoop team including founders Alan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O’Malley, Sanjay Radia, and Suresh Srinivas.
Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.