Herramientas de usuario

Herramientas del sitio


List of available services

This is the list of BigData services offered by the Central University “Marta Abreu” of the Villas



A sorted, distributed key-value store with cell-based access control

Accumulo is a low-latency, large table data storage and retrieval system with cell-level security. Accumulo is based on Google’s Bigtable and it runs on YARN, the data operating system of Hadoop. YARN provides visualization and analysis applications predictable access to data in Accumulo.

What Accumulo does

Accumulo was originally developed at the National Security Agency, before it was contributed to the Apache Software Foundation as an open-source incubation project. Due to its origins in the intelligence community, Accumulo provides extremely fast access to data in massive tables, while also controlling access to its billions of rows and millions of columns down to the individual cell. This is known as fine-grained data access control. Cell-level access control is important for organizations with complex policies governing who is allowed to see data. It enables the intermingling of different data sets with access control policies for fine-grained access to data sets that have some sensitive elements. Those with permission to see sensitive data can work alongside co-worker without those privileges. Both users can access data in accordance with their permissions.

Without Accumulo, those policies are difficult to enforce systematically, but Accumulo encodes those rules for each individual data cell and controls fine-grained access.

More detailed information about this topic



A completely open source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters. Apache Ambari takes the guesswork out of operating Hadoop.

Apache Ambari, as part of the Hortonworks Data Platform, allows enterprises to plan, install and securely configure HDP making it easier to provide ongoing cluster maintenance and management, no matter the size of the cluster.

What Ambari does

Ambari makes Hadoop management simpler by providing a consistent, secure platform for operational control. Ambari provides an intuitive Web UI as well as a robust REST API, which is particularly useful for automating cluster operations. With Ambari, Hadoop operators get the following core benefits:

  • Simplified Installation, Configuration and Management. Easily and efficiently create, manage and monitor clusters at scale. Takes the guesswork out of configuration with Smart Configs and Cluster Recommendations. Enables repeatable, automated cluster creation with Ambari Blueprints.
  • Centralized Security Setup. Reduce the complexity to administer and configure cluster security across the entire platform. Helps automate the setup and configuration of advanced cluster security capabilities such as Kerberos and Apache Ranger.
  • Full Visibility into Cluster Health. Ensure your cluster is healthy and available with a holistic approach to monitoring. Configures predefined alerts — based on operational best practices — for cluster monitoring. Captures and visualizes critical operational metrics — using Grafana — for analysis and troubleshooting. Integrated with Hortonworks SmartSense for proactive issue prevention and resolution.
  • Highly Extensible and Customizable. Fit Hadoop seamlessly into your enterprise environment. Highly extensible with Ambari Stacks for bringing custom services under management, and with Ambari Views for customizing the Ambari Web UI.

More detailed information about this topic



Agile enterprise compliance through metadata

Atlas is designed to exchange metadata with other tools and processes within and outside of the Hadoop stack, thereby enabling platform-agnostic governance controls that effectively address compliance requirements.

Apache Atlas provides scalable governance for Enterprise Hadoop that is driven by metadata. Atlas, at its core, is designed to easily model new business processes and data assets with agility. This flexible type system allows exchange of metadata with other tools and processes within and outside of the Hadoop stack, thereby enabling platform-agnostic governance controls that effectively address compliance requirements

Apache Atlas is developed around two guiding principles:

  • Metadata Truth in Hadoop: Atlas provides true visibility in Hadoop. By using native connector to Hadoop components, Atlas provides technical and operational tracking enriched by business taxonomical metadata. Atlas facilitates easy exchange of metadata by enabling any metadata consumer to share a common metadata store that facilitates interoperability across many metadata producers.
  • Developed in the Open: Engineers from Aetna, Merck, SAS, Schlumberger, and Target are working together to help ensure Atlas is purposely built to solve real data governance problems across a wide range of industries that use Hadoop. This approach is an example of open source community innovation that helps accelerate product maturity and time-to-value for the data-first enterprise.

Apache Atlas empowers enterprises to effectively and efficiently address their compliance requirements through a scalable set of core governance services. These services include:

  • Data Lineage: Captures lineage across Hadoop components at platform level.
  • Agile Data Modeling: Type system allows custom metadata structures in a hierarchy taxonomy.
  • REST API: Modern, flexible access to Atlas services, HDP components, UI & external tools.
  • Metadata Exchange: Leverage existing metadata / models by importing it from current tools. Export metadata to downstream systems

More detailed information about this topic



A framework for managing data life cycle in Hadoop clusters

Apache™ Falcon addresses enterprise challenges related to Hadoop data replication, business continuity, and lineage tracing by deploying a framework for data management and processing. Falcon centrally manages the data lifecycle, facilitate quick data replication for business continuity and disaster recovery and provides a foundation for audit and compliance by tracking entity lineage and collection of audit logs.

What Falcon does

Falcon allows an enterprise to process a single massive dataset stored in HDFS in multiple ways—for batch, interactive and streaming applications. With more data and more users of that data, Apache Falcon’s data governance capabilities play a critical role. As the value of Hadoop data increases, so does the importance of cleaning that data, preparing it for business intelligence tools, and removing it from the cluster when it outlives its useful life.

Falcon simplifies the development and management of data processing pipelines with a higher layer of abstraction, taking the complex coding out of data processing applications by providing out-of-the-box data management services. This simplifies the configuration and orchestration of data motion, disaster recovery and data retention workflows.

The Falcon framework can also leverage other HDP components, such as Pig, HDFS, and Oozie. Falcon enables this simplified management by providing a framework to define, deploy, and manage data pipelines.

More detailed information about this topic



A service for streaming logs into Hadoop

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

YARN coordinates data ingest from Apache Flume and other services that deliver raw data into an Enterprise Hadoop cluster.

What Flume does

Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage. Specifically, Flume allows users to:

  • Stream data: Ingest streaming data from multiple sources into Hadoop for storage and analysis.
  • Insulate systems: Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination
  • Guarantee data delivery: Flume NG uses channel-based transactions to guarantee reliable message delivery. When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event. This ensures guaranteed delivery semantics.
  • Scale horizontally: To ingest new data streams and additional volume as needed.

Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the Hadoop Distributed File System (HDFS). Typical sources of these streams are application logs, sensor and machine data, geo-location data and social media. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive. Or they can feed business dashboards served ongoing data by Apache HBase.

In one specific example, Flume is used to log manufacturing operations. When one run of product comes off the line, it generates a log file about that run. Even if this occurs hundreds or thousands of times per day, the large volume log file data can stream through Flume into a tool for same-day analysis with Apache Storm or months or years of production runs can be stored in HDFS and analyzed by a quality assurance engineer using Apache Hive.

More detailed information about this topic

6- APACHE Hadoop


Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations.

A brief history and benefits


The genesis of Hadoop came from the Google File System paper that was published in October 2003. This paper spawned another research paper from Google – MapReduce: Simplified Data Processing on Large Clusters. Development started in the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. The first committer added to the Hadoop project was Owen O’Malley in March 2006. Hadoop 0.1.0 was released in April 2006 and continues to be evolved by the many contributors to the Apache Hadoop project. Hadoop was named after one of the founder’s toy elephant.

In 2011, Rob Bearden partnered with Yahoo! to establish Hortonworks with 24 engineers from the original Hadoop team including founders Alan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O’Malley, Sanjay Radia, and Suresh Srinivas.


Some of the reasons organizations use Hadoop is its’ ability to store, manage and analyze vast amounts of structured and unstructured data quickly, reliably, flexibly and at low-cost.

  • Scalability and Performance – distributed processing of data local to each node in a cluster enables Hadoop to store, manage, process and analyze data at petabyte scale.
  • Reliability – large computing clusters are prone to failure of individual nodes in the cluster. Hadoop is fundamentally resilient – when a node fails processing is re-directed to the remaining nodes in the cluster and data is automatically re-replicated in preparation for future node failures.
  • Flexibility – unlike traditional relational database management systems, you don’t have to created structured schemas before storing data. You can store data in any format, including semi-structured or unstructured formats, and then parse and apply schema to the data when read.
  • Low Cost – unlike proprietary software, Hadoop is open source and runs on low-cost commodity hardware.

More detailed information about hadoop

bigdata/available.txt · Última modificación: 2017/11/22 11:25 por eppelaez