Tuesday, January 4, 2022

UNIT 1



Introduction to Big Data Introduction to Big Data Platform

Big Data:

It can be defined as data sets whose size or type is beyond the ability of traditional relational databases

to capture, manage and process the data with low latency (time delay for sending data from source to destination).

Data sets:

A data set is a collection of numbers or values that relate to a particular subject.

Ex:

All hospitals must provide a standard data set of each patient's details, the number of fish eaten by each dolphin at an aquarium is a data set.




Characteristics of big data include high volume, high velocity and high variety.

Sources of data are becoming more complex than those for traditional data because they are being driven by Artificial Intelligence, mobile devices, social media and the Internet of Things (IoT).

For example, the different types of data originate from sensors, devices, video/audio, networks, log files, transactional applications, web and social media - much of it generated in real time and at a very large scale.

With big data analytics, you can ultimately fuel better and faster decision-making, modelling and predicting of future outcomes and enhanced business intelligence. As you build your big data solution, consider open source software such as Apache Hadoop, Apache Spark and the entire Hadoop ecosystem as cost-effective, flexible data processing and storage tools designed to handle the volume of data being generated today.




Introduction to Bigdata Platform:

Big Data Platform refers to IT solutions that combine several Big Data Tools and utilities into one packaged answer, and this is then used further for managing as well as analyzing Big Data.

Why Do We Need a Big Data platform?

This solution combines all the capabilities and every feature of many big data applications into a single solution. It generally consists of big data servers, management, storage, databases, management utilities, and business intelligence.

It also focuses on providing their users with efficient analytics tools for massive datasets. These platforms are often used by data engineers to aggregate, clean, and prepare data for business analysis. Data scientists use this platform to discover relationships and patterns in large data sets using a Machine Learning Algorithm(AI). The users of such platforms can custom build applications according to their use cases like to calculate customer loyalty (E-Commerce user case), and so on, there are countless use cases.



What are the best Big Data Platforms?

This aims around four letters which are S, A, P, S; which means Scalability, Availability, Performance, and Security. There are various tools responsible to manage hybrid data of IT systems. Some of them are listed below:
  • Hadoop Delta Lake Migration Platform

  • Data Catalog Platform

  • Data Ingestion Platform

  • IoT Analytics Platform

  • Data Integration and Management Platform

  • ETL Data Transformation Platform

Hadoop - Delta Lake Migration Platform

It is an open-source software platform managed by Apache Software Foundation. It is used to manage and store large data sets at a low cost and with great efficiency.
Data Catalog Platform

It provides a single self-service environment to the users, helping them find, understand, and trust the data source. It also helps the users to discover the new data sources if there are any. Discovering and understanding data sources are the initial steps for registering the sources. Users search for the Data Catalog Tools based on the needs and filter the appropriate results. In Enterprises, Data Lake is needed for Business Intelligence, Data Scientists, ETL Developers where the right data needed. The users use catalog discovery to find the data which fits their needs.
Data Ingestion Platform

This layer is the first step for the data coming from variable sources to start its journey. This means the data here is prioritized and categorized, making data flow smoothly in further layers in this process flow.
IoT Analytics Platform

It provides a wide range of tools to work on big data; this functionality comes handy while using it over the IoT(Internet of Things) case.
Big Data Integration and Management Platform

Our Elixir Data provides a highly customizable solution for Enterprises. Elixir Data provides Flexibility, Security, and Stability for an Enterprise application and Big Data Infrastructure to deploy on-premises and Public Cloud with cognitive insights using Machine Learning and Artificial Intelligence.
ETL Data Transformation Platform

This Platform can be used to build pipelines and even schedule the running of the same for data transformation.

Essential components of Big Data Platform:

  • Data Ingestion, Management, ETL, and Warehouse – It provides these resources for effective data management and effective data warehousing, and this manages data as a valuable resource.

  • Stream Computing – Helps compute the streaming data that is used for real-time analytics.

  • Analytics/ Machine Learning – Features for advanced analytics and machine learning.

  • Integration – It provides its user with features like integrating big data from any source with ease.

  • Data Governance – It also provides comprehensive security, data governance, and solutions to protect the data.

  • Provides Accurate Data – This delivers with analytic tools which in turn helps to omit any inaccurate data that has not been analyzed. This also helps the business to make the right decision by utilizing accurate information.

  • Scalability – It also helps scale the application to analyze all time climbing data; it sizes to provide efficient analysis. It offers scalable storage capacity.

  • Price Optimization – Data analytics with the help of a big data platform provides insight for B2C and B2B enterprises which helps the business to optimize the prices they charge accordingly.

  • Reduced Latency – With the set of the warehouse, analytics tools, and, it’s Efficient Data Transformation helps to reduce the data latency and provide high throughput.
    Big Data Platform Use Cases

  • Insurance Fraud Detection – Companies handling a large number of financial transactions use tools provided by this platform to look for any fraud that’s happening.

  • In Real Life – It can be used for various use cases of real-time stream processing like in the field of Media and Entertainment, Weather patterns, the Transportation industry, Banking sector, and so on.

BIG DATA ANALYTICS


Objectives:

 To learn to analyze the big data using intelligent techniques.

 To understand the various search methods and visualization techniques.

 To learn to various techniques for mining data stream.

 To understand the applications using Map Reduce Concepts.

Outcomes:

On completion of this course the student will able to

 Analyze the big data analytics techniques for useful business application.

 Design efficient algorithms for mining the data from large volumes.

 Analyze the HADOOP and Map Reduce technologies associated with big data analytics.

 Explore on big data applications using Pig and Hive.

UNIT-I

Introduction to Big DataIntroduction to Big Data Platform – Challenges of Conventional System – Intelligent data analysis – Nature of Data – Analytic Processes and Tool – Analysis vs Reporting – Modern Data Analytic Tool – Statistical Concepts: Sampling Distributions – Re-Sampling – Statistical Inference – Prediction Error.

UNIT- II Mining Data Streams Introduction To Stream Concepts – Stream Data Model and Architecture - Stream Computing – Sampling Data in a Stream – Filtering Stream – Counting Distinct Elements in a Stream – Estimating Moments – Counting Oneness in a Window – Decaying Window – Real time Analytics Platform(RTAP) Applications – Case Studies – Real Time Sentiment Analysis, Stock Market Predictions.

UNIT – III Hadoop History of Hadoop- The Hadoop Distributed File System – Components of Hadoop – Analyzing the Data with Hadoop – Scaling Out – Hadoop Streaming – Design of HDFS- Java interfaces to HDFSBasics- Developing a Map Reduce Application – How Map Reduce Works – Anatomy of a Map Reduce Job run – Failures – Job Scheduling – Shuffle and Sort – Task Execution – Map Reduce Types and Formats – Map Reduce Features.

UNIT – IV Hadoop Environment Setting up a Hadoop Cluster – Cluster specification – Cluster Setup and Installation –Hadoop Configuration – Security in Hadoop – Administering Hadoop – HDFS – Monitoring – Maintence – Hadoop Benchmarks – Hadoop in the Cloud

UNIT –V Frameworks Applications on Big Data Using Pig and Hive – Data Processing operators in Pig – Hive Services – HiveQL – Querying Data in Hive – fundamentals of HBase and Zookeeper – IBM Info Sphere Big Insights and Streams. Visualization - Visual data analysis techniques, interaction techniques; Systems and applications.

Text Books: 1. Michael Berthold, David J.Hand, Intelligent Data Analysis, Spingers, 2007.

2. Tom White, Hadoop: The Definitive Guide Third Edition, O’reilly Media, 2012.

3. Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis, Paul Zikopoulos, Uderstanding Big Data : Analytics for Enterprise Class Hadoop and Streaming Data, McGrawHill Publishing, 2012.

4. AnandRajaraman and Jeffrey David UIIman, Mining of Massive Datasets Cambridge University Press, 2012.

Reference Books:

1. Bill Franks, Taming the big Data tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics, John Wiley & sons, 2012.

2. Glenn J. Myatt, Making Sense of Data , John Wiley & Sons, 2007 Pete Warden, Big Data Glossary, O’Reilly, 2011.

3. Jiawei Han, MichelineKamber, Data Mining Concepts and Techniques, Second Edition.

4. Elsevier, Reprinted 2008. Da Ruan, Guoquing Chen, Etienne E.Kerre, Geert Wets, Intelligent Data Mining, Springer, 2007.