Bigdata Pipeline with AWS



Diksha Singh Tomer
Computer and Science Engineering
Banasthali University, India
Introduction
As we become a more digital society, the amount of data being created and collected is growing and accelerating significantly. Analysis of this ever-growing data becomes a challenge with traditional analytical tools. We require innovation to bridge the gap between data being generated and data that can be analyzed effectively.
The amount of (raw) data is too huge to store in raw format in any database system or data warehouse. One, definitely, wants to process/extract this data to get useful information to store and for further processing on this information. Data pipeline (or Bigdata pipeline) is a solution to collect, process and move this data from source system to data lakes or warehouses. Later the useful data can be sent to analytical and recommendation systems for further processing on it.
Amazon Web Services (AWS) provides a broad platform of managed services to help you build, secure, and seamlessly scale end-to-end big data applications quickly and with ease. Whether your applications require real-time streaming or batch data processing, AWS provides the infrastructure and tools to tackle your next big data project
This document depicts general terms about data pipeline, its types and some solutions to create such pipelines with AWS services.
Data Pipeline
Data pipelines refer to the general term of movement of data from one location to another location. The location from where the flow of data starts is known as a data source, and the destination is called as the data sink.
The data sources can be data stored in any of the AWS Big Data locations such as databases, data files, or data warehouses. Such data pipelines are called batch data pipelines as the data are already defined and we transfer the data in typical batches.
Whereas there are some data sources such as log files or streaming data from games or real-time application, such type of data is not well defined and may vary in structure as well. Such pipelines are called as streaming data pipelines. Streaming data requires a special kind of solution, as we have to consider late data records due to network latency or inconsistent data velocity.
ETL & ELT
We may also like to perform some operations/transformation on the data while it’s going from the data source to a data sink, such kind of data pipelines have been given a special kind of names:
• ETL (Extract Transform Load): In ETL data moves from the data source, to staging and then into the warehouse. All transformations are performed before the data is loaded into the warehouse.
• ELT (Extract Load Transform): ELT offers a modern alternative to ETL where analysts load data into the warehouse before transforming it, supporting a more flexible and agile way of working.
bigdata-pipeline-with-aws

FREE DOWNLOAD


@ engpaper.com published paper
PUBLICATION PROCEDURE WITH US ENGPAPER.COM
ENGPAPER.COM PUBLISHED PAPERS