If you want to build an end-to-end data pipeline with AWS services, this post which will introduce the AWS Big Data portfolio is made for you. Before diving into the AWS suite, let’s first define what Big Data is.
What’s Big Data?
According to Amazon Web Services (AWS):
Big data can be described in terms of data management challenges that – due to increasing volume, velocity and variety of data – cannot be solved with traditional databases.
In a simpler way:
A data set is considered big data when it is too big or complex to be stored or analyzed by traditional data systems
As we have defined what Big Data is, we will continue with the AWS services portfolio that is used to help solve the challenges of Big Data.
Big data 1st STEP: COLLECT
Collecting raw data is always a challenge for many organizations, especially for developers as businesses always have different complex source systems scattered throughout the company such as ERP system, CRM system, transaction database, etc.
Businesses must also think about how they will integrate data between these systems to create a unified view of corporate data.
AWS makes these steps easier for businesses, allowing developers to ingest data from – structured and unstructured, real-time to batch.
AWS Direct Connect
AWS Direct Connect is a cloud service solution that makes it easy to establish a dedicated network connection from your premises to AWS.
Using AWS Direct Connect, data that has previously been transferred over the internet is delivered via a private network connection between the user’s premises and AWS.
This is useful if users want consistent network performance or if they have bandwidth-intensive workloads.
Read more about AWS Direct Connect here
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information.
Amazon Kinesis enables data processing and analysis as it arrives and responds immediately, rather than having to wait until all data is collected before processing can begin.
Amazon Kinesis is fully managed and runs enterprise streaming applications without requiring any infrastructure management.
Kinesis has 4 capabilities namely:
- Kinesis Video Streams
- Kinesis Data Streams
- Kinesis Data Firehose
- Kinesis Data Analytics
Amazon Kinesis Video Streams
Kinesis Video Streams make it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing.
Amazon Kinesis Data Streams
Kinesis Data Streams is a scalable and persistent real-time data transmission service that can continuously collect gigabytes (GB) of data per second from hundreds of thousands of different sources.
Amazon Kinesis Data Firehose
Kinesis Data Firehose is the easiest way to capture, transform, and load data streams into an AWS data warehouse for near-real-time analysis using the intelligence tools available.
Amazon Kinesis Data Analytics
Kinesis Data Analytics is the easiest way to process data streams in real-time with SQL or Apache Flink without having to learn new programming languages or frameworks.
An interesting way to migrate your data from on-premises to AWS Cloud is AWS Snowball. It is a service that provides safe and secure appliances, so you can bring AWS storage and compute capabilities into edge environments, and transfer data in and out of AWS.
The most famous service in the AWS Big Data solution suite will be Amazon S3. It’s object storage built to store and retrieve any amount of data from anywhere.
It’s a simple hosting service that offers industry-leading durability, availability, performance, security, and virtually unlimited scalability at a very low cost.
Amazon S3 was also the first AWS service launched in 2006.
Amazon S3 Glacier
Amazon S3 Glacier is an ultra-low-cost storage service that provides secure, durable, and flexible storage for data backup and archiving.
This perfectly meets the needs of businesses or organizations that need to store their data for years and even decades!
Big Data 2nd STEP: STORE
We are indeed big fans of Amazon S3, due to its scalability and ease of use. If you’re not using Amazon S3 for your data lakes, you’re probably missing out on a lot.
It is clear that there are many factors that need to be considered when building a Big Data project. Any Big Data platform needs a secure, flexible, and persistent repository to store data before or even after processing tasks, AWS provides services depending on your specific requirements.
Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale.
This is one of the fully managed AWS Services, which means you don’t have to worry about setting up your infrastructure and updating your software, just use the service.
Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database in the cloud
Amazon RDS supports the Amazon Aurora, MySQL, MariaDB, Oracle, SQL Server, and PostgreSQL database engines and is typically the service used when customers migrate their databases from on-premises to AWS.
Amazon Aurora is a relational database engine that combines the speed and reliability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases.
Big data 3RD STEP: PROCESS & ANALYZE
This is the step where data is transformed from its raw state into a consumable format – often by sorting, aggregating, concatenating, and even implementing more advanced functions and algorithms.
The resulting data sets are then stored for further processing or made available for consumption through business intelligence and data visualization tools.
PROCESSING & ANALYZING SERVICES
Amazon Redshift is the most widely used cloud data warehouse.
It makes analyzing all your data fast, simple, and cost-effective using standard SQL and your existing Business Intelligence (BI) tools.
It allows you to run complex analytical queries from terabytes (TB) to petabytes (PB, 1 PB of structured and semi-structured data, using complex query optimization, columnar storage) on high-performance memory and batch parallel query execution.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
Amazon Athena Services is serverless, so there is no infrastructure to manage and you only pay for the queries you run.
We used Athena a lot in our implementation and I have to say they really helped us with Data Discovery and Data Validation.
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analysis, machine learning, and application development.
AWS Glue has evolved significantly from the initial release 0.9 to AWS Glue 2.0. Along with that are improvements that help tie all the pipelines together.
Amazon EMR is a web service that enables businesses, researchers, data analysts, and developers to process large amounts of data easily and cost-effectively.
It uses the hosted Hadoop framework that runs on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
In contrast to Glue, which is serverless which means you don’t need to provision your own server, EMR gives you more flexibility in your workload depending on your data processing workload.
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable computing in the cloud. Basically, your virtual machine in the cloud has a lot of use cases, true to its name “Elastic”.
Amazon SageMaker is a fully managed service that gives every developer and data scientist the ability to quickly build, train, and deploy machine learning (ML) models.
SageMaker removes the heavy lifting from every step of the machine learning process to make it easier to develop high-quality models.
AWS re:Invent 2020 also introduced us to a lot of significant improvements to Amazon Sagemaker such as Data Wrangler, Clarify, SageMaker pipe, and more.
Big Data 4th STEP: VISUALIZE
Amazon QuickSight is a fast, easy-to-use, cloud-powered Business Analytics service that makes it easy for all employees in your organization to visualize, perform ad-hoc analytics, and quickly gain insights into business details from their data, anytime, on any device.
QuickSight is easy to use and has also made some major improvements since it was publicly released. Although relatively new compared to other major BI tools, Amazon QuickSight has a lot of potentials, especially as a cost-effective BI solution.
About VTI Cloud
VTI Cloud is an Advanced Consulting Partner of AWS Vietnam with a team of over 50+ AWS certified solution engineers. With the desire to support customers in the journey of digital transformation and migration to the AWS cloud, VTI Cloud is proud to be a pioneer in consulting solutions, developing software, and deploying AWS infrastructure to customers in Vietnam and Japan.
Building safe, high-performance, flexible, and cost-effective architectures for customers is VTI Cloud’s leading mission in enterprise technology mission.
In addition, VTI Cloud supports building VIET-AWS community. This group is one of the fast-growing AWS User Groups and officially recognized by Amazon in the Asia Pacific (Vietnam) region.
VIET-AWS is a place to connect and exchange support between Solutions Architect, DevOps, SysOps, and budding students with cloud computing services of Amazon Web Services (AWS). Join VTI Cloud to join VIET-AWS: https://www.facebook.com/groups/vietawscommunity