Week 16 / 2023

Mohamed Saif published on April 22, 2023

17 min, 3377 words

Categories: data engineering

Tags: data engineering

Fundementals of Data Engineering

CH1: Data Engineering Descriped

It builds the foundation for data science and analytics in production.
The data engineering lifecycle: data generation, storage, ingestion, transformation, and serving.
Data engineering has existed in some form since companies started doing things with data—such as predictive analysis, descriptive analytics, and reports.
A data engineer gets data, stores it, and prepares it for consumption by data scientists, analysts, and others.
Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engi‐ neering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning.

Data Engineering Lifecycle

The term big data is essentially a relic to describe a particular time and approach to handling large amounts of data.
Data engineering is increasingly a discipline of interoperation, and connecting various technologies like LEGO bricks, to serve ultimate business goals.

A data engineer typically does not directly build ML models, create reports or dashboards, perform data analysis, build key performance indicators (KPIs), or develop software applications. Instead, a data engineer builds the infrastructure that enables these activities.
Data Maturity Model: is a framework for understanding the maturity of an organization’s data practices. It is a tool for assessing the current state of an organization’s data practices and identifying areas for improvement. The model is based on the premise that data maturity is a continuum, and that organizations can move along the continuum by improving their data practices. The model is also based on the premise that data maturity is a function of the organization’s ability to meet the needs of its stakeholders. The model is composed of four dimensions: data strategy, data culture, data infrastructure, and data governance. Each dimension is composed of three levels: basic, intermediate, and advanced. The model is designed to be used as a self-assessment tool. The model is also designed to be used as a tool for benchmarking an organization’s data practices against those of other organizations.
Simple Data Maturity Model, the most important part
Data maturity is a helpful guide to understanding the types of data challenges a company will face as it grows its data capability.
Stage 1: Starting with data:
Stage 2: Scaling with data:
Stage 3: Leading with data:
Data Engineer -> Data Lifecycle Engineer.
As a company grows its data maturity, it will move from ad hoc data analysis to self-service analytics, allowing democratized data access to business users without needing IT to intervene.
Internal BI faces a limited audience and generally presents a limited number of unified views. External BI is more flexible and can be used to present a wide variety of views to a wide variety of audiences.
Reverse ETL takes processed data from the output side of the data engineering lifecycle and feeds it back into source systems, as shown in Figure 2-6. In reality, this flow is beneficial and often necessary; reverse ETL allows us to take analytics, scored models, etc., and feed these back into production systems or SaaS platforms. This is a common practice in the world of marketing automation, where we can take the results of a campaign and feed them back into the campaign to optimize it.
Since money is involved, correctness is paramount.
Data quality sits across the boundary of human and technology problems.
Data integration and interoperability is the process of integrating data across tools and processes. As we move away from a single-stack approach to analytics and toward a heterogeneous cloud environment in which various tools process data on demand, integration and interoperability occupy an ever-widening swath of the data engineer’s job.
Increasingly, integration happens through general-purpose APIs rather than custom database connections.
While the complexity of interacting with data systems has decreased, the number of systems and the complexity of pipelines has dramatically increased.
Data products differ from software products because of the way data is used.
First and foremost, DataOps is a set of cultural habits;
DataOps has three core technical elements: automation, monitoring and observability, and incident responseDataOps has three core technical elements: automation, monitoring and observability, and incident response.
Observability, monitoring, logging, alerting, and tracing are all critical to getting ahead of any problems along the data engineering lifecycle.
DODD focuses on making data observability a first-class consideration in the data engineering lifecycle.
A system may have downtime, a new data model may break downstream reports, an ML model may become stale and provide bad predictions—countless problems can interrupt the data engineering life‐ cycle.
Data engineers should proactively find issues before the business reports them.
A data architecture reflects the current and future state of data systems that support an organization’s long-term data needs and strategy.
A data engineer should first understand the needs of the business and gather requirements for new use cases. Next, a data engineer needs to translate those requirements to design new ways to capture and serve data, balanced for cost and operational simplicity. This means knowing the trade-offs with design patterns, technologies, and tools in source systems, ingestion, storage, transformation, and serving data.
Orchestration is the process of coordinating many jobs to run as quickly and effi‐ ciently as possible on a scheduled cadence.
A pure scheduler, such as cron, is aware only of time; an orchestration engine builds in metadata on job dependencies, generally in the form of a directed acyclic graph (DAG).
Before data engineers begin engineering new internal tools, they would do well to survey the landscape of publicly available tools.
When data engineers have to manage their infrastructure in a cloud environment, they increasingly do this through IaC frameworks rather than manually spinning up instances and installing software.
A data engineer has several top-level goals across the data lifecycle: produce optimum ROI and reduce costs (financial and opportunity), reduce risk (security, data quality), and maximize data value and utility.
Data without context is often meaningless and can lead to ill-informed and costly decisions. More metadata translates to being more data informed.
Although asking a colleague about data is easy, it is highly inefficient at scale. It can be hard to re-train yourself to search for data on your own.

CH3: Designing Good Data Architecture

Successful data engineering is built upon rock-solid data architecture.
Enterprise architecture: business, technical, application, and data
Enterprise architecture is the design of systems to support change in the enterprise, achieved by flexible and reversible decisions reached through careful evaluation of trade-offs.
Technical solutions exist not for their own sake but in support of business goals.
Data architecture is the design of systems to support the evolving data needs of an enterprise, achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs.
data engineering architecture is a subset of general data architecture.
data architecture: operational architecture describes what needs to be done, and technical architecture details how it will happen.
Good data architecture serves business requirements with a common, widely reusable set of building blocks while maintaining flexibility and making appropriate trade-offs
AWS Well-Architected Framework, Google Cloud’s Five Principles for Cloud-Native Architecture.
Event-driven architecture (EDA), This workflow boils down to three main areas: event production, routing, and consumption. An event must be produced and routed to something that consumes it without tightly coupled dependencies among the producer, event router, and consumer.
The advantage of an event-driven architecture is that it distributes the state of an event across multiple services. This is helpful if a service goes offline, a node fails in a distributed system, or you’d like multiple consumers or services to access the same events. Anytime you have loosely coupled services, this is a candidate for event-driven architecture.
EDA is an effective tool for reducing coupling between the components of a system by modelling interactions using the concepts of producers, consumers, events and streams.
It may be argued that EDA is an essential element of any successful microservices deployment.
A microservice is a small, loosely coupled, distributed service. Microservices are often deployed in containers and are managed by an orchestration engine such as Kubernetes.
Microservices evolved as a solution to the scalability challenges with monolithic architectures.
Brownfield Versus Greenfield Projects: Brownfield projects are those that are built on top of existing systems, whereas greenfield projects are those that are built from scratch.
A data warehouse is a central data hub used for reporting and analysis.
be aware of architecture tiers. Your architecture has layers—data, application, business logic, presentation, and so forth —and you need to know how to decouple these layers
Architecture Tiers: Single tier: In a single-tier architecture, your database and application are tightly coupled, residing on a single server. Miltitier (n-tier): In a multtier architecture, your application is split into multiple layers, each with its own server. This is the most common architecture for web applications. Microservices: In a microservices architecture, your application is split into multiple services, each with its own server. This is the most common architecture for cloud-native applications.
A common multitier architecture is a three-tier architecture, a three-tier architecture consists of data, application logic, and presentation tiers
it is often impractical (and not advisable) to run analytics queries against production applica‐ tion databases. Doing so risks overwhelming the database and causing the application to become unavailable.
Microservices architecture comprises separate, decentralized, and loosely coupled services. Each service has a specific function and is decoupled from other services operating within its domain.
Another approach is the data mesh. With the data mesh, each software team is responsible for preparing its data for consumption across the rest of the organization.
Traditionally, a data warehouse pulls data from application systems by using ETL. The data warehouse is then used to generate reports and dashboards. The data mesh, on the other hand, pushes data from application systems to a data lake. The data lake is then used to generate reports and dashboards.
Google BigQuery, Snowflake, and other competitors popularized the idea of separating compute from storage.
The ability to separate compute and storage allows database software increased availability and scalability, and has the potential to dramatically reduce cost
“Separating compute and storage” involves designing databases systems such that all persistent data is stored on remote, network attached storage.
Online analytical processing (OLAP) and online transactional processing (OLTP) are the two primary data processing systems used in data science.
organizations are not typically making a decision between OLAP and OLTP.
OLTP systems are designed to handle large volumes of transactional data involving multiple users. Relational databases rapidly update, insert, or delete small amounts of data in real time
OLAP system is designed to process large amounts of data quickly, allowing users to analyze multiple data dimensions in tandem.
Many OLAP systems pull their data from OLTP databases via an ETL pipeline and can provide insights.
A data mart is a more refined subset of a warehouse designed to serve analytics and reporting, focused on a single suborganization, department, or line of business; every department has its own data mart, specific to its needs. This is in contrast to the full data warehouse that serves the broader organization or business.

data_mart

the data lake allows an immense amount of data of any size and type to be stored. When this data needs to be queried or transformed, you have access to nearly unlimited computing power by spinning up a cluster on demand, and you can pick your favorite data-processing technology for the task at hand
We should be careful not to understate the utility and power of first-generation data lakes.
For many organizations, data lakes turned into an internal superfund site of waste, disappointment, and spiraling costs.
Instead of choosing between a data lake or data warehouse architecture, future data engineers will have the option to choose a converged data platform based on a variety of factors, including vendor, ecosystem, and relative openness.
Whereas past data stacks relied on expensive, monolithic toolsets, the main objective of the modern data stack is to use cloud-based, plug-and-play, easy-to-use, off-the-shelf components to create a modular and cost-effective data architecture. These components include data pipelines, storage, transformation, data management/governance, monitoring, visualization, and exploration. The domain is still in flux, and the specific tools are changing and evolving rapidly, but the core aim will remain the same: to reduce complexity and increase modularization.

modern data stack

Key outcomes of the modern data stack are self-service (analytics and pipelines), agile data management, and using open source tools or simple proprietary tools with clear pricing structures
the key concept of plug-and-play modularity with easy-to-understand pricing and implementation is the way of the future. Especially in analytics engineering.
The data mesh attempts to invert the challenges of centralized data architecture, taking the concepts of domain-driven design and applying them to data architecture. Because the data mesh has captured much recent attention, you should be aware of it.
Domain-Driven Design (DDD):
The solution circles around the business model by connecting execution to the key business principles.
Domain logic: Domain logic is the purpose of your modeling. Most commonly, it’s referred to as the business logic. This is where your business rules define the way data gets created, stored, and modified.
Domain model: Domain model includes the ideas, knowledge, data, metrics, and goals that revolve around that problem you’re trying to solve. It contains all the rules and patterns that will help you deal with complex business logic. Moreover, they will be useful to meet the requirements of your business.
Subdomain: A domain consists of several subdomains that refer to different parts of the business logic. For example, an online retail store could have a product catalog, inventory, and delivery as its subdomains.
Bounded context: represent boundaries in which a certain subdomain is defined and applicable.
The Ubiquitous Language: The Ubiquitous Language is a methodology that refers to the same language domain experts and developers use when they talk about the domain they are working on. That’s why it’s necessary to define a set of terms that everyone uses. All the terms in the ubiquitous language are structured around the domain model.
Entities: a combination of data and behavior
Value objects and aggregates:?
Domain service:?
Repository:?
Deep domain knowledge is needed.
Domain-driven design is perfect for applications that have complex business logic. However, it might not be the best solution for applications with minor domain complexity but high technical complexity.
However, if a company is small or low in its level of data maturity, a data engineer might work double duty as an architect.

CH3: Choosing Technologies Across the Data Engineering Lifecycle

Architecture is strategic; tools are tactical.
Architecture is the what, why, and when. Tools are used to make the architecture a reality; tools are the how.
We strongly advise against choosing technology before getting your architecture right. Architecture first, technology second.
Your team’s size will influence the types of technologies you adopt.
use as many managed and SaaS tools as possible, and dedicate your limited bandwidth to solving the complex problems that directly add value to the business.
In technology, speed to market wins. This means choosing the right technologies that help you deliver features and data faster while maintaining high-quality standards and security. It also means working in a tight feedback loop of launching, learning, iterating, and making improvements.
Interoperability describes how various technologies or systems connect, exchange information, and interact.
We look at costs through three main lenses: total cost of ownership, opportunity cost, and FinOps.
Expenses fall into two big groups: capital expenses (capex) and operational expenses (opex).
This is capex, a significant capital outlay with a long-term plan to achieve a positive ROI on the effort and expense put forth.
In general, opex allows for a far greater ability for engineering teams to choose their software and hardware. Cloud-based services let data engineers iterate quickly with various software and technology configurations, often inexpensively.
we urge data engineers to take an opex-first approach centered on the cloud and flexible, pay-as-you-go technologies.
Total cost of ownership (TCO):
Total opportunity cost of ownership (TOCO):
typical cloud spending is inherently opex: companies pay for services to run critical data processes rather than making up-front purchases and clawing back value over time. The goal of FinOps is to fully operationalize financial account‐ ability and business value by applying the DevOps-like practices of monitoring and dynamically adjusting systems.
We have two classes of tools to consider: immutable and transitory.
PaaS services allow engineers to ignore the operational details of managing individual machines and deploying frameworks across distributed systems. They provide turnkey access to complex, autoscaling systems with minimal operational overhead.
Serverless products generally offer automated scaling from zero to extremely high usage rates. They are billed on a pay-as-you-go basis and allow engineers to operate without operational awareness of underlying servers.
serverless usually means many invisible servers.
every technology—even open source software—comes with some degree of lock-in
open source and proprietary solutions.
Today is possibly the most confusing time in history for evaluating and selecting technologies. Choosing technologies is a balance of use case, cost, build versus buy, and modularization.
Serverless provides a quick time to value for the right use cases.
With the promise of executing small chunks of code on an as-needed basis without having to manage a server. The main reasons for its popularity are cost and convenience. Serverless has many flavors. Faas, Baas, and CaaS are the most common.
Containers play a role in both serverless and microservices.
Abstraction will continue working its way across the data stack.
With the data landscape morphing at warp speed, the best tool for the job is a moving target.

CH4: Generation in Source Systems

But before you get raw data, you must understand where the data exists, how it is generated, and its characteristics and quirks.
On the other hand, it will remain critical to understand the nature of data as it’s created in source systems.
Put in the effort to read the source system documentation and understand its patterns and quirks. If your source system is an RDBMS, learn how it operates (writes, commits, queries, etc.); learn the ins and outs of the source system that might affect your ability to ingest from it.
In addition, files are a universal medium of data exchange.
In theory, APIs simplify the data ingestion task for data engineers.
Fundamentally, OLTP databases work well as application backends when thousands or even millions of users might be interacting with the application simultaneously, updating and writing data concurrently. OLTP systems are less suited to use cases driven by analytics at scale, where a single query must scan a vast amount of data.
Database ACID:
Consistency: means that any database read will return the last written version of the retrieved item.
Isolation entails that if two updates are in flight concurrently for the same thing, the end database state will be consistent with the sequential execution of these updates in the order they were submitted.
Durability indicates that committed data will never be lost, even in the event of power loss.
Atomicity means that a transaction will either succeed or fail in its entirety. There is no partial success.
Understanding the consistency model you’re working with helps you prevent disasters.
An atomic transaction is a set of several changes that are committed as a unit.
running analytical queries on OLTP runs into performance issues due to structural limitations of OLTP or resource contention with competing transactional workloads.
The online part of OLAP implies that the system constantly listens for incoming queries, making OLAP sys‐ tems suitable for interactive analytics.
Change data capture (CDC) is a method for extracting each change event (insert, update, delete) that occurs in a database