Week 18 / 2023

Mohamed Saif published on May 01, 2023

26 min, 5044 words

Categories: data engineering

Tags: python data engineering sqlite DuckDB

misc

pycharm shortcut: ctrl + shift + ;
cache hit, cache miss, cache hit ratio, data staleness, cache invalidation (write-through, write-around, write-back)
DataBricks
Apache Spark
Apache Hadoop
DuckDB
LightDash
Apache Superset
Atlas
internal knowledge base vs customer-service knowledge base, 1, 2

Fundementals of Data Engineering

CH5: Data Generation in Source Systems, CH6: Storage

Message vs Streams: A message is typically sent through a message queue from a publisher to a consumer, and once the message is delivered, it is removed from the queue. A stream is a continuous flow of data that is not removed after it is consumed.
DBMS:
Lookup: How does the database find and retrieve data?
Query optimizer: How does the database choose the best way to execute a query?
Scaling and distribution: How does the database handle large amounts of data and queries? horizontal scaling vs vertical scaling?
Modeling patterns: How does the database model data? What are the best practices for modeling data?
CRUD: How does the database handle create, read, update, and delete operations? Every type of database handles CRUD operations differently.
Consistency: How does the database ensure that data is consistent? What are the trade-offs between consistency and availability? eventual consistency vs strong consistency?
Models provide the means for building quality software in a predictable manner. Patterns provide building blocks that help developers construct models faster and better.
Databases:
Relational, NoSQL, key-value, document, graph, wide-column, and time series, Search databases.
A time-series database is optimized for retrieving and statistical processing of time-series data. Time-series databases often utilize memory buffering to support fast writes and reads. The schema for a time series typically contains a timestamp and a small set of fields.
APIs:
REST: Representational State Transfer.
GraphQL: A query language for APIs and a runtime for fulfilling those queries with your existing data. Whereas REST APIs generally restrict your queries to a specific data model, GraphQL opens up the possibility of retrieving multiple data models in a single request.
Webhooks: A webhook is a way for an app to provide other applications with real-time information. A webhook delivers data to other applications as it happens, meaning you get data immediately. Webhooks are a simple event-based data-transmission pattern.
RPC and gRPC: Remote Procedure Call (RPC) is a protocol that one program can use to request a service from a program located in another computer on a network without having to understand the network's details. gRPC is a high-performance, open-source universal RPC framework. It allows you to run a procedure on a remote system.
Message Queues and Event-Streaming: asynchronous, loosely coupled, and highly scalable applications. Message queues and event-streaming platforms are the backbone of event-driven architectures.
Message Queues: A message queue is a mechanism to asynchronously send data (usually as small individual messages, in the kilobytes) between discrete systems using a publish and subscribe model. A message queue is a temporary storage location used by sender applications while the message is routed to its destination. Message queues are a critical ingredient for decoupled microservices and event-driven architectures.
Some things to keep in mind with message queues are frequency of delivery, message ordering, and scalability.
Message ordering and delivery guarantees are important considerations when choosing a message queue.
In some ways, an event-streaming platform is a continuation of a message queue with the added ability to replay messages and store them for a longer period of time. Event-streaming platforms are a critical component of event-driven architectures.
In an event-streaming platform, a producer streams events to a topic, a collection of related events.
Data engineers who abuse source systems may need to look for another job when production goes down.
Were the systems built with consideration of data inges‐ tion? Often the answer is no, which has implications ranging from the ingestion putting unintended load on the system to the inability to load data incrementally.
First, what is the interface to the data? database, REST API, Stream processing (Apache Kafka), Shared network file system or cloud storage bucket, data warehouse or data lake, HDFS or HBase database?
In addition to the interface, the structure of the data will vary.
common characteristics of “messy data”: Duplicate, Orphaned, incomplete, inconsistent, and invalid data..
for ensuring data cleanliness and validity:
The need to frequently extract high volumes of data from source systems is a common use case in a modern data stack.
A data warehouse is a database where data from different sys‐ tems is stored and modeled to support analysis and other activ‐ ities related to answering questions with it. Data in a data ware‐ house is structured and optimized for reporting and analysis queries.
A data lake is where data is stored, but without the structure or query optimization of a data warehouse. It will likely contain a high volume of data as well as a variety of data types.
It’s rare to find two organizations with exactly the same data infrastructure.
key decision points in selecting a product or tool: dollars, engineering resources, security, and legal risk tolerance.
Even for seasoned data engineers, designing a new data pipeline is a new journey each time.
Common Data Pipeline Patterns: ETL, ELT
Either way, the end result is loading data into the data warehouse, data lake, or other destination.
Emergence of ELT over ETL: The emergence of ELT over ETL is due to the increased availability of cloud storage and the ability to scale compute resources on demand. It’s now better to focus on extracting data and loading it into a data warehouse where you can then perform the necessary transformations to complete the pipeline.
However, in analytics the situation is reversed. Instead of the need to read and write small amounts of data fre‐ quently, we often read and write a large amount of data infrequently.
A columnar database, such as Snowflake or Amazon Redshift or Google BigQuery, stores data in disk blocks by column rather than row.
A final benefit is reduction in storage, thanks to the fact that blocks can be fully utilized and optimally compressed since the same data type is stored in each block rather than multiple types that tend to occur in a single row-based record.
With ELT, data engineers can focus on the extract and load steps in a pipeline (data ingestion), while analysts can utilize SQL to transform the data that’s been ingested as needed for reporting and analysis.
Because of these exceptional performance characteristics, SSDs have revolutionized transactional databases and are the accepted standard for commercial deployments of OLTP systems.
However, SSDs are not currently the default option for high-scale analytics data storage
Some OLAP databases leverage SSD caching to support high-performance queries on frequently accessed data.
AWS was a huge step forward in terms of flexibility and scalability, but a massive step backward in terms of pricing.

Cloud Storage

Distributed storage coordinates the activities of multiple servers to store, retrieve, and process data faster and at a larger scale, all while providing redundancy in case a server becomes unavailable.
MapReduce: a programming model and software framework for processing large volumes of data in a distributed computing environment. divides the processing task into two main phases: the map phase and the reduce phase.
MapReduce provides a simple and scalable approach to distributed data processing. It abstracts away the complexity of managing distributed computing tasks, fault tolerance, and data parallelization, making it easier for developers to write distributed programs.
from rows to columns: the basic storage mechanics of data. Row and column orientation. Columnar databases are optimized for reading data, while row-oriented databases are optimized for writing data.
Row oriented stores each individual record together. Column oriented stores each individual column together.
Data is generally read off disk by blocks of KB (or MB/GB), so a single read for 1 record brings in a lot more than what you may want.
File systems differ in the size of the blocks they read off disk. HDFS uses 64MB blocks, while ext4 uses 4KB blocks.. etc.
Block sizes are conceptually different for a filesystem/OS, Database, and Big Data stores like S3/GFS. Its a logical unit in each case. AWS EBS uses 4K, S3 is an object store but its min allocation is 128KB.
For columnar block reads — the read varies on column size. If you have a column that is 1KB, and you want to read 1 record, you will read 1KB. If you have a column that is 1MB, and you want to read 1 record, you will read 1MB.
Postgres has both traditional row based storage (sometimes called “heapfiles”) as well as a columnar store extension (cstore_fdw). > I can guess as an extension it is not as robust as standard row storage.
Columnar performance was only faster on cold boot/cleared cache. Once the DB fills its caches, traditional row storage on Postgres was consistently faster. However, columnar stores are for Big Data ~TB size datasets larger than memory which would exhibit cold-cache behavior.
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
Database Storage Engine:
1
2
A database storage engine is an internal software component that a database server uses to store, read, update, and delete data in the underlying memory and storage systems.
B-Tree: A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. B-trees usually grow wide and shallow, so for most queries very few nodes need to be traversed. However, the need to maintain a well-ordered data structure with random writes usually leads to poor write performance.
LSM Tree: A log-structured merge-tree (or LSM tree) is a data structure with performance characteristics that make it attractive for providing indexed access to files with high insert volume, such as transactional log data. LSM trees maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches. This allows for high throughput of random inserts into the database, while sacrificing some read performance and space efficiency.
LSM engines are now default in popular NoSQL databases including Apache Cassandra, Elasticsearch (Lucene), Google Bigtable, Apache HBase, and InfluxDB.
The database API layer (SQL vs. NoSQL) is a choice independent of the storage engine since both B-tree and LSM engines are used in SQL and NoSQL databases.
Bloom Filter:
1
2
3
The Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter); the more elements that are added to the set, the larger the probability of false positives.
To understand Bloom filters, you first have to understand hashing. A hash is like a fingerprint for data. A hash function takes your data — which can be any length — as an input, and gives you back an identifier of a smaller (usually), fixed (usually) length, which you can use to index or compare or identify the data.
There are literally thousands of named hashing functions. Some are secure, but comparatively slow to calculate. Some are very fast, but have more collisions. Some are close to perfectly uniformly distributed, but very hard to implement. You get the idea. If there’s one rule in programming it’s this: there will always be trade-offs.
When suggesting stories to a user, the system may come up with thousands of possibilities, but not every one will be appropriate for a user. Perhaps a story has been suggested to them too many times already. Perhaps they’ve already read it. We need to filter those stories out, but we don’t want to retrieve thousands of records from the database to do it. We need a mechanism that is particularly good at answering the question “has this been seen before?”. And for that, we use a Bloom filter.
Remember that using a hashed value lets a database jump straight to the record, like an index in a book
Bloom filters can very quickly answer variations on the Yes/No question “is this item in the set?”, like “have I seen this item before?”.
Very rarely, it will say Yes when the answer is actually No (although it will never say No, when the answer is actually Yes).
databases indexes: 1
ACID, CAP, BASE, MapReduce
distributed compute, distributed file system
DuckDB: an embeddable SQL OLAP database management system. can efficiently run SQL queries directly on Pandas DataFrames. offers many advantages such as a query optimizer, automatic multi-threading and larger-than-memory computation. uses the Postgres SQL parser.
Result set serialization (RSS): RSS is a technique for serializing a result set into a single string. The string can then be sent over the network and deserialized back into a result set.
RSS has a significant impact on overall system performance when the network is slow or when the result set is large.
Data Engineering Storage Abstractions: Data Warehouse, Data Lake, Data Lakehouse.
Purpose and use case: You must first identify the purpose of storing the data.
Update patterns: Is the abstraction optimized for bulk updates, streaming inserts, or upserts?
Cost
Separate storage and compute: The storage and compute layers are decoupled, allowing you to scale them independently.
Data warehouses are a standard OLAP data architecture.
MMP (Massively Parallel Processing) Database: A database that is designed to handle massive amounts of data, and is optimized for write operations. designed to process large-scale data sets by distributing the workload across multiple nodes or servers.
The limitation is that cloud data warehouses cannot handle truly unstructured data, such as images, video, or audio, unlike a true data lake
A lakehouse system is a metadata and file-management layer deployed with data management and transformation tools. Databricks has heavily promoted the lake‐ house concept with Delta Lake, an open source storage management system.
Reserializing data from a proprietary database format incurs overhead in processing, time, and cost. In a data lakehouse architecture, various tools can connect to the metadata layer and read data directly from object storage.
Increasingly, vendors are styling their products as data platforms. These vendors have created their ecosystems of interoperable tools with tight integration into the core data storage layer.
some big ideas in storage—key considerations that you need to keep in mind as you build out your storage architecture.
Data Catalog: A data catalog is a metadata management tool designed to help organizations find and manage their data. Data catalogs centralize metadata from multiple sources, providing data analysts and other users with a single source of truth for data discovery and governance.
Atlan: Data Discovery, Data 360, Business Glossary
Google four your data. Atlan supports natural language search that scans for related words in addition to your search term. This means that you can search for “sales” and get results for “revenue” and “profit” as well. Data teams are diverse. Analysts, engineers, and scientists all have their own preferences. Can we create personalizations for every user based on how they love to work?
The Sidebar to answer questions that can help us identify the right table to use for your dashboard on customer behavior.
Data Asset 360:
Business Glossary: A business glossary is a set of definitions for commonly used business terms, aggregating information from across the organization into a single source of truth. A business glossary is a critical component of a data catalog, providing context for data assets and helping users understand the meaning of the data they’re working with.
you can link your terms to data assets like tables, columns, and queries, to build a connected semantic layer for your business.
Governace Center: A simple, scalable way to manage access policies and curate assets for your users.
Column-level Lineage: Column-level lineage is a feature that allows you to track the flow of data from its source to its destination, and understand how it has been transformed along the way. Column-level lineage is a critical component of data governance, helping you ensure that your data is accurate and trustworthy. Atlan's lineage is powered by automated SQL parsing, from your warehouse to your BI tool.
Atlan's integration with dbt Semantic Layer creates trust and visibility for your business metrics, so that you'll never be lost without context about your metrics ever again.
Data catalogs have both organizational and technical use cases.
Two major schema patterns exist: schema on write and schema on read.
Separation of compute and storage:
Ephemerality and scalability: The cloud is ephemeral. This means that you can spin up and spin down resources as needed. This is a major advantage over on-premises data centers, where you have to plan for peak capacity and pay for idle resources.
Data durability and availability: Data durability is the ability to recover data after a failure. Data availability is the ability to access data when you need it. The cloud offers high durability and availability, but it’s not perfect. You still need to plan for failure.
In practice, we constantly hybridize colocation and separation to realize the benefits of both approaches.
zero-copy cloning. This typically means that a new virtual copy of an object is created (e.g., a new table) without necessarily physically copying the underlying data. Typically, new pointers are created to the raw data files, and future changes to these tables will not be recorded in the old table.
Did you know that data has a temperature? Depending on how frequently data is accessed, we can roughly bucket the way it is stored into three categories of persis‐ tence: hot, warm, and cold.
Storage tier considerations. When considering the storage tier for your data, consider the costs of each tier. If you store all of your data in hot storage, all of the data can be accessed quickly. But this comes at a tremendous price!
If you’re using cloud-based object storage, create automated lifecycle policies for your data.
Data retention: Data retention is the amount of time that data is stored. Data retention is often governed by legal and regulatory requirements. For example, the European Union’s General Data Protection Regulation (GDPR) requires that personal data be deleted after a certain period of time.
Back in the early days of “big data,” there was a tendency to err on the side of accu‐ mulating every piece of data possible, regardless of its usefulness. The expectation was, “we might need this data in the future.”
Nowadays, data engineers need to consider data retention: what data do you need to keep, and how long should you keep it?
Value:
Time:
Cost:
Compliance:
Single-Tenant Storage: Single-tenant storage is a storage architecture in which each tenant has their own dedicated storage. Single-tenant storage is the traditional storage architecture for on-premises data centers.
As always, exercise the principle of least privilege. Don’t give full database access to anyone unless required.
Cataloging enables data scientists, analysts, and ML engineers by enabling data discovery.
Data versioning can help with error recovery when processes fail, and data becomes corrupted. data version control can aid ML engineers in tracking changes that lead to model performance degradation.
DataOps is not orthogonal to data management, and a significant area of overlap exists.
Consider the following data architecture tips. Design for required reliability and durability. Understand the upstream source systems and how that data, once ingested, will be stored and accessed. Understand the types of data models and queries that will occur downstream.
Don’t prematurely optimize, but prepare for scale if business opportunities exist in operating on large data volumes.
Lean toward fully managed systems, and understand provider SLAs. Fully managed systems are generally far more robust and scalable than systems you have to babysit.

In-process database

1
SQLite is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine.
The database file format is cross-platform - you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.
Think of SQLite not as a replacement for Oracle but as a replacement for fopen().
SQLite is indeed an in-process database, which means that it runs as a library linked directly into the application that uses it.
In other words, SQLite operates within the same process space as the application itself.
SQLite is a serverless database, which means that it does not require a separate server process to operate.
SQLite is a zero-configuration database, which means that it does not need any setup or administration by a database administrator or system administrator.
Advantages:
1. Lightweight: SQLite is designed to be lightweight and self-contained.
2. There is no need to install and configure a separate database server,
3. Efficiency: In-process databases like SQLite avoid the overhead of inter-process communication (IPC) that occurs in client-server database systems.
4. Single-user Focus: SQLite is primarily designed for single-user scenarios, where a single application accesses the database at a time. It doesn't provide built-in mechanisms for concurrent access and locking required for multi-user environments.
The opposite of an in-process database is an out-of-process or client-server database.
Out-of-process databases offer several benefits:
1. Multi-user support: Out-of-process databases are designed to support multiple users accessing the database at the same time.
2. Scalability: In client-server architectures, the database server can be scaled independently of the client applications.
3. Centralized Management: administrators can centrally manage the database server, perform backups, configure security settings, and enforce access controls.
Single Database File: SQLite stores the entire database (definitions, tables, indices, and the data itself) as a single cross-platform file on a host machine.
Mainfiest Typing: In manifest typing, the datatype is a property of the value itself, not of the column in which the value is stored.
Variable-length records:
SQL statements compile into virtual machine code: SQLite compiles SQL statements into virtual machine code, which is then executed by a virtual machine. Other database engines typically interpret SQL statements or compile them into some form of byte code. The virtual machine allows the developers to see clearly and in an easily readable form what SQLite is trying to do with each statement it compiles, which is a tremendous help in debugging.

Rowise vs Columnar Database? Theory and in Practice

The whole block with data is loaded into the memory for reading by the operating system. Any further read for data for this block will happen from memory and will be super fast.
Read/Writing operations disks are not slow. only seek operation is slow.
Due to the above point — sequential read/writes are much faster on disks rather than the random access.
But how does this difference translates to performance and difference use-cases?
Columnar Databases are quite slow for insertions as you insert a whole row.
One advantage of this approach is that it allows for better compression of the data. Since columnar databases store homogenous data in a single block, they can apply compression algorithms that are optimized for specific data types. For example, a compression algorithm that works well for numeric data may not work as well for text data. By compressing the data in a block, columnar databases can optimize space and reduce storage requirements.
In addition, because columnar databases store data in columns, they tend to have less fragmentation than row-based databases. Fragmentation occurs when data is added, updated, or deleted, causing the storage space to become fragmented and leading to reduced performance. In a columnar database, since data for a specific column is stored together, adding or updating data in one column does not necessarily cause fragmentation in the storage space for other columns. This can lead to better performance over time, particularly when dealing with large datasets with many columns and rows.

Object storage

1
An object is a file plus metadata, and an object store is a collection of objects.
object stores are fundamentally different from file storage.
Individual objects are sharded across a series of storage nodes. You don’t interact with these nodes directly; instead, you interact with the router.
Object stores embrace two more concepts to manage this scale: versions and metadata.
Object storage is designed for scale, and it’s good for large data sets.
object storage isn’t as fast as block or file storage.
You also can’t edit individual objects, only write new ones. If you’re making frequent, small edits to an object, then object storage is the wrong choice. Don’t try to run a database on object storage!
Infinitely scalable but at slower speeds
There’s no one way to do storage.

Incremental Loads

When moving data in an ETL process, the most efficient design pattern is to touch only the data you must, copying just the data that was newly added or modified since the last load was run. This pattern of incremental loads usually presents the least amount of risk, takes less time to run, and preserves the historical accuracy of the data.
An incremental load is the selective movement of data from one system to another.
This differs from the conventional full data load, which copies the entire set of data from a given source.
Because incremental loads only move the delta, you can expect more consistent performance over time.
There’s not always a clear way to identify new and changed data.
CDC (Change Data Capture) is a technique that identifies changes to data and logs them in a separate table.

CH7: Data Ingestion

Data ingestion is the process of moving data from one place to another.
A data pipeline is the combination of architecture, systems, and processes that move data through the stages of the data engineering lifecycle.
A data pipeline should be flexible enough to fit any needs along the data engineering lifecycle.
Considiration for data ingestion:
1. Bounded vs. Unbounded Data: Bounded data is data that has a clear beginning and end. Unbounded data is data that is continuously generated and has no clear beginning or end. All data is unbounded until it’s bounded.
2. Frequency: How often is the data generated? Is it a one-time load or a continuous stream? Ingestion processes can be batch, micro-batch, or real-time.
3. Synchronous Versus Asynchronous Ingestion: Synchronous ingestion is when the ingestion process waits for the data to be available before it starts. Asynchronous ingestion is when the ingestion process starts before the data is available. With synchronous ingestion, the source, ingestion, and destination have complex dependencies and are tightly coupled. With asynchronous ingestion, dependencies can now operate at the level of individ‐ ual events, much as they would in a software backend built from microservices
4. Serialization and Deserialization:
5. Throughput and Scalability: Throughput is the amount of data that can be processed in a given amount of time. Scalability is the ability to handle increased throughput. Whenever possible, use managed services that handle the throughput scaling for you.
6. Reliability and Durability: Reliability entails high uptime and proper failover for ingestion systems. Durability entails making sure that data isn’t lost or corrupted. Our advice is to evaluate the risks and build an appropriate level of redundancy and self-healing based on the impact and cost of losing data.
7. Payload: The payload is the data that is being ingested. has characteristics such as kind (Kind consists of type and format.), shape (that describes its dimensions.), size, schema and data types, and metadata.
The great engineering challenge is understanding the underlying schema. In other cases, engineers are not so lucky. The API is a thin wrapper around underlying systems, requiring engineers to understand application internals to use the data.
It’s becoming increasingly common for ingestion tools to automate the detection of schema changes and even auto-update target tables.
Metadata can be as critical as the data itself.
8. Push Versus Pull Versus Poll Patterns: Another pattern related to pulling is polling for data. Polling involves periodically checking a data source for any changes. When changes are detected, the destination pulls the data as it would in a regular pull situation.

Batch Ingestion Considerations

This means that data is ingested by taking a subset of data from a source system, based either on a time interval or the size of accumulated data
Snapshot or Differential (sometimes called incremental) Extraction: Snapshot extraction is when the entire data set is extracted from the source system. Differential extraction is when only the changes since the last extraction are extracted from the source system.
File-Based Export and Ingestion: File-based ingestion is when data is exported from a source system into a file, and then the file is ingested into a destination system. This is the most common pattern for batch ingestion. Data is quite often moved between databases and systems using files. Common file-exchange methods are object storage, secure file transfer protocol (SFTP), electronic data interchange (EDI), or secure copy (SCP).
ETL Versus ELT:
Batch-oriented systems often perform poorly when users attempt to perform many small-batch operations rather than a smaller number of large operations. Understand the appropriate update patterns for the database or data store you’re working with. Know the limits and characteristics of your tools.
Most data systems perform best when data is moved in bulk rather than as individual rows or events.