Week 20 / 2023

Mohamed Saif published on May 14, 2023

26 min, 5019 words

Categories: fluent-python frappe/erpnext data-engineering

Tags: python fluent python data engineering frappe/erpnext

CH8: Queries, Modeling, and Transformation

By understanding queries, modeling, and transformations, you’ll have the tools to turn raw data ingredients into something consumable by downstream stakeholders.
data definition language (DDL): create, alter, drop, truncate, rename, comment, etc. used to define and modify the structure of database objects. tables, views, indexes, users, etc.
data manipulation language (DML): select, insert, update, delete, merge, etc. used to query and modify data.
Data control language (DCL): grant, deny, revoke, etc. allows you to control access to the database objects or the data.
transaction control language (TCL): commit, rollback, savepoint, etc. used to manage the changes made by DML statements. supports commands that control the details of transactions.
The life of a SQL query in a database: parsing, optimization, execution, and fetching.

SQL Query Life Cycle

Edgedb

The question of ORM performance is more complex than simply "they generate slow queries".
throughput (iterations/ second) and latency (milliseconds) of a set of realistic CRUD queries.

Python

Generators are functions that return an iterable generator object. They are lazily evaluated, meaning that they are evaluated on demand, starting from the first call to next() on the generator object.
Generator expressions are similar to list comprehensions. They are lazy and thus produce items one at a time and only when asked. Generator expressions are surrounded by parentheses () and list comprehensions are surrounded by square brackets [].
When to refactor your code into generators
anthony
David Beazley: Generators: The Final Frontier - PyCon 2014, David Beazley
Trey Hunner - Lazy Looping in Python: Making and Using Generators and Iterators - PyCon 2019
Real Python
Dan Bader

Frappe/ERPNext

Price List Management:
a price list refers to a feature that allows you to define and manage different pricing structures for your products or services.
It is a list of prices associated with specific items or services offered by your business.
The price list functionality is useful for businesses that offer variable pricing based on factors such as customer groups, sales regions, currencies, or discounts.
By creating multiple price lists, you can tailor your pricing strategy to different market segments or specific customer requirements.
Customer-specific pricing: You can assign different price lists to different customers or customer groups, allowing you to offer personalized pricing based on customer relationships or agreements.
Volume-based pricing: Price lists enable you to set different price tiers based on order quantities. For example, you can define lower prices for customers who order in bulk.
Promotional pricing: You can create temporary price lists to apply special discounts or promotional offers for specific products or services during a specific period.
Currency conversion: If your business operates in multiple countries or deals with international customers, you can create separate price lists for each currency to account for exchange rates and localized pricing.
Wholesale vs. retail pricing: Price lists allow you to define separate pricing structures for wholesale customers and retail customers, accommodating different profit margins and business models.
By utilizing price lists in frappe/ERPNext, you can efficiently manage and maintain various pricing strategies, ensuring accurate and consistent pricing across your organization.
Price Not UOM Dependent: If you want to set a price for an item that is not dependent on the UOM, you can set the UOM Conversion Factor to 1.
Standard Buying and Selling Price Lists are created by default.
you can select a Price List or tag it to a Customer
Item Price:
the record in which you can log the selling and buying rate of an item.
It is used to set the base price of an item and can be associated with various price lists, allowing for different pricing structures and scenarios.
Base Price: Each item in your inventory can have a default or base price associated with it. This price serves as a starting point for calculating other prices, such as the selling price to customers.
Select the Price List whether Selling/Buying price or any other price list you may have created.
To have Item Price fetching in the sales or purchase transaction, you should have 'Price List' selected in the transaction under Currency and Price List.
UOM: The unit of measure for which the price is set. if one item is sold in different UOMs, you can set the price for each UOM. based on the UOM selected in the transaction Item Prcie will be applied
Minimum quantity: The minimum quantity of the item that must be purchased to qualify for the price.
Inventory Valuation:
Inventory valuation is the cost of unsold goods in a company’s inventory. It includes all costs incurred to get the item ready for sale, including the cost of materials used, direct labor, and overhead costs.

Python Profiling

bottlenecks both for CPU and memory consumption
It's good to be lazy when it comes to performance optimization.
CProfile a builtin module that can measure execution time of each function in our code.
python -m cProfile -s cumtime myscript.py: Functions with higher cumulative times are likely to have a larger impact on the overall performance of your code.
cProfile only gives information about function calls, not about individual lines of code - if you call some particular function such as append in different places, then it will be all aggregated into single line in cProfile output.
py-spy: a sampling profiler for Python programs. It lets you visualize what your Python program is spending time on without restarting the program or modifying the code in any way. Py-spy is extremely low overhead: it is written in Rust for speed and doesn't run in the same process as the profiled Python program, nor does it interrupt the running program in any way.
ps -A -o pid,cmd | grep python: to get the pid of the running python process. A to list all processes on the system, not just those associated with the current terminal session. -o specifies the output format for the ps command. In this case, it specifies that the output should include the process ID (pid) and the command (cmd) associated with each process.
python some-code.py &: to run the code in the background. it will return the pid of the process.
Note that running a script in the background means that the script's output will not be displayed in the terminal by default. If you want to view the output, you can redirect it to a file using python some-code.py > output.txt &, for example, or you can bring the background process to the foreground using the fg command.
fg: brings the most recent background process to the foreground.

MariaDB

Reading binlog for Change Data Capture (CDC)?
binlog is a binary log file that contains all the statements that update data in a database.
debezium read the binlog and send the data to kafka. then you can consume the data from kafka and do whatever you want with it.
Kafka & its Use Cases:
Apache Kafka is a highly scalable and distributed streaming platform that is widely used for building real-time data pipelines and streaming applications.
Stream processing refers to the continuous processing of data streams in real-time.
It involves performing computations, transformations, and analyses on data as it flows through a system.
Instead of processing data in batches or at rest, stream processing enables handling and analyzing data in motion.
Stream processing is commonly used in various scenarios where data needs to be processed and analyzed as it is generated.
It allows organizations to derive insights, make decisions, and take actions in real-time based on the continuous stream of incoming data.
Real-time data processing:
Scalability: Stream processing systems can handle high volumes of data and scale horizontally to accommodate increased data rates.
Event-driven architecture:
Continuous data integration: Stream processing can integrate and process data from multiple sources, allowing for unified analysis and correlation of data streams.
Complex event processing: Stream processing systems can detect and analyze patterns in data streams, enabling the identification of complex events and the triggering of actions based on those events.
Low-latency processing: Stream processing systems are designed for low-latency processing, minimizing the delay between data ingestion and analysis.
Apache Kafka is a highly scalable and distributed streaming platform that is widely used for building real-time data pipelines and streaming applications.
Data Integration: Kafka can act as a central data bus for integrating data from multiple sources and systems. It allows for efficient and reliable data ingestion, decoupling data producers from consumers.
Messaging and Event Streaming: Kafka's publish-subscribe model enables real-time messaging and event streaming. It can be used for building event-driven architectures, inter-service communication, and real-time data processing.
Log Aggregation: Kafka is often used for collecting and aggregating logs from various systems and applications. It provides a fault-tolerant and scalable solution for centralized log management and analysis.
Data Pipeline and ETL: Kafka can serve as a backbone for building data pipelines and performing ETL processes. It enables the continuous ingestion, transformation, and delivery of data between different systems and components.
Stream Processing: Kafka integrates well with stream processing frameworks like Apache Flink, Apache Samza, and Apache Spark Streaming. It allows for real-time processing and analysis of streaming data, enabling applications such as fraud detection, anomaly detection, and real-time analytics.
Commit Log for Data Durability: Kafka's fault-tolerant design makes it an ideal choice for storing critical data and serving as a durable commit log. It provides strong durability guarantees and ensures data availability even in the presence of failures.
Internet of Things (IoT): Kafka's high throughput and low latency capabilities make it suitable for handling large volumes of data generated by IoT devices. It can efficiently handle real-time data streams from sensors, devices, and IoT gateways.
Metrics and Monitoring: Kafka can be used to collect, process, and store metrics and monitoring data from various systems. It enables real-time monitoring and analytics of key performance indicators (KPIs) and system metrics.

The Little Book of Redis

Redis isn’t a one- size-fits-all solution.
the ecosystem around data is never going to be the same.
it solves a specific set of problems while at the same time being quite generic.
We’ll focus on learning Redis’ five data structures and look at various data modeling approaches.
docker run --name my-redis-container -d -p 6379:6379 redis:latest: to run redis in docker.
docker exec -it my-redis-container redis-cli: to connect to redis.
Redis’ API is best described as an explicit set of functions. It has a very simple and procedural feel to it.
it’s much more than a simple key-value store.
Redis exposes five different data structures,
In Redis, databases are simply identified by a number with the default database being number 0. select 1: to select database 1.
every one of Redis’ five data structures has at least a key and a value?
For the most part, Redis treats values as a byte array and doesn’t care what they are.
The colon doesn’t have any special meaning, as far as Redis is concerned, but using a separator is a common approach people use to organize their keys.
set users:saif '{"name": "saif", "age": 30}': to set a key with a json value. users:saif is the key and {"name": "saif", "age": 30} is the value.
As far as Redis is concerned, keys are everything and values are nothing. Or, put another way, Redis doesn’t allow you to query an object’s values. It helps us understand why values can be anything
by default, Redis snapshots the database to disk based on how many keys have changed. by default, it’s set to 900 seconds (15 minutes) if 1000 or more keys have changed. You configure it so that if X number of keys change, then save the database every Y seconds.
append mode?
offloading persistence to a slave?
Redis doesn’t do auto-compression but, since it treats values as bytes, there’s no reason you can’t trade processing time for RAM by compressing/decompressing the data yourself.
Redis is single-threaded, which means it can only do one thing at a time. This is a feature, not a bug. which is how every command is guaranteed to be atomic.
query limitations, data structures and Redis’ way to store data in memory, and how to model data in Redis.
The real reason Redis shines versus other solutions is its specialized data structures, not just its in-memory nature.
However, given that we are dealing with simpler data structures, we’ll sometimes need to hit the Redis server multiple times to achieve our goal. Such data access patterns can feel unnatural at first, but in reality it tends to be an insignificant cost compared to the raw performance we gain.
in redis, every command is specific to a data structure.
flushdb: to delete all keys in the current database.
flushall: to delete all keys in all databases.
keys *: to get all keys in the current database.

Strings

set <key> <value>: to set a key with a value.
get <key>: to get a key's value.
del <key>: to delete a key.
exists <key>: to check if a key exists.
strlen <key>: to get the length of a key's value.
getrange <key> <start> <end>: to get a range of characters from a key's value.
append <key> <value>
You are right, the lesson here is that some of the commands, especially with the string data structure, only make sense given specific type of data.
incr <key>: to increment a key's value by 1.
incrby <key> <value>: to increment a key's value by a specific value.
decr <key>: to decrement a key's value by 1.
decrby <key> <value>: to decrement a key's value by a specific value.
Redis strings are more powerful than they initially seem.

Hashes

why calling Redis a key-value store isn’t quite accurate? Because Redis’ data structures are more than just key-value pairs.
hashes is like strings but with extra level of indirection.
hset <key> <field> <value>: to set a field in a hash.
hget <key> <field>: to get a field's value in a hash.
We can also set multiple fields at once, get multiple fields at once, get all fields and values, list all the fields or delete a specific field:
hmset <key> <field1> <value1> <field2> <value2> ...: to set multiple fields in a hash.
hmget <key> <field1> <field2> ...: to get multiple fields' values in a hash.
hgetall <key>: to get all fields and values in a hash.
hkeys <key>: to get all fields in a hash.
hdel <key> <field>: to delete a field in a hash.

Lists

Lists let you store and manipulate an array of values for a given key.
lpush <key> <value>: to push a value to the left of a list.
ltrim <key> <start> <end>: to trim a list to a specific range. O(n)
lrange <key> <start> <end>: to get a range of values from a list.
lrange <key> 0 -1: to get all values from a list.
If you were building a game, you might use one to track queued user actions.
efficient index-based operations.

Sets

Sets are similar to lists, but they are unordered and don’t allow duplicates.
Sets provide a number of set-based operations, like unions and intersections.
efficient value-based operations.
sadd <key> <value>: to add a value to a set.
sisemember <key> <value>: to check if a value exists in a set.
sinter <key1> <key2> ...: to get the intersection of multiple sets.
sinterstore <destination> <key1> <key2> ...: to store the intersection of multiple sets in a destination set.

Sorted Sets

If hashes are like strings but with fields, then sorted sets are like sets but with a score. The score provides sorting and ranking capabilities.
zadd <key> <score> <value>: to add a value to a sorted set with a score.
zcount <key> <min> <max>: to count the number of values in a sorted set with scores between min and max.
zrevrank <key> <value>: to get the rank of a value in a sorted set in reverse order. Redis’ default sort is from low to high.
zrank <key> <value>: to get the rank of a value in a sorted set in order.
The most obvious use-case for sorted sets is a leaderboard system.

More advanced, yet common, Topics and Design Patterns.

It would be nice if Redis let you link one key to another, but it doesn’t (and it probably never will). Look at hashes.
Pseudo Multi Key Queries:

set users:9001 '{"id": 9001, "email": "s@s.s"}'
hset users:lookup:email s@s.s 9001

To get a user by email, you’d first get the user’s ID from the lookup hash and then get the user by ID.

user_id = redis.hget('users:lookup:email', 's@s.s`)
user = redis.get(f'users:{user_id}')

References and Indexes:?
Round Trips and Pipelining:
making frequent trips to the server is a common pattern in Redis.
Redis also supports pipelining. Normally when a client sends a request to Redis it waits for the reply before sending the next request. With pipelining you can send a number of requests without waiting for their responses. This reduces the networking overhead and can result in significant performance gains.
Transactions:
Every Redis command is atomic, including the ones that do multiple things. Additionally, Redis has support for trans- actions when using multiple commands.
by first issuing the multi command, followed by all the commands you want to execute as part of the transaction, and finally executing exec to actually execute the commands or discard to throw away, and not execute the commands.

multi
hincrby groups:1percent balance -9000000000
hincrby groups:99percent balance 9000000000
exec

watch?
Keys Anti-Pattern:
keys command seems like it’s well suited for a number of tasks, but it should never be used in production code. Why? Because it does a linear scan through all the keys looking for matches. Or, put simply, it’s slow.

Expiration

expire <key> <seconds>: to set a key's expiration time.
expireat <key> <timestamp>: to set a key's expiration time in unix timestamp. You can give it an absolute time in the form of a Unix timestamp (seconds since January 1, 1970) or a time to live in seconds.
ttl <key>: to get a key's time to live.
persist <key>: to remove a key's expiration time.
setex <key> <seconds> <value>: to set a key with a value and an expiration time.

Pub/Sub

returns and removes the first (or last) element from the list or blocks until one is available.
first cli: subscribe <channel>: to subscribe to a channel.
second cli: publish <channel> <message>: to publish a message to a channel. returns the number of clients that received the message.

Monitoring and Slow logs

monitor: to monitor all the commands being sent to the server.
It’s a great debugging tool that gives you insight into how your application is interacting with Redis.
slowlog: to get the slow log of the server. which acts as a great profiling tool.
It logs any command which takes longer than a specified number of microseconds.
slowlog get: to get the slow log of the server.
slowlog get <n>: to get the last n entries of the slow log of the server.
The slow log is maintained in memory, so running it in production, even with a low threshold, shouldn’t be a problem. By default it will track the last 1024 logs.

Sort

rpush <key> <value> <value> ...: to push a value to the right of a list.
rpush users:saif:guesses 5 8 7 9 6 3 2 4 8 10: to push a value to the right of a list.
sort <key>: to sort a list. sorted from lowest to highest.
sort users:saif:guests: to sort a list.
sort <key> limit <offset> <count>: to sort a list with a limit.
sort <key> by <pattern>: to sort a list by a pattern. sort users:saif:guests by users:*->score: to sort a list by a pattern.
The real power of sort is its ability to sort based on a referenced object. This is done by using the by keyword followed by a pattern. The pattern is used to get a value from a key. The value is then used to sort the list.
The good news is that the output of a sort can be stored: sort users:saif:guests by users:*->score store users:saif:guests:sorted.
Combining the store capabilities of sort with the expiration commands we’ve already seen makes for a nice combo.

Scan

Key Iteration: When you need to iterate through the entire keyspace or a subset of keys based on a pattern, the SCAN command allows you to retrieve keys in a cursor-based manner without blocking the Redis server.
Using SCAN instead of commands like KEYS is recommended for large keyspaces because it avoids blocking the Redis server for an extended period and minimizes the impact on performance.
Overall, the SCAN command provides a flexible and efficient way to iterate through keys in Redis, offering control, performance, and reliability when dealing with large or dynamically changing keyspaces.
In addition to scan, hscan, sscan and zscan commands were also added. They work the same way as scan but for hashes, sets, and sorted sets respectively.

RQ

A job is a Python object, representing a function that is invoked asynchronously in a worker (background) process.
rq worker --with-scheduler: to run a worker with a scheduler.
A queue is a Redis data structure that stores jobs.
A worker is a Python process that runs in the background and is controlled by RQ.
In addition, you can add a few options to modify the behaviour of the queued job. By default, these are popped out of the kwargs that will be passed to the job function. q.enqueue(..):
job_timeout: specifies the maximum runtime of the job before it’s interrupted and marked as failed.
result_ttl: specifies how long the job’s return value will be kept in Redis. After this time, the return value will be deleted.
ttl: specifies the maximum queued time (in seconds) of the job before it’s discarded. This argument defaults to None (infinite TTL).
failure_ttl: specifies how long failed jobs are kept
depends_on: specifies another job (or list of jobs) that must complete before this job will be queued.
job_id: specifies the job ID. If not specified, a random ID will be generated.
at_front: specifies whether the job should be enqueued at the front of the queue or not. Defaults to False.
description: specifies a human-readable description of the job.
on_success: specifies a function to be called when the job succeeds.
on_failure: specifies a function to be called when the job fails.
args and kwargs: specifies the arguments and keyword arguments that will be passed to the job function when the job is executed.
For cases where the web process doesn’t have access to the source code running in the worker (i.e. code base X invokes a delayed function from code base Y), you can pass the function as a string reference, too.
look at this pattern:

q = Queue(connection=Redis())
report_job = q.enqueue(generate_report, 'some', 'args', foo='bar')
q.enqueue(send_report, depends_on=report_job)

multiple deps: depends_on=[foo_job, bar_job]
The ability to handle job dependencies allows you to split a big job into several smaller ones. By default, a job that is dependent on another is enqueued only when its dependency finishes successfully.
Dependency object: helps you run a job even if its dependency fails. This is useful when you want to run a job only if its dependency fails.
Callbacks: superset of on_success and on_failure options. helps you specify a timeout
Success callbacks must be a function that accepts job, connection and result arguments.
Success callbacks are executed after job execution is complete, before dependents are enqueued.
Failure callbacks are functions that accept job, connection, type, value and traceback arguments.
q.enqueue returns a proxy object that can be used to check the outcome of the actual job.
For testing purposes, you can enqueue jobs without delegating the actual execution to a worker (available since version 0.3.1). is_async=False. runs without an active worker and executes the job synchronously within the same process.
Make sure that the function’s __module__ is importable by the worker.
Make sure that the worker and the work generator share exactly the same source code.

CH7: Queries, Modeling, and Tranasctions

the term "data grain" refers to the level of detail or granularity at which data is captured, stored, and processed.
It represents the specific unit or level of data that is considered as a single observation or transaction in a dataset. For example, in a sales dataset, the data grain could be at the level of individual transactions.
Partial Dependency: Partial dependency occurs when a non-key attribute (an attribute that is not part of the primary key) depends on only a part of the primary key rather than the entire key. In other words, a non-key attribute is functionally dependent on only a subset of the primary key. This can lead to data redundancy and anomalies.
Transitive Dependency: Transitive dependency occurs when an attribute depends on another non-key attribute rather than directly on the primary key. In other words, the dependency is through another attribute. Transitive dependencies can lead to data redundancy and update anomalies.
Normalization in database design is essential for several reasons:
1. Data Integrity: Normalization helps maintain data integrity by minimizing data redundancy and inconsistencies. When data is duplicated across multiple tables or attributes, there is a risk of inconsistencies and anomalies. Normalization eliminates these issues by ensuring that each piece of data is stored in only one place, reducing the likelihood of conflicting or contradictory information.
2. Minimized Redundancy: Redundant data storage wastes storage space and increases the risk of data inconsistency. Normalization reduces redundancy by structuring the database in a way that promotes efficient storage and eliminates the need for storing the same data multiple times. This improves storage efficiency and reduces the chances of inconsistencies arising from redundant data.
3. Efficient Updates and Modifications: Normalized databases make it easier to update, insert, and delete data without encountering anomalies. By organizing data into smaller, well-structured tables with minimal dependencies, changes can be made to specific portions of the database without impacting other parts. This allows for more efficient and controlled data modifications, reducing the risk of data corruption and maintaining data integrity.
4. Scalability and Flexibility: Normalization facilitates scalability and flexibility in database design. As the database grows and evolves, a normalized schema allows for easier modifications and expansions. The well-defined relationships and dependencies between tables make it easier to add new tables or modify existing ones without introducing inconsistencies or affecting other parts of the database.
5. Query Performance: Normalized databases often result in better query performance. With normalized tables, queries can be more targeted and specific, accessing only the necessary data. This can lead to faster query execution and improved overall performance.
Consistent Data Model: Normalization ensures a consistent and structured data model. By following normalization principles, database designers establish a clear and standardized structure for the data. This makes the database easier to understand, maintain, and collaborate on among multiple developers or teams.
"anomalies" refer to irregularities or inconsistencies that can occur when manipulating or querying data in a database. Anomalies occur when the database schema is not properly normalized or structured, leading to issues with data integrity, reliability, and consistency.
Insertion Anomalies: Insertion anomalies occur when it is not possible to insert data into a database without including additional, unnecessary data. This can happen when a table is not properly structured, and certain attributes depend on others that are not part of the primary key. For example, if a table for customer orders contains attributes like OrderID, CustomerID, CustomerName, and CustomerAddress, an insertion anomaly could occur if it is required to enter the customer's name and address when inserting a new order, even if that information is already stored in the database.
Update Anomalies: Update anomalies occur when updating data in a database results in inconsistencies or contradictions. This can happen when the same data is stored in multiple places, and updating one instance of the data does not propagate the change to all instances. For example, if customer information is duplicated in multiple tables, updating the customer's address in one table may not automatically update it in the other tables, leading to inconsistencies.
Deletion Anomalies: Deletion anomalies occur when removing data from a database unintentionally removes other necessary data. This can happen when data dependencies are not properly defined, and the deletion of one record leads to the loss of related information that should have been preserved. For example, if a customer record is deleted from a table without proper handling of related order records, it could result in the loss of valuable order data.
A data mart is a specialized subset of a data warehouse that focuses on a specific area or subject of an organization's data. It is a smaller, more focused database that is designed to support the reporting, analysis, and decision-making needs of a particular department, business unit, or user group within an organization.
Data marts are typically created by extracting, transforming, and loading (ETL) data from a larger enterprise data warehouse or other data sources. The data in a data mart is pre-aggregated, summarized, and organized in a way that is optimized for the specific needs of the targeted users. This includes selecting relevant data, applying appropriate transformations, and structuring the data in a dimensional model such as a star schema or snowflake schema.
The purpose of a data mart is to provide a simplified and focused view of data for users who require quick and easy access to specific information. By creating data marts, organizations can provide tailored data sets that align with the requirements of specific departments or business functions. This enables business users to perform ad hoc queries, generate reports, and gain insights into their specific area of responsibility without needing to navigate complex and extensive data structures.