CH2: The Python Data Model
- The Python data model is a way of understanding how data structures and objects work in Python. It is essentially a collection of protocols and methods that allow objects to interact with each other in a consistent and predictable way.
- At its core, the Python data model defines a set of methods and conventions that allow objects to be treated as if they were built-in types. For example, the
+
operator can be used to add two numbers, concatenate two strings, or add two lists together. This is possible because the +
operator is implemented in terms of the __add__
special method, which is defined for the int
, str
, and list
types.
- A
namedtuple
is a subclass of tuple in Python that allows access to its elements using names as well as indexes. It is a convenient way to define simple classes that are mostly used to store data.
- A
namedtuple
can be particularly useful when working with large datasets, where each record has a fixed set of fields. Instead of creating a class or a dictionary for each record, we can use a namedtuple
to create a lightweight and efficient container for the data.
- Overall, namedtuple is a convenient and efficient way to create simple classes for storing data, especially when the data is read-only and has a fixed set of fields. It is also a good way to avoid the overhead of defining a full class when you only need a simple container for your data.
from collections import namedtuple
Person = namedtuple('Person', ['name', 'age'])
p = Person(name='John', age=30)
print(p.name, p.age)
- Although
FrenchDeck
implicitly inherits from the object class, most of its functionality is not inherited, but comes from leveraging the data model and composition. By implementing the special methods len and getitem, our FrenchDeck
behaves like a standard Python sequence, allowing it to benefit from core language features (e.g., iteration and slicing) and from the standard library, as shown by the examples using random.choice
, reversed
, and sorted
. Thanks to composition, the len and getitem implementations can delegate all the work to a list object, self._cards
.
- The first thing to know about special methods is that they are meant to be called by the Python interpreter, and not by you.
- But the interpreter takes a shortcut when dealing for built-in types. So, if
my_object
is an instance of one of those built-ins, then len(my_object)
retrieves the value of the ob_size field, and this is much faster than calling a method.
- In CPython, the internal representation of most Python objects is based on a structure called
PyVarObject
. This structure is used to represent variable-sized objects, such as lists, tuples, and dictionaries, as well as some fixed-size objects, such as integers and floating-point numbers. The PyVarObject
structure includes a variable-length header that contains information about the object's size and type, as well as a flexible array of data that stores the object's actual data. The header includes two fields that are used to manage the object's memory allocation: ob_refcnt
, which is a reference count that tracks the number of references to the object, and ob_size
, which is the size of the object in bytes (not including the header).
- the Python interpreter is the only frequent caller of most special methods.
- python format specifier:
!r
means "use the repr
function to get the string representation of the object. !s
means "use the str
function to get the string representation of the object. !a
means "use the ascii
function to get the string representation of the object.
str()
returns a string that is meant to be human-readable and informative, while repr()
returns a string that can be used to recreate the original object with its exact value.
import datetime
now = datetime.datetime.now()
print(str(now)) # output: 2022-10-24 15:30:00.000000
# a Python expression that can be used to create a new datetime object with the same value.
print(repr(now)) # output: datetime.datetime(2022, 10, 24, 15, 30)
- To determine whether a value x is truthy or falsy, Python applies bool(x), which returns either True or False.
- If
__bool__
is not implemented, Python tries to invoke x.__len__()
, and if that returns zero, bool returns False. Otherwise bool returns True.
- Since Python 3.7, the
dict
type is officially “ordered,” but that only means that the key insertion order is preserved. You cannot rearrange the keys in a dict however you like.
CH2: An Array of Sequences
- Sequence types can be classified into two groups: mutable sequences, immutable sequences
- Sequence types can be classified into two groups: Container sequences, Flat sequences
- List comprehensions build lists from sequences or any other iterable type by filtering and transforming items. The
filter
and map
built-ins can be composed to do the same, but readability suffers
>>> symbols = '$¢£¥€¤'
>>> beyond_ascii = [ord(s) for s in symbols if ord(s) > 127]
>>> beyond_ascii
[162, 163, 165, 8364, 164]
>>> beyond_ascii = list(filter(lambda c: c > 127, map(ord, symbols)))
>>> beyond_ascii
[162, 163, 165, 8364, 164]
- Listcomps are a one-trick pony: they build lists.
- Generator expressions build iterators. They are a generalization of listcomps and have the same syntax, but are enclosed in parentheses instead of square brackets. Generator expressions do not allocate the list they produce, so they are more memory efficient than listcomps when the list is large or the generator expression is part of a long pipeline of processing.
- Generator expression saves memory because it yields items one by one using the iterator protocol instead of building a whole list just to feed another constructor.
- six-item list of T- shirts is never built in memory: the generator expression feeds the for loop produc‐ ing one item at a time. If the two lists used in the Cartesian product had a thousand items each, using a generator expression would save the cost of building a list with a million items just to feed the for loop.
>>> colors = ['black', 'white']
>>> sizes = ['S', 'M', 'L']
>>> for tshirt in (f'{c} {s}' for c in colors for s in sizes):
...
print(tshirt)
...
black S
black M
black L
white S
white M
white L
- Common use of generator expressions to initialize sequences other than lists, or to produce output that you don’t need to keep in memory.
- Tuples Are Not Just Immutable Lists. Tuples as records with no field names.
- The
%
formatting operator understands tuples and treats each item as a separate field. '%s %s' % (42, 7)
- However, in a match/case statement,
_
is a wildcard that matches any value but is not bound to a value.
- using tuple as record: often, there’s no need to go through the trouble of creating a class just to name the fields, especially if you leverage unpacking and avoid using indexes to access the fields.
- When you see a tuple in code, you know its length will never change. A tuple uses less memory than a list of the same length.
- However, be aware that the immutability of a tuple only applies to the references contained in it. References in a tuple cannot be deleted or replaced. But if one of those references points to a mutable object, and that object is changed, then the value of the tuple changes.
- an object is only hashable if its value cannot ever change. An unhasha‐ ble tuple cannot be inserted as a dict key, or a set element. But a tuple that contains only hashable items is hashable, and can be used as a dict key or set element.
- Are tuples more efficient than lists in Python?
- tuple supports all list methods that do not involve adding or removing items, with one exception—tuple lacks the
__reversed__
method. However, that is just for optimization; reversed(my_tuple)
works without it.
- tuple, list, and iterable unpacking: parallel assignment, swapping the values of variables without using a temporary variable, and prefixing an argument with a star when calling a function, and unpacking arguments to a function.
# parallel assignment
>>> a, b = (1, 2)
>>> a, b
(1, 2)
# swapping the values of variables without using a temporary variable
>>> a, b = b, a
>>> a, b
(2, 1)
# prefixing an argument with a star when calling a function
>>> divmod(20, 8)
(2, 4)
>>> t = (20, 8)
>>> divmod(*t)
(2, 4)
# unpacking arguments to a function
>>> quotient, remainder = divmod(*t)
>>> quotient, remainder
(2, 4)
- Defining function parameters with
*args
to grab arbitrary excess arguments is a classic Python feature. In the context of parallel assignment, the *
prefix can be applied to exactly one variable, but it can appear in any position
- PEP 448—Additional Unpacking Generalizations introduced more flexible syntax for iterable unpacking, best summarized in “What’s New In Python 3.5”.
- Unpacking assignment can also be used with a special syntax to handle variable-length iterables, such as lists or tuples of unknown length. This is known as extended iterable unpacking, and it uses the
*
operator to assign the remaining values to a variable.
- Nested unpacking:
def query_returning_single_row():
# Here we are simulating a database query that returns a single row of data as a list
return ['Alice', 30, 'New York']
# Example usage of query_returning_single_row()
[record] = query_returning_single_row()
print(record) # output: ['Alice', 30, 'New York']
def query_returning_single_row_with_single_field():
# Here we are simulating a database query that returns a single value as a nested list
return [['Alice']]
# Example usage of query_returning_single_row_with_single_field()
[[name]] = query_returning_single_row_with_single_field()
print(name) # output: 'Alice'
- The most visible new feature in Python 3.10 is pattern matching with the match/case statement
- match, subject, case, pattern, guard, and action.
- the default case:
case _:
. The underscore is a wildcard that matches any value but is not bound to a value.
- One key improvement of match over switch is destructuring—a more advanced form of unpacking. The match statement can destructure the subject object into its constituent parts, and use them in the pattern matching.
database_records = [
('Alice', 10, 'New York'),
('Bob', 20, 'London'),
('Charlie', 30, 'Seattle'),
('Dave', 40, 'Portland'),
]
for record in database_records:
match record:
# The optional guard clause is evaluated only if the pattern matches, and can reference variables bound in the pattern
case (name, age, city) if age >= 30:
print(f'{name} is {age} years old and lives in {city}')
# non-pattern matching version
for record in database_records:
name, age, city = record
if age >= 30:
print(f'{name} is {age} years old and lives in {city}')
- fallthrough and dangling else problems
- We can make patterns more specific by adding type information. In the context of pattern, that syntax is called a type guard (type check). The type guard is a condition that is checked after the pattern is matched. If the type guard is true, the case is considered a match. If the type guard is false, the case is not a match, and the next case is tried.
- Clever:
# *_: match any subject sequence starting with a str, and ending with a nested sequence of two floats. Without binding them to a variable.
# *extra instead of *_: would bind the items to extra as a list with 0 or more items.
case [str(name), *_, (float(lat), float(lon))]:
- Pattern matching is an example of declarative programming: the code describes “what” you want to match, instead of “how” to match it. The shape of the code follows the shape of the data. This is a powerful paradigm shift that makes code easier to read and understand.
- In other words, to evaluate
a[i, j]
, Python calls a.__getitem__((i, j))
.
- Slices are not just useful to extract information from sequences; they can also be used to change mutable sequences in place—that is, without rebuilding them from scratch.
- Both
+
and *
always create a new object, and never change their operands.
- Beware of expressions like
a * n
when a is a sequence containing mutable items: my_list = [[]] * 3
will result in a list with three references to the same inner list, which is probably not what you want.
- To initialize a list with a certain number of nested lists, use a list comprehension:
my_list = [["_"] * 3 for _ in range(3)]
# output: [['_', '_', '_'], ['_', '_', '_'], ['_', '_', '_']]
# Wrong way
my_list = [["_"] * 3] * 3
# output: [['_', '_', '_'], ['_', '_', '_'], ['_', '_', '_']]
# But the inner lists are the same object, so if you modify one of them, you will see the change in all three lists.
# That is because the outer list is built by repeating a reference to the same inner list three times.