Tips an' Tricks ############### Tips, tricks and gotchas ======================== This lecture addresses some gotchas that could arise in daily programming; moreover, at the beginning we will introduce some helpful objects that could make coding easier. First of all, some imports as usual: .. code:: ipython3 from collections import defaultdict, Counter import matplotlib.pyplot as plt %matplotlib inline plt.rcParams['figure.figsize'] = (8, 8) A grouping pattern, avoiding *quadratic* time --------------------------------------------- Assume to have two lists that have to be related in some way, namely using a predicate :math:`P`. In the following example we want to build a list of all pairs (boy,girl) such that their names starts with the same letter. Here the input: .. code:: ipython3 girls = ['alice', 'allie', 'bernice', 'brenda', 'clarice', 'cilly'] boys = ['chris', 'christopher', 'arald', 'arnold', 'bob'] the bad way, quadratic time: .. code:: ipython3 [(b, g) for b in boys for g in girls if b[0] == g[0]] .. parsed-literal:: [('chris', 'clarice'), ('chris', 'cilly'), ('christopher', 'clarice'), ('christopher', 'cilly'), ('arald', 'alice'), ('arald', 'allie'), ('arnold', 'alice'), ('arnold', 'allie'), ('bob', 'bernice'), ('bob', 'brenda')] there is a better approach avoiding quadratic time, toward ```defaultdict`` `__: .. code:: ipython3 letterGirls = {} for girl in girls: letterGirls.setdefault(girl[0], []).append(girl) [(b, g) for b in boys for g in letterGirls[b[0]]] .. parsed-literal:: [('chris', 'clarice'), ('chris', 'cilly'), ('christopher', 'clarice'), ('christopher', 'cilly'), ('arald', 'alice'), ('arald', 'allie'), ('arnold', 'alice'), ('arnold', 'allie'), ('bob', 'bernice'), ('bob', 'brenda')] However there is an even better solution, as pointed out in the `example `__ subsection of the previous link: use ``defaultdict`` instead of repeating call ``setdefault`` method for each new key. From the official documentation: .. code:: ipython3 >>> s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)] >>> d = defaultdict(list) >>> for k, v in s: ... d[k].append(v) ... >>> list(d.items()) [('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])] .. parsed-literal:: [('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])] The *Bunch* pattern ------------------- A very good book on algorithms implemented in Python is the one by Magnus Hetland, https://www.apress.com/gp/book/9781484200568, with the companion Github repository https://github.com/apress/python-algorithms-14. Hetland, pag. 34, propose the following pattern to build a container of properties in order to avoid vanilla dict (adjusting from item 4.18 of Alex Martelli’s `Python Cookbook `__): .. code:: ipython3 class Bunch(dict): def __init__(self, *args, **kwds): super(Bunch, self).__init__(*args, **kwds) self.__dict__ = self .. code:: ipython3 >>> T = Bunch >>> t = T(left=T(left="a", right="b"), right=T(left="c")) >>> t.left .. parsed-literal:: {'left': 'a', 'right': 'b'} .. code:: ipython3 >>> t.left.right .. parsed-literal:: 'b' .. code:: ipython3 >>> t['left']['right'] .. parsed-literal:: 'b' .. code:: ipython3 >>> "left" in t.right .. parsed-literal:: True .. code:: ipython3 "right" in t.right .. parsed-literal:: False However, inheriting from ``dict`` is discouraged by Alex: A further tempting but not fully sound alternative is to have the Bunch class inherit from ``dict``, and set attribute access special methods equal to the item access special methods, as follows: :: class DictBunch(dict): __getattr__ = dict.__getitem__ __setattr__ = dict.__setitem__ __delattr__ = dict.__delitem__ .. One problem with this approach is that, with this definition, an instance x of ``DictBunch`` has many attributes it doesn’t really have, because it inherits all the attributes (methods, actually, but there’s no significant difference in this context) of ``dict``. So, you can’t meaningfully check ``hasattr(x, someattr)`` , as you could with the classes ``Bunch`` and ``EvenSimplerBunch`` (which sets the dictionary directly, without using ``update``) previously shown, unless you can somehow rule out the value of someattr being any of several common words such as ``keys`` , ``pop`` , and ``get``. Python’s distinction between attributes and items is really a wellspring of clarity and simplicity. Unfortunately, many newcomers to Python wrongly believe that it would be better to confuse items with attributes, generally because of previous experience with JavaScript and other such languages, in which attributes and items are regularly confused. But educating newcomers is a much better idea than promoting item/ attribute confusion. Alex original definition reads as follows: .. code:: ipython3 class Bunch(object): def __init__(self, **kwds): self.__dict__.update(kwds) It is interesting to observe that this idiom has been merged within the *standard library*, starting from Python **3.3**, as with the name of ```SimpleNamespace`` `__: .. code:: ipython3 from types import SimpleNamespace x, y = 32, 64 point = SimpleNamespace(datum=y, squared=y*y, coord=x) point .. parsed-literal:: namespace(datum=64, squared=4096, coord=32) .. code:: ipython3 point.datum, point.squared, point.coord .. parsed-literal:: (64, 4096, 32) .. code:: ipython3 [i for i in point] :: --------------------------------------------------------------------------- TypeError Traceback (most recent call last) in ----> 1 [i for i in point] TypeError: 'types.SimpleNamespace' object is not iterable If you need ``point`` to be iterable use the structured object ```namedtuple`` `__ instead. Python’s ``list.append`` isn’t Lisp’s ``cons`` ---------------------------------------------- Python ``list`` objects behave like ``stack`` objects, such that it is *cheap* to ``append`` and ``pop`` at the *top*, which is the *right* end. On the other hand, Lisp ``pair`` objects allows us to *easily* ``cons`` on the *beginning*, the very *opposite* direction. .. code:: ipython3 def fast_countdown(count): nums = [] for i in range(count): nums.append(i) nums.reverse() return nums def slow_countdown(count): nums = [] for i in range(count): nums.insert(0, i) return nums def printer(lst, chunk=10): print("{}...{}".format(" ".join(map(str, lst[:chunk])), " ".join(map(str, lst[-chunk:])))) .. code:: ipython3 %timeit nums = fast_countdown(10**5) .. parsed-literal:: 5.13 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) .. code:: ipython3 %timeit nums = slow_countdown(10**5) .. parsed-literal:: 1.61 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Citing Hetland, pag 11: Python lists aren’t really lists in the traditional computer science sense of the word, and that explains the puzzle of why append is so much more efficient than insert . A classical list - a so-called linked list - is implemented as a series of nodes, each (except for the last) keeping a reference to the next. The underlying implementation of Python’s list type is a bit different. Instead of several separate nodes referencing each other, a list is basically a single, contiguous slab of memory - what is usually known as an array. This leads to some important differences from linked lists. For example, while iterating over the contents of the list is equally efficient for both kinds (except for some overhead in the linked list), directly accessing an element at a given index is much more efficient in an array. This is because the position of the element can be calculated, and the right memory location can be accessed directly. In a linked list, however, one would have to traverse the list from the beginning. The difference we’ve been bumping up against, though, has to do with insertion. In a linked list, once you know where you want to insert something, insertion is cheap; it takes roughly the same amount of time, no matter how many elements the list contains. That’s not the case with arrays: An insertion would have to move all elements that are to the right of the insertion point, possibly even moving all the elements to a larger array, if needed. A specific solution for appending is to use what’s often called a dynamic array, or vector. 4 The idea is to allocate an array that is too big and then to reallocate it in linear time whenever it overflows. It might seem that this makes the append just as bad as the insert. In both cases, we risk having to move a large number of elements. The main difference is that it happens less often with the append. In fact, if we can ensure that we always move to an array that is bigger than the last by a fixed percentage (say 20 percent or even 100 percent), the average cost, amortized over many appends, is constant. enhance with ``deque`` objects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``deque`` implements *FIFO* queues: they are as cheap to append to the right as a normal ``list``, but enhance it to *cheaply* insert on the *front* too. .. code:: ipython3 from collections import deque def enhanced_slow_countdown(count): nums = deque() for i in range(count): nums.appendleft(i) return nums .. code:: ipython3 %timeit nums = enhanced_slow_countdown(10**5) .. parsed-literal:: 5.19 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Hidden squares: concerning ``list``\ s and ``set``\ s ----------------------------------------------------- .. code:: ipython3 from random import randrange max_value = 10000 checks = 1000 L = [randrange(max_value) for i in range(checks)] .. code:: ipython3 %timeit [randrange(max_value) in L for _ in range(checks)] .. parsed-literal:: 12.7 ms ± 644 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) .. code:: ipython3 S = set(L) # convert the list to a set object. %timeit [randrange(max_value) in S for _ in range(checks)] .. parsed-literal:: 439 µs ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) Hetland’s words, pag. 35: They’re both pretty fast, and it might seem pointless to create a set from the list—unnecessary work, right? Well, it depends. If you’re going to do many membership checks, it might pay off, because membership checks are linear for lists and constant for sets. What if, for example, you were to gradually add values to a collection and for each step check whether the value was already added? […] Using a list would give you quadratic running time, whereas using a set would be linear. That’s a huge difference. **The lesson is that it’s important to pick the right built-in data structure for the job.** .. code:: ipython3 lists = [[1, 2], [3, 4, 5], [6]] sum(lists, []) .. parsed-literal:: [1, 2, 3, 4, 5, 6] Hetland, pag.36: This works, and it even looks rather elegant, but it really isn’t. You see, under the covers, the sum function doesn’t know all too much about what you’re summing, and it has to do one addition after another. That way, you’re right back at the quadratic running time of the += example for strings. Here’s a better way: Just try timing both versions. As long as lists is pretty short, there won’t be much difference, but it shouldn’t take long before the sum version is thoroughly beaten. .. code:: ipython3 res = [] for lst in lists: res.extend(lst) res .. parsed-literal:: [1, 2, 3, 4, 5, 6] try to do that with more populated lists… concerning ``string``\ s ~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: ipython3 def string_producer(length): return ''.join([chr(randrange(ord('a'), ord('z'))) for _ in range(length)]) .. code:: ipython3 %%timeit s = "" for chunk in string_producer(10**5): s += chunk .. parsed-literal:: 74.4 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) maybe some optimization is performed because ``s`` is a ``string`` object. .. code:: ipython3 %%timeit chunks = [] for chunk in string_producer(10**5): chunks.append(chunk) s = ''.join(chunks) .. parsed-literal:: 61.5 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) a better approach using constant ``append`` to the top .. code:: ipython3 %timeit s = ''.join(string_producer(10**5)) .. parsed-literal:: 60.1 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) maybe a little better since it doesn’t loop with ``for`` explicitly. Counting ======== Max permutation --------------- Eight persons with very particular tastes have bought tickets to the movies. Some of them are happy with their seats, but most of them are not. Let’s say each of them has a favorite seat, and you want to find a way to let them switch seats to make as many people as possible happy with the result. However, all of them refuse to move to another seat if they can’t get their favorite. The following function ``max_perm`` computes the maximum permutation that can be applied given a desired one; namely, it produces a new permutation that moves as many elements as it can, in order to ensure the ``one-to-one`` property – no one in the set points outside it, and each seat (in the set) is pointed to exactly once. It can be seen as a function that *fixes* a given permutation according to the required behavior. .. code:: ipython3 def perm_isomorphism(M, domain): iso = dict(enumerate(domain)) return [iso[M[i]] for i in range(len(M))] def fix_perm(M, fix): return [M[i] if i in fix else i for i in range(len(M))] The following is a naive implementation, recursive but in :math:`\mathcal{O}(n^{2})`, where :math:`n` is the permutation length. .. code:: ipython3 def naive_max_perm(M, A=None): ''' Fix a permutation such that it is one-to-one and maximal, recursively. consumes: M - a permutation as a list of integers A - a set of positions allowed to move produces: a set `fix` such that makes M maximal, ensuring to be one-to-one ''' if A is None: A = set(range(len(M))) # init to handle first invocation, all elems can move if len(A) == 1: return A # recursion base, unary perm can move, trivial B = set(M[i] for i in A) # b in B iff b is desired by someone C = A - B # c in C iff c isn't desired, so discard it return naive_max_perm(M, A - C) if C else A # recur with desired position only .. code:: ipython3 I = range(8) # the identity permutation letters = "abcdefgh" perm_isomorphism(I, letters) .. parsed-literal:: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'] .. code:: ipython3 M = [2, 2, 0, 5, 3, 5, 7, 4] perm_isomorphism(M, letters) .. parsed-literal:: ['c', 'c', 'a', 'f', 'd', 'f', 'h', 'e'] .. code:: ipython3 fix = naive_max_perm(M) max_M = fix_perm(M, fix) perm_isomorphism(max_M, letters) .. parsed-literal:: ['c', 'b', 'a', 'd', 'e', 'f', 'g', 'h'] Hetland, pag. 78: The function ``naive_max_perm`` receives a set ``A`` of remaining people and creates a set ``B`` of seats that are pointed to. If it finds an element in ``A`` that is not in ``B``, it removes the element and solves the remaining problem recursively. Let’s use the implementation on our example, M = ``[2, 2, 0, 5, 3, 5, 7, 4]``: .. code:: ipython3 naive_max_perm(M) .. parsed-literal:: {0, 2, 5} So, a, c, and f can take part in the permutation. The others will have to sit in nonfavorite seats. The implementation isn’t too bad. The handy set type lets us manipulate sets with ready-made high-level operations, rather than having to implement them ourselves. There are some problems, though. For one thing, we might want an iterative solution. […] A worse problem, though, is that the algorithm is quadratic! (Exercise 4-10 asks you to show this.) The most wasteful operation is the repeated creation of the set B. If we could just keep track of which chairs are no longer pointed to, we could eliminate this operation entirely. One way of doing this would be to keep a count for each element. We could decrement the count for chair x when a person pointing to x is eliminated, and if x ever got a count of zero, both person and chair x would be out of the game. >This idea of reference counting can be useful in general. It is, for example, a basic component in many systems for garbage collection (a form of memory management that automatically deallocates objects that are no longer useful). You’ll see this technique again in the discussion of topological sorting. There may be more than one element to be eliminated at any one time, but we can just put any new ones we come across into a “to-do” list and deal with them later. If we needed to make sure the elements were eliminated in the order in which we discover that they’re no longer useful, we would need to use a first-in, first-out queue such as the deque class (discussed in Chapter 5). We don’t really care, so we could use a set, for example, but just appending to and popping from a list will probably give us quite a bit less overhead. But feel free to experiment, of course. .. code:: ipython3 def max_perm(M): n = len(M) # How many elements? A = set(range(n)) # A = {0, 1, ... , n-1} count = Counter(M) # desired positions by frequencies Q = deque([i for i in A if not count[i]]) # useless elements while Q: # While useless elts. left... i = Q.pop() # get one of them A.remove(i) # remove it from the maximal permutation j = M[i] # get its desired position count[j] -= 1 # and release it for someone else if not count[j]: # if such position isn't desired anymore Q.appendleft(j) # enqueue such position in order to discard it return A .. code:: ipython3 fix = max_perm(M) max_M = fix_perm(M, fix) perm_isomorphism(max_M, letters) .. parsed-literal:: ['c', 'b', 'a', 'd', 'e', 'f', 'g', 'h'] Counting Sort ------------- Hetland, pag 85: By default, I’m just sorting objects based on their values. By supplying a key function, you can sort by anything you’d like. Note that the keys must be integers in a limited range. If this range is :math:`0\ldots k-1`, running time is then :math:`\mathcal{O}(n + k)`. (Note that although the common implementation simply counts the elements and then figures out where to put them in ``B``, Python makes it easy to just build value lists for each key and then concatenate them.) If several values have the same key, they’ll end up in the original order with respect to each other. Sorting algorithms with this property are called *stable*. .. code:: ipython3 def counting_sort(A, key=None, sort_boundary=None): ''' Sorts the given collection A in linear time, assuming their elements are hashable. This implementation implements a vanilla counting sort, working in linear time respect iterable length and spacing between objects. It works best if elements are evenly, namely *uniformly* distributed in the domain; on contrast, if they are sparse and concentrated near accumulation points, traversing distances between them is time consuming. If `sort_boundary` is instantiated to a float within [0,1], then the domain is ordered using a classic loglinear algorithm before building the result. ''' if key is None: key = lambda x: x B, C = [], defaultdict(list) for x in A: C[key(x)].append(x) domain = sorted(C) if sort_boundary and len(C) <= len(A)*sort_boundary \ else range(min(C), max(C)+1) for k in domain: B.extend(C[k]) return B .. code:: ipython3 A = [randrange(50) for i in range(2*10**3)] assert sorted(A) == counting_sort(A) .. code:: ipython3 n, bins, patches = plt.hist(A, 10, facecolor='green', alpha=0.5) plt.xlabel('elements'); plt.ylabel('frequencies'); plt.grid(True) plt.show() .. image:: gotchas_files/gotchas_72_0.png .. code:: ipython3 %timeit counting_sort(A) .. parsed-literal:: 219 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) .. code:: ipython3 %timeit counting_sort(A, sort_boundary=1) .. parsed-literal:: 206 µs ± 8.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) .. code:: ipython3 B = ([randrange(50) for i in range(10**3)] + [10**4 + randrange(50) for i in range(10**3)]) .. code:: ipython3 n, bins, patches = plt.hist(B, 100, facecolor='green', alpha=0.5) plt.xlabel('elements'); plt.ylabel('frequencies'); plt.grid(True) plt.show() .. image:: gotchas_files/gotchas_76_0.png .. code:: ipython3 assert sorted(B) == counting_sort(B) .. code:: ipython3 %timeit counting_sort(B) .. parsed-literal:: 2.01 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) .. code:: ipython3 %timeit counting_sort(B, sort_boundary=1/8) .. parsed-literal:: 247 µs ± 20.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)