Python Hash Tables: Understanding Dictionaries

2020-08-21 05:59:49 +0000

teaser

Hi guys, have you ever wondered how can Python dictionaries be so fast and reliable? The answer is that they are built on top of another technology: hash tables.

Knowing how Python hash tables work will give you a deeper understanding of how dictionaries work and this could be a great advantage for your Python understanding because dictionaries are almost everywhere in Python.

Hash Functions

Before introducing hash tables and their Python implementation you have to know what is a hash function and how it works.

A hash function is a function that can map a piece of data of any length to a fixed-length value, called hash.

Hash functions have three major characteristics:

  1. They are fast to compute: calculate the hash of a piece of data have to be a fast operation.
  2. They are deterministic: the same string will always produce the same hash.
  3. They produce fixed-length values: it doesn’t matter if your input is one, ten, or ten thousand bytes, the resulting hash will be always of a fixed, predetermined length.

Another characteristic that is quite common in hash functions is that they often are one-way functions: thanks to a voluntary data loss implemented in the function, you can get a hash from a string but you can’t get the original string from a hash. This is not a mandatory feature for every hash functions but becomes important when they have to be cryptographically secure.

Some popular hash algorithms are MD5, SHA-1, SHA-2, NTLM.

If you want to try one of these algorithms by yourself, just point your browser to https://www.md5online.org, insert a text of any length in the textbox, click the crypt button and get your 128bit MD5 hash back.

Common Usages of Hashes

There are a lot of things that rely on hashes, and hash tables are just one of them. Other common usages of hashes are for cryptographic and security reasons.

A concrete example of this is when you try to download open-source software from the internet. Usually, you find also a companion file that is the signature of the file. This signature is just the hash of the original file and it’s very useful because if you calculate the hash of the original file by yourself and you check it against the signature that the site provides, you can be sure that the file you downloaded hasn’t have tampered.

Another common use of hashes is to store user passwords. Have you ever asked yourself why when you forget the password of a website and you try to recover it the site usually lets you choose another password instead of giving back to you the original one you chose? The answer is that the website doesn’t store the entire password you choose, but just its hash.

This is done for security reasons because if some hacker got the access to the site’s database, they won’t be able to know your password but just the hash of your password, and since hash functions are often one-way functions you can be sure that they will never be able to get back to your password starting from the hash.

The Python hash() Function

Python has a built-in function to generate the hash of an object, the hash() function. This function takes an object as input and returns the hash as an integer.

Internally, this function invokes the .__hash__() method of the input object, so if you want to make your custom class hashable, all you have to do is to implement the .__hash__() method to return an integer based on the internal state of your object.

Now, try to start the Python interpreter and play with the hash() function a little bit. For the first experiment, try to hash some numeric values:

>>> hash(1)
1
>>> hash(10)
10
>>> hash(10.00)
10
>>> hash(10.01)
230584300921368586
>>> hash(-10.01)
-230584300921368586

If you are wondering why these hashes seems to have different length remember that the Python hash() function returns integers objects, that are always represented with 24 bytes on a standard 64 bit Python 3 interpreter.

As you can see, by default the hash value of an integer value is the value itself. Note that this works regardless of the type of the value you are hashing, so the integer 1 and the float 1.0 have the same hash: 1.

What’s so special about this? Well, this shows what you learned earlier, that is that hash functions are often one-way functions: if two different objects may have the same hash, it’s impossible to do the reverse process starting from a hash and going back to the original object. In this case, the information about the type of the original hashed object has gone lost.

Another couple of interesting things you could note by hashing numbers is that decimal numbers have hashes that are different from their value and that negative values have negative hashes. But what happens if you try to hash the same number you got for the decimal value? The answer is that you get the same hash, as shown in the following example:

>>> hash(0.1)
230584300921369408
>>> hash(230584300921369408)
230584300921369408
>>> hash(0.1) == hash(230584300921369408)
True

As you can see, the hash of the integer number 230584300921369408 is the same as the hash of the number 0.1. And this is perfectly normal if you think of what you learned earlier about hash functions because if you can hash any number or any string getting a fixed-length value since you can’t have infinite values represented by a fixed-length value, that implies that there must be duplicated values. They exist in fact, and they are called collisions. When two objects have the same hash, it is said that they collide.

Hashing a string is not much different from hashing a numeric value. Start your Python interpreter and have a try hashing a string:

>>> hash("Bad Behaviour")
7164800052134507161

As you can see a string is hashable and produce a numeric value as well but if you have tried to run this command you could see that your Python interpreter hasn’t returned the same result of the example above. That’s because starting from Python 3.3 values of strings and bytes objects are salted with a random value before the hashing process. This means that the value of the string is modified with a random value that changes every time your interpreter starts, before getting hashed. If you want to override this behaviour, you can set the PYTHONHASHSEED environment variable to an integer value greater than zero before starting the interpreter.

As you may expect this is a security feature. Earlier you learned that websites usually store the hash of your password instead of the password itself to prevent an attack to the site’s database to stole all the site passwords. If a website stores just the hash as it is calculated it could be easy for attackers to know what was the original password. They just need to get a big list of commonly used passwords (the web is full of these lists) and calculate their corresponding hash to get what is usually called rainbow tables.

By using a rainbow table the attacker may not be able to get every password in the database, still being able to steal a vast majority of them. To prevent this kind of attack, a good idea is to salt the password before hashing them, which is modifying the password with a random value before calculating the hash.

Starting from Python 3.3 the interpreter by default salt every string and bytes object before hashing it, preventing possible DOS attacks as demonstrated by Scott Crosby and Dan Wallach on this 2003 paper.

A DOS attack (where DOS stands for Denial Of Service) is an attack where the resources of a computer system are deliberately exhausted by the attacker so that the system is no longer able to provide service to the clients. In this specific case of the attack demonstrated by Scott Crosby, the attack was possible flooding the target system with a lot of data whose hash collide, making the target system use a lot more of computing power to resolve the collisions.

Python Hashable Types

So at this point, you could wonder if any Python type is hashable. The answer to this question is no, by default, just immutable types are hashable in Python. In case you are using an immutable container (like a tuple) also the content should be immutable to be hashable.

Trying to get the hash of an unashable type in Python you will get a TypeError from the interpreter as shown in the following example:

>>> hash(["R","e","a","l","P","y","t","h","o","n"])
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

However, every custom defined object is hashable in Python and by default its hash is derived from it’s id. That means that two different instance of a same class, by default have different hashes, as shown in the following example:

>>> class Car():
...     velocity = 0
...     direction = 0
...     damage = 0
...
>>> first_car = Car()
>>> second_car = Car()
>>> hash(first_car)
274643597
>>> hash(second_car)
274643604

As you can see, two different instances of the same custom object by default have different hash values. However, this behavior can be modified by implementing a .__hash__() method inside the custom class.

Hash Tables

Now that you know what a hash function is, you can start examining hash tables. A hash table is a data structure that allows you to store a collection of key-value pairs.

In a hash table, the key of every key-value pair must be hashable, because the pairs stored are indexed by using the hash of their keys. Hash tables are very useful because the average number of instructions that are necessary to lookup an element of the table is independent of the number of elements stored in the table itself. That means that even if your table grows ten or ten thousand times, the overall speed to look up a specific element is not affected.

A hash table is typically implemented by creating a variable number of buckets that will contain your data and indexing this data by hashing their keys. The hash value of the key will determine the correct bucket to be used for that particular piece of data.

In the example below, you can find an implementation of a basic hash table in Python. This is just an implementation to give you the idea of how a hash table could work because as you will know later, in Python there’s no need to create your custom implementation of hash tables since they are implemented as dictionaries. Let’s see how this implementation works:

import pprint

class Hashtable:
    def __init__(self, elements):
        self.bucket_size = len(elements)
        self.buckets = [[] for i in range(self.bucket_size)]
        self._assign_buckets(elements)

    def _assign_buckets(self, elements):
        for key, value in elements:
            hashed_value = hash(key)
            index = hashed_value % self.bucket_size
            self.buckets[index].append((key, value))

    def get_value(self, input_key):
        hashed_value = hash(input_key)
        index = hashed_value % self.bucket_size
        bucket = self.buckets[index]
        for key, value in bucket:
            if key == input_key:
                return(value)
        return None

    def __str__(self):
        return pprint.pformat(self.buckets) # here pformat is used to return a printable representation of the object

if __name__ == "__main__":
     capitals = [
        ('France', 'Paris'),
        ('United States', 'Washington D.C.'),
        ('Italy', 'Rome'),
        ('Canada', 'Ottawa')
    ]
hashtable = Hashtable(capitals)
print(hashtable)
print(f"The capital of Italy is {hashtable.get_value('Italy')}")

Look at the for loop starting at line 9. For each element of the hashtable this code calculate the hash of the key (line 10), it calculate the position of the element in the bucket depending on the hash (line 11) and add a tuple in the bucket (line 12).

Try to run the example above after setting the environment varible PYTHONHASHSEED to the value 46 and you will get the the following output, where two buckets are empty and two other buckets contains two key-value pairs each:

[[('United States', 'Washington D.C.'), ('Canada', 'Ottawa')],
 [],
 [],
 [('France', 'Paris'), ('Italy', 'Rome')]]
The capital of Italy is Rome

Note that if you try to run the program without having set the PYTHONHASHSEED variable, you may probably get a different result, because as you already know the hash function in Python, starting from Python 3.3 salts every string with a random seed before the hashing process.

In the example above you have implemented a Python hash table that takes a list of tuples as input and organizes them in a number of buckets equal to the length of the input list with a modulo operator to distribute the hashes in the table.

However, as you can see in the output, you got two empty buckets while the other two have two different values each. When this happens, it’s said that there’s a collision in the Python hash table.

Using the standard library’s hash() function, collisions in a hash table are unavoidable. You could decide to use a higher number of buckets and lowering the risk of incurring in a collision, but you will never reduce the risk to zero.

Moreover, the more you increase the number of buckets you will handle, the more space you will waste. To test this you can simply change the bucket size of your previous example using a number of buckets that is two times the length of the input list:

```python hl_lines=”3” class Hashtable: def init(self, elements): self.bucket_size = len(elements) * 2 self.buckets = [[] for i in range(self.bucket_size)] self._assign_buckets(elements)


Running this example, I ended up with a better distribution of the input data, but I had however a collision and five unused buckets:

```console
[[],
 [],
 [],
 [('Canada', 'Ottawa')],
 [],
 [],
 [('United States', 'Washington D.C.'), ('Italy', 'Rome')],
 [('France', 'Paris')]]
The capital of Italy is Rome

As you can see, two hashes collided and have been inserted into the same bucket.

Since collisions are often unavoidable, to implement a hash table requires you to implement also a collision resolution method. The common strategies to resolve collisions in a hash table are:

The separate chaining is the one you already implemented in the example above and consists of creating a chain of values in the same bucket by using another data structure. In that example, you used a nested list that had to be scanned entirely when looking for a specific value in an over occupied bucket.

In the open addressing strategy, if the bucket you should use is busy, you just keep searching for a new bucket to be used. To implement this solution, you need to do a couple of changes to both how you assign buckets to new elements and how you retrieve values for a key. Starting from the _assign_buckets() function, you have to initialize your buckets with a default value and keep looking for an empty bucket if the one you should use has been already taken:

    def _assign_buckets(self, elements):
        self.buckets = [None] * self.bucket_size

        for key, value in elements:
            hashed_value = hash(key)
            index = hashed_value % self.bucket_size

            while self.buckets[index] is not None:
                print(f"The key {key} collided with {self.buckets[index]}")
                index = (index + 1) % self.bucket_size

            self.buckets[index] = ((key, value))

As you can see, all the buckets are set to a default None value before the assignment, and the while loop keeps looking for an empty bucket to store the data.

Since the assignment of the buckets is changed, also the retrival process should change as well, because in the get_value() method you now need to check the value of the key to be sure that the data you found was the one you were looking for:

    def get_value(self, input_key):
        hashed_value = hash(input_key)
        index = hashed_value % self.bucket_size
        while self.buckets[index] is not None:
            key,value = self.buckets[index]
            if key == input_key:
                return value
            index = (index + 1) % self.bucket_size

During the lookup process, in the get_value() method you use the None value to check when you need to stop looking for a key and then you check the key of the data to be sure that you are returning the correct value.

Running the example above, the key for Italy collided with a previously inserted element (France) and for this reason has been relocated to the first free bucket available. However, the search for Italy worked as expected:

The key Italy collided with ('France', 'Paris')
[None,
 None,
 ('Canada', 'Ottawa'),
 None,
 ('France', 'Paris'),
 ('Italy', 'Rome'),
 None,
 ('United States', 'Washington D.C.')]
The capital of Italy is Rome

The main problem of the open addressing strategy is that if you have to handle also deletions of elements in your table, you need to perform logical deletion instead of physical ones because if you delete a value that was occupying a bucket during a collision, the other collided elements will never be found.

In our previous example, Italy collided with a previously inserted element (France) and so it has been relocated to the very next bucket, so removing the France element will make Italy unreachable because it does not occupy its natural destination bucket, that appears to be empty to the interpreter.

So, when using the open addressing strategy, to delete an element you have to replace its bucket with a dummy value, which indicates to the interpreter that it has to be considered deleted for new insertion but occupied for retrieval purposes.

Dictionaries: Implementing Python Hash Tables

Now that you know what hash tables are, let’s have a look at their most important Python implementation: dictionaries. Dictionaries in Python are built using hash tables and the open addressing collision resolution method.

As you already know a dictionary is a collection of key-value pairs, so to define a dictionary you need to provide a comma-separated list of key-value pairs enclosed in curly braces, as in the following example:

>>> chess_players = {
...    "Carlsen": 2863,
...    "Caruana": 2835,
...    "Ding": 2791,
...    "Nepomniachtchi": 2784,
...    "Vachier-Lagrave": 2778,
... }

Here you have created a dictionary named chess_players that contains the top five chess players in the world and their actual rating.

To retrieve a specific value you just need to specify the key using square brackets:

>>> chess_players["Nepomniachtchi"]
2784

If you try to access a non existing element, the Python interpreter throws a Key Error exception:

>>> chess_players["Mastromatteo"]
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
KeyError: 'Mastromatteo'

To iterate the entire dictionary you can use .items() method, that returns an iterable objects of all the key-value pairs in tuples:

>>> for (k, v) in chess_players.items():
...     print(k,v)
... 
Carlsen 2863
Caruana 2835
Ding 2791
Nepomniachtchi 2784
Vachier-Lagrave 2778

To iterate over the keys or over the values of the Python dictionary, you can use the .keys() or the .values() methods as well:

>>> chess_players.keys()
dict_keys(["Carlsen", "Caruana", "Ding", "Nepomniachtchi", "Vachier-Lagrave"])
>>> chess_players.values()
dict_values([2863, 2835, 2791, 2784, 2778])

To insert another element into the dictionary you just need to assign a value to a new key:

>>> chess_players["Grischuk"] = 2777
>>> chess_players
{'Carlsen': 2863, 'Caruana': 2835, 'Ding': 2791, 'Nepomniachtchi': 2784, 'Vachier-Lagrave': 2778, 'Grischuk': 2777}

To update the value of an existing key, just assign a different value to the previously inserted key.

Please note that since dictionaries are built on top of hash tables, you can only insert an element if its key is hashable. If the key of your element is not hashable (like a list, for example), the interpreter throws an TypeError exception:

>>> my_list = ["Giri", "Mamedyarov"]
chess_players[my_list] = 2764
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

To delete an element, you need to use the del statement, specifying the key you want to delete:

>>> del chess_players["Grischuk"]
>>> chess_players
{'Carlsen': 2863, 'Caruana': 2835, 'Ding': 2791, 'Nepomniachtchi': 2784, 'Vachier-Lagrave': 2778}

Deleting an entry doesn’t delete the actual value into the dictionary, it just replaces the key with a dummy value so that the open addressing collision resolution method will continue to work, but the interpreter handles all this complexity for you, ignoring the deleted element.

The Pythonic Implementation of Python Hash Tables

Now you know that dictionaries are Python hash tables but you may wonder how the implementation works under the hood, so in this chapter, I will try to give you some information about the actual implementation of Python Hash Tables.

Bear in mind that the information I will provide here is based on recent versions of Python, because with Python 3.6 dictionaries have changed a lot and are now smaller, faster and even more powerful, as they are now insertion ordered (the insertion ordered guarantee has been implemented in Python 3.6 but has officially be recognized by Guido in Python 3.7).

Try to create an empty Python dictionary and check its size and you will find out that an empty Python dictionary takes 240 bytes of memory:

>>> import sys
>>> my_dict = {}
>>> sys.getsizeof(my_dict)
240

By running this example you can see that the basic occupation of a Python dictionary is 240 bytes. But what happens if you decide to add a value? Well, that’s may seem odds, but the size doesn’t change:

>>> my_dict["a"] = 100
>>> sys.getsizeof(my_dict)
240

So, why the size of the dictionary hasn’t changed? Because starting from Python 3.6 values are stored in a different data structure and the dictionary contains just a pointer to where the actual value is stored. Moreover, when you create an empty dictionary it starts creating a Python Hash Table with 8 buckets that are just 240 bytes long, so the first element in our dictionary hasn’t changed the size at all.

Now try to add some more elements and see how your dictionary behaves, you will see that the dictionary grows:

>>> for i in range(20):
...     my_dict[i] = 100
...     print(f"elements = {i+1} size = {sys.getsizeof(my_dict)}")
... 
elements = 1 size = 240
elements = 2 size = 240
elements = 3 size = 240
elements = 4 size = 240
elements = 5 size = 240
elements = 6 size = 368
elements = 7 size = 368
elements = 8 size = 368
elements = 9 size = 368
elements = 10 size = 368
elements = 11 size = 648
elements = 12 size = 648
elements = 13 size = 648
elements = 14 size = 648
elements = 15 size = 648
elements = 16 size = 648
elements = 17 size = 648
elements = 18 size = 648
elements = 19 size = 648
elements = 20 size = 648

As you can see, the dict has grown after you have inserted the sixth and the eleventh element, but why? Because to make our Python hash table fast and reduce collisions, the interpreter keeps resizing the dictionary when it becomes full for two-third.

Now, try to delete all the elements in your dictionary, one at a time, and when you have finished, check the size again, you will find that even if the dictionary is empty, space hasn’t been freed:

>>> keys = list(my_dict.keys())
>>> for key in keys:
...     del my_dict[key]
...
>>> my_dict
{}
>>> sys.getsizeof(my_dict)
648

This happens because since dictionaries have a really small memory footprint and the deletion is not frequent when working with dictionaries, the interpreter prefers to waste a little bit of space than to dynamically resize the dictionary after every deletion. However, if you empty your dictionary by calling the .clear() method, since it is a bulk deletion, space is freed and it goes to its minimum of 72 bytes:

>>> my_dict.clear()
>>> sys.getsizeof(my_dict)
72

As you may imagine, the first insertion on this dictionary will make the interpreter reserve the space for 8 buckets, going back to the initial situation.

Conclusions

In this article you have learned what are hash tables and how are they implemented in Python.

A huge part of this article is based on Raymond Hettinger’s speech at the Pycon 2017.

Raymond Hettinger is a Python core developer and its contribution to the Python development has been invaluable so far.

Manning Publications

2020-05-06 23:18:49 +0000

teaser Hey guys, today I’m very thrilled to announce that people at Manning Publications have made a wonderful gift to our beloved visitors of The Python Corner: the 40% off to ANY single Python book of their catalog. Just use the code nlcorner40 during the check out process and enjoy your 40% off!

Manning Publications has a great catalog of books about Python, both printed or ebook and what I love most is that they have a modern idea about what a book is.

What do I mean? Well, it’s easily explained in three points:

  1. If you buy a printed book on Manning Publications, you get the eBook version FOR FREE as well. Their idea is simple: you’re buying the content, not the book. And I couldn’t agree more on that!
  2. If you have bought an eBook with Manning you can easily upgrade to the printed version anytime, and it cost as little as $12 + shipping. Pretty cool uh?
  3. They have the “Manning Early Access Program” (or MEAP) that allows you to read a book chapter by chapter while it’s being written. That’s an amazing opportunity both for you and for the author… and obviously, you will get the final book as soon it’s finished!

So what are you waiting for? Go to Manning Publications and have a look at their catalog, they have also some FREE book as well, so why don’t give them a try?

My advice is for “The Well-Grounded Python Developer” By Doug Farrell. It’s only at chapter 2 right now, but Doug is an incredible professional that writes articles for Real Python, so it’s what I call a “sure bet”! ;)

Enjoy! D.

Working with EBCDIC in Python

2020-04-29 23:18:49 +0000

teaser
A couple of months ago I had to rewrite a program that used to be executed on a IBM System Z9. For those of you who don’t know what I’m talking about… “it’s a mainframe, kiddo”!

However, even if I do know what a mainframe is, when I looked at the input files I was having to work with, I was like… “oh my gosh, what is this stuff?!?!”

It took me a while to understand that what I was seeing wasn’t a set of standard ASCII files but a set of weird EBCDIC files. EBCDIC is an eight bit character encoding used on IBM mainframe, or to tell it in other words… “it’s 1960’s technology, baby”!

It’s something so old that it’s standard has been thought to be “Punched card friendly”. What? You don’t know what a punched card is? Lucky you, kiddo… check this out.

However… when I looked at this EBCDIC encoded files I felt lost… “why am I still doing this work?” I kept repeating. And then, since I’m a Pythonista, without even notice it, I started my console and I tried …

$ pip install ebcdic
Collecting ebcdic
  Downloading ebcdic-1.1.1-py2.py3-none-any.whl (128 kB)
     |████████████████████████████████| 128 kB 1.9 MB/s
Installing collected packages: ebcdic
Successfully installed ebcdic-1.1.1

Wait… what? No error? Are you telling me that someone wrote an EBCDIC package to handle this stuff?

GOD BLESS PYTHON! (and Thomas Aglassinger who wrote this useful library)

Once you have installed the library, just use the .encode() and the .decode() methods on a string specifying one of the supported codec:

* cp290 - Japan (Katakana)
* cp420 - Arabic bilingual
* cp424 - Israel (Hebrew)
* cp833 - Korea Extended (single byte)
* cp838 - Thailand
* cp870 - Eastern Europe (Poland, Hungary, Czech, Slovakia, Slovenia, Croatian, Serbia, Bulgarian); represents Latin-2
* cp1097 - Iran (Farsi)
* cp1140 - Australia, Brazil, Canada, New Zealand, Portugal, South Africa, USA
* cp1141 - Austria, Germany, Switzerland
* cp1142 - Denmark, Norway
* cp1143 - Finland, Sweden
* cp1144 - Italy
* cp1145 - Latin America, Spain
* cp1146 - Great Britain, Ireland, North Ireland
* cp1147 - France
* cp1148 - International
* cp1148ms - International, Microsoft interpretation; similar to cp1148 except that 0x15 is mapped to 0x85 (“next line”) instead if 0x0a (“linefeed”)
* cp1149 - Iceland
* cp037 - Australia, Brazil, Canada, New Zealand, Portugal, South Africa; similar to cp1140 but without Euro sign
* cp273 - Austria, Germany, Switzerland; similar to cp1141 but without Euro sign
* cp277 - Denmark, Norway; similar to cp1142 but without Euro sign
* cp278 - Finland, Sweden; similar to cp1143 but without Euro sign
* cp280 - Italy; similar to cp1141 but without Euro sign
* cp284 - Latin America, Spain; similar to cp1145 but without Euro sign
* cp285 - Great Britain, Ireland, North Ireland; similar to cp1146 but without Euro sign
* cp297 - France; similar to cp1147 but without Euro sign
* cp500 - International; similar to cp1148 but without Euro sign
* cp500ms - International, Microsoft interpretation; identical to codecs.cp500 similar to ebcdic.cp500 except that 0x15 is mapped to 0x85 (“next line”) instead if 0x0a (“linefeed”)
* cp871 - Iceland; similar to cp1149 but without Euro sign
* cp875 - Greece; similar to cp9067 but without Euro sign and a few other characters
* cp1025 - Cyrillic
* cp1047 - Open Systems (MVS C compiler)
* cp1112 - Estonia, Latvia, Lithuania (Baltic)
* cp1122 - Estonia; similar to cp1157 but without Euro sign
* cp1123 - Ukraine; similar to cp1158 but without Euro sign

So, in my case, when I needed to encode a string, I had to use this piece of code to get a sequence of encoded EBCDIC bytes:

>>> my_string = "is this fun or useless?"
>>> my_encoded_string = my_string.encode('cp1141')
>>> print(my_encoded_string)
b'\x89\xa2@\xa3\x88\x89\xa2@\x86\xa4\x95@\x96\x99@\xa4\xa2\x85\x93\x85\xa2\xa2o'

… and this piece of code when I had something to be decoded:

>>> print(my_encoded_string.decode('cp1141'))
is this fun or useless?

Super easy, uh? Keep up the great work, kiddo(s), and happy Pythoning! =)

D.

Representing geographic data in Python - feat. Coronavirus

2020-03-01 23:18:49 +0000

teaser

As you may know, I live in Italy, a beautiful country made famous by style, fashion, and food. But in the last days, we’ve become famous also for something a little less cool: the Coronavirus.

At the end of February, in fact, we have become overnight the third country with more infections in the world, after China and South Korea. And Milan (the city I live in) is one of the most affected Italian cities.

As this disease started to spread, everyone in town was like obsessed with the contagious map an interactive map that let you track the disease in the world. And I was as well obsessed with this map, not for the contagious, not for the coronavirus, but because I wanted to know… how to create a map like this in Python!

It turns out that it’s easier than you think with the right tools.

The first tool we will need is Folium. Folium is a package that makes super easy to get a map like that with the data you want. You just need to create the object and place your data in the map specifying latitude, longitude and the data you need.

For this example, I will use Jupiter Notebooks. According to the official site:

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

It’s somehow similar to a playground for Python, which also supports some markdown and other useful stuff. Once you have developed a notebook there are a lot of sites that offer you hosting of your notebooks almost for free. One of the most famous is Google Colab.

So let’s recap:

Am I missing something to create my interactive map? Oh, sure… data! I need some data to plot. And when you need data a good place to start is kaggle.

Kaggle is a site that allows you to download a lot of datasets for any kind of use. Weather, Virus, Hotel booking … there’s plenty of topics to choose from and there are data for any kind of use. Brief research on Kaggle brings me to this dataset from Vignesh Coumarane. I don’t know if the data are accurate or not… but who cares, I don’t need precise data, I just need something to create an interactive map, right?

Now, to use Kaggle data you have to register (for free) to Kaggle and create an API Token. From the official docs of Kaggle:

To use the Kaggle’s public API, you must first authenticate using an API token. From the site header, click on your user profile picture, then on “My Account” from the dropdown menu. This will take you to your account settings at https://www.kaggle.com/account. Scroll down to the section of the page labeled API: To create a new token, click on the “Create New API Token” button. This will download a fresh authentication token onto your machine.

Is everything clear? So, let’s start!

Let’s go to google colab and create a new Jupiter Notebook. For the first block we have to import some stuff:

# import some stuff
import folium
import pandas as pd
import os
import json

then we have to set up two variables that will keep our Kaggle username and our Kaggle API KEY:

# set kaggle username and API token
USERNAME="YOUR USERNAME GOES HERE"
KEY="YOUR KEY GOES HERE"

then we will need to create a JSON file with this information under ~/.kaggle/kaggle.json

# pack everything into a json file
!mkdir ~/.kaggle
!mkdir /content/.kaggle

token = {"username":USERNAME,"key":KEY} 
with open('/content/.kaggle/kaggle.json', 'w') as file:
    json.dump(token, file)

!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json

Ok. We’re ready to authenticate against Kaggle now

# let's authenticate on kaggle
api=KaggleApi()
from kaggle.api.kaggle_api_extended import KaggleApi
api.authenticate()

Ok, now that we’re in, we need to download the dataset. We can still use the api of kaggle for that:

# download the data we need
dataset_name="vignesh1694/covid19-coronavirus"
filename="time_series_19-covid-Confirmed.csv"
api.dataset_download_file(dataset_name, filename)

and now that we have downloaded the file, we can read this CSV and plot the data to the map. Let’s start by creating a Panda data frame with the data we need:

# create a Panda data frame with the data
df=pd.read_csv(filename)

Cool. Now if you examine the dataset you will see that the “Province/State” field is populated just for Chinese regions. Let’s create a new “name” field that contains the “Province/State” when indicated and the “Country/Region” field otherwise.

# transform your dataset to coalesce the Province/State and the Country/Region
df['name']=df['Province/State'].mask(pd.isnull, df['Country/Region'])

Now the cool stuff. Let’s create a map with folium:

# create an empty map
map = folium.Map(zoom_start=1.5,width=1000,height=750,location=[0,0], tiles = 'Stamen Toner')

And now start to loop on your data frame to add all the rows with confirmed cases on your map. As you can see you just need to create a CircleMarker object specifying the location, the radius of the point and the color:

# loop on your date to populate the map
for row in df.itertuples():
    lat=getattr(row, "Lat")
    long=getattr(row, "Long")
    confirmed=int(row[-2])
    name=getattr(row, "name")
    tooltip = f"{name} - {confirmed}"
    radius = 30 if confirmed/10>30 else confirmed/10

    if confirmed>0:
        folium.vector_layers.CircleMarker(
            location=(lat, long),
            radius=radius,
            tooltip=tooltip,
            color='red',
            fill_color='red'
        ).add_to(map)

and now your map is ready to be shown:

# output the map
map

Et voilà:

map

Oh my God… 888 cases right now in Italy! … it’s time to go now!

Bye D.

Serialization in Python with JSON

2020-02-19 23:00:49 +0000

teaser

In 2016 I wrote a post about serialization in Python by using the pickle Python module.

In this article, we will try to serialize Python objects by using another module: json.

According to Wikipedia “JSON is an open-standard file format or data interchange format that uses human-readable text to transmit data objects consisting of attribute-value pairs and array data types (or any other serializable value)”.

But why you should use to use JSON instead of the official pickle module? Well, it depends on what you have to do… JSON is a safer protocol, it’s human-readable and it’s a standard adopted by a lot of languages out there (so it’s the best choice if you need a language independent platform). pickle, on the other side, is faster and allow you to serialize even custom defined objects.

So, there isn’t a silver bullet if you want to serialize and deserialize objects, you have to choose which module to use depending on your specific use case.

Now, before starting I need to tell you something: do not expect me to explain what serialization means. If you don’t know that, start by reading my previous article on this topic.

However, if you just need a hint I could also cite the 2016 myself:

To serialize an object means to transform it in a format that can be stored, to be able to deserialize it later, recreating the original object from the serialized format. To do all these operations we will use the pickle module.

Pretty straightforward, uh? And now… it’s JSON time!

I still remember the first time I met JSON… what a wonderful time! Someone told me that the JS part of the JSON word means “JavaScript” and so I had myself and decided to retire in a Buddist temple for the rest of my life.

alan

Then I had the enlightenment and understood that I had better to decipher the whole acronym… and I found out that JSON stands for JavaScript Object Notation.

So the JavaScript part is just about the “Object Notation” token? Why did they decide to use the world Javascript on something not related to that $)%!(£& little language? Who knows…

However, I decided to dig a little bit more and this is how a JSON object appears to be:

{"widget": {
    "debug": "on",
    "window": {
        "title": "Sample Konfabulator Widget",
        "name": "main_window",
        "width": 500,
        "height": 500
    },
    "image": { 
        "src": "Images/Sun.png",
        "name": "sun1",
        "hOffset": 250,
        "vOffset": 250,
        "alignment": "center"
    },
    "text": {
        "data": "Click Here",
        "size": 36,
        "style": "bold",
        "name": "text1",
        "hOffset": 250,
        "vOffset": 100,
        "alignment": "center",
        "onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;"
    }
}}

Yes, it’s just like a Python dictionary! Isn’t that great?

So how can we serialize a Python object in a JSON format?

Quite easy, you just need to import the JSON module and then use the dumps() and loads() functions:

Let see an example of how this works:

# json1.py
import json

my_list = ["this","is","a","simple","list",35]
my_json_object = json.dumps(my_list)
print(my_json_object)

my_second_list = json.loads(my_json_object)
print(my_second_list)

Really easy, isn’t it?

It’s also possible to use the dump() and load() functions to serialize and deserialize objects. In this case, instead of strings, you will directly use files, like this:

# json2.py
import json

my_list = ["this","is","a","simple","list",35]
with open("my_file", "w") as my_file:
    my_json_object = json.dump(my_list, my_file)

with open("my_file", "r") as my_file_read:
    my_second_list = json.load(my_file_read)

print(my_second_list)

In this last example, with the first dump() method you create a file named my_file with the JSON result of the serialization process and with the next load() method you read the previously created file and deserialize the JSON to a Python object.

Super easy, isn’t it?

Please remember that not every Python object can be serialized in JSON. This process works just with the following types:

However, it’s possible to serialize and deserialize even custom object with just a little bit of extra magic… do you want to know how? Stay tuned for the next article! ;)

Happy Pythoning D.

Creating command-line interfaces in Python with Argparse

2020-02-11 23:00:49 +0000

teaser

If you have ever written a script in Python you surely know the importance of having a good command-line interface for your application.

And this is why today I’m posting here a link to another post I wrote for Real Python on this topic some months ago.

Real Python is one of the most important resources on the web about Python and I’m really proud to be part of that community full of brilliant people that share a common passion for this incredible programming language.

If you want to read this article just visit this link

Happy Pythoning! D.

The Detailed Guide on Sending Emails from your Python App

2019-10-21 22:00:45 +0000

teaser

Hey there! Now you are reading a quick but detailed guide on adding the essential functionality to your web app built with Python: email sending. From this post, you will learn about the capabilities of the native Python modules for email sending and then get the practical steps for creating a message with images and attachments. With plenty of code examples, you will be able to craft and send your own emails using an SMTP server.

Before we start

Just a brief theory and a couple of notes before we move to coding. In Python, there is an email package designed for handling email messages. We will explain how to use its main modules and components. It’s simple but comprehensive so that you don’t need any additional libraries, at least, for a start.

This guide was created and tested with Python version 3.7.2.

How to configure email sending

First of all, we need to import the necessary modules. The main one for sending emails is smtplib. From the very beginning, help(smtplib) is indeed very helpful: it will provide you with the list of all available classes and arguments as well as confirm whether smtplib was properly imported:

import smtplib
help(smtplib)

Define SMTP server

Before we can move to the first code sample, we should define an SMTP server. We strongly recommend starting by testing options and only when everything is set up and tested, switch to the production server.

Python provides an smtpd module for testing your emails in the local environment. There is a DebuggingServer *feature, for discarding your sent messages and printing them to *stdout.

Set your SMTP server to localhost:1025

python -m smtpd -n -c DebuggingServer localhost:1025

With the local debugging, you can check whether your code works and detect the possible problems. Nevertheless, you won’t be able to preview your email template and verify whether it works as designed with it. For this purpose, we would advise you to use a dedicated testing tool.

Sending emails via Gmail or another external SMTP server

To send an email via any SMTP server, you have to know the hostname and port as well as get your username and password.

The difference in sending emails via Gmail is that you need to grant access for your applications. You can do it in two ways: allowing less secure apps (2-step verification should be disabled) or using the OAuth2 authorization protocol. The latter is more secure.

The Gmail server credentials are:

import smtplib, ssl
port = 465  
password = input("your password")
context = ssl.create_default_context()
with smtplib.SMTP_SSL("smtp.gmail.com", port, context=context) as server:
    server.login("my@gmail.com", password)

Alternatively, you can try Yagmail, the dedicated Gmail/SMTP, which simplifies email sending with Gmail:

import yagmail
yag = yagmail.SMTP()
contents = [
    "This is the body, and here is just text http://somedomain/image.png",
    "You can find an audio file attached.", '/local/path/to/song.mp3'
]
yag.send('to@someone.com', 'subject', contents)

Finally, let’s review the whole example. We will use some external SMTP server:

import smtplib
port = 2525
smtp_server = "smtp.yourserver.com"
login = "1a2b3c4d5e6f7g" # paste your login 
password = "1a2b3c4d5e6f7g" # paste your password 
# specify the sender's and receiver's email addresses
sender = "my@example.com"
# make sure you are not sending test emails to real email addresses 
receiver = "your@example.com"
# type your message: use two newlines (\n) to separate the subject from the message body and use 'f' to  automatically insert variables in the text
message = f"""\
Subject: Hi there
To: {receiver}
From: {sender}
This is my first message with Python."""
#send your message with credentials specified above
with smtplib.SMTP(smtp_server, port) as server:
        server.login(login, password)
        server.sendmail(sender, receiver, message)
print('Sent')

Sending personalized emails to multiple recipients

Python lets you send multiple emails with dynamic content with almost no extra effort, with the help of loops. Make a database in a **.csv **format and save it to the same folder as your Python script. The most simple example is a table with two columns - name and email address - as follows:

#name,email
John Johnson,john@johnson.com
Peter Peterson,peter@peterson.com

The file will be opened with the script and its rows will be looped over line by line. In this case, the {name} will be replaced with the value from the “name” column:

import csv, smtplib
port = 2525 
smtp_server = "smtp.yourserver.com"
login = "1a2b3c4d5e6f7g" # paste your login 
password = "1a2b3c4d5e6f7g" # paste your password 
message = """Subject: Order confirmation
To: {recipient}
From: {sender}
Hi {name}, thanks for your order! We are processing it now and will contact you soon"""
sender = "new@example.com"

with smtplib.SMTP(smtp_server, port) as server:
    server.login(login, password)
    with open("contacts.csv") as file:
        reader = csv.reader(file)
        next(reader)  # it skips the header row
        for name, email in reader:
            server.sendmail(
              sender,
              email,
              message.format(name=name, recipient=email, sender=sender)
            )
            print(f'Sent to {name}')

As a result, you should receive the following response:

Sent to John Johnson
Sent to Peter Peterson

Let’s add HTML content

We have examined how the email sending works. Now it’s time to create email templates containing images and file attachments.

In Python, this can be done with the email.mime module, which handles the MIME message type. Write a text version apart from the HTML one, and then merge them with the MIMEMultipart(“alternative”) instance.

import smtplib
from email.mime.text import MIMEText 
from email.mime.multipart import MIMEMultipart 
port = 2525 
smtp_server = "smtp.yourserver.com" 
login = "1a2b3c4d5e6f7g" # paste your login 
password = "1a2b3c4d5e6f7g" # paste your password 
sender_email = "sender@example.com" 
receiver_email = "new@example.com" 
message = MIMEMultipart("alternative") 
message["Subject"] = "multipart test" 
message["From"] = sender_email 
message["To"] = receiver_email 

# write the plain text part 
text = """\ Hi, Check out the new post on our blog blog: How to Send Emails with Python? https://blog.example.com/send-email-python/ Feel free to let us know what content would be useful for you!""" 

# write the HTML part 
html = """\ <html> <body> <p>Hi,\n Check out the new post on our blog blog: </p> <p><a href="https://blog.example.com/send-email-python/">How to Send Emails with Python?</p> <p> Feel free to <strong>let us</strong> know what content would be useful for you!</p> </body> </html> """

# convert both parts to MIMEText objects and add them to the MIMEMultipart message 
part1 = MIMEText(text, "plain") 
part2 = MIMEText(html, "html") 
message.attach(part1)
message.attach(part2) 

# send your email with smtplib.SMTP("smtp.yourserver.com", 2525) as server: server.login(login, password) 
server.sendmail( sender_email, receiver_email, message.as_string() ) 
print('Sent')

How to attach files in Python

In Python, email attachments are treated as the MIME objects. But first, you need to encode them with the base64 module.

You can attach images, text and audio, as well as applications. Each of the file types should be defined by the corresponding email class - for example, *email.mime.image.MIMEImage or email.mime.audio.MIMEAudio. *For details, follow this section of the Python documentation.

Example of attaching a PDF file:

import smtplib
# import the corresponding modules
from email import encoders
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

port = 2525 
smtp_server = "smtp.yourserver.com"
login = "1a2b3c4d5e6f7g" # paste your login 
password = "1a2b3c4d5e6f7g" # paste your password 
subject = "An example of boarding pass"
sender_email = "sender@example.com"
receiver_email = "new@example.com"

message = MIMEMultipart()
message["From"] = sender_email
message["To"] = receiver_email
message["Subject"] = subject

# Add body to email
body = "This is an example of how you can send a boarding pass in attachment with Python"
message.attach(MIMEText(body, "plain"))
filename = "yourBP.pdf"

# Open PDF file in binary mode
# We assume that the file is in the directory where you run your Python script from
with open(filename, "rb") as attachment:
    # The content type "application/octet-stream" means that a MIME attachment is a binary file
    part = MIMEBase("application", "octet-stream")
    part.set_payload(attachment.read())

    # Encode to base64
    encoders.encode_base64(part)

    # Add header 
    part.add_header(
        "Content-Disposition",
        f"attachment; filename= {filename}",
    )

    # Add attachment to your message and convert it to string
    message.attach(part)
    text = message.as_string()

    # send your email
    with smtplib.SMTP("smtp.yourserver.com", 2525) as server:
        server.login(login, password)
        server.sendmail(
            sender_email, receiver_email, text
        )

    print('Sent')

Call the message.attach() method several times for adding several attachments

Embed an image

There are three common ways to include an image in an email message: base64 image (inline embedding), CID attachment (embedded as a MIME object), and linked image.

In the example below we will experiment with inline embedding.

For this purpose, we will use the base64 module:

# import the necessary components first
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
import base64

port = 2525
smtp_server = "smtp.yourserver.com"
login = "1a2b3c4d5e6f7g" # paste your login 
password = "1a2b3c4d5e6f7g" # paste your password 
sender_email = "sender@example.com"
receiver_email = "new@example.com"
message = MIMEMultipart("alternative")
message["Subject"] = "inline embedding"
message["From"] = sender_email
message["To"] = receiver_email

# The image file is in the same directory that you run your Python script from
encoded = base64.b64encode(open("illustration.jpg", "rb").read()).decode()
html = f"""\
<html>
 <body>
   <img src="data:image/jpg;base64,{encoded}">
 </body>
</html>
"""

part = MIMEText(html, "html")
message.attach(part)

# send your email
with smtplib.SMTP("smtp.yourserver.com", 2525) as server:
    server.login(login, password)
    server.sendmail(
       sender_email, receiver_email, message.as_string()
   )
print('Sent')

That’s it!

Useful resources for sending emails with Python

Python offers a wide set of capabilities for email sending. In this article, we went through the main steps. To go further, you can refer to the Python documentation and also try additional libraries such as Flask Mail or Marrow Mailer.

Here you will find a really awesome list of Python resources sorted by their functionality.

This article was originally published on Mailtrap’s blog: Sending emails with Python.