Writing a FUSE filesystem in Python

2017-02-27 23:11:19 +0000

teaser

We ran into a problem last week. Our web application produces a lot of documents that have to be accessed frequently for a couple of months after they’re created. However, in less than a year these documents will be almost never accessed anymore, but we need to keep them available for the web application and for tons of other legacy apps that might need to access them.

Now, these documents take a lot of space on our expensive but super fast storage system (let’s call it primary storage system or PSS from now on) and we would like to be able to move them on the cheaper, not so good and yet quite slow storage system (that we’re going to call secondary storage system or SSS) when we believe that they will not be accessed anymore.

Our idea was to move the older files to the SSS and to modify all the software that needs to access the storage so to look at the PSS first and in the case, nothing was found, to look at the SSS. This approach, however, meant that we should have to modify all the client software we had…

“There are no problems, only opportunities” — I.R.

So, wouldn’t it be great if we could create a virtual filesystem to map both the PSS and the SSS into a single directory?

And that’s what we’re gonna do today.

From the client software perspective, everything will remain unchanged, but under the hood all our read and write operations will be forwarded to the correct storage system.

Please note: I’m not saying that this is the best solution ever for this specific problem. There are probably better solutions to address this problem but… we have to talk about Python, don’t we?

What we’ll need

To start this project we just need to satisfy a couple of prerequisites:

I assume that you already have Python (if not… what are you doing here?), and for what about the OS keep in mind that this article is based on FUSE.

According to Wikipedia, FUSE is

a software interface for Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code. This is achieved by running file system code in user space while the FUSE module provides only a “bridge” to the actual kernel interfaces.

FUSE is available for Linux, FreeBSD, OpenBSD, NetBSD (as puffs), OpenSolaris, Minix 3, Android and macOS.

So, if you use macOS you need to download and install FUSE, if you use Linux, keep in mind that Fuse has been merged into the mainstream Linux kernel in the 2.6.14 version, originally released in 2005, on October the 27th, so every recent version of Linux has it yet.

If you use Windows… well… I mean… I’m sorry buddy, but you didn’t satisfy the second prerequisite…

The fusepy module

First of all, to communicate with the FUSE module from Python you will need to install the fusepy module. This module is just a simple interface to FUSE and MacFUSE. Nothing more than this, so go on and install it by using pip:

pip install fusepy

Let’s start

There’s a great start point for building our filesystem, and it’s the Stavros Korokithakis code. What Stavros made is available on his GitHub repo and I will report it here:

#!/usr/bin/env python

from __future__ import with_statement

import os
import sys
import errno

from fuse import FUSE, FuseOSError, Operations


class Passthrough(Operations):
    def __init__(self, root):
        self.root = root

    # Helpers
    # =======

    def _full_path(self, partial):
        if partial.startswith("/"):
            partial = partial[1:]
        path = os.path.join(self.root, partial)
        return path

    # Filesystem methods
    # ==================

    def access(self, path, mode):
        full_path = self._full_path(path)
        if not os.access(full_path, mode):
            raise FuseOSError(errno.EACCES)

    def chmod(self, path, mode):
        full_path = self._full_path(path)
        return os.chmod(full_path, mode)

    def chown(self, path, uid, gid):
        full_path = self._full_path(path)
        return os.chown(full_path, uid, gid)

    def getattr(self, path, fh=None):
        full_path = self._full_path(path)
        st = os.lstat(full_path)
        return dict((key, getattr(st, key)) for key in ('st_atime', 'st_ctime',
                     'st_gid', 'st_mode', 'st_mtime', 'st_nlink', 'st_size', 'st_uid'))

    def readdir(self, path, fh):
        full_path = self._full_path(path)

        dirents = ['.', '..']
        if os.path.isdir(full_path):
            dirents.extend(os.listdir(full_path))
        for r in dirents:
            yield r

    def readlink(self, path):
        pathname = os.readlink(self._full_path(path))
        if pathname.startswith("/"):
            # Path name is absolute, sanitize it.
            return os.path.relpath(pathname, self.root)
        else:
            return pathname

    def mknod(self, path, mode, dev):
        return os.mknod(self._full_path(path), mode, dev)

    def rmdir(self, path):
        full_path = self._full_path(path)
        return os.rmdir(full_path)

    def mkdir(self, path, mode):
        return os.mkdir(self._full_path(path), mode)

    def statfs(self, path):
        full_path = self._full_path(path)
        stv = os.statvfs(full_path)
        return dict((key, getattr(stv, key)) for key in ('f_bavail', 'f_bfree',
            'f_blocks', 'f_bsize', 'f_favail', 'f_ffree', 'f_files', 'f_flag',
            'f_frsize', 'f_namemax'))

    def unlink(self, path):
        return os.unlink(self._full_path(path))

    def symlink(self, name, target):
        return os.symlink(name, self._full_path(target))

    def rename(self, old, new):
        return os.rename(self._full_path(old), self._full_path(new))

    def link(self, target, name):
        return os.link(self._full_path(target), self._full_path(name))

    def utimens(self, path, times=None):
        return os.utime(self._full_path(path), times)

    # File methods
    # ============

    def open(self, path, flags):
        full_path = self._full_path(path)
        return os.open(full_path, flags)

    def create(self, path, mode, fi=None):
        full_path = self._full_path(path)
        return os.open(full_path, os.O_WRONLY | os.O_CREAT, mode)

    def read(self, path, length, offset, fh):
        os.lseek(fh, offset, os.SEEK_SET)
        return os.read(fh, length)

    def write(self, path, buf, offset, fh):
        os.lseek(fh, offset, os.SEEK_SET)
        return os.write(fh, buf)

    def truncate(self, path, length, fh=None):
        full_path = self._full_path(path)
        with open(full_path, 'r+') as f:
            f.truncate(length)

    def flush(self, path, fh):
        return os.fsync(fh)

    def release(self, path, fh):
        return os.close(fh)

    def fsync(self, path, fdatasync, fh):
        return self.flush(path, fh)


def main(mountpoint, root):
    FUSE(Passthrough(root), mountpoint, nothreads=True, foreground=True)

if __name__ == '__main__':
    main(sys.argv[2], sys.argv[1])

Take a minute to analyze Stavros’ code. It just implements a “passthrough filesystem”, that just mount a directory into a mount point. For each operation requested to the mount point, it returns the python implementation on the real file of the mounted directory.

So, to try this code just save this file as Passthrough.py and run

python Passthrough.py [directoryToBeMounted] [directoryToBeUsedAsMountpoint]

That’s it! Now, your bare new filesystem is mounted on what you specified in the[directoryToBeUsedAsMountpoint] parameter and all the operations you will do on this mount point will be silently passed to what you specified in the [directoryToBeMounted] parameter.

Really cool, even if a little bit useless so far… :)

So, how can we implement our filesystem as said before? Thanks to Stavros, our job is quite simple. We just need to create a class that inherits from Stavros’ base class and overrides some methods.

The first method we have to override is the _full_path method. This method is used in the original code to take the mount point relative path and translate it to the real mounted path. In our filesystem, this will be the most difficult piece of code, because we will need to add some logic to define if the requested path belongs to the PSS or to the SSS. However, also this “most difficult piece of code” is quite trivial.

We just need to verify if the requested path exists at least in one storage system. If it does, we will return the real path, if not, we will assume that the path has been requested for a write operation on a file that does not exist yet. So we will try to look if the directory name of the path exists in one of the storage systems and we will return the correct path.

A look at the code will make things more clear:

    def _full_path(self, partial, useFallBack=False):
        if partial.startswith("/"):
            partial = partial[1:]

        # Find out the real path. If has been requesetd for a fallback path,
        # use it
        path = primaryPath = os.path.join(
            self.fallbackPath if useFallBack else self.root, partial)

        # If the pah does not exists and we haven't been asked for the fallback path
        # try to look on the fallback filessytem
        if not os.path.exists(primaryPath) and not useFallBack:
            path = fallbackPath = os.path.join(self.fallbackPath, partial)

            # If the path does not exists neither in the fallback fielsysem
            # it's likely to be a write operation, so use the primary
            # filesystem... unless the path to get the file exists in the
            # fallbackFS!
            if not os.path.exists(fallbackPath):
                # This is probabily a write operation, so prefer to use the
                # primary path either if the directory of the path exists in the
                # primary FS or not exists in the fallback FS

                primaryDir = os.path.dirname(primaryPath)
                fallbackDir = os.path.dirname(fallbackPath)

                if os.path.exists(primaryDir) or not os.path.exists(fallbackDir):
                    path = primaryPath

        return path

Done this, we have almost finished. If we’re using a Linux system we have also to override the “getattr” function to return also the ‘st_blocks’ attribute (it turned out that without this attribute the “du” bash command doesn’t work as expected).

So, we need just to override this method and return the extra attribute:

    def getattr(self, path, fh=None):
        full_path = self._full_path(path)
        st = os.lstat(full_path)
        return dict((key, getattr(st, key)) for key in ('st_atime', 'st_ctime',
                                                        'st_gid', 'st_mode', 'st_mtime', 'st_nlink', 'st_size', 'st_uid', 'st_blocks'))

And then we need to override the “readdir” function, that is the generator function that is called when someone does a “ls” in our mount point. In our case, the “ls” command has to list the content of both our primary storage system and our secondary storage system.

def readdir(self, path, fh):
        dirents = ['.', '..']
        full_path = self._full_path(path)
        # print("listing " + full_path)
        if os.path.isdir(full_path):
            dirents.extend(os.listdir(full_path))
        if self.fallbackPath not in full_path:
            full_path = self._full_path(path, useFallBack=True)
            # print("listing_ext " + full_path)
            if os.path.isdir(full_path):
                dirents.extend(os.listdir(full_path))
        for r in list(set(dirents)):
            yield r

We’ve almost finished, we just need to override the “main” method because we need an extra parameter (in the original code we had one directory to be mounted and one directory to be used as a mount point, in our filesystem we have to specify two directories to be mounted into the mount point).

So here there is the full code of our new file system “dfs” (the “Dave File System” :D )

#!/usr/bin/env python

import os
import sys
import errno

from fuse import FUSE, FuseOSError, Operations
from Passthrough import Passthrough

class dfs(Passthrough):
    def __init__(self, root, fallbackPath):
        self.root = root
        self.fallbackPath = fallbackPath
        
    # Helpers
    # =======
    def _full_path(self, partial, useFallBack=False):
        if partial.startswith("/"):
            partial = partial[1:]
        # Find out the real path. If has been requesetd for a fallback path,
        # use it
        path = primaryPath = os.path.join(
            self.fallbackPath if useFallBack else self.root, partial)
        # If the pah does not exists and we haven't been asked for the fallback path
        # try to look on the fallback filessytem
        if not os.path.exists(primaryPath) and not useFallBack:
            path = fallbackPath = os.path.join(self.fallbackPath, partial)
            # If the path does not exists neither in the fallback fielsysem
            # it's likely to be a write operation, so use the primary
            # filesystem... unless the path to get the file exists in the
            # fallbackFS!
            if not os.path.exists(fallbackPath):
                # This is probabily a write operation, so prefer to use the
                # primary path either if the directory of the path exists in the
                # primary FS or not exists in the fallback FS
                primaryDir = os.path.dirname(primaryPath)
                fallbackDir = os.path.dirname(fallbackPath)
                if os.path.exists(primaryDir) or not os.path.exists(fallbackDir):
                    path = primaryPath
        return path
      
    def getattr(self, path, fh=None):
        full_path = self._full_path(path)
        st = os.lstat(full_path)
        return dict((key, getattr(st, key)) for key in ('st_atime', 'st_ctime',
                                                        'st_gid', 'st_mode', 'st_mtime', 'st_nlink', 'st_size', 'st_uid', 'st_blocks')) 

    def readdir(self, path, fh):
        dirents = ['.', '..']
        full_path = self._full_path(path)
        # print("listing " + full_path)
        if os.path.isdir(full_path):
            dirents.extend(os.listdir(full_path))
        if self.fallbackPath not in full_path:
            full_path = self._full_path(path, useFallBack=True)
            # print("listing_ext " + full_path)
            if os.path.isdir(full_path):
                dirents.extend(os.listdir(full_path))
        for r in list(set(dirents)):
            yield r
            
def main(mountpoint, root, fallbackPath):
    FUSE(dfs(root, fallbackPath), mountpoint, nothreads=True,
         foreground=True, **{'allow_other': True})

if __name__ == '__main__':
    mountpoint = sys.argv[3]
    root = sys.argv[1]
    fallbackPath = sys.argv[2]
    main(mountpoint, root, fallbackPath)

That’s it, now if we issue the command …

python dfs.py /home/dave/Desktop/PrimaryFS/ /home/dave/Desktop/FallbackFS/ /home/dave/Desktop/myMountpoint/

… we get a mount point (/home/dave/Desktop/myMountpoint/) that lists both the content of /home/dave/Desktop/PrimaryFS/ and /home/dave/Desktop/FallbackFS/ and that works as expected.

Yes, it was THAT easy!

A couple of notes

It worth to be noted that:

That’s all folks, now stop reading and start to develop your first filesystem with Python! :)

D.

Working with Exception in Python

2017-01-10 18:02:01 +0000

teaser

According to the official documentation, an exception is “an error detected during execution not unconditionally fatal”. Let’s start the interpreter and write:

>>> 5/0
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    5/0
ZeroDivisionError: division by zero

As you can see we asked the interpreter to divide the number 5 by 0. Even if our request was syntactically correct when the interpreter tried to compute it, it “raised” the ZeroDivisionError exception to signal us that we asked something impossible. There are a lot of built-in exceptions in the base library to handle a different kind of errors (system errors, value errors, I/O errors, Arithmetic errors etc…) and to know how to handle this kind of errors is very important for every Python developer.

Handling exception

Handle an exception means to define what to do when a specific exception happens so that the execution can proceed smoothly.

The basic way to handle an exception is by using the try statement. Basically, we need to specify what we want to do, which kind of exceptions we do expect and what to do when one of these exceptions is raised.

To make an example, let’s say we want to open a file to read it and show its content. To accomplish this task we can write this script:

f = open("myfile.txt")

for line in f:
    print(line)

The problem is: what happens if the file myfile.txt does not exist? Let’s try…

Traceback (most recent call last):
  File "exceptions.py", line 1, in <module>
    f = open("myfile.txt")
FileNotFoundError: [Errno 2] No such file or directory: 'myfile.txt'

Well, we get a FileNotFoundException and the execution stops. So, to handle this exception we could just modify the code as follows:

try:
    f = open("myfile.txt")
    
    for line in f:
        print(line)
    except FileNotFoundError:
        print("The file does not exist")

In this way, the interpreter tries to do what’s inside the try block and, if a FileNotFoundError exception is raised, instead of writing the exception’s detail on screen and exit, it just continues executing what’s inside the except block. If the file existed, the exception would not be raised and the except block would be skipped.

Now, executing the script again, the result would be:

The file does not exist

It’s worth to be noted that we can use more except clauses for a single try block. For example, the code could be modified as follows:

try:
    f = open("myfile.txt")
    for line in f:
        print(line)
except FileNotFoundError:
    print("The file does not exist")
except PermissionError:
    print("You don't have the permission to open the file")
except Exception:
    print("Unexpected error occured")

In the latest example, we are trapping the exception to handle the case when the file does not exist, the exception to handle the case when the file exists but the user does not have the permission to read from it and any other error that could happen at run time, catching the general Exception *exception. This is made possible thanks to the fact that in Python, everything is an object, even the exceptions. This means that almost all the exceptions that can be fired at runtime are actually derived from the *Exception exception.

Here’s the complete hierarchy as it appears in the official docs:

BaseException
 +-- SystemExit
 +-- KeyboardInterrupt
 +-- GeneratorExit
 +-- Exception
      +-- StopIteration
      +-- StopAsyncIteration
      +-- ArithmeticError
      |    +-- FloatingPointError
      |    +-- OverflowError
      |    +-- ZeroDivisionError
      +-- AssertionError
      +-- AttributeError
      +-- BufferError
      +-- EOFError
      +-- ImportError
           +-- ModuleNotFoundError
      +-- LookupError
      |    +-- IndexError
      |    +-- KeyError
      +-- MemoryError
      +-- NameError
      |    +-- UnboundLocalError
      +-- OSError
      |    +-- BlockingIOError
      |    +-- ChildProcessError
      |    +-- ConnectionError
      |    |    +-- BrokenPipeError
      |    |    +-- ConnectionAbortedError
      |    |    +-- ConnectionRefusedError
      |    |    +-- ConnectionResetError
      |    +-- FileExistsError
      |    +-- FileNotFoundError
      |    +-- InterruptedError
      |    +-- IsADirectoryError
      |    +-- NotADirectoryError
      |    +-- PermissionError
      |    +-- ProcessLookupError
      |    +-- TimeoutError
      +-- ReferenceError
      +-- RuntimeError
      |    +-- NotImplementedError
      |    +-- RecursionError
      +-- SyntaxError
      |    +-- IndentationError
      |         +-- TabError
      +-- SystemError
      +-- TypeError
      +-- ValueError
      |    +-- UnicodeError
      |         +-- UnicodeDecodeError
      |         +-- UnicodeEncodeError
      |         +-- UnicodeTranslateError
      +-- Warning
           +-- DeprecationWarning
           +-- PendingDeprecationWarning
           +-- RuntimeWarning
           +-- SyntaxWarning
           +-- UserWarning
           +-- FutureWarning
           +-- ImportWarning
           +-- UnicodeWarning
           +-- BytesWarning
           +-- ResourceWarning

Finally, you can have the need to execute some code after the try block is executed, whether or not the code in the try block has raised exceptions. In this case, you can add a finally clause. For example:

try:
    f = open("myfile.txt")
    for line in f:
        print(line)
except FileNotFoundError:
    print("The file does not exist")
except PermissionError:
    print("You don't have the permission to open the file")
except Exception:
    print("Unexpected error occured")
finally:
    print("The execution will now be terminated")

In this last example, whatever happens, the message “The execution will now be terminated” will be shown before leaving the try/except block.

Raising exception

Now that we know what an exception is and how to handle exceptions, let’s see how is it possible to raise exceptions by ourselves. Look at this code:

def get_numeric_value_from_keyboard():
    '''Get a value from keyboard, if the value is not a valid number, raise a "ValueError" exception'''
    input_value = input("Please, enter an integer: ")
    if not input_value.isdigit():
        raise ValueError("The value inserted is not a number")

	return input_value

while True:
    try:
        numeric_value = get_numeric_value_from_keyboard()
        print("You have inserted the value " + str(numeric_value))
        break
    except ValueError as ex:
        print(ex)

In this example, we have created a function that gets input from the user keyboard. If the input is numeric, it just returns the input to the caller, but if it’s not, it raises a “ValueError” exception. Note that in this example, we’re not just raising a ValueError exception but we are also specifying a custom message for the exception. In the except clause, we grab the exception, assign it to the ex variable and then we use the ex variable to print the message for the user.

Another possibility you have is to re-raise an exception once it gets caught in an except block. For example, try to modify the code as follows:

def get_numeric_value_from_keyboard():
    '''Get a value from keyboard, if the value is not a valid number, raise a "ValueError" exception'''
    input_value = input("Please, enter an integer: ")
    if not input_value.isdigit():
        raise ValueError("The value inserted is not a number")

    return input_value

while True:
    try:
        numeric_value = get_numeric_value_from_keyboard()
        print("You have inserted the value " + str(numeric_value))
        break
    except ValueError as ex:
        print("Something strange happened...")
        raise

Now, if you run this code and insert a non-numeric value, the execution will be interrupted and you will get this message:

Please, enter an integer: asd
Something strange happened...
Traceback (most recent call last):
  File "exceptions.py", line 12, in <module>
    numeric_value = get_numeric_value_from_keyboard()
  File "exceptions.py", line 5, in get_numeric_value_from_keyboard
    raise ValueError("The value inserted is not a number")
ValueError: The value inserted is not a number

As you can see, in the except block we’ve caught the ValueError exception, we’ve done something (we’ve printed the “Something strange happened…” message) and then we’ve re-raised the same exception we previously caught. Obviously, since there were no other code blocks to catch our exception, the program has exited and the exception has been shown to the console.

Define custom exception

Knowing how to raise an exception is important especially if we want to build our custom exception. We’ve already said that there’s a hierarchy of exceptions, so to create our custom exception we just need to create a new class inheriting from the Exception class.

So, let’s say that we are coding the software for an ATM, we could need a special “WithdrawLimitError” exception to be raised when the user asks for a too high sum of money. In this case, we can create our custom exception like this:

class WithdrawLimitError(Exception):
    pass

Now, we can use it in our code just like any other exception.

The bottom line

There are several programming languages where the developer is asked to use exception just to handle “errors” because the handling of an exception can lead to performance issues. Well, the Python approach is completely different. Python internals relies on exceptions (for example, in a simple “for” loop the StopIteration exception is used to signal that there are no further items to iterate) and is encouraged the use of an exception to indicate failures, even when they are expected on regular basis.

So for example, if you have to open a file don’t check whether it exists or not, just open it and handle the exception if something goes wrong. It makes the code more readable, Pythonic, and easier to be maintained.

Enjoy! D.

Python Metaclasses

2016-12-22 11:31:01 +0000

teaser

Working with Python means working with objects because, in Python, everything is an object. So, for example:

>>> type(1)
<class 'int'>
>>> type('x')
<class 'str'>

As you can see, even basic types like integer and strings are objects, in particular, they are respectively instances of int and str classes. So, since everything is an object and given that an object is an instance of a class… what is a class?

Let’s check it:

>>> type(int)
<class 'type'>
>>> type(str)
<class 'type'>

It turns out that classes are an object too, specifically they are instances of the “type” class, or better, they are instances of the “type” metaclass.

A metaclass is the class of a class and the use of metaclasses could be convenient for some specific tasks like logging, profiling and more.

So, let’s start demonstrating that a class is just an instance of a metaclass. We’ve said that type is the base metaclass and instantiating this metaclass we can create some class so… let’s try it:

>>> my_class = type("Foo", (), {"bar": 1})
>>> print(my_class)
<class '__main__.Foo'>

Here you can see that we have created a class named “Foo” just instantiating the metaclass type. The parameters we have passed are:

If everything is clear so far, we can try to create and use a custom metaclass. To define a custom metaclass it’s enough to subclass the type class.

Look at this example:

class Logging_Meta(type):
    def __new__(cls, name, bases, attrs, **kwargs):
        print(str.format("Allocating memory space for class {0} ", cls))
        return super().__new__(cls, name, bases, attrs, **kwargs)

    def __init__(self, name, bases, attrs, **kwargs):
        print(str.format("Initializing object {0}", self))
        return super().__init__(name, bases, attrs)

class foo(metaclass=Logging_Meta):
    pass

foo_instance = foo()
print(foo_instance)
print(type(foo))

on my PC, this code returns:

Allocating memory space for class <class '__main__.Logging_Meta'>
Initializing object <class '__main__.foo'>
<__main__.foo object at 0x000000B54ACC0B00>
<class '__main__.Logging_Meta'>

In this example we have defined a metaclass called Logging_Meta and using the magic methods new and init we have redefined the behavior of the class when the object is created and initialized. Then, we’ve declared a foo class specifying which is the metaclass to use for this class and as you can see, our class behavior is changed according to the Logging_Meta metaclass implementation.

A concrete use-case: Abstract Base classes (ABC’s)

A concrete use of metaclasses is the abc module. The abc module is a module of the standard library that provides the infrastructure for defining an abstract base class. Using abc you can check that a derived class that inherits from an abstract base class implements all the abstract methods of the superclass when the class is instantiated.

For example:

from abc import ABCMeta, abstractmethod

class my_base_class(metaclass=ABCMeta):
    @abstractmethod
    def foo(self):
        pass

class my_derived_class(my_base_class):
    pass

a_class = my_derived_class()

If you try this example, you will see that the last line (the one that tries to instantiate the derived class) will raise the following exception:

TypeError: Can't instantiate abstract class my_derived_class with abstract methods foo

That’s because my_derived_class does not implement the method foo as requested from the abstract base class.

It’s worth to be said that if you subclass a base class that uses a specific metaclass, your new object will use the metaclass as well. In fact, since Python 3.4 the module abc now provide also the ABC class that is just a generic class that uses the ABCMeta metaclasses. This means that the last example can be rewritten as follows:

from abc import ABC

class my_base_class(ABC):
    @abstractmethod
    def foo(self):
        pass

class my_derived_class(my_base_class):
    pass

a_class = my_derived_class()

This was just a brief introduction to metaclasses in Python. It’s more or less what I think should be known about this topic because it could lead to a better understanding of some internals of Python.

But let’s be clear: this is not something that every single Python user needs to know in order to start working in Python. As the “Python Guru” Tim Peters once said:

“Metaclasses are deeper magic than 99% of users should ever worry about. If you wonder whether you need them, you don’t (the people who actually need them know with certainty that they need them, and don’t need an explanation about why).”  — Tim Peters*

Enjoy! D.

Syntax sugar in Python 3.6

2016-12-12 23:00:49 +0000

teaser

On December the 8th Guido van Rossum (also known to be the BDFL or the Python project) announced on his twitter account that Python 3.6 rc1 has been officially released. That means that if no major problems will be found with this latest version, the final release is just around the corner and it’s scheduled to be released on December the 16th, carrying among other things also some improvements in the Python syntax.

Let’s have a look at this new “syntax sugar”!

Formatted string literals

The formatted string literals are one of my favorite features ever! Since today, if you wanted to create a string with some variable value inside you could do something like:

>>> name = "Dave"
>>> print ("Hi " + name)

or:

>>> name = "Dave"
>>> print ("Hi %s" % name)

or again:

>>> name = "Dave"
>>> print(str.format("Hi {0}", name))

We’re not going to describe the pros and cons of these methods here, but the shiny new method to accomplish this task is:

>>> name = "Dave"
>>> print (f"Hi {name}")

Pretty cool, uh? And these methods works great also with other variable types like numbers. For example:

>>> number = 10/3
>>> print(f"And the number is: {number:5.3}")

Note that in this last example we have also formatted the number specifying the width (5) and the precision (3).

Underscores in Numeric Literals

This feature is better seen that explained: what’s the value of the variable “big_number” after the foloowing assignment?

>>> big_number = 1000000000

If you like me needs to count the zeroes to say that big_number is a billion, you will be probabily happy to know that from now on, this assignment can be written like this:

>>> big_number = 1_000_000_000

Syntax for variable annotations

If you need to annotate the type of a variable now you can use this special syntax:

variable: type

this means that you can write something like

>>> some_list: [int] = []

this doesn’t do anything more than before, Python is a dinamically typed language and so it will be in the future, but if you read some third party code this notation lets you know that some_list is not just a list, but it’s intended to be a list of integers.

Please, note: I say it’s intended to be just because nothing prevents you from assigning some_list any other kind of value! This new variable annotation is just intended to avoid writing something like:

>>> some_list = [] # this is a list of integers

Asynchronous generators

Asynchronous generators have been awaited since Python 3.5, that introduced the async / await syntax feature. In fact, one of the limitation it had was that it was impossible to use the yield and the await statements in the same body. This restriction has been lifted and in the documentation, there’s a quite interesting example that shows a function called “ticker” that generate a number from 0 to a limit passed as a parameter every X seconds (again, passed as a parameter).

import asyncio

async def ticker(delay, to):
    """Yield numbers from 0 to *to* every *delay* seconds."""
    for i in range(to):
        yield i
        await asyncio.sleep(delay)

async def main():
    async for x in ticker(1,10):
        print(x)

if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

Asynchronous comprehensions

If I asked you to modify the previous example so to create a list using the generator you would probabily modify the code this way:

import asyncio

async def ticker(delay, to):
    """Yield numbers from 0 to *to* every *delay* seconds."""
    for i in range(to):
        yield i
        await asyncio.sleep(delay)

async def main():
    mylist = []
    async for x in  ticker(1,10):
        mylist.append(x)

    print (mylist)

if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

But wouldn’t it be great if we could use list comprehensions like we normaly do? Now, we can:

import asyncio

async def ticker(delay, to):
    """Yield numbers from 0 to *to* every *delay* seconds."""
    for i in range(to):
        yield i
        await asyncio.sleep(delay)

async def main():

    mylist = [await x async for x in  ticker(1,10)]
    print (mylist)

if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

That’s all for now, happy coding in Python 3.6!

Object Serialization in Python With the Pickle Module

2016-12-05 23:00:49 +0000

teaser

DISCLAMER: There’s a newer (and probably better) article about this topic that I wrote for Real Python website.

It’s longer and more detailed and it also have a section about the security concern when using the pickle module, so… I want to be honest and I suggest you to read that article … but if you like it, don’t forget to come back here and buy me a coffe! :)


Today we’re going to explore a wonderful feature that the Python library offers to you out of the box: the serialization. To serialize an object means to transform it in a format that can be stored, so as to be able to deserialize it later, recreating the original object from the serialized format. To do all these operations we will use the pickle module.

Pickling

Pickling is the name of the serialization process in Python. By pickling, we can convert an object hierarchy to a binary format (usually not human readable) that can be stored. To pickle an object we just need to import the pickle module and call the dumps() function passing the object to be pickled as a parameter.

For example:

import pickle
​
class Animal:
    def __init__(self, number_of_paws, color):
        self.number_of_paws = number_of_paws
        self.color = color
​
class Sheep(Animal):
    def __init__(self, color):
        Animal.__init__(self, 4, color)
​
mary = Sheep("white")
​
print (str.format("My sheep mary is {0} and has {1} paws", mary.color, mary.number_of_paws))
my_pickled_mary = pickle.dumps(mary)
​
print ("Would you like to see her pickled? Here she is!")
print (my_pickled_mary)

So, in the example above, we have created an instance of a sheep class and then we’ve pickled it, transforming our sheep instance into a simple array of bytes. It’s been easy, hasn’t it?

Now we can easily store the bytes array on a binary file or in a database field and restore it from our storage support in a later time to transform back this bunch of bytes in an object hierarchy.

Note that if you want to create a file with a pickled object, you can use the dump() method (instead of the dumps() one) passing also an opened binary file and the pickling result will be stored in the file automatically.

To do so, the previous example could be changed like this:

import pickle
​
class Animal:
    def __init__(self, number_of_paws, color):
        self.number_of_paws = number_of_paws
        self.color = color
​
class Sheep(Animal):
    def __init__(self, color):
        Animal.__init__(self, 4, color)
​
mary = Sheep("white")
​
print (str.format("My sheep mary is {0} and has {1} paws", mary.color, mary.number_of_paws))
my_pickled_mary = pickle.dumps(mary)
​
binary_file = open('my_pickled_mary.bin', mode='wb')
my_pickled_mary = pickle.dump(mary, binary_file)
binary_file.close()

Unpickling

The process that takes a binary array and converts it to an object hierarchy is called unpickling.

The unpickling process is done by using the load() function of the pickle module and returns a complete object hierarchy from a simple bytes array. Let’s try to use the load function on the example above:

import pickle
​
class Animal:
    def __init__(self, number_of_paws, color):
        self.number_of_paws = number_of_paws
        self.color = color
​
class Sheep(Animal):
    def __init__(self, color):
        Animal.__init__(self, 4, color)
​
# Step 1: Let's create the sheep Mary
mary = Sheep("white")
​
# Step 2: Let's pickle Mary
my_pickled_mary = pickle.dumps(mary)
​
# Step 3: Now, let's unpickle our sheep Mary creating another instance, another sheep... Dolly!
dolly = pickle.loads(my_pickled_mary)
​
# Dolly and Mary are two different objects, in fact if we specify another color for dolly
# there are no conseguencies for Mary
dolly.color = "black"
​
print (str.format("Dolly is {0} ", dolly.color))
print (str.format("Mary is {0} ", mary.color))

In this example you can see that after having pickled the first sheep object (Mary) we have unpickled it to another variable (Dolly) and so we have — in a sense — cloned Mary to create Dolly (Yes, we’re cloning sheep… lol!).

It goes without saying that changing an attribute value on one of these objects the other one remain untouched because we haven’t just copied the reference to the original object, we have actually cloned the original object and its state to create a perfect copy in a completely different instance.

Note: in this example we have cloned an object using the trick of pickling it and unpickling the resulting binary stream in another variable. This is ok and there are several languages where this approach could even be advised, but if you need to clone an object in Python it’s probably better to use the copy module of the standard lib. Since it’s designed to clone objects, it works far better.

Some notes about pickling

All I’ve said so far is just to whet your appetite because there’s a lot more we could say about pickling. One important thing to be known is that there are several types (or protocols) of pickling because this technic is evolving as the language evolves.

So, there are currently 5 protocols of pickling:

According to the official documentation:

Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python. Protocol version 1 is an old binary format which is also compatible with earlier versions of Python. Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2. Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required. Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. Refer to PEP 3154 for information about improvements brought by protocol 4.

Another thing that is important to keep in mind is that not every object is picklable. Some objects (like DB connections, handles to opened files etc…) can’t be pickled and trying to pickle an unpicklable object (or to unpickle an object that is not a valid pickle), a pickle.PickleError exception or one of its subclasses (PicklingError and UnpicklingError) is raised.

For example:

import pickle
​
my_custom_pickle = bytes("this is unpicklable", encoding="UTF-8")
​
# this next statement will raise a _pickle.UnpicklingError
my_new_object = pickle.loads(my_custom_pickle)

The problem when you have unpicklable object in the hierarchy of the object you want to pickle is that this prevents you to serialize (and store) the entire object. Fortunately, Python offers you two convenient methods to specify what you want to pickle and how to re-initialize (during the unpickling process) the objects that you haven’t pickled before. These methods are setstate() and getstate()

For example:

import pickle
​
class my_zen_class:
   number_of_meditations = 0
​
    def __init__(self, name):
        self.number_of_meditations = 0
        self.name = name
​
    def meditate(self):
        self.number_of_meditations = self.number_of_meditations + 1
​
    def __getstate__(self):
        # this method is called when you are 
        # going to pickle the class, to know what to pickle
​
        state = self.__dict__.copy()
​
        # You will never get the Buddha state if you count 
        # meditations, so 
        # don't pickle this counter, the next time you will just 
        # start meditating from scratch :)
        del state['number_of_meditations']
​
        return state
​
    def __setstate__(self, state):
        # this method is called when you are going to 
        # unpickle the class,
        # if you need some initialization after the 
        # unpickling you can do it here.
​
        self.__dict__.update(state)
​
# I start meditating
my_zen_object = my_zen_class("Dave")
for i in range(100):
    my_zen_object.meditate()
​
# Now I pickle my meditation experience
print(str.format("I'm {0}, and I've meditated {1} times'", my_zen_object.name, my_zen_object.number_of_meditations))
my_pickled_zen_object = pickle.dumps(my_zen_object)
my_zen_object = None
​
# Now I get my meditation experience back
my_new_zen_object = pickle.loads(my_pickled_zen_object)
​
# As expected, the number_of_meditations property 
# has not been restored because it hasn't been pickled
print(str.format("I'm {0}, and I don't have a beginner mind yet because I've meditated only {1} times'", my_new_zen_object.name, my_new_zen_object.number_of_meditations))

The security concern

Now you know about what does it means to serialize and deserialize objects in Python but… have you ever tought at what could it means from a security perspective? If not, go ahead and have a loot to this article I wrote for “Real Python”. It has a lot of details and a specific section about the security topic.

Enjoy! D.

Using Virtual Environments in Python

2016-11-28 23:00:49 +0000

teaser

Working with Python is like to have superpowers. When you need to do something and you don’t know where to start from, you can google your problem and usually, you get aware that someone has already had your problem and created a library for the community. Whichever is your need…

“there’s a lib for that!”

This is so deep inside the Python culture that if you start your python interpreter and you do a…

>>> import antigravity

… you are redirected to this xkcd comic.

Ok, so Python encourages the usage of third-party libraries but using these libraries you will soon have to deal with the versioning problem.

Let’s say that your super new project xxx needs the version 2.5 of the library yyy, but you can’t install the 2.5 version of the yyy library because your old project zzz needs the version 1.7 of that library. Moreover, the old project zzz use Python 2.7 while the new one has been written using Python 3.5.

What are you supposed to do to sort this mess out? The pythonic answer to this common problem is Virtual environments.

Using virtual environments you will be able to assign a completely different environment to each project you’re working on, installing for each one different library, dependencies and why not, even a different Python interpreter! And all this leaving the global site-packages directory on your pc will free from any other third party package.

It sounds cool, uh?

Let’s see how to do this.

virtualenv, virtualenv-wrapper, pyvenv, python3 -m venv… Are you kidding me?

There are several ways to create virtual environments and I think this is the main reason why usually beginners tend not to use virtual environments because it’s quite common for the beginner to get lost on this topic. In this article, we’ll try to make this subject a little bit clearer.

virtualenv

The first method you have to create virtual environments is by using Ian Bicking’s virtualenv. This is a tool that has been around for a long time and that allows you to create virtual environments both for Python 2.7 and Python 3.x. When you create a virtual environment, it lets you specify which version of Python to use and it automatically installs the pip utility on the created virtual environment, so that you can just start to pip install whatever you need. To install virtualenv just use pip:

$ pip install virtualenv

Once you’ve done this, you can create a virtualenv simply typing:

$ virtualenv NameOfYourVirtualEnvironment

where NameOfYourVirtualEnvironment is the name of the virtual environment you’re going to create. This command will create, in the current directory, a subdirectory named NameOfYourVirtualEnvironment containing all the stuff you need, the python interpreter and the pip utility.

Once you have created a virtual environment you can simply start using it by executing the activate script that you will find into the NameOfYourVirtualEnvironment/bin directory (if you are using Windows it is named activate.bat and it’s available under NameOfYourVirtualEnvironment directory).

So, to activate your new virtual environment simply type:

$ source ./NameOfYourVirtualEnvironment/bin/activate

your prompt will change and you will see the name of your virtual environment inside parentheses at the beginning of the command line, meaning that you have activated the virtual environment in the correct way. Now, let’s try to install a package to see what happen:

(NameOfYourVirtualEnvironment) $ pip install pytyler

Perfect, if everything was ok now you have installed the package pytyler only on your virtual environment. If you’re skeptic, try to start the python interpreter and import the module:

(NameOfYourVirtualEnvironment) $ python -c import tyler

and look no importing errors! :)

To exit from your virtual environment you just need to issue the command deactivate

(NameOfYourVirtualEnvironment) $ deactivate

doing this, you will see that your command prompt will change again, back to the standard prompt, meaning that you are not anymore inside your virtual environment. Now, try to import the module tyler again:

$ python -c import tyler

and you will get:

ImportError: No module named tyler

This is the expected behavior and proves that you have installed the pytyler package ONLY into the virtual environment, so it is not available system-wide.

Ok, if everything is clear so far let’s take a step forward. As I said before, when you create a virtual environment the virtualenv utility puts inside the environment also the default python interpreter that you have installed on your computer, the one that starts when you just type python from the command line that is probably the one that is installed in the /usr/bin/python directory (type which python if you want to be sure). This usually means python 2.7 (at least on my Debian Jessie machine). But what if you want to use another version of python? Well, this is quite easy actually… try to write:

$ virtualenv -p /usr/bin/python3 AnotherVirtualEnvironmentName

and you will create another virtual environment named AnotherVirtualEnvironmentName that uses python 3. Deleting a virtual environment is just as easy as removing its directory, so to destroy the virtual environments you have just created type:

$ rm -rf NameOfYourVirtualEnvironment

virtualenvwrapper

virtualenvwrapper is just a wrapper around virtualenv that make easier (yes, it’s possible) working with virtual environments. Let’s start with the installation of virtualenvwrapper (if you use Windows the package name is virtualenvwrapper-win)

$ pip-install virtualenvwrapper

Please note that virtualenv is a dependency for virtualenvwrapper, so if you don’t have virtualenv istalled yet this command will install it for you before actually installing virtualenvwrapper. Now, once you’ve installed the wrapper, you just need to execute the virtualenvwrapper.sh script every time you want to use it, so let’s put it into your .bashrc file:

$ cd
$ echo source /usr/local/bin/virtualenvwrapper.sh >> .bashrc

Now, quit the current terminal and reopen it to execute the script and start working with virtualenvwrapper. Here there are the basic commands you can use:

$ mkvirtualenv NameOfTheVirtualEnv
$ workon NameOfTheVirtualEnv
$ rmvirtualenv NameOfTheVirtualEnv
$ lsvirtualenv

This is the feature that I like most of this script, and it’s made possible by the fact that now, all the environments you create with mkvirtualenv are created under the ~/.virtualenvs directory, so… no more mess on the filesystem! If you want to change the directory where the virtual environments are stored, you just need to specify the directory of your choice in the WORKON_HOME bash variable.

the venv module

Now that you know almost everything about virtual environments you can easily create, activate and destroy virtual environments. So… why you should need to know something about venv? The answer is in the PEP405: venv is a python module very similar to virtualenv but that comes by default as a part of the standard Python library since the release of Python 3.3, that dates back to September the 29th of 2012. It’s basically virtualenv written the right way because being part of the standard distribution of Python it can use some Python internals that couldn’t be used by virtualenv.

So, while virtualenv tries to trick the system with some hack to make everything work, venv doesn’t. Moreover, venv is part of the Python distribution and this means that you don’t need to install anything to start using it, if you use a recent version of Python it’s already installed and it just works out of the box. The only drawback is that venv is not available for Python versions prior the 3.3, so, if you work on a project written in Python 2.7 for example, you can’t use venv and you’re stuck with virtualenv.

At the first release, to use the venv module you could use a script, that was named pyvenv but according to the official documentation:

The pyvenv script has been deprecated as of Python 3.6 in favor of using python3 -m venv to help prevent any potential confusion as to which Python interpreter a virtual environment will be based on.

So, since it’s going to be deprecated, forgot the pyenv script and just use the module as suggested. To create a virtual environment with the venv module just type:

$ python3 -m venv NameOfTheVirtualEnv

To activate the virtual environment use the active script in the ./bin subfolder of the created virtual environment directory and as always, to delete the virtual environment simply get rid of its directory.

The bottom line

I hope this article could have made the virtual environment topic clearer. To sum up:

You haven’t decided yet?

Now, stop reading and go coding on a virtual environment! :)

Enjoy! D.

Iterators and Generators in Python

2016-11-21 23:00:49 +0000

teaser

If you have written some code in Python, something more than the simple “Hello World” program, you have probably used iterable objects. Iterable objects are objects that conform to the Iteration Protocol and can hence be used in a loop.

For example:

for i in range(50):
    print(i)

In this example, the range(50) is an iterable object that provides, at each iteration, a different value that is assigned to the i variable.

Quite easy, but what if we would like to create an iterable object ourselves?

The iteration protocol

Creating an iterable object in Python is as easy as implementing the iteration protocol. Let’s pretend that we want to create an object that would let us iterate over the Fibonacci sequence. The Fibonacci sequence is a sequence of integer numbers characterized by the fact that every number after the first two is the sum of the two preceding ones. So the sequence starts with 0 and 1 and then each number that follows is just the sum of the two previous numbers in the sequence. So the third number is 1 (0+1), the fourth is 2 (1+1), the fifth is 3 (1+2), the sixth is 5 (2+3) and so on.

Enough said, let’s the code talk:

class fibonacci:

    def __init__(self, max=1000000):
        self.a, self.b = 0, 1
        self.max = max

    def __iter__(self):
        # Return the iterable object (self)
        return self

    def next(self):
        # When we need to stop the iteration we just need to raise
        # a StopIteration exception
        if self.a > self.max:
            raise StopIteration

        # save the value that has to be returned
        value_to_be_returned = self.a

        # calculate the next values of the sequence
        self.a, self.b = self.b, self.a + self.b

        return value_to_be_returned

    def __next__(self):
        # For compatibility with Python3
        return self.next()


if __name__ == '__main__':
    MY_FIBONACCI_NUMBERS = fibonacci()
    for fibonacci_number in MY_FIBONACCI_NUMBERS:
        print(fibonacci_number)

As you can see, all we’ve done has been creating a class that implements the iteration protocol. This protocol consists in two methods:

Please note that the protocol in Python 2 is a little different and the .__next__() method is called just .next() so it is quite common to use the old Python 2 style method to generate the value and then create the Python 3 style method to simply return the value generated by the former one, so as to have code that can works both with Python 2 and Python 3. Generators

Generators in Python are just another way of creating iterable objects and are usually used when you need to create iterable object quickly, without the need of creating a class and adopting the iteration protocol. To create a generator you just need to define a function and then use the yield keyword instead of return.

So, the Fibonacci sequence in a generator could be something like this:

# mario.py
def fibonacci(max):
    a, b = 0, 1
    while a < max:
        yield a
        a, b = b, a+b

if __name__ == '__main__':
    # Create a generator of fibonacci numbers smaller than 1 million
    fibonacci_generator = fibonacci(1000000)

    # print out all the sequence
    for fibonacci_number in fibonacci_generator:
        print(fibonacci_number)

Yes, so simple! Now, run it and see the Fibonacci sequence generated right away:

$ python mario.py
0
1
1
2
3
5
8
13
21
34
55
89
144
233
377
610
987
1597
2584
4181
6765
10946
17711
28657
46368
75025
121393
196418
317811
514229
832040

Please note that once we have consumed the generator, we can’t use it anymore because generators in Python can’t be rewound.

So, if after the code above we tried to print out all the sequence again, we won’t get any values.

    # since the sequence is over, we will not get any value here
    for fibonacci_number in fibonacci_generator:
        print(fibonacci_number)

And if you need to use the generator again, you have to call the generator function again

    # So, if you need to use the generator again... recreate it!
    fibonacci_generator = fibonacci(1000000)

    # Ok, let's list 'em again
    for fibonacci_number in fibonacci_generator:
        print(fibonacci_number)

Now, if you can, take some time to debug the generator code above and look at how the values are generated and returned. You will find out that the values are generated in a lazy way, just when they need to be generated and then they are returned by the yield statement as it’s hit. Hence, the line after the yield is executed just when it needs to be executed when the next value is requested.

About debugging the code I have to say that one of the best tool to write and debug Python code I know is from Microsoft and it’s Visual Studio Code. It’s really good and available for Windows, macOS and Linux for free. Playing with iterable objects

Iterable objects give you a lot of possibilities. For example, if you need to create a list from the previous generator you can simply do:

    my_fibonacci_list = list(fibonacci(100000))
    print("My fibonacci list: {0}".format(my_fibonacci_list))

and you will get:

My fibonacci list: [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025

Another way of creating a list from an iterable object is by using list comprehension that allows you to create a list in a very natural way, specifying also which elements to choose for the list. For example, if you need to create a list with only the odd Fibonacci numbers you can do:

    fibonacci_odds_list = [x for x in fibonacci(100000) if x%2!=0]
    print("The odds number are: {0}".format(fibonacci_odds_list))

and you’ll get:

The odds number are: [1, 1, 3, 5, 13, 21, 55, 89, 233, 377, 987, 1597, 4181, 6765, 17711, 28657, 75025]

And you can use them also for all the functions based on iterables, like sum(), max(), min() and so on, like this:

    print("The min number is: {0}".format(min(fibonacci(1000000))))
    print("The max number is: {0}".format(max(fibonacci(1000000))))
    print("The sum of is: {0}".format(sum(fibonacci(1000000))))

and running this example you’ll get:

The min number is: 0
The max number is: 832040
The sum is: 2178308

… or for functional programming functions like map() and reduce()… but this is another story for a future article.

Happy Pythoning! D.


This article has been written in loving memory of one of the most amazing human being I’ve ever known and that taught me a lot. Thank you Mario T. rest in peace.