17 August 2016

Plotting the (actual) frequencies in a FreqDist in NLTK

Some days ago, trying to visualise a frequency distribution of tokens in a text via NLTK, I was quite surprised (and slightly disappointed) to see that the plot() method of the FreqDist class does not support a kwarg for plotting the actual frequencies, rather than the counts.

Now, the term frequency in NLP is borrowed from Linguistics, where it's used to mean the counts, not the actual frequency of occurrence of a linguistic something. I personally never quite liked this usage of the word as I find it pretty confusing: a frequency is, typically, a ratio of the occurrences counts of that something to the total of every something in the set.

FreqDist is creating a dictionary of counts, not frequencies, which is quite alright. Then you can directly plot them by calling the class method plot(), without the need to externally call pyplot. I was expecting that among the kwargs allowed by the method there would have been something to normalise said counts to transform them into frequencies. As of NLTK's version 3.2.1, there isn't. The freq(sample) method gives the frequency of a given sample, but nothing enables the possibility for frequencies to be directly plotted.

My hack to obtain this is then: What we're doing here is simply normalising the counts to their sum, paying attention to the fact that N(), which does return this sum, changes when we change the values, so we need to store it beforehand. All the other existing kwargs are preserved for consistency.

The result is in figure, for the text of Austen's Sense and Sensibility (available as a book in the NLTK data), we're plotting the frequency of the 50 most frequent tokens.

Looks like to is the most frequent token (note that no pre-processing/removals have been employed), with a frequency of around 6.5%. Having this number might be much more interesting than the count (which is around 9000, for reference).

31 July 2016

Interacting with a DynamoDB via boto3


Boto3 is the Python SDK to interact with the Amazon Web Services. DynamoDB are databases inside AWS in a noSQL format, and boto3 contains methods/classes to deal with them. This post assumes the AWS CLI (the tool to set access/authorization to the cloud) has been set, it can be easily done via terminal.

I've written this Gist for the methods outlined here.

I find its docs a bit convoluted and more than once I had to go retrieve the part describing exactly how to perform this or that action. So what did I decide to do about this? When I figure how to do something I write some small wrappers around boto3's methods/objects so I can easily run them instead as they save me from the hassle of having to retrieve information from the docs once again.

This post outlines some operations on DynamoDB databases, run through boto3.

Let us assume you have a certain table in DynamoDB. We'll start by importing the relevant stuff and by initialising the resource for the DynamoDB:
from boto3 import resource
from boto3.dynamodb.conditions import Key

# The boto3 dynamoDB resource
dynamodb_resource = resource('dynamodb')
You'd call your table as
table = dynamodb_resource.Table(table_name)
where table_name is just the string specifying the name of your table in DynamoDB. Easy peasy!

Because certain operations can be expensive on DynamoDBs and there isn't really any way to run aggregations (differently from MongoDB which has a full aggregation framework), you might want to start knowing a bit about your table. Specifically, scan operations are as slow as the number of items in your table dictates, as they have to walk the table. It's typically useful to know the size, the number of items, what is the field covering the role of a primary key, and so on. So, I've collected all relevant attributes in a convenient dict as
def get_table_metadata(table_name):
    """
    Get some metadata about chosen table.
    """
    table = dynamodb_resource.Table(table_name)

    return {
        'num_items': table.item_count,
        'primary_key_name': table.key_schema[0],
        'status': table.table_status,
        'bytes_size': table.table_size_bytes,
        'global_secondary_indices': table.global_secondary_indexes
    }
Say for instance you have hundreds of thousands of items in table, then a scan might not be a great idea. At least you know beforehand!

Now, a GET, a PUT and a DELETE can be performed as

def read_table_item(table_name, pk_name, pk_value):
    """
    Return item read by primary key.
    """
    table = dynamodb_resource.Table(table_name)
    response = table.get_item(Key={pk_name: pk_value})

    return response


def add_item(table_name, col_dict):
    """
    Add one item (row) to table. col_dict is a dictionary {col_name: value}.
    """
    table = dynamodb_resource.Table(table_name)
    response = table.put_item(Item=col_dict)

    return response


def delete_item(table_name, pk_name, pk_value):
    """
    Delete an item (row) in table from its primary key.
    """
    table = dynamodb_resource.Table(table_name)
    response = table.delete_item(Key={pk_name: pk_value})

    return

The two main operations you can run to retrieve items from a DynamoDB table are query and scan. The AWS docs explain that while a query is useful to search for items via primary key, a scan walks the full table, but filters can be applied. The basic way to achieve this in boto3 is via the query and scan APIs:

def scan_table(table_name, filter_key=None, filter_value=None):
    """
    Perform a scan operation on table.
    Can specify filter_key (col name) and its value to be filtered.
    """
    table = dynamodb_resource.Table(table_name)

    if filter_key and filter_value:
        filtering_exp = Key(filter_key).eq(filter_value)
        response = table.scan(FilterExpression=filtering_exp)
    else:
        response = table.scan()

    return response


def query_table(table_name, filter_key=None, filter_value=None):
    """
    Perform a query operation on the table. 
    Can specify filter_key (col name) and its value to be filtered.
    """
    table = dynamodb_resource.Table(table_name)

    if filter_key and filter_value:
        filtering_exp = Key(filter_key).eq(filter_value)
        response = table.query(KeyConditionExpression=filtering_exp)
    else:
        response = table.query()

    return response

The actual items of the table will be in the 'Items' key of the response dictionary.

The issue here is that results in a DynamoDB table are paginated hence it is not guaranteed that this scan will be able to grab all the data in table, which is yet another reason to keep track of how many items there are and how many you end up with at the end when scanning.

In order to scan the table page by page, we need to play a bit around the parameter leading us to the next page in a loop, until we have seen the full table. So you can do a loop as in:

def scan_table_allpages(table_name, filter_key=None, filter_value=None):
    """
    Perform a scan operation on table. 
    Can specify filter_key (col name) and its value to be filtered. 
    This gets all pages of results. Returns list of items.
    """
    table = dynamodb_resource.Table(table_name)

    if filter_key and filter_value:
        filtering_exp = Key(filter_key).eq(filter_value)
        response = table.scan(FilterExpression=filtering_exp)
    else:
        response = table.scan()

    items = response['Items']
    while True:
        print len(response['Items'])
        if response.get('LastEvaluatedKey'):
            response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
            items += response['Items']
        else:
            break

    return items
Happy dynamoying!

27 July 2016

Python for (some) Elasticsearch queries

This post will be a quick round of the most common ES queries to be run via the low-level Python client Elasticsearch.

Assuming you have an Elasticsearch cluster somewhere, either locally or remotely, you'd use the client to connect to it as (here we are grabbing the remote URL via environment variable and passing it to the constructor, if we don't pass anything it will connect to a local instance):

from elasticsearch import Elasticsearch
from os import environ

ES_cluster_URL = environ['ES_CLUSTER_URL']

es_client = Elasticsearch()                  # local
es_client = Elasticsearch([ES_cluster_URL])  # remote

and then you'd build the prototypical query body as

body = {
    "from": 10,                       # get docs from the number 10
    "size": 100,                      # get 100 docs (default = 10)
    "fields": ["wanted_field"],       # get only wanted fields
    "query": {                        # the query
        "term": {
  }
     },
    "sort": {                         # to sort
        "date_field": {
            "order": "desc"
        }
}


while running the query as

r = es_client.search(index='myindex',
                     doc_type='mytype',
                     body=body)


By exploring the structure of r you'd find what you need (the structure of what you get back will change based on the type of query you run).

Let's see how to do some specific/commonplace queries by tweaking the body object.

term query

You run a term query when you want to retrieve all documents matching a field conditions.

# A term query
body = {
    "query": {
        "term": {
            "your_field": "needed_value"
         }
     },
}

range query

# A range query
body = {
    "query": {
        "range": {
            "date_field": {
                "gte": start_date,
                "lt": final_date
            }
        }
    }
}



Here, start_date and final_date are datetime objects, gt and lt mean "greater than" and "less than" respectively and the e will signify that the interval is closed.

bool query 

To perform an AND, you need to run a so-called bool query, which can be used for all sorts of logical queries, but here I give the prototype of an AND.

# A bool query for an AND
body = {
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "field1": "value1"
                    }
                },
                {
                    "term": {
                        "field2": "value2"
                    }
                }
            ]
        }
    }
}

Here we're asking for all documents where field1 matches value1 AND field2 matches value2. In a similar way, we could use a must_not keyword to mean that we want documents who do not match a given value. There are also other types of keywords one can use depending on the use case.

aggregation query 

Many are the situations where you need to aggregate the documents on a field. A prototype to obtain this would be (we are aggregating on field called my_field):
body = {
    "size": 0,
    "aggs": {
        "my_field_agg": {
            "terms": {
                "size": 100,
                "field": "my_field"
            }
        }
    }
}

The parameter size in the aggregation has to be tweaked to make sure the returned sum_other_doc_count in r is 0, otherwise it means not all documents have been aggregated.

All this I've reported in a Gist here.

14 July 2016

Break it to get it: class and instance attributes in Python

Let us investigate the relation between class and instance attributes in Python. It can be rather confusing, I believe.

This post will be about trying things out in order to break them and better understand what they do and it is somehow inspired by this source here, which is the most comprehensive I've found about this topic, describing and explaining in detail how things work.

Suppose I have a class describing a cake. I want to identify the cake by its type, which will be a string, and I want to give it a diameter attribute, which is set to 20 (cm) as default, as it seems reasonable for a typical cake. So I'll write
class Cake(object):

    def __init__(self, cake_type, diameter=20):
        self.cake_type = cake_type
        self.diameter = diameter

Nothing special here. Now, suppose I write this instead
class Cake(object):

    # A class attribute
    garnishes = []

    def __init__(self, cake_type, diameter=20):

        # Instance attributes
        self.cake_type = cake_type
        self.diameter = diameter
This version has an attribute of the class, garnishes, which is a list supposed to contain all the garnishes I decide to add to my cake. If I instanciate the class, I can access this attribute both from the class itself and from the instance:
c = Cake('carrot')
c.garnishes
# []
Cake.garnishes
# []
This leads to the funny behaviour described in the docs as a piece of warning. In fact, if I insert a method in the class to add a garnish and I create two instances of the class, then I use the method to add a garnish to one of the instances, I will end up with the same garnish being added to the other instance as well:
class Cake(object):

    # A class attribute
    garnishes = []

    def __init__(self, cake_type, diameter=20):

        # Instance attributes
        self.cake_type = cake_type
        self.diameter = diameter

    def add_garnish(self, garnish):
        self.garnishes.append(garnish)


c1 = Cake('carrot')
c2 = Cake('brownie')

c1.add_garnish('cream')
print c1.garnishes
#['cream']
print c2.garnishes
#['cream']
I typically don't like cream on my brownie, so this is clearly a problem. The code design is not good for the job.

What is happening here is that all the instances of the class are making use of the same list.  In order to make each cake instance have their own set of garnishes, I'd have to redefine their garnishes lists separately from the instances and only then I could happily append to them:
c1 = Cake('carrot')
c2 = Cake('brownie')

c1.garnishes = ['cream']
c2.garnishes = ['chocolate_chips']

c1.garnishes.append(['jam'])
c2.garnishes.append(['cherry'])

print c1.garnishes, c2.garnishes
# ['cream', ['jam']] ['chocolate_chips', ['cherry']]
I got stupidly confused on this matter when I was playing around with another example. Instead of having a list of garnishes, let's design the class to have an integer attribute counting the garnishes and let's create a method meant to increment the counter:
class Cake(object):

    # A class attribute
    garnishes = 0

    def __init__(self, cake_type, diameter=20):

        # Instance attributes
        self.cake_type = cake_type
        self.diameter = diameter

    def add_garnish(self):
        self.garnishes += 1

c1 = Cake('carrot')
c2 = Cake('brownie')

c1.add_garnish()

print c1.garnishes, c2.garnishes
# 1 0
The second instance does not get its garnishes count incremented! We broke it. Let's try to understand.

A fact: an instance attribute has priority over the class attribute when access is performed from the instance.

The "difference" in the behaviour in the two cases is in the list being a mutable type while the int being an immutable type.
While in the list case we are appending to the list, which is shared by all instances of the Cake class, in the second case what we're actually doing when we call the method is
self.garnishes = self.garnishes + 1
that is, because of the immutability of the type, an instance attribute is created for instance c1 and it is incremented. The second instance does not see this attribute at all because it belongs to the first instance and the class attribute is not changed. The type is immutable so I cannot mutate a 0 into a 1, that is why.

To make it even clearer to myself, I did this:
class Cake(object):

    # A class attribute
    garnishes = 0

    def __init__(self, cake_type, diameter=20):

        # Instance attributes
        self.cake_type = cake_type
        self.diameter = diameter


c1 = Cake('carrot')
c2 = Cake('brownie')

# Changing the class attribute from the class
Cake.garnishes = 2
print c1.garnishes, c2.garnishes
# 2, 2
Makes sense, obviously: we are changing the class attribute so all the instances will see it changed. If I now change the attribute from one of the instances instead:
class Cake(object):

    # A class attribute
    garnishes = 0

    def __init__(self, cake_type, diameter=20):

        # Instance attributes
        self.cake_type = cake_type
        self.diameter = diameter


c1 = Cake('carrot')
c2 = Cake('brownie')

# Changing the class attribute from the first instance
c1.garnishes = 2
print c1.garnishes, c2.garnishes
# 2, 0
I get the expected: only the first instance sees it modified. And in fact the class attribute is not changed:
print Cake.garnishes
# 0
In the case of a mutable type, the type can by definition be mutated, so when I try to append something to the list from the first instance what happens is the class attribute gets modified.

9 July 2016

Some Matplotlib idiosyncrasies

Matplotlib, the main data plotting library in Python, is pretty comprehensive in what you can do but not the best in how quick/clear it is to customise your plots to your specific needs. Also, not the best plotting tool available in general in terms of the quality (read beauty) of the end-result, but still, makes stuff come easy and of course, immediately available in a Jupyter notebook, so ...

For publication-quality plots, especially if we're talking about "maths-heavy" ones, I'd always and stubbornly suggest Gnuplot. It's just beautiful, but it's typically a pain to get good things done. I have literally spent hours creating the scripts for a plot for a journal article.

Matplotlib is just the thing you use for the regular data visualisation in your everyday life. I'm writing this post more as a memento for me about how to achieve stuff.
I typically forget Matplotlib's idiosyncrasies all the time and then what I do is just go to one of the notebooks I have, copy paste the relevant code where I've done something, and *voilĂ *, food is served. I know it is ugly, but maybe I'm lazy and I'd rather copypaste my code rather than search the docs again because I don't have a good memory.

This is not meant to be a tutorial and by the way I've just discovered there's two excellent ones here and here. This is literally just me writing down some stuff so I remember better. Maybe it can be useful to someone else as well (?!). Also, I have of course not explored all of the libraries capabilities, yet.  For instance, other than the pyplot API, I don't think I've ever used the other parts of it.

It's usually just a "I need to do this" - I look for docs/examples - I do that process. The following time it goes as "I need to do this, but I'm sure I've already done it somewhere" - I look for the example within my code - I copypaste and tweak the code. It's just an inefficient process.

Let's see.

Plotting a function with one independent variable

First of all, assuming to have called pyplot and numpy as 
import matplotlib.pyplot as plt
import numpy as np
I can plot a sine function as simply as
x = np.linspace(0, 10)
plt.plot(x, np.sin(x), linewidth=2)

plt.title('Sine function', fontweight='bold', fontsize=16)
plt.xlabel('x')
plt.ylabel('sin(x)')
and the result would be



Pretty basic, eh? Not great. Now, from Matplotlib's version 1.4 on we can have the wonderful ggplot style (borrowed from R then), which we can load as

plt.style.use('ggplot')
and the plot changes (for the muuuch better) as



On a quick note, Seaborn, the "statistical" plotting library for Python has the ggplot style loaded by default. We will keep this style from now on.
Also note that the ggplot style loads a grid automatically, while in the basic style case we'd have to add it.

Now, obviously we can do lots of tweaks to the aesthetics by changing the colour, the type of line/points, the fonts in the labels, and so on. It would be pretty straightforward from the API. Also, we can do several types of plots as well, such as a bar plot, as scatter plot, and so on.

Plotting a surface

Plotting surfaces is a little more tricky, and this is my example (for a paraboloid centered on the origin):
from matplotlib import cm

x = np.array([i for i in range(-100, 100)])
y = np.array([i for i in range(-100, 100)])
x, y = np.meshgrid(x, y)

def f(x, y):
    return x**2 + y**2

fig = plt.figure()
ax = fig.gca(projection='3d')
parabola = ax.plot_surface(x, y, f(x, y), cmap=cm.RdPu)
plt.xlabel('x')
plt.ylabel('y')
which results in



I find this quite nice. You'd have to work a bit on the tics sizes but all in all it's a good one.

The log scale

One thing I keep forgetting is how to get a log scale on either or both of the axes. It's simple.
 An explonential will look like a line on a semilog plot (log on the y axis), so here we go how to get this:
x = np.linspace(0, 1, 100)
plt.semilogy(x, np.exp(x))

Sure, we could make it look better. Note that a log scale on the x axis is done similarly. Now, a log-log plot (let's plot a power law so it'll look like a line) is achieved by
plt.loglog(x, x**(-0.6))


Putting a legend when there are more curves plotted

 We need a handler in this case. You can choose where to place the legend, in quadrants (it maybe understands other placement units as well?).
from matplotlib.legend_handler import HandlerLine2D

x = np.linspace(0, 10)
sin_line, = plt.plot(x, np.sin(x), label='sin(x)')
cos_line, = plt.plot(x, np.cos(x), label='cos(x)')
legend = plt.legend(handler_map={line: HandlerLine2D(numpoints=2)}, loc=4)
plt.title('Sin and cos', fontweight='bold', fontsize=16)
plt.xlabel('x')


Obviously this was just a short list of the most compelling difficult-to-remember-how-to-achieve things I've found by plotting stuff here and there, not claiming it's comprehensive.