Back from the Dead

September 9, 2010

The previous post in this blog is more than nine months old, and before that it already had been inactive for six more. There have been good reasons for that long silence. Chiefly, the birth of Gabriela, our second child, in November 2009.

Well, all that is about to change!

Business as usual will resume shortly with a post on Fibonacci numbers, and there’s also stuff in the making regarding the Chinese remainder theorem as a preliminary to the prime-factor FFT algorithm. The general plan from over a year ago sort of still holds. And I’m sticking with Python 2.x, at least for the time being.

But I have more ambitious plans, and have to decide if this blog will be part of them…

Over the past months I have spent quite a bit of time looking into several physics problems. Things like the speed at which a liquid evaporates, the rolling resistance of a deformable wheel or the stability of a bicycle. Yes, my wife agrees: I have weird ways of enjoying my free time… So that my newly acquired knowledge will not be lost like tears in the rain, I have the firm intention of writing about all that stuff. Not that any of those subjects would in anyway fit within this blog as it is, so I have registered another one just in case.

Also, after 10 years in the same company, and 5 years doing the same job, I am starting to feel an itching urge to give my career a good shake. As a Spanish saying goes, “el hombre propone y Dios dispone,” (man proposes and then God decides) but the firm intention is to leave Spain and find a challenging, enjoyable, better paying job abroad within the next couple of years. If you happen to want to hire an experienced R&D engineer with an EU passport, do contact me.

Not that I expect my inbox to be flooded by this announcement, so I will also be working on honing some skills to better sell myself in the free market, and will definitely want to leave written testimony of my exploits. Some statistics and data analysis posts could see the light here, but I don’t really see relational databases or web programming fitting in this blog. I could of course open yet another one devoted to more generic programming topics, but at some point I’ll have to decide if it isn’t better to have a single one with more varied content.

Anyway, if there’s anyone out there still listening: welcome back, make yourselves comfortable, and enjoy the ride!


Good boy!

November 30, 2009

Just a quick pat on my own shoulder…

Project Euler’s problem 266, posted this very last weekend, was proposed by yours truly:

The divisors of 12 are: 1,2,3,4,6 and 12.
The largest divisor of 12 that does not exceed the square root of 12 is 3.
We shall call the largest divisor of an integer n that does not exceed the square root of n the pseudo square root (PSR) of n.
It can be seen that PSR(3102)=47.

Let p be the product of the primes below 190.
Find PSR(p) mod 1016.

I had something much more modest in mind, such as “all primes below 100″, since the way I figured it out this was to be solved treating it as a variation of the knapsack problem and then using dynamic programming on it. An approach that will see you die of old age if you try it for the proposed value…

But I then had the pleasure of witnessing how xan, hk and daniel.is.fischer started coming up with alternative, much more sophisticated and elegant methods, that pushed the question all the way up to were it is now. It is both a humbling and edifying experience to see these people at work, so if you have an idea for a problem, do propose it to them.

I won’t go into any more details as to how to solve this particular problem: the right place to ask questions is here.


Binary Exponentiation

June 5, 2009

(The code examples in this post are also available here.)

Although I have already written something on binary exponentiation, as it applied to modular exponentiation, there’s plenty more to be said about it in more general settings. Basically, today we are after computing the n-th power of a, but a need not be an integer, rational, real or even complex number, but can be any more general mathematical object that can be raised to a power, as a square matrix or a polynomial could be…

For an n-th power to exist, the object being exponentiated must fulfill certain conditions, which can basically be summarized in belonging to a more general set that has a well-behaved multiplication operation defined. Good behavior can be translated into closure and associativity, which in mathematical slang can be expressed as the object belonging to a semigroup. This will often mean that the object belongs to a monoid, and even to a ring or rng… But I’ll follow V.I. Arnold‘s advice, and not get too carried away with the abstract generalizations, so enough said: the thing must be multipliable, and the result must be of the same kind.

Defining Multiplication

Because we are dealing with general mathematical objects, we first need to define the multiplication operation. There are two ways of dealing with this in python. To work on an example, say we want to deal with 2×2 matrices. We could define a Matrix2x2 class, and write a __mul__ method for it, so that whenever we write a*b, what gets evaluated is a.__mul__(b). For the same price, I’ve thrown in an __add__ method…

class Matrix2x2(object) :
    def __init__(self,elements = None) :
        if elements :
            self._data_ = elements[:4]
        else :
            self._data_ = [0] * 4

    def __repr__(self) :
        as_string = [str(j) for j in self._data_]
        str_length = [len(j) for j in as_string]
        longest = max(str_length)
        for j in xrange(4) :
            as_string[j] =' '*(longest - str_length[j]) + as_string[j]
        ret = 'Matrix2x2 object:'
        for j in xrange(2) :
                ret += '\n[ %s %s ]' % (as_string[2*j], as_string[2*j+1])
        return ret
        

    def __mul__(self, b) :
        ret = [self._data_[0]*b._data_[0] + self._data_[1]*b._data_[2]]
        ret += [self._data_[0]*b._data_[1] + self._data_[1]*b._data_[3]]
        ret += [self._data_[2]*b._data_[0] + self._data_[3]*b._data_[2]]
        ret += [self._data_[2]*b._data_[1] + self._data_[3]*b._data_[3]]
        return Matrix2x2(ret)

    def __add__(self, b) :
        return Matrix2x2([self._data_[j] + b._data_[j] for j in xrange(4)])

Or if we don’t want to go through the hassle of dealing with objects, we can agree with ourselves in keeping the matrix’s data in a list in row-major order, and define a standalone multiplication function…

def mat_mul(a, b) :
    """
    Returns the product of two 2x2 square matrices.

    Computes the product of two 2x2 matrices, each stored in a four element
    list in row major order.
    """

    ret = [a[0]*b[0] + a[1]*b[2]]
    ret += [a[0]*b[1] + a[1]*b[3]]
    ret += [a[2]*b[0] + a[3]*b[2]]
    ret += [a[2]*b[1] + a[3]*b[3]]
    return ret

Direct Multiplication

To set a baseline for performance, the starting point has to be the most direct implementation of exponentiation: repeated multiplication. There really isn’t much mistery to it…

import nrp_base

@nrp_base.performance_timer
def direct_pow(a, n, **kwargs) :
    """
    Computes a**n by direct multiplication.

    Arguments
     a - The object to be exponentiated.
     n - The integral exponent.

    Keyword Argument
     mul - A function taking two arguments of the same type as a, and
           returning their product. If undefined, a's __mul__ method will
           be used.
           
    """
    mul = kwargs.pop('mul',None)
    ret = a
    if mul is None :
        mul = lambda x,y : x*y
    for j in xrange(n-1) :
        ret = mul(ret,a)
    return ret

The Two Flavors of Binary Exponentiation

OK, so now we are ready for the fun to begin… The basic idea behind speeding exponentiation up is that a4 can be computed with two, instead of three, multiplications, if rather than doing it as a4 = a·a·a·a, we do it as a2 = a·a, a4 = a2·a2, storing the intermediate result for a2

To expand this idea to exponents other than 4 we basically need to write the exponent in binary. Examples are great for figuring out this kind of things, so we’ll take computing a19 as ours. Knowing that 19 in binary is 10011, we now have a choice on how to proceed, which basically amount to start using the bits of the exponent from the least significant one, or from the most significant one…

Least Significant Bit First

That 19 in binary is 10011 can also be interpreted as 19 = 20 + 21 + 24, and so a19 = a20·a21·a24… The algorithm then is pretty straightforward:

  • start computing sequentially from j = 0 the values of a2j, where each new value is the square of the preceding one,
  • when the first non-zero bit is found, set the return value to the corresponding a2j value,
  • for subsequent non-zero bits, multiply the return value by the corresponding a2j value,
  • once all bits of the exponent have been searched, the return value holds the sought value.

If the exponent is n, then there will be log2 n squarings, to compute the a2j values, plus one multiplication less than there are 1′s in the binary expansion of the exponent.

Most Significant Bit First

It is also possible to write a19 = ((a23·a)2)·a. While it is easy to verify that is the case, it may be harder to see where does it come from. The algorithm is as follows:

  • Take the exponent bits from most to least significant,
  • since the first bit is always a 1, set the return value to a,
  • for every bit processed, first square the return value, and then, if the bit is a 1, multiply the return value by a,
  • again, once the last bit is used, the return value holds the sought value.

The analysis of the algorithm is similar to the previous one: there will again be log2 n squarings, and the same number of multiplications as well.

These second flavor has one clear disadvantage: there’s no simple way to generate the bits of a number from most to least significant. So this approach requires computing and storing the bits from least to most significant, and then retrieve them in inverse order. No that it is anything too complicated, but there are additional time costs involved.

So why would one even worry about this scheme then? Well, it also has a potential advantage: the multiplications performed always involves the original object being exponentiated. This can prove beneficial in two different ways:

  1. it lends itself better for optimization of particular cases, and
  2. if the original element involves small numbers, but the end result involves large numbers, the multiplications are usually faster than in the previous algorithm.

To see the differences with examples, it is better to first write the python code…

@nrp_base.performance_timer
def bin_pow(a, n, **kwargs) :
    """
    Computes a**n by binary exponentiation.

    Arguments
     a - The object to be exponentiated.
     n - The integral exponent.

    Keyword Argument
     mul - A function taking two arguments of the same type as a, and
           returning their product. If undefined, a's __mul__ method will
           be used.
     sqr - A function taking a single argument of the same type as a, and
           returning its square. If undefined, mul(a,a) will be used. To
           use a method of the object's class, use sqr=ClassName.function_name.
     lsb - Set to True to use least significant bits first. Default is False.
     msb - Set to True to use most significant bits first. Overrides lsb.
           Default is True.
           
    """
    mul = kwargs.pop('mul',None)
    sqr = kwargs.pop('sqr',None)
    lsb = kwargs.pop('lsb',None)
    msb = kwargs.pop('msb',None)
    if mul is None :
        mul = lambda x, y : x * y
    if sqr is None :
        sqr = lambda x : mul(x, x)
    if lsb is None and msb is None :
        msb = True
        lsb = False
    elif msb is None :
        msb = not lsb
    else :
        lsb = not msb

    if msb :
        bits = []
        while n > 1 : # The last bit is always a 1...
            if n & 1 :
                bits += [1]
            else :
                bits += [0]
            n >>= 1
        ret = a
        while bits :
            ret = sqr(ret)
            if bits.pop() :
                ret = mul(ret,a)
    else :
        val = a
        while not n & 1 :
            n >>= 1
            val = sqr(val)
        ret = val
        while n > 1:
            n >>= 1
            val = sqr(val)
            if n & 1 :
                ret = mul(ret,val)
    return ret

Taking bin_pow for a Ride…

To get a better grip on how the algorithm performs, we’ll take a very simple matrix, which we will find again in a future post, when we discuss Fibonacci numbers

>>> a = Matrix2x2([1,1,1,0])
>>> a
Matrix2x2 object:
[ 1 1 ]
[ 1 0 ]

So lets raise it first to a small power, and see how the two algorithms perform…

>>> val = bin_pow(a,10,lsb=True,verbose=True,timer_loops=100)
Call to bin_pow took:
2.84952e-05 sec. (min)
0.000205613 sec. (max)
3.363e-05 sec. (avg)
0.000178629 sec. (st_dev)
>>> val = bin_pow(a,10,msb=True,verbose=True,timer_loops=100)
Call to bin_pow took:
3.12889e-05 sec. (min)
5.16825e-05 sec. (max)
3.32891e-05 sec. (avg)
2.93951e-05 sec. (st_dev)

Not many surprises here, as the algorithm having to store and retrieve bits is a little slower than the other. But lets see what happens if we scale things…

>>> val = bin_pow(a,1000000,lsb=True,verbose=True)
Call to bin_pow took 2.599 sec.
>>> val = bin_pow(a,1000000,msb=True,verbose=True)
Call to bin_pow took 1.76194 sec.

This sort of shows that the multiplication thing can be very relevant. A last little experiment gives more insight into the workings of this…

>>> 2**20
1048576
>>> val = bin_pow(a,2**20,lsb=True,verbose=True)
Call to bin_pow took 1.95651 sec.
>>> val = bin_pow(a,2**20,msb=True,verbose=True)
Call to bin_pow took 1.90171 sec.

Funny, isn’t it? Why would it be that the time differences are so much smaller now, even though the exponents are roughly the same? This other times may help answer that…

>>> val = bin_pow(a,2**20-1,lsb=True,verbose=True)
Call to bin_pow took 2.61069 sec.
>>> val = bin_pow(a,2**20-1,msb=True,verbose=True)
Call to bin_pow took 1.89222 sec.

Got it now? 220 has a binary expression full of 0′s, 20 of them, preceded by a single 1. This means that neither algorithm is performing any multiplication, just squarings, so the performance is almost identical. On the other hand, 220-1 has a binary expression full of 1′s, 20 of them, with no 0 anywhere. This means that, although there will be one less squaring than for the previous case, there will be 19 multiplications more. And here’s where the msb approach excels, because the involved multiplications are simpler.


The Cooley-Tukey FFT Algorithm for General Factorizations

May 29, 2009

(All code examples in the post have been included in the nrp_base.py module, which can be downloaded from this repository.)

As presented in the previous post, Cooley-Tukey’s FFT algorithm has a clear limitation: it can only be used to speed the calculation of DFTs of a size that is a power of two. It isn’t hard, though, to extend the same idea to a more general factorization of the input signal size. We start again from the formula that defines the DFT,

\displaystyle X_k = \sum_{n=0}^{N-1}{ x_k e^{-2\pi i kn/N}}

Say that N = P·Q. Following the previous approach, we could try to split the calculation of the DFT into Q DFTs of size P, by rewriting it as

\displaystyle X_k = \sum_{q=0}^{Q-1}{ e^{-2\pi i \frac{kq}{N}} \sum_{p=0}^{P-1}{x_{Qp+q} e^{-2\pi i \frac{kp}{P}}}}

where n = Q·p + q. We are again faced with the same issue to turn the inner summation into a proper DFT: while p ranges from 0 to P-1, k goes all the way up to N-1. To get around this, we can make k = P·r + s, and replacing in the previous formula we get

\displaystyle X_{Pr+s} = \sum_{q=0}^{Q-1}{ e^{-2\pi i \frac{qr}{Q}} e^{-2\pi i \frac{qs}{N}} \sum_{p=0}^{P-1}{ x_{Qp+q} e^{-2\pi i \frac{ps}{P}}}}

where the inner summation has been reduced to a standard DFT of size P. Furthermore, the splitting of the first exponential in two factors unveils that the outer summation is also a standard DFT, this one of size Q, of the product of the DFT of size P by twiddle factors.

The subscripts in the above formula can get confusing to follow, so take note that q and r go from 0 to Q-1, while p and s go from 0 to P-1. Therefore the computation of the size N DFT has effectively been split into the computation of Q DFTs of size P, followed by twiddle factor multiplication, and then by P DFTs of size Q. It may be helpful to picture the linear array of input data as a PxQ matirx stored in row-major order. Then the Q DFTs of size P are performed on the columns of this matrix, and the P DFTs of size Q on the rows. Note though that the output has to be rearranged, in what would be equivalent to a matrix transposition, before getting to the final answer. As an example, the following picture illustrates the process on an input of size 15.

The operations in a step of a generalized factorization Cooley-Tukey FFT algorithm.

Planning a Generalized Cooley-Tukey FFT

Disregarding the multiplication by twiddle factors and the matrix transpositions, the algorithm turns a calculation which requires a time proportional to N2 = P2Q2, into one which requires a time proportional to P·Q·(P+Q). This very simplistic complexity analysis suggests that the factoring of the size of the original signal into two multiplicative factors isn’t arbitrary, and will affect the performance of the algorithm. And in principle a split into two similarly sized factors, P ~ Q ~ √N should, although often doesn’t, outperform any other approach.

In many practical applications a small P or Q is used, which is then called the radix. If the smaller DFT is in the outer loop. i.e. if Q is the radix, the scheme is termed decimation in time (DIT), while if it is in the inner loop, i.e. if P is the radix, we have decimation in frequency (DIF), which explains the long name of the simplest Cooley-Tukey algorithm described earlier.

The splitting of the input size, N, into the two multiplicative factors, P and Q, can be done in d(N) – 2 different ways, where d(N) is the divisor function, i.e. the number of divisors of N. If we use the same algorithm recursively, the possible paths to the solution can grow very fast, and determining the optimal one is anything but easy. It is in any case a point where we don’t want to make excessive assumptions, and which should be left sufficiently open for user experimentation. So I’ve decided to copy FFTW‘s approach: the first step to peform a FFT is to plan it. In our case, a plan is a data structure which is similar to a tree, holding the information of the factorizations chosen, the function that should be called to perform the next step, and references to those next steps. The following picture sketches two possible plans for a DFT of size 210 = 2·3·5·7.

In line with the image, I have coded a planning function that gives two choices to the user:

  • decimation in time (dit=True) vs. decimation in frequency (dif=True)
  • minimum possible radix (min_rad=True) vs. maximum possible radix (max_rad=True)

Note that the standard radix 2 with decimation in time plan would result from setting dit=True and min_rad=True when the input size is a power of two, and that it would be the default behavior of this planner.

@nrp_base.performance_timer
def plan_fft(x, **kwargs) :
    """
    Returns an FFT plan for x.

    The plan uses direct DFT for prime sizes, and the generalized CT algorithm
    with either maximum or minimum radix for others.

    Keyword Arguments
     dif     -- If True, will use decimation in frequency. Default is False.
     dit     -- If True, will use decimation in time. Overrides dif. Default
                is True.
     max_rad -- If True, will use tje largest possible radix. Default is False.
     min_rad -- If True, will use smallest possible radix. Overrides max_rad.
                Default is True.
               
    """
    dif = kwargs.pop('dif', None)
    dit = kwargs.pop('dit', None)
    max_rad = kwargs.pop('max_rad', None)
    min_rad = kwargs.pop('min_rad', None)
    
    if dif is None and dit is None :
        dif = False
        dit = True
    elif dit is None :
        dit = not dif
    else :
        dif = not dit

    if max_rad is None and min_rad is None :
        max_rad = False
        min_rad = True
    elif min_rad is None :
        min_rad = not max_rad
    else :
        max_rad = not min_rad
        
    N = len(x)
    plan = []

    if min_rad :
        fct = sorted(nrp_numth.naive_factorization(N))
        NN = N
        ptr = plan
        for f in fct :
            NN //= f
            P = [f, dft, None]
            Q = [NN, fft_CT, []]
            if dit :
                ptr += [Q, P]
            else :
                ptr += [P, Q]
            if NN == fct[-1] :
                Q[1:3] = [dft, None]
                break
            else :
                ptr = Q[2]
    else :
        items ={}
        def follow_tree(plan, N, fct, **kwargs) :
            if fct is None:
                fct = nrp_numth.naive_factorization(N)
            divs = sorted(nrp_numth.divisors(fct))
            val_P = divs[(len(divs) - 1) // 2]
            P = [val_P]
            if val_P in items :
                P = items[val_P]
            else :
                fct_P = nrp_numth.naive_factorization(val_P)
                if len(fct_P) > 1 :
                    P += [fft_CT, []]
                    follow_tree(P[2], val_P, fct_P, dit = dit)
                else :
                    P +=[dft, None]
                items[val_P] = P
            val_Q = N // val_P
            Q = [val_Q]
            if val_Q in items :
                Q = items[val_Q]
            else :
                fct_Q = nrp_numth.naive_factorization(val_Q)
                if len(fct_Q) > 1 :
                    Q += [fft_CT, []]
                    follow_tree(Q[2], val_Q, fct_Q, dit = dit)
                else :
                    Q +=[dft, None]
                items[val_Q] = Q
            if dit :
                plan += [Q, P]
            else :
                plan += [P, Q]

        follow_tree(plan, N, None, dit = dit)

    return plan

While the planner can probably be further optimized, especially for max_rad=True, with all the repeated factorizations, this shouldn’t be the critical part of the DFT calculation. Specially so since the plan can be calculated in advance, and reused for all inputs of the same size.

Twiddle Factors Reloaded

As already discussed in previous posts, one of the simplest ways of trimming down the computation time is to store the twiddle factors, which are almost certain to show up again, and reuse them whenever possible, rather than going through complex exponentiations over and over again. There is a slight difference though with regard to the radix 2 case. There, when computing a size N FFT, the twiddle factors that intervened in the calculations where all those with numerators in the exponent (apart from the ±2·π·i) ranging from 0 to N/2 – 1, inclusive. If N is split as P·Q, rather than as 2·N/2, then the largest numerator that will appear will be (P-1)·(Q-1), which is definitely larger than N/2, but not all of the items up to that value will be required, with many of them missing.

Take for instance a DFT of size 512. If we go for the radix 2 FFT, we will in total use 256 different twiddle factors, ranging from 0 to 255. If rather than that we go for a max_rad=True approach, the first split will be 512 = 16·32, and so the largest twiddle factor will have numerator 15·31 = 465. But with all the intermediate twiddle factors not showing up in the calculation, overall only 217 different ones are needed.

It is also interesting to note that the same twiddle factors that are used in the FFTs, may appear in the DFTs of the leaves of the planning tree. So it is worth modifying the current dft function, so that it can accept precalculated twiddles. Because the twiddle factors sequence will have gaps, a list is no longer the most convenient data structure to hold these values, a dictionary being a better fit for the job. Combining all of these ideas, we can rewrite our function as

@nrp_base.performance_timer
def dft(x, **kwargs) :
    """
    Computes the (inverse) discrete Fourier transform naively.

    Arguments
     x         - Te list of values to transform

    Keyword Arguments
     inverse   - True to compute an inverse transform. Default is False.
     normalize - True to normalize inverse transforms. Default is same as
                 inverse.
     twiddles  - Dictionary of twiddle factors to be used by function, used
                 for compatibility as last step for FFT functions, don't set
                 manually!!!

    """
    inverse = kwargs.pop('inverse',False)
    normalize = kwargs.pop('normalize',inverse)
    N = len(x)
    inv = -1 if not inverse else 1
    twiddles = kwargs.pop('twiddles', {'first_N' : N, 0 : 1})
    NN = twiddles['first_N']
    X =[0] * N
    for k in xrange(N) :
        for n in xrange(0,N) :
            index = (n * k * NN // N) % NN
            twid= twiddles.setdefault(index,
                                      math.e**(inv*2j*math.pi*index/NN))
            X[k] += x[n] * twid
        if inverse and normalize :
             X[k] /= NN
    return X

Note also the changes to the user interface, with the addition of the normalize keyword argument. This has been introduced to correct a notorious bug in the inverse DFT calculations for the radix 2 FFT algorithm. The changes are implemented and can be checked in the module nrp_dsp.py, although they are very similar in nature to what will shortly be presented for the generalized factorization FFT.

Putting it all Together

We now have most of the building blocks to put together a FFT function using Cooley-Tukey’s algorithm for general factorizations, a possible realization of which follows:

@nrp_base.performance_timer
def fft_CT(x, plan,**kwargs) :
    """
    Computes the (inverse) DFT using Cooley-Tukey's generalized algorithm.
    
    Arguments
     x        - The list of values to transform.
     plan     - A plan data structure for FFT execution.

    Keyword Arguments
     inverse  - if True, computes an inverse FFT. Default is False.
     twiddles - dictionary of twiddle factors to be used by function, used
                for ptimization during recursion, don't set manually!!!
                                
    """
    inverse = kwargs.pop('inverse', False)
    inv = 1 if inverse else -1
    N = len(x)
    twiddles = kwargs.pop('twiddles', {'first_N' : N, 0 : 1})
    NN = twiddles['first_N']
    X = [0] * N
    P = plan[0]
    Q = plan[1]
    # do inner FFT
    for q in xrange(Q[0]) :
        X[q::Q[0]] = P[1](x[q::Q[0]], plan=P[2], twiddles=twiddles,
                          inverse=inverse,normalize=False)
    # multiply by twiddles
    for q in xrange(Q[0]) :
        for s in xrange(P[0]) :
            index = q * s * NN // N
            
            mult = twiddles.setdefault(index,
                                       math.e**(inv*2j*math.pi*index/NN))
            X[Q[0]*s+q] *= mult
    # do outer FFT
    for s in xrange(P[0]) :
        X[Q[0]*s:Q[0]*(s+1)] = Q[1](X[Q[0]*s:Q[0]*(s+1)],plan=Q[2],
                                    twiddles=twiddles,inverse=inverse,
                                    normalize=False)
    # transpose matrix
    ret = [0] * N
    for s in xrange(P[0]) :
        ret[s::P[0]] = X[s*Q[0]:(s+1)*Q[0]]
    if inverse and N == NN : # normalize inverse only in first call
        ret = [val / N for val in ret]
    return ret

For power of two sizes, this algorithm is noticeably slower, by a factor of more than 2 depending on the input size, than the hardcoded radix 2 prior version. This is mostly due to the time spent doing matrix inversion in the last step. Also due to the matrix inversion, minimum radix plans usually outperform maximum radix ones, despite the benefits in FFT and twiddle factor computations the latter present over the former. Of course none of this matters if the input size presents several prime factors other than two, in which case the new algorithm clearly outperforms the radix 2 version, which is then close to the DFT version. Still, it is worth noting for later optimizations that matrix inversion should be done in a more efficient fashion.


Signature Preserving Function Decorators

May 25, 2009

(The code examples in this post are also available here.)

On the comments to the post on the greatest common divisor, Steve pointed out that the code to time function performance could be embedded in the code in a neater way using function decorators. At the time, I confess I did not have a clue of what a decorator is. But I’ve been doing my homework, and so here’s my take on the subject…

So what the @#$%& is a decorator?

Many moons ago, when I ditched my ZX Spectrum for a 80286 PC (with a i287 math coprocessor), and moved from BASIC to C, I eventually found out that there were such things as pointers to functions. What a sick thing to do to functions! What were this Kernighan and Ritchie!? Perverts!? So, still mourning for my lost GOTOs and GOSUBs, I dismissed the whole subject as pornographic, and decided not to learn it on moral grounds. Well, I’ve come a long way since then, am much more open-minded now, and have even developed a keen appreciation for errr… adult entertainment. So, bring in the syntactic porn sugar!!!

The basic idea is that when you write something like…

@my_decorator
def decorated_function(a, b, c) :
    <function definition goes here>

…it is evaluated the same as if you had written…

def decorated_function(a, b, c) :
    <function definition goes here>
decorated_function = my_decorator(decorated_function)

If this last assignment has sent you into stupor, that is simply the sweet way in which python’s indirection handles function pointers, so all there is to it is that my_decorator has to be a function (or a callable object, i.e. one with a __call__ method) that takes a function as only argument, and returns a function (or a callable object). This last return function has to be compatible with the arguments of decorated_function, since calls to the latter will actually call the former.

You can even raise the bar by passing arguments to the decorator, but the basic scheme is the same, if you write this…

@my_decorator(dec_a, dec_b, dec_c)
def decorated_function(a, b, c) :
    <function definition goes here>

…python does the same as if you had written this…

def decorated_function(a, b, c) :
    <function definition goes here>
decorated_function = my_decorator(dec_a, dec_b, dec_c)(decorated_function)

…where my_decorator(dec_a, dec_b, dec_c) should return a function, possibly different depending on the passed parameters, which takes a function as a parameter, which should return yet another function, that better be compatible with the arguments of decorated_function, for the same reasons as before.

Decorators can even be nested, but the functioning is quite predictable s well. You write this…

@one_decorator(one_args)
@two_decorators(two_args)
def decorated_function(a, b, c) :
    <function definition goes here>

…and python won’t know the difference from doing this…

def decorated_function(a, b, c) :
    <function definition goes here>
decorated_function = one_decorator(one_args)(two_decorator(two_args)(decorated_function))

…and by now you should be able to figure out what is being returned by what.

If you are still in distress, it may be a good idea to stop here and first read David‘s great article at Siafoo , or Bruce Eckel‘s series (1, 2, 3) at Artima. To see them suckers at work, check the Python Decorator Library. There’s also the option to drink from the pristine springs of PEP 318, not that I recommend it to the uninitiated, though.

A Decorator to Time Function Execution

So lets learn by doing, and build a decorator to time the performance of the decorated function, trying to mimic as much as possible the approach I have been following up till now: the last of the function’s argument is verbose, which defaults to False, and setting it to True activates printout of the time required by the function to run. This is possible, but messy. On the other hand, what can more easily be achieved is to control the timing with a keyword argument, as we will implement…

I have added more goodies… I learnt from the folks at Leohnard Euler’s Flying Circus that, while time.clock has superior performance than time.time under Windows, it is the opposite on Linux, so we will choose our timing function at run time, through os.name, which holds a string with the operating system your computer is running.

from __future__ import division
from os import name as OP_SYS
import time

def performance_timer_1(_func_) :
    """
    A decorator function to time execution of a function.

    Timing is activated by setting the verbose keyword to True.
    """

    stopwatch = time.time
    if OP_SYS in ('nt', 'dos') :
        stopwatch = time.clock

    def _wrap_(*args, **kwargs) :
        """
        This is a function wrapper.
        """
        
        verbose = kwargs.pop('verbose',False)
        t = stopwatch()
        ret = _func_(*args, **kwargs)
        t = stopwatch() - t
        if verbose :
            print "Call to",_func_.__name__,"took",t,"sec."
        return ret

    return _wrap

Run this file on IDLE, or import it into an interactive python session, and you can do things like these:

>>> @performance_timer_1
def decorated_sum(a,b) :
"""
This is a function.
"""
return a+b

>>> decorated_sum(3,5)
8

>>> decorated_sum(3,5,verbose=True)
Running function decorated_sum took 2.72380986743e-06 sec.
8

If you are wondering about the silly docstrings in the function and the wrapper, try the following:

>>> help(decorated_sum)
Help on function wrapper:

wrapper(*args, **kwargs)
This is a function wrapper.

Now that’s odd, isn’t it? Well, not really, as it is a natural byproduct of the devilish practice of assigning functions to functions: we are getting help for the function with which the decorator replaced the original function during decoration. Part of it is easy to fix, though, as the following code demonstrates…

import inspect

def performance_timer_2(_func_) :
    """
    A decorator function to time execution of a function.

    Timing is activated by setting the verbose keyword to True.
    """

    stopwatch = time.time
    if OP_SYS in ('nt', 'dos') :
        stopwatch = time.clock

    def _wrap_(*args, **kwargs) :
        """
        This is a function wrapper.
        """
        
        verbose = kwargs.pop('verbose',False)
        t = stopwatch()
        ret = _func_(*args, **kwargs)
        t = stopwatch() - t
        if verbose :
            print "Call to",_func_.__name__,"took",t,"sec."
        return ret
    _wrap_.__name__ = _func_.__name__
    _wrap_.__doc__ = _func_.__doc__
    return _wrap_

If you try the same thing again, this time you will get…

@performance_timer_2
def decorated_sum(a,b) :
"""
This is a function.
"""
return a+b

>>> help(decorated_sum)
Help on function decorated_sum:

decorated_sum(*args, **kwargs)
This is a function.

We’re almost there!!! There is just one disturbing thing remaining: the function signature. See, if we had invoked help on the undecorated decorated_sum, we would have got decorated_sum(a, b), not the generic arguments and keyword arguments thing.

Preserving the Function Signature

There are ways to get around it, but things are going to get messy. The primal source is Michele Simionato‘s decorator module. Most of what I will present in the rest of this post is shamelessly ripped off from inspired on his code. You can check the documentation, or the last section of David’s Siafoo article, if you want to use that module.

There are two problems though… The first is of course having to rely on a third party module which isn’t part of the standard library: the rules forbid it!

But there is also a more fundamental issue: we would like to preserve the original function signature, but at the same time we are changing it!!! Going back to the prior example, if we preserved the signature unaltered as decorated_sum(a, b), we would get a TypeError: decorated_sum() got an unexpected keyword argument 'verbose' every time we tried to activate the timer. Trust me, I found the hard way…

What we would like to get is something like decorated_sum(a, b, **kwargs). To achieve something like this we are going to have to dynamically generate a function at run-time…

def performance_timer_3(_func_) :
    """
    A decorator function to time execution of a function.

    Timing is activated by setting the verbose keyword to True.
    """
    
    stopwatch = time.time
    if OP_SYS in ('nt', 'dos') :
        stopwatch = time.clock

    def _wrap_(*args, **kwargs) :
        """
        This is a function wrapper.
        """
        
        verbose = kwargs.pop('verbose',False)
        t = stopwatch()
        ret = _func_(*args, **kwargs)
        t = stopwatch() - t
        if verbose :
            print "Call to",_func_.__name__,"took",t,"sec."
        return ret
        
    sig = list(inspect.getargspec(_func_))
    wrap_sig = list(inspect.getargspec(_wrap_))
    if not sig[2] :
        sig[2] = wrap_sig[2]
    src =  'def %s%s :\n' %(_func_.__name__, inspect.formatargspec(*sig))
    src += '    return _wrap_%s\n' % (inspect.formatargspec(*sig))
    evaldict = {'_wrap_' : _wrap_}
    exec src in evaldict
    ret = evaldict[_func_.__name__]
    ret.__doc__ = _func_.__doc__
    return ret

If you are not familiar with them, you may want to check the documentation for the inspect module, as well as for the exec statement, to get a hang of what’s going on. It is important to note how the _wrap_ function is not simply named in the dynamically generated source code, but included in evaldict. If we didn’t do it this way, it would be out of scope once we left the decorator, and an error will occur. Again, you can trust me on this one…

But the thing is we have finally arrived where we wanted to…

>>> @performance_timer_3
def decorated_sum(a,b) :
"""
This is a function.
"""
return a+b

>>> help(decorated_sum)
Help on function decorated_sum:

decorated_sum(a, b, **kwargs)
This is a function.

So every time you want to do a function decorator with a wrapper activated by a keyword, all you need to do is take this last example as a template, and with some cutting and pasting you’ll be done in no time.

Now wait a minute!!! Wasn’t avoiding this kind of thing what decorators where supposed to be all about? Didn’t we get into this mess to avoid cutting and pasting a couple of lines of code to the beginning and end of every function? What if we built a decorator that automated the decorating process?

def signature_decorator(_wrap_, has_kwargs = False) :
    """
    A decorator to create signature preserving wrappers.

    _wrap_     : the wrapper function, which must take a function as first
                 argument, can then take several optional arguments, which
                 must have defaults, plus the *args and **kwargs to pass to
                 the wrapped function.
    has_kwargs : should be set to True if the wrapper also accepts keyword
                 arguments to alter operation.
    """
    def decorator(func):
        """
        An intermediate decorator function
        """
        sig = list(inspect.getargspec(func))
        wrap_sig = list(inspect.getargspec(_wrap_))
        sig[0], wrap_sig[0] = sig[0] + wrap_sig[0][1:], wrap_sig[0] + sig[0]
        wrap_sig[0][0] = '_func_'
        sig[3] = list(sig[3]) if sig[3] is not None else []
        sig[3] += list(wrap_sig[3]) if wrap_sig[3] is not None else []
        sig[3] = tuple(sig[3]) if sig[3] else None
        wrap_sig[3] = None
        if sig[2] is None and has_kwargs:
            sig[2] = wrap_sig[2]
        wrap_sig[1:3] = sig[1:3]
        src = 'def %s%s :\n' % (func.__name__, inspect.formatargspec(*sig))
        src += '    return _wrap_%s\n' % (inspect.formatargspec(*wrap_sig))
        evaldict = {'_func_':func, '_wrap_': _wrap_}
        exec src in evaldict
        ret = evaldict[func.__name__]
        ret.__doc__ = func.__doc__
        return ret
    return decorator
    
def timer_wrapper_1(_func_, verbose = False, *args, **kwargs) :
    """
    A wrapper to time functions, activated by the last argument.

    For use with signature_decorator
    """
    stopwatch = time.time
    if OP_SYS in ('nt','dos') :
        stopwatch = time.clock
    t = stopwatch()
    ret = _func_(*args,**kwargs)
    t = stopwatch() - t
    if verbose :
        print "Call to",_func_.__name__,"took",t,"sec."
    return ret

def timer_wrapper_2(func, *args, **kwargs) :
    """
    A wrapper to time functions, activated by a keyword argument..

    For use with signature_decorator
    """
    stopwatch = time.time
    if OP_SYS in ('nt','dos') :
        stopwatch = time.clock
    verbose = kwargs.pop('verbose',False)
    t = stopwatch()
    ret = func(*args,**kwargs)
    t = stopwatch() - t
    if verbose :
        print "Call to",_func_.__name__,"took",t,"sec."
    return ret                                     

Do note that the wrappers are now defined outside the decorator, and that they must take the wrapped function as first argument. Do note as well the decorators two parameters, the first being the wrapper, the second whether the wrapper takes keyword arguments or not. To get a hang of how this works try the following…

>>> @signature_decorator(timer_wrapper_1)
def decorated_sum(a,b=2) :
"""
This is a function.
"""
return a+b

>>> help(decorated_sum)
Help on function decorated_sum:

decorated_sum(a, b=2, verbose=False)
This is a function.

>>> @signature_decorator(timer_wrapper_2,True)
def decorated_sum(a,b=2) :
"""
This is a function.
"""
return a+b

>>> help(decorated_sum)
Help on function decorated_sum:

decorated_sum(a, b=2, **kwargs)
This is a function.

There are a million things that could go wrong with this last decorator, specially with conflicting variable or keyword names in the function and the wrapper. If you check the decorator module source code, you can find ways to build robustness to such issues. As for myself, a week’s worth of sleep deprivation is enough: I already know more about decorators than ever thought possible or wished. If you are curious, you can check the final, a little more elaborate, version of the performance_timer decorator I’m sticking to, which does not use this last refinement of the decorator_signature, checking nrp_base.py.


Home Sweet Home

May 25, 2009

The move to wordpress is now completed… All the relevant content that was in blogspot, is now here for your enjoyment. The importing was smooth, but I had to do a lot of reformatting by hand, and with python’s sensitivity to whitespace errors, I can’t guarantee that I haven’t messed anything up. If you find anything broken, let me know and I’ll fix it.

Once I got started, I got a little carried away, and have done some more housekeeping:

  • There now is a Coming Soon page on the top menu.
  • There is also a Post Index there, which should provide a convenient way to access the content.

And the big news is that I have set up a Mercurial repository at bitbucket, here’s the link. This should provide two very convenient features for the future:

  • No longer will anyone wanting to try out some of the code I post have to cut and paste from several older posts to get the functions running, as any dependencies will now be handled through a few modules stored in the repository.
  • For future blog entries, rather than cutting and pasting from the posted code, a single file containing all the code will also be available at bitbucket.

Lastly, I am starting to consider moving on to python 3.0 a wise choice. If there are any regular readers out there with an opinion on this, your comments are more than welcome.


Divisors

May 13, 2009

When I wrote the first post of the FFT series I confessed it was out of boredom from number theory. That statement got me into a coffee break discussion with a friend and mathematician, who claimed that the FFT is as number theory as it gets, and went on to explain why. Since I usually don’t understand a word of what he is saying when he gets started on one of this rants, which invariably lead to Riemann’s zeta function, I just pretended to acknowledge his doubts, and then forgot about the conversation altogether.

Well, a few weeks down the road it seems he was right: while trying to make my FFT code work, I have found myself dealing with all sorts of number theoretical stuff, ranging from simple ones, like factorization, to arcane, as the Chinese remainder theorem. And since I have this thing about complete expositions, I’m afraid that this interleaving of apparently unrelated number theory posts with the FFT ones is going to be the norm. As a matter of fact the odd post on the Euclidean algorithm was supposed to be a stepping stone to an improved radix 2 FFT, which I finally managed to do without.

But lets get going with today’s subject: computing the divisors of a positive integer. May not seem a very sexy topic at first glance, but there’s beauty in it, as the striking Divisor Plot website put together by Jeffrey Ventrella proves…

Getting into the algorithmic treatment of the problem, there is of course the dumber approach, on the line of “to compute all divisors of a number, do trial division by all numbers smaller than it.” I will not pursue that line of thinking though, and will focus on computing the divisors once the prime factorization of the number is known, which is the efficient way of dealing with this problem.

So I will assume that we know the expansion of the number into its prime factors as

\displaystyle n = \prod_{i=1}^{\omega(n)} p_i^{a_i},

where the function ω(n) represents the number of distinct prime factors, pi, each appearing with the corresponding exponent, ai.

A divisor will have a subset of the prime factors of n as its own factors, so we could represent any divisor as a list of ω(n) numbers, the i-th ranging from 0 up to and including ai. This scheme makes it very easy to calculate the total number of divisors as

\displaystyle d(n)=\prod_{i=1}^{\omega(n)} (a_i+1).

Unordered Generation of Divisors
Furthermore, it points out that the computation of all the divisors could be achieved using ω(n) nested loops. Of course the number of loops to nest will vary from input to input, so recursion is probably the best of choices, or so thinks Tim Peters, as the code below has only cosmetic modifications to what he posted on the Python-list@python.org mailing list:

def divisors(factors) :
    """
    Generates all divisors, unordered, from the prime factorization.
    """
    ps = sorted(set(factors))
    omega = len(ps)

    def rec_gen(n = 0) :
        if n == omega :
            yield 1
        else :
            pows = [1]
            for j in xrange(factors.count(ps[n])) :
                pows += [pows[-1] * ps[n]]
            for q in rec_gen(n + 1) :
                for p in pows :
                    yield p * q

    for p in rec_gen() :
        yield p

This approach, while extremely fast, is not much use if you want to produce the divisors sorted, as you will then have to first generate and store all divisors, then sort them.

Ordered Generation of Divisors
If what you are after is not the full set of divisors, or any arbitrary subset of them, but something like “the first divisor larger than x” or “the last divisor smaller than x,” it may be worth trying other algorithms which, while being less efficient, manage to produce the divisors properly sorted.

There’s little literature on the subject, so its hard to find a valid benchmark . Even PARI/GP, which is the golden standard on computational number theory seems, from the little sense I have managed to make out of their Subversion repository, to be generating all divisors (in a similar manner to above) and then sorting them.

So here’s a calculation scheme that can be effectively modified to yield ordered divisors… We start with a pivot value of 1, and an ordered list of all prime factors. Before every factor we split our flow in two:

  • one branch skips the factor, and proceeds to the first factor different from the current one, keeping the pivot unchanged
  • the other branch first updates the pivot, multiplying it by the factor, yields the result, and then proceeds to the next factor, with the modified pivot.

The flow is probably easier to understand with the example below, showing the branching when computing the divisors of 540 = 22·33·5. Blue arrows skip to the first different factors, red ones go to the next factors, a line going through a factor implies multiplication by it…

The branching of the algorithm to generate all the divisors of 540 in order.

To keep track of this structure, I’m going to get fancy, and will build a tree-like data structure, where each item will be a list, holding the factor in the first position, and pointers, i.e. nested lists, to the next elements in the second and third ones. Again, to make things clear, here’s the data structure for 540, using the same color code as before…

The data structure holding the prime factors of 540, arranged for ordered calculation of the divisors.

Rather than traversing this structure recursively, we will approach it in a more structured way, by keeping a separate list of factors that can be visited in a next step. In the example above, when we start out, we can visit the first factor (a 2), or skip to the third (the first 3), so the list will hold two elements. If we select going to the first 2, we then have two new options, i.e. going on to the second 2, or taking the shortcut to the first three. Note that the two instances point at the first three are different, since once has a pivot value of 1 (and will thus generate the divisor 3), while the other has it set to 2 (an will generate the divisor 6).

If we don’t care about the other of divisors, this method can even be implemented by hand, by keeping a list of pivots and remaining factors, and on each step erasing the first element of the list and generating a divisor, plus adding up to two new items at the bottom. The first steps, plus the resulting list of divisors, are worked out below, again for the case of n = 540.

Working out the divisors of 540

Now, while when doing it on paper there is not much more choice other than processing the items sequentially, we could choose to process always the element that produces the smallest divisor, and thus produce them in order. The simplest implementation of this idea would simply use a sort on the list of processable items before yielding the next divisor…

def divisors_ordered(factors) :
    """
    Generates all divisors, ordered, from the prime factorization
    """
    factors =  [(j,factors.count(j)) for j in sorted(set(factors))]
    omega = len(factors)
    fcts =[factors[0][0],[],[]]
    aux = [fcts, None]
    for j in xrange(omega) :
        p,a = factors[j]
        if omega - j > 1 :
            aux[1] = [factors[j+1][0],[],[]]
        else :
            aux[1] = None
        for k in xrange(1, a) :
            b = [p,[],[]]
            aux[0][1] = b
            aux[0][2] = aux[1]
            aux[0] = b
        aux[0][1] = aux[1]
        aux[0][2] = aux[1]
        aux[0] = aux[1]
    div_tree = [[fcts[0],1, fcts]]
    yield 1
    while div_tree :
        div_tree.sort()
        divisor, pivot, pointer = div_tree.pop(0)
        yield divisor
        if pointer[1] :
            div_tree += [[divisor * pointer[1][0] , divisor, pointer[1]]]
        if pointer[2] :
            div_tree += [[pivot * pointer[2][0], pivot, pointer[2]]]

Of course doing a full sort before yielding every divisor kills the efficiency of this algorithm. The fancy (or should I simply say correct?) way of keeping an ordered list, popping the smallest item only, and pushing arbitrary elements, is with a priority queue, which I will implement as a binary heap. Since we will be building our heap from scratch, we only need to worry about push and pop functions.

def push_heap(item, heap = None, compare = cmp) :
    """
    Pushes an item into a min-heap, or creates one.
    """
    heap_size = None
    if not heap :
        heap = [item, 1]
    else :
        heap_size = heap.pop()
        heap += [item, heap_size + 1]
    i = heap[-1] - 1
    val = heap[i]
    heap_ordered = False
    while i > 0 and not heap_ordered :
        n_i = i - 1
        if n_i & 1 : # faster than n_i % 2
            n_i -= 1
        n_i >>= 1 # faster than n_i // 2
        n_val = heap[n_i]
        if compare(val, n_val) < 0 :
            heap[i], heap[n_i] = heap[n_i], heap[i]
            i, val = n_i, heap[n_i]
        else :
            heap_ordered = True
    return heap

def pop_heap(heap, compare = cmp) :
    """
    Pops the smallest item from a min-heap.
    """
    heap_size = heap.pop()
    if not heap_size:
        return None
    elif heap_size == 1 :
        return heap.pop()
    ret, heap[0] = heap[0], heap.pop()
    heap_size -= 1
    heap += [heap_size]
    i = 0
    heap_ordered = False
    while not heap_ordered :
        n_i = 2 * i + 1
        if n_i < heap_size :
            if n_i + 1 < heap_size and compare(heap[n_i],heap[n_i + 1]) > 0 :
                n_i += 1
            if compare(heap[i], heap[n_i]) > 0 :
                heap[i], heap[n_i] = heap[n_i], heap[i]
                i = n_i
            else :
                heap_ordered = True
        else :
            heap_ordered = True
    return ret

def divisors_ordered(factors) :
    """
    Generates an ordered list of all divisors
    """
    factors =  [(j,factors.count(j)) for j in sorted(set(factors))]
    omega = len(factors)
    fcts =[factors[0][0],[],[]]
    aux = [fcts, None]
    for j in xrange(omega) :
        p,a = factors[j]
        if omega - j > 1 :
            aux[1] = [factors[j+1][0],[],[]]
        else :
            aux[1] = None
        for k in xrange(1, a) :
            b = [p,[],[]]
            aux[0][1] = b
            aux[0][2] = aux[1]
            aux[0] = b
        aux[0][1] = aux[1]
        aux[0][2] = aux[1]
        aux[0] = aux[1]
    yield 1
    div_heap = push_heap([fcts[0],1, fcts])
    while div_heap :
        divisor, pivot, pointer = pop_heap(div_heap)
        yield divisor
        if pointer[1] :
            div_heap = push_heap([divisor * pointer[1][0] , divisor, pointer[1]], div_heap)
        if pointer[2] :
            div_heap = push_heap([pivot * pointer[2][0], pivot, pointer[2]], div_heap)

The following graph shows the time required by each of the three implementations presented to produce an ordered list of all divisors of pn#, i.e. the product of the first n prime numbers. Do notice how the lower computational complexity of the heap sorting eventually beats the initial speed advantage of the sorting version. Still, for full lists of divisors, generating all of them and then sorting beats any other approach. It is also worth pointing out how the time required for the heap version of the ordered algorithm eventually settles around being 14 times slower than the unordered generator with delayed sorting, i.e. the computational complexities of both are the same, which probably means that sorting dominates the time of the unordered algorithm.

Time to compute all ordered divisors of the first few primorials.

Just to show why we could want to use an algorithm that is more than ten times slower, the following graph shows the time needed to compute the first divisor larger than 1000 for the first few primorials. While using the heap version of the ordered algorithm time is close to constant, the unordered algorithm shows the same behavior as in the previous test.

Time to compute the first divisor larger than 1000 for the first few primorials.

It is nevertheless worth noting that the break even point is p11# = 200,560,490,130, so unless you are after the first few divisors of a huge number, the first algorithm presented should always be the choice.


Follow

Get every new post delivered to your Inbox.