The Cooley-Tukey FFT Algorithm for General Factorizations

May 29, 2009

(All code examples in the post have been included in the nrp_base.py module, which can be downloaded from this repository.)

As presented in the previous post, Cooley-Tukey’s FFT algorithm has a clear limitation: it can only be used to speed the calculation of DFTs of a size that is a power of two. It isn’t hard, though, to extend the same idea to a more general factorization of the input signal size. We start again from the formula that defines the DFT,

\displaystyle X_k = \sum_{n=0}^{N-1}{ x_k e^{-2\pi i kn/N}}

Say that N = P·Q. Following the previous approach, we could try to split the calculation of the DFT into Q DFTs of size P, by rewriting it as

\displaystyle X_k = \sum_{q=0}^{Q-1}{ e^{-2\pi i \frac{kq}{N}} \sum_{p=0}^{P-1}{x_{Qp+q} e^{-2\pi i \frac{kp}{P}}}}

where n = Q·p + q. We are again faced with the same issue to turn the inner summation into a proper DFT: while p ranges from 0 to P-1, k goes all the way up to N-1. To get around this, we can make k = P·r + s, and replacing in the previous formula we get

\displaystyle X_{Pr+s} = \sum_{q=0}^{Q-1}{ e^{-2\pi i \frac{qr}{Q}} e^{-2\pi i \frac{qs}{N}} \sum_{p=0}^{P-1}{ x_{Qp+q} e^{-2\pi i \frac{ps}{P}}}}

where the inner summation has been reduced to a standard DFT of size P. Furthermore, the splitting of the first exponential in two factors unveils that the outer summation is also a standard DFT, this one of size Q, of the product of the DFT of size P by twiddle factors.

The subscripts in the above formula can get confusing to follow, so take note that q and r go from 0 to Q-1, while p and s go from 0 to P-1. Therefore the computation of the size N DFT has effectively been split into the computation of Q DFTs of size P, followed by twiddle factor multiplication, and then by P DFTs of size Q. It may be helpful to picture the linear array of input data as a PxQ matirx stored in row-major order. Then the Q DFTs of size P are performed on the columns of this matrix, and the P DFTs of size Q on the rows. Note though that the output has to be rearranged, in what would be equivalent to a matrix transposition, before getting to the final answer. As an example, the following picture illustrates the process on an input of size 15.

The operations in a step of a generalized factorization Cooley-Tukey FFT algorithm.

Planning a Generalized Cooley-Tukey FFT

Disregarding the multiplication by twiddle factors and the matrix transpositions, the algorithm turns a calculation which requires a time proportional to N2 = P2Q2, into one which requires a time proportional to P·Q·(P+Q). This very simplistic complexity analysis suggests that the factoring of the size of the original signal into two multiplicative factors isn’t arbitrary, and will affect the performance of the algorithm. And in principle a split into two similarly sized factors, P ~ Q ~ √N should, although often doesn’t, outperform any other approach.

In many practical applications a small P or Q is used, which is then called the radix. If the smaller DFT is in the outer loop. i.e. if Q is the radix, the scheme is termed decimation in time (DIT), while if it is in the inner loop, i.e. if P is the radix, we have decimation in frequency (DIF), which explains the long name of the simplest Cooley-Tukey algorithm described earlier.

The splitting of the input size, N, into the two multiplicative factors, P and Q, can be done in d(N) – 2 different ways, where d(N) is the divisor function, i.e. the number of divisors of N. If we use the same algorithm recursively, the possible paths to the solution can grow very fast, and determining the optimal one is anything but easy. It is in any case a point where we don’t want to make excessive assumptions, and which should be left sufficiently open for user experimentation. So I’ve decided to copy FFTW‘s approach: the first step to peform a FFT is to plan it. In our case, a plan is a data structure which is similar to a tree, holding the information of the factorizations chosen, the function that should be called to perform the next step, and references to those next steps. The following picture sketches two possible plans for a DFT of size 210 = 2·3·5·7.

In line with the image, I have coded a planning function that gives two choices to the user:

  • decimation in time (dit=True) vs. decimation in frequency (dif=True)
  • minimum possible radix (min_rad=True) vs. maximum possible radix (max_rad=True)

Note that the standard radix 2 with decimation in time plan would result from setting dit=True and min_rad=True when the input size is a power of two, and that it would be the default behavior of this planner.

@nrp_base.performance_timer
def plan_fft(x, **kwargs) :
    """
    Returns an FFT plan for x.

    The plan uses direct DFT for prime sizes, and the generalized CT algorithm
    with either maximum or minimum radix for others.

    Keyword Arguments
     dif     -- If True, will use decimation in frequency. Default is False.
     dit     -- If True, will use decimation in time. Overrides dif. Default
                is True.
     max_rad -- If True, will use tje largest possible radix. Default is False.
     min_rad -- If True, will use smallest possible radix. Overrides max_rad.
                Default is True.
               
    """
    dif = kwargs.pop('dif', None)
    dit = kwargs.pop('dit', None)
    max_rad = kwargs.pop('max_rad', None)
    min_rad = kwargs.pop('min_rad', None)
    
    if dif is None and dit is None :
        dif = False
        dit = True
    elif dit is None :
        dit = not dif
    else :
        dif = not dit

    if max_rad is None and min_rad is None :
        max_rad = False
        min_rad = True
    elif min_rad is None :
        min_rad = not max_rad
    else :
        max_rad = not min_rad
        
    N = len(x)
    plan = []

    if min_rad :
        fct = sorted(nrp_numth.naive_factorization(N))
        NN = N
        ptr = plan
        for f in fct :
            NN //= f
            P = [f, dft, None]
            Q = [NN, fft_CT, []]
            if dit :
                ptr += [Q, P]
            else :
                ptr += [P, Q]
            if NN == fct[-1] :
                Q[1:3] = [dft, None]
                break
            else :
                ptr = Q[2]
    else :
        items ={}
        def follow_tree(plan, N, fct, **kwargs) :
            if fct is None:
                fct = nrp_numth.naive_factorization(N)
            divs = sorted(nrp_numth.divisors(fct))
            val_P = divs[(len(divs) - 1) // 2]
            P = [val_P]
            if val_P in items :
                P = items[val_P]
            else :
                fct_P = nrp_numth.naive_factorization(val_P)
                if len(fct_P) > 1 :
                    P += [fft_CT, []]
                    follow_tree(P[2], val_P, fct_P, dit = dit)
                else :
                    P +=[dft, None]
                items[val_P] = P
            val_Q = N // val_P
            Q = [val_Q]
            if val_Q in items :
                Q = items[val_Q]
            else :
                fct_Q = nrp_numth.naive_factorization(val_Q)
                if len(fct_Q) > 1 :
                    Q += [fft_CT, []]
                    follow_tree(Q[2], val_Q, fct_Q, dit = dit)
                else :
                    Q +=[dft, None]
                items[val_Q] = Q
            if dit :
                plan += [Q, P]
            else :
                plan += [P, Q]

        follow_tree(plan, N, None, dit = dit)

    return plan

While the planner can probably be further optimized, especially for max_rad=True, with all the repeated factorizations, this shouldn’t be the critical part of the DFT calculation. Specially so since the plan can be calculated in advance, and reused for all inputs of the same size.

Twiddle Factors Reloaded

As already discussed in previous posts, one of the simplest ways of trimming down the computation time is to store the twiddle factors, which are almost certain to show up again, and reuse them whenever possible, rather than going through complex exponentiations over and over again. There is a slight difference though with regard to the radix 2 case. There, when computing a size N FFT, the twiddle factors that intervened in the calculations where all those with numerators in the exponent (apart from the ±2·π·i) ranging from 0 to N/2 – 1, inclusive. If N is split as P·Q, rather than as 2·N/2, then the largest numerator that will appear will be (P-1)·(Q-1), which is definitely larger than N/2, but not all of the items up to that value will be required, with many of them missing.

Take for instance a DFT of size 512. If we go for the radix 2 FFT, we will in total use 256 different twiddle factors, ranging from 0 to 255. If rather than that we go for a max_rad=True approach, the first split will be 512 = 16·32, and so the largest twiddle factor will have numerator 15·31 = 465. But with all the intermediate twiddle factors not showing up in the calculation, overall only 217 different ones are needed.

It is also interesting to note that the same twiddle factors that are used in the FFTs, may appear in the DFTs of the leaves of the planning tree. So it is worth modifying the current dft function, so that it can accept precalculated twiddles. Because the twiddle factors sequence will have gaps, a list is no longer the most convenient data structure to hold these values, a dictionary being a better fit for the job. Combining all of these ideas, we can rewrite our function as

@nrp_base.performance_timer
def dft(x, **kwargs) :
    """
    Computes the (inverse) discrete Fourier transform naively.

    Arguments
     x         - Te list of values to transform

    Keyword Arguments
     inverse   - True to compute an inverse transform. Default is False.
     normalize - True to normalize inverse transforms. Default is same as
                 inverse.
     twiddles  - Dictionary of twiddle factors to be used by function, used
                 for compatibility as last step for FFT functions, don't set
                 manually!!!

    """
    inverse = kwargs.pop('inverse',False)
    normalize = kwargs.pop('normalize',inverse)
    N = len(x)
    inv = -1 if not inverse else 1
    twiddles = kwargs.pop('twiddles', {'first_N' : N, 0 : 1})
    NN = twiddles['first_N']
    X =[0] * N
    for k in xrange(N) :
        for n in xrange(0,N) :
            index = (n * k * NN // N) % NN
            twid= twiddles.setdefault(index,
                                      math.e**(inv*2j*math.pi*index/NN))
            X[k] += x[n] * twid
        if inverse and normalize :
             X[k] /= NN
    return X

Note also the changes to the user interface, with the addition of the normalize keyword argument. This has been introduced to correct a notorious bug in the inverse DFT calculations for the radix 2 FFT algorithm. The changes are implemented and can be checked in the module nrp_dsp.py, although they are very similar in nature to what will shortly be presented for the generalized factorization FFT.

Putting it all Together

We now have most of the building blocks to put together a FFT function using Cooley-Tukey’s algorithm for general factorizations, a possible realization of which follows:

@nrp_base.performance_timer
def fft_CT(x, plan,**kwargs) :
    """
    Computes the (inverse) DFT using Cooley-Tukey's generalized algorithm.
    
    Arguments
     x        - The list of values to transform.
     plan     - A plan data structure for FFT execution.

    Keyword Arguments
     inverse  - if True, computes an inverse FFT. Default is False.
     twiddles - dictionary of twiddle factors to be used by function, used
                for ptimization during recursion, don't set manually!!!
                                
    """
    inverse = kwargs.pop('inverse', False)
    inv = 1 if inverse else -1
    N = len(x)
    twiddles = kwargs.pop('twiddles', {'first_N' : N, 0 : 1})
    NN = twiddles['first_N']
    X = [0] * N
    P = plan[0]
    Q = plan[1]
    # do inner FFT
    for q in xrange(Q[0]) :
        X[q::Q[0]] = P[1](x[q::Q[0]], plan=P[2], twiddles=twiddles,
                          inverse=inverse,normalize=False)
    # multiply by twiddles
    for q in xrange(Q[0]) :
        for s in xrange(P[0]) :
            index = q * s * NN // N
            
            mult = twiddles.setdefault(index,
                                       math.e**(inv*2j*math.pi*index/NN))
            X[Q[0]*s+q] *= mult
    # do outer FFT
    for s in xrange(P[0]) :
        X[Q[0]*s:Q[0]*(s+1)] = Q[1](X[Q[0]*s:Q[0]*(s+1)],plan=Q[2],
                                    twiddles=twiddles,inverse=inverse,
                                    normalize=False)
    # transpose matrix
    ret = [0] * N
    for s in xrange(P[0]) :
        ret[s::P[0]] = X[s*Q[0]:(s+1)*Q[0]]
    if inverse and N == NN : # normalize inverse only in first call
        ret = [val / N for val in ret]
    return ret

For power of two sizes, this algorithm is noticeably slower, by a factor of more than 2 depending on the input size, than the hardcoded radix 2 prior version. This is mostly due to the time spent doing matrix inversion in the last step. Also due to the matrix inversion, minimum radix plans usually outperform maximum radix ones, despite the benefits in FFT and twiddle factor computations the latter present over the former. Of course none of this matters if the input size presents several prime factors other than two, in which case the new algorithm clearly outperforms the radix 2 version, which is then close to the DFT version. Still, it is worth noting for later optimizations that matrix inversion should be done in a more efficient fashion.

Advertisements

Signature Preserving Function Decorators

May 25, 2009

(The code examples in this post are also available here.)

On the comments to the post on the greatest common divisor, Steve pointed out that the code to time function performance could be embedded in the code in a neater way using function decorators. At the time, I confess I did not have a clue of what a decorator is. But I’ve been doing my homework, and so here’s my take on the subject…

So what the @#$%& is a decorator?

Many moons ago, when I ditched my ZX Spectrum for a 80286 PC (with a i287 math coprocessor), and moved from BASIC to C, I eventually found out that there were such things as pointers to functions. What a sick thing to do to functions! What were this Kernighan and Ritchie!? Perverts!? So, still mourning for my lost GOTOs and GOSUBs, I dismissed the whole subject as pornographic, and decided not to learn it on moral grounds. Well, I’ve come a long way since then, am much more open-minded now, and have even developed a keen appreciation for errr… adult entertainment. So, bring in the syntactic porn sugar!!!

The basic idea is that when you write something like…

@my_decorator
def decorated_function(a, b, c) :
    <function definition goes here>

…it is evaluated the same as if you had written…

def decorated_function(a, b, c) :
    <function definition goes here>
decorated_function = my_decorator(decorated_function)

If this last assignment has sent you into stupor, that is simply the sweet way in which python’s indirection handles function pointers, so all there is to it is that my_decorator has to be a function (or a callable object, i.e. one with a __call__ method) that takes a function as only argument, and returns a function (or a callable object). This last return function has to be compatible with the arguments of decorated_function, since calls to the latter will actually call the former.

You can even raise the bar by passing arguments to the decorator, but the basic scheme is the same, if you write this…

@my_decorator(dec_a, dec_b, dec_c)
def decorated_function(a, b, c) :
    <function definition goes here>

…python does the same as if you had written this…

def decorated_function(a, b, c) :
    <function definition goes here>
decorated_function = my_decorator(dec_a, dec_b, dec_c)(decorated_function)

…where my_decorator(dec_a, dec_b, dec_c) should return a function, possibly different depending on the passed parameters, which takes a function as a parameter, which should return yet another function, that better be compatible with the arguments of decorated_function, for the same reasons as before.

Decorators can even be nested, but the functioning is quite predictable s well. You write this…

@one_decorator(one_args)
@two_decorators(two_args)
def decorated_function(a, b, c) :
    <function definition goes here>

…and python won’t know the difference from doing this…

def decorated_function(a, b, c) :
    <function definition goes here>
decorated_function = one_decorator(one_args)(two_decorator(two_args)(decorated_function))

…and by now you should be able to figure out what is being returned by what.

If you are still in distress, it may be a good idea to stop here and first read David‘s great article at Siafoo , or Bruce Eckel‘s series (1, 2, 3) at Artima. To see them suckers at work, check the Python Decorator Library. There’s also the option to drink from the pristine springs of PEP 318, not that I recommend it to the uninitiated, though.

A Decorator to Time Function Execution

So lets learn by doing, and build a decorator to time the performance of the decorated function, trying to mimic as much as possible the approach I have been following up till now: the last of the function’s argument is verbose, which defaults to False, and setting it to True activates printout of the time required by the function to run. This is possible, but messy. On the other hand, what can more easily be achieved is to control the timing with a keyword argument, as we will implement…

I have added more goodies… I learnt from the folks at Leohnard Euler’s Flying Circus that, while time.clock has superior performance than time.time under Windows, it is the opposite on Linux, so we will choose our timing function at run time, through os.name, which holds a string with the operating system your computer is running.

from __future__ import division
from os import name as OP_SYS
import time

def performance_timer_1(_func_) :
    """
    A decorator function to time execution of a function.

    Timing is activated by setting the verbose keyword to True.
    """

    stopwatch = time.time
    if OP_SYS in ('nt', 'dos') :
        stopwatch = time.clock

    def _wrap_(*args, **kwargs) :
        """
        This is a function wrapper.
        """
        
        verbose = kwargs.pop('verbose',False)
        t = stopwatch()
        ret = _func_(*args, **kwargs)
        t = stopwatch() - t
        if verbose :
            print "Call to",_func_.__name__,"took",t,"sec."
        return ret

    return _wrap

Run this file on IDLE, or import it into an interactive python session, and you can do things like these:

>>> @performance_timer_1
def decorated_sum(a,b) :
"""
This is a function.
"""
return a+b

>>> decorated_sum(3,5)
8

>>> decorated_sum(3,5,verbose=True)
Running function decorated_sum took 2.72380986743e-06 sec.
8

If you are wondering about the silly docstrings in the function and the wrapper, try the following:

>>> help(decorated_sum)
Help on function wrapper:

wrapper(*args, **kwargs)
This is a function wrapper.

Now that’s odd, isn’t it? Well, not really, as it is a natural byproduct of the devilish practice of assigning functions to functions: we are getting help for the function with which the decorator replaced the original function during decoration. Part of it is easy to fix, though, as the following code demonstrates…

import inspect

def performance_timer_2(_func_) :
    """
    A decorator function to time execution of a function.

    Timing is activated by setting the verbose keyword to True.
    """

    stopwatch = time.time
    if OP_SYS in ('nt', 'dos') :
        stopwatch = time.clock

    def _wrap_(*args, **kwargs) :
        """
        This is a function wrapper.
        """
        
        verbose = kwargs.pop('verbose',False)
        t = stopwatch()
        ret = _func_(*args, **kwargs)
        t = stopwatch() - t
        if verbose :
            print "Call to",_func_.__name__,"took",t,"sec."
        return ret
    _wrap_.__name__ = _func_.__name__
    _wrap_.__doc__ = _func_.__doc__
    return _wrap_

If you try the same thing again, this time you will get…

@performance_timer_2
def decorated_sum(a,b) :
"""
This is a function.
"""
return a+b

>>> help(decorated_sum)
Help on function decorated_sum:

decorated_sum(*args, **kwargs)
This is a function.

We’re almost there!!! There is just one disturbing thing remaining: the function signature. See, if we had invoked help on the undecorated decorated_sum, we would have got decorated_sum(a, b), not the generic arguments and keyword arguments thing.

Preserving the Function Signature

There are ways to get around it, but things are going to get messy. The primal source is Michele Simionato‘s decorator module. Most of what I will present in the rest of this post is shamelessly ripped off from inspired on his code. You can check the documentation, or the last section of David’s Siafoo article, if you want to use that module.

There are two problems though… The first is of course having to rely on a third party module which isn’t part of the standard library: the rules forbid it!

But there is also a more fundamental issue: we would like to preserve the original function signature, but at the same time we are changing it!!! Going back to the prior example, if we preserved the signature unaltered as decorated_sum(a, b), we would get a TypeError: decorated_sum() got an unexpected keyword argument 'verbose' every time we tried to activate the timer. Trust me, I found the hard way…

What we would like to get is something like decorated_sum(a, b, **kwargs). To achieve something like this we are going to have to dynamically generate a function at run-time…

def performance_timer_3(_func_) :
    """
    A decorator function to time execution of a function.

    Timing is activated by setting the verbose keyword to True.
    """
    
    stopwatch = time.time
    if OP_SYS in ('nt', 'dos') :
        stopwatch = time.clock

    def _wrap_(*args, **kwargs) :
        """
        This is a function wrapper.
        """
        
        verbose = kwargs.pop('verbose',False)
        t = stopwatch()
        ret = _func_(*args, **kwargs)
        t = stopwatch() - t
        if verbose :
            print "Call to",_func_.__name__,"took",t,"sec."
        return ret
        
    sig = list(inspect.getargspec(_func_))
    wrap_sig = list(inspect.getargspec(_wrap_))
    if not sig[2] :
        sig[2] = wrap_sig[2]
    src =  'def %s%s :\n' %(_func_.__name__, inspect.formatargspec(*sig))
    src += '    return _wrap_%s\n' % (inspect.formatargspec(*sig))
    evaldict = {'_wrap_' : _wrap_}
    exec src in evaldict
    ret = evaldict[_func_.__name__]
    ret.__doc__ = _func_.__doc__
    return ret

If you are not familiar with them, you may want to check the documentation for the inspect module, as well as for the exec statement, to get a hang of what’s going on. It is important to note how the _wrap_ function is not simply named in the dynamically generated source code, but included in evaldict. If we didn’t do it this way, it would be out of scope once we left the decorator, and an error will occur. Again, you can trust me on this one…

But the thing is we have finally arrived where we wanted to…

>>> @performance_timer_3
def decorated_sum(a,b) :
"""
This is a function.
"""
return a+b

>>> help(decorated_sum)
Help on function decorated_sum:

decorated_sum(a, b, **kwargs)
This is a function.

So every time you want to do a function decorator with a wrapper activated by a keyword, all you need to do is take this last example as a template, and with some cutting and pasting you’ll be done in no time.

Now wait a minute!!! Wasn’t avoiding this kind of thing what decorators where supposed to be all about? Didn’t we get into this mess to avoid cutting and pasting a couple of lines of code to the beginning and end of every function? What if we built a decorator that automated the decorating process?

def signature_decorator(_wrap_, has_kwargs = False) :
    """
    A decorator to create signature preserving wrappers.

    _wrap_     : the wrapper function, which must take a function as first
                 argument, can then take several optional arguments, which
                 must have defaults, plus the *args and **kwargs to pass to
                 the wrapped function.
    has_kwargs : should be set to True if the wrapper also accepts keyword
                 arguments to alter operation.
    """
    def decorator(func):
        """
        An intermediate decorator function
        """
        sig = list(inspect.getargspec(func))
        wrap_sig = list(inspect.getargspec(_wrap_))
        sig[0], wrap_sig[0] = sig[0] + wrap_sig[0][1:], wrap_sig[0] + sig[0]
        wrap_sig[0][0] = '_func_'
        sig[3] = list(sig[3]) if sig[3] is not None else []
        sig[3] += list(wrap_sig[3]) if wrap_sig[3] is not None else []
        sig[3] = tuple(sig[3]) if sig[3] else None
        wrap_sig[3] = None
        if sig[2] is None and has_kwargs:
            sig[2] = wrap_sig[2]
        wrap_sig[1:3] = sig[1:3]
        src = 'def %s%s :\n' % (func.__name__, inspect.formatargspec(*sig))
        src += '    return _wrap_%s\n' % (inspect.formatargspec(*wrap_sig))
        evaldict = {'_func_':func, '_wrap_': _wrap_}
        exec src in evaldict
        ret = evaldict[func.__name__]
        ret.__doc__ = func.__doc__
        return ret
    return decorator
    
def timer_wrapper_1(_func_, verbose = False, *args, **kwargs) :
    """
    A wrapper to time functions, activated by the last argument.

    For use with signature_decorator
    """
    stopwatch = time.time
    if OP_SYS in ('nt','dos') :
        stopwatch = time.clock
    t = stopwatch()
    ret = _func_(*args,**kwargs)
    t = stopwatch() - t
    if verbose :
        print "Call to",_func_.__name__,"took",t,"sec."
    return ret

def timer_wrapper_2(func, *args, **kwargs) :
    """
    A wrapper to time functions, activated by a keyword argument..

    For use with signature_decorator
    """
    stopwatch = time.time
    if OP_SYS in ('nt','dos') :
        stopwatch = time.clock
    verbose = kwargs.pop('verbose',False)
    t = stopwatch()
    ret = func(*args,**kwargs)
    t = stopwatch() - t
    if verbose :
        print "Call to",_func_.__name__,"took",t,"sec."
    return ret                                     

Do note that the wrappers are now defined outside the decorator, and that they must take the wrapped function as first argument. Do note as well the decorators two parameters, the first being the wrapper, the second whether the wrapper takes keyword arguments or not. To get a hang of how this works try the following…

>>> @signature_decorator(timer_wrapper_1)
def decorated_sum(a,b=2) :
"""
This is a function.
"""
return a+b

>>> help(decorated_sum)
Help on function decorated_sum:

decorated_sum(a, b=2, verbose=False)
This is a function.

>>> @signature_decorator(timer_wrapper_2,True)
def decorated_sum(a,b=2) :
"""
This is a function.
"""
return a+b

>>> help(decorated_sum)
Help on function decorated_sum:

decorated_sum(a, b=2, **kwargs)
This is a function.

There are a million things that could go wrong with this last decorator, specially with conflicting variable or keyword names in the function and the wrapper. If you check the decorator module source code, you can find ways to build robustness to such issues. As for myself, a week’s worth of sleep deprivation is enough: I already know more about decorators than ever thought possible or wished. If you are curious, you can check the final, a little more elaborate, version of the performance_timer decorator I’m sticking to, which does not use this last refinement of the decorator_signature, checking nrp_base.py.


Divisors

May 13, 2009

When I wrote the first post of the FFT series I confessed it was out of boredom from number theory. That statement got me into a coffee break discussion with a friend and mathematician, who claimed that the FFT is as number theory as it gets, and went on to explain why. Since I usually don’t understand a word of what he is saying when he gets started on one of this rants, which invariably lead to Riemann’s zeta function, I just pretended to acknowledge his doubts, and then forgot about the conversation altogether.

Well, a few weeks down the road it seems he was right: while trying to make my FFT code work, I have found myself dealing with all sorts of number theoretical stuff, ranging from simple ones, like factorization, to arcane, as the Chinese remainder theorem. And since I have this thing about complete expositions, I’m afraid that this interleaving of apparently unrelated number theory posts with the FFT ones is going to be the norm. As a matter of fact the odd post on the Euclidean algorithm was supposed to be a stepping stone to an improved radix 2 FFT, which I finally managed to do without.

But lets get going with today’s subject: computing the divisors of a positive integer. May not seem a very sexy topic at first glance, but there’s beauty in it, as the striking Divisor Plot website put together by Jeffrey Ventrella proves…

Getting into the algorithmic treatment of the problem, there is of course the dumber approach, on the line of “to compute all divisors of a number, do trial division by all numbers smaller than it.” I will not pursue that line of thinking though, and will focus on computing the divisors once the prime factorization of the number is known, which is the efficient way of dealing with this problem.

So I will assume that we know the expansion of the number into its prime factors as

\displaystyle n = \prod_{i=1}^{\omega(n)} p_i^{a_i},

where the function ω(n) represents the number of distinct prime factors, pi, each appearing with the corresponding exponent, ai.

A divisor will have a subset of the prime factors of n as its own factors, so we could represent any divisor as a list of ω(n) numbers, the i-th ranging from 0 up to and including ai. This scheme makes it very easy to calculate the total number of divisors as

\displaystyle d(n)=\prod_{i=1}^{\omega(n)} (a_i+1).

Unordered Generation of Divisors
Furthermore, it points out that the computation of all the divisors could be achieved using ω(n) nested loops. Of course the number of loops to nest will vary from input to input, so recursion is probably the best of choices, or so thinks Tim Peters, as the code below has only cosmetic modifications to what he posted on the Python-list@python.org mailing list:

def divisors(factors) :
    """
    Generates all divisors, unordered, from the prime factorization.
    """
    ps = sorted(set(factors))
    omega = len(ps)

    def rec_gen(n = 0) :
        if n == omega :
            yield 1
        else :
            pows = [1]
            for j in xrange(factors.count(ps[n])) :
                pows += [pows[-1] * ps[n]]
            for q in rec_gen(n + 1) :
                for p in pows :
                    yield p * q

    for p in rec_gen() :
        yield p

This approach, while extremely fast, is not much use if you want to produce the divisors sorted, as you will then have to first generate and store all divisors, then sort them.

Ordered Generation of Divisors
If what you are after is not the full set of divisors, or any arbitrary subset of them, but something like “the first divisor larger than x” or “the last divisor smaller than x,” it may be worth trying other algorithms which, while being less efficient, manage to produce the divisors properly sorted.

There’s little literature on the subject, so its hard to find a valid benchmark . Even PARI/GP, which is the golden standard on computational number theory seems, from the little sense I have managed to make out of their Subversion repository, to be generating all divisors (in a similar manner to above) and then sorting them.

So here’s a calculation scheme that can be effectively modified to yield ordered divisors… We start with a pivot value of 1, and an ordered list of all prime factors. Before every factor we split our flow in two:

  • one branch skips the factor, and proceeds to the first factor different from the current one, keeping the pivot unchanged
  • the other branch first updates the pivot, multiplying it by the factor, yields the result, and then proceeds to the next factor, with the modified pivot.

The flow is probably easier to understand with the example below, showing the branching when computing the divisors of 540 = 22·33·5. Blue arrows skip to the first different factors, red ones go to the next factors, a line going through a factor implies multiplication by it…

The branching of the algorithm to generate all the divisors of 540 in order.

To keep track of this structure, I’m going to get fancy, and will build a tree-like data structure, where each item will be a list, holding the factor in the first position, and pointers, i.e. nested lists, to the next elements in the second and third ones. Again, to make things clear, here’s the data structure for 540, using the same color code as before…

The data structure holding the prime factors of 540, arranged for ordered calculation of the divisors.

Rather than traversing this structure recursively, we will approach it in a more structured way, by keeping a separate list of factors that can be visited in a next step. In the example above, when we start out, we can visit the first factor (a 2), or skip to the third (the first 3), so the list will hold two elements. If we select going to the first 2, we then have two new options, i.e. going on to the second 2, or taking the shortcut to the first three. Note that the two instances point at the first three are different, since once has a pivot value of 1 (and will thus generate the divisor 3), while the other has it set to 2 (an will generate the divisor 6).

If we don’t care about the other of divisors, this method can even be implemented by hand, by keeping a list of pivots and remaining factors, and on each step erasing the first element of the list and generating a divisor, plus adding up to two new items at the bottom. The first steps, plus the resulting list of divisors, are worked out below, again for the case of n = 540.

Working out the divisors of 540

Now, while when doing it on paper there is not much more choice other than processing the items sequentially, we could choose to process always the element that produces the smallest divisor, and thus produce them in order. The simplest implementation of this idea would simply use a sort on the list of processable items before yielding the next divisor…

def divisors_ordered(factors) :
    """
    Generates all divisors, ordered, from the prime factorization
    """
    factors =  [(j,factors.count(j)) for j in sorted(set(factors))]
    omega = len(factors)
    fcts =[factors[0][0],[],[]]
    aux = [fcts, None]
    for j in xrange(omega) :
        p,a = factors[j]
        if omega - j > 1 :
            aux[1] = [factors[j+1][0],[],[]]
        else :
            aux[1] = None
        for k in xrange(1, a) :
            b = [p,[],[]]
            aux[0][1] = b
            aux[0][2] = aux[1]
            aux[0] = b
        aux[0][1] = aux[1]
        aux[0][2] = aux[1]
        aux[0] = aux[1]
    div_tree = [[fcts[0],1, fcts]]
    yield 1
    while div_tree :
        div_tree.sort()
        divisor, pivot, pointer = div_tree.pop(0)
        yield divisor
        if pointer[1] :
            div_tree += [[divisor * pointer[1][0] , divisor, pointer[1]]]
        if pointer[2] :
            div_tree += [[pivot * pointer[2][0], pivot, pointer[2]]]

Of course doing a full sort before yielding every divisor kills the efficiency of this algorithm. The fancy (or should I simply say correct?) way of keeping an ordered list, popping the smallest item only, and pushing arbitrary elements, is with a priority queue, which I will implement as a binary heap. Since we will be building our heap from scratch, we only need to worry about push and pop functions.

def push_heap(item, heap = None, compare = cmp) :
    """
    Pushes an item into a min-heap, or creates one.
    """
    heap_size = None
    if not heap :
        heap = [item, 1]
    else :
        heap_size = heap.pop()
        heap += [item, heap_size + 1]
    i = heap[-1] - 1
    val = heap[i]
    heap_ordered = False
    while i > 0 and not heap_ordered :
        n_i = i - 1
        if n_i & 1 : # faster than n_i % 2
            n_i -= 1
        n_i >>= 1 # faster than n_i // 2
        n_val = heap[n_i]
        if compare(val, n_val) < 0 :
            heap&#91;i&#93;, heap&#91;n_i&#93; = heap&#91;n_i&#93;, heap&#91;i&#93;
            i, val = n_i, heap&#91;n_i&#93;
        else :
            heap_ordered = True
    return heap

def pop_heap(heap, compare = cmp) :
    """
    Pops the smallest item from a min-heap.
    """
    heap_size = heap.pop()
    if not heap_size:
        return None
    elif heap_size == 1 :
        return heap.pop()
    ret, heap&#91;0&#93; = heap&#91;0&#93;, heap.pop()
    heap_size -= 1
    heap += &#91;heap_size&#93;
    i = 0
    heap_ordered = False
    while not heap_ordered :
        n_i = 2 * i + 1
        if n_i < heap_size :
            if n_i + 1 < heap_size and compare(heap&#91;n_i&#93;,heap&#91;n_i + 1&#93;) > 0 :
                n_i += 1
            if compare(heap[i], heap[n_i]) > 0 :
                heap[i], heap[n_i] = heap[n_i], heap[i]
                i = n_i
            else :
                heap_ordered = True
        else :
            heap_ordered = True
    return ret

def divisors_ordered(factors) :
    """
    Generates an ordered list of all divisors
    """
    factors =  [(j,factors.count(j)) for j in sorted(set(factors))]
    omega = len(factors)
    fcts =[factors[0][0],[],[]]
    aux = [fcts, None]
    for j in xrange(omega) :
        p,a = factors[j]
        if omega - j > 1 :
            aux[1] = [factors[j+1][0],[],[]]
        else :
            aux[1] = None
        for k in xrange(1, a) :
            b = [p,[],[]]
            aux[0][1] = b
            aux[0][2] = aux[1]
            aux[0] = b
        aux[0][1] = aux[1]
        aux[0][2] = aux[1]
        aux[0] = aux[1]
    yield 1
    div_heap = push_heap([fcts[0],1, fcts])
    while div_heap :
        divisor, pivot, pointer = pop_heap(div_heap)
        yield divisor
        if pointer[1] :
            div_heap = push_heap([divisor * pointer[1][0] , divisor, pointer[1]], div_heap)
        if pointer[2] :
            div_heap = push_heap([pivot * pointer[2][0], pivot, pointer[2]], div_heap)

The following graph shows the time required by each of the three implementations presented to produce an ordered list of all divisors of pn#, i.e. the product of the first n prime numbers. Do notice how the lower computational complexity of the heap sorting eventually beats the initial speed advantage of the sorting version. Still, for full lists of divisors, generating all of them and then sorting beats any other approach. It is also worth pointing out how the time required for the heap version of the ordered algorithm eventually settles around being 14 times slower than the unordered generator with delayed sorting, i.e. the computational complexities of both are the same, which probably means that sorting dominates the time of the unordered algorithm.

Time to compute all ordered divisors of the first few primorials.

Just to show why we could want to use an algorithm that is more than ten times slower, the following graph shows the time needed to compute the first divisor larger than 1000 for the first few primorials. While using the heap version of the ordered algorithm time is close to constant, the unordered algorithm shows the same behavior as in the previous test.

Time to compute the first divisor larger than 1000 for the first few primorials.

It is nevertheless worth noting that the break even point is p11# = 200,560,490,130, so unless you are after the first few divisors of a huge number, the first algorithm presented should always be the choice.


The radix-2 Cooley-Tukey FFT Algorithm with Decimation in Time

April 30, 2009

EDIT May 29th 2009: The code presented in this post has a major bug in the calculation of inverse DFTs using the FFT algorithm. Some explanation can be found here, and fixed code can be found here.

Once the DFT has been introduced, it is time to start computing it efficiently. Focusing on the direct transform,

\displaystyle X_k = \sum_{n=0}^{N-1}{x_n e^{-2\pi i k n / N}},

if the size of the input is even, we can write N = 2·M and it is possible to split the N element summation in the previous formula into two M element ones, one over n = 2·m, another over n = 2·m + 1. By doing so, after some cancellations in the exponents we arrive at

\displaystyle X_k = \sum_{m=0}^{M-1}{x_{2m} e^{-2\pi i \frac{m k}{M} }} + e^{-2\pi i\frac{k}{N}} \sum_{m=0}^{M-1}{x_{2m+1} e^{-2\pi i\frac{m k}{M}}}.

There’s something very promising here: each of the two summations are very close to being the expression of a DFT of half the size of the original. The only thing preventing them from fully being so is the fact that k runs from 0 to N-1, rather than up to M-1 only. To get around this, we will consider two different cases: k < M and k ≥ M. The first trivially produces a pair of half sized DFTs, which we will term Ek and Ok, owing to one running over the even indexes, the other over the odd ones:

\displaystyle E_k = \sum_{m=0}^{M-1}{x_{2m} e^{-2\pi i \frac{m k}{M} }},  O_k = \sum_{m=0}^{M-1}{x_{2m+1} e^{-2\pi i \frac{m k}{M} }}.

As for the case k ≥ M, we may as well write it out as k = M + k’. And if we plug this value into the original formula, work a little with the algebra, and then undo the variable change, i.e. plug k’ = M – k back in, we arrive at the following combined formula,

X_k  =  \left\{ \begin{array}{ll} E_k +  e^{-2\pi i \frac{k}{N}} O_k & \mbox{if } k < M \\ E_{k-M} -  e^{-2\pi i\frac{k-M}{N}} O_{k-M} & \mbox{if } k \geq M \end{array} \right..

The cool thing about this decomposition is that we have turned our N element DFT into two N/2 DFTs, plus some overhead composing the results, which includes calculating twiddle factors, i.e. the complex exponential multiplying the O’s in the last formula. Given the quadratic nature of the time required to calculate the DFT, this amounts to cutting the time in half. And the best part is that we need not stop after one such decomposition: if the original size is a power of two, we can keep doing the same trick recursively until we reach a DFT of only two elements, for which we already have produced explicit formulas. A similar approach can be used for the inverse DFT, the details are built into the following code…

from __future__ import division
import math
import time

def fft_CT(x, inverse = False, verbose = False) :
    """
    Computes the DFT of x using Cooley-Tukey's FFT algorithm.
    """
    t = time.clock()
    N = len(x)
    inv = -1 if not inverse else 1
    if N % 2 :
        return dft(x, inverse, False)
    x_e = x[::2]
    x_o  = x[1::2]
    X_e = fft_CT(x_e, inverse, False)
    X_o  = fft_CT(x_o, inverse, False)
    X = []
    M = N // 2
    for k in range(M) :
        X += [X_e[k] + X_o[k] * math.e ** (inv * 2j * math.pi * k / N)]
    for k in range(M,N) :
        X += [X_e[k-M] - X_o[k-M] * math.e ** (inv * 2j * math.pi * (k-M) / N)]
    if inverse :
        X = [j/2 for j in X]
    t = time.clock() - t
    if verbose :
        print "Computed","an inverse" if inverse else "a","CT FFT of size",N,
        print "in",t,"sec."
    return X

The speed gain with this approach, restricted for now to power of two sizes, are tremendous even for moderately large input sizes. Mostly so due to the fact the new algorithm scales as N log N. For instance, a 220 (~106) item input can be processed with the FFT approach in a couple of minutes, while the projected duration using the naive DFT will be more like a couple of weeks…

Twiddle Factors Again
While it may not be immediately obvious, the previous code is generously wasting resources by calculating over and over again the same twiddle factors. Imagine, for instance, a 1024 element DFT computed using this algorithm. At some point in the recursion the original DFT will have been exploded into 64 DFTs of 16 elements each. Each of this 64 DFTs must then be composed with the exact same 8 different twiddle factors, but right now they are being calculated from scratch each of the 64 times. Furthermore, half of this 8 different twiddle factors are the same that where needed to compose the 128 DFTs of 8 elements into the 64 DFTs of 16 elements… Summing it all up, to compute a FFT of size 2n, which requires the calculation of 2n-1 different twiddle factors, we are calculating as many as n·2n-1 of them.

We can easily work around this by precalculating them first, and then passing the list to other functions. For instance as follows…

def fft_CT_twiddles(x, inverse = False, verbose = False, twiddles = None) :
    """
    Computes the DFT of x using Cooley-Tukey's FFT algorithm. Twiddle factors
    are precalculated in the first function call, then passed down recursively.
    """
    t = time.clock()
    N = len(x)
    inv = -1 if not inverse else 1
    if N % 2 :
        return dft(x, inverse, False)
    M = N // 2
    if not twiddles :
        twiddles = [math.e**(inv*2j*math.pi*k/N) for k in xrange(M)]+[N]
    x_e = x[::2]
    x_o  = x[1::2]
    X_e = fft_CT_twiddles(x_e, inverse, False, twiddles)
    X_o  = fft_CT_twiddles(x_o, inverse, False, twiddles)
    X = []
    for k in range(M) :
        X += [X_e[k] + X_o[k] * twiddles[k * twiddles[-1] // N]]
    for k in range(M,N) :
        X += [X_e[k-M] - X_o[k-M] * twiddles[(k - M) * twiddles[-1] // N]]
    if inverse :
        X = [j/2 for j in X]
    t = time.clock() - t
    if verbose :
        print "Computed","an inverse" if inverse else "a","CT FFT of size",N,
        print "in",t,"sec."
    return X

This additional optimization does not improve the time scaling of the algorithm, but reduces the calculation time to about two thirds of the original one.


Computing the Greatest Common Divisor

April 30, 2009

Here comes my take at one of the most ubiquitous algorithms you can find googling

The least common multiple of 2 numbers, lcm(a,b), is the smallest number that is divisible by both a and b. The greatest common divisor of 2 numbers, gcd(a,b), is the largest number that divides both a and b. But just how do you go about calculating them? A most important property is that one can be calculated from the other based on the relation lcm(a,b)·gcd(a,b) = a·b, which ends up meaning that in practice the gcd is always calculated first. But lets not get ahead of ourselves…

Brute Force
I will disregard the dumber brute force approaches, such as “to get gcd(a,b), try all numbers down from min(a,b), until you find one that divides both a and b”, but a very resource wasteful implementation could use a sieving approach for the lcm. See, the lcm has to be smaller than a·b, which is definitely a multiple of both a and b. So if we set up an array of zeros of size a·b, we could use sieving to first set all multiples of a to 1, then do a search over all multiples of b to find the first common multiple. This approach will definitely break down for even moderately large a or b, but it is probably as good as most of us would get on our own without wikipedia…

from time import clock

def lcmBruteForce(a, b, verbose = False) :
    t = clock()
    sieve = [0]*a*b
    a,b = max(a,b), min(a,b)
    for j in range(a-1,a*b,a) :
        sieve[j] += 1
    ret = 0
    for j in range(b-1,a*b,b) :
        if sieve[j] == 1 :
            ret = j + 1
            break
    t = clock()-t
    if verbose :
        print "Calculated the lcm of",a,"and",b,"in",t,"sec."
    return ret

Integer Factorization
In my case I believe it was Ernesto, back in 1985 or 1986, who showed me into the secret of calculating the lcm and the gcd from the prime factorization of both numbers:

  • to find the gcd, keep only factors common to both and b, raised to the smaller exponent, and
  • for the lcm, keep all factors, common or not, raised to the largest exponent.

This approach will prove to be very inefficient, but lets give it a go. We will reuse the factorAndSieve function from the post on integer factorization. I have also written an auxiliary function, listToDict, that turns the list of factors with repeating entries, into a dictionary, where keys are factors and corresponding entries are exponents:

import time

def list_to_dict(seq) :
    ret = dict()
    for j in seq :
        if j in ret :
            ret[j] += 1
        else :
            ret[j] = 1
    return ret

def gcd_factor(a, b, verbose = False) :
    t = time.clock()
    aa = list_to_dict(factorAndSieve(a))
    bb = list_to_dict(factorAndSieve(b))
    rr = dict()
    for j in bb :
        if j in aa :
            rr[j] = min(aa[j], bb[j])
    ret = 1
    for j in rr :
        ret *= j**rr[j]
    t = time.clock() - t
    if verbose :
        print "Calculated the gcd of",a,"and",b,"in",t,"sec."
    return ret

The Euclidean Algorithm
When we looked at the calculation of the modular multiplicative inverse, we already took a look at the extended Euclidean algorithm, and pointed out it gives the gcd of the two passed numbers as a by-product. Not that it was an interesting thing then, since when calculating multiplicative inverses, we normally are interested in relatively prime number pairs, i.e. gcd(a,b) = 1. But if the adjective extended is dropped from the name, we are left with the Euclidean algorithm, which is the standard to calculate the gcd.

The basic idea behind the algorithm is the observation that if a > b, and g is the gcd of a and b, then it is also the gcd of a – b and b. To make it more efficient, rather than subtracting, we can cut corners by using the equivalent fact that, if g is the gcd of a and b, then it is also the gcd of b and a mod b. If we perform that same operation recursively, after a finite number of iterations, the result of the modular operation will eventually reach zero, and the gcd then is equal to the last non-zero remainder. Or

import time

def gcd_euclid(a, b, verbose = False) :
    t = time.clock()
    aa, bb = a,b
    while b != 0 :
        a, b = b, a%b
    t = time.clock() - t
    if verbose :
        print "Calculated the gcd of",aa,"and",bb,"in",t,"sec."
    return a

The Binary GCD Algorithm
Euclid’s algorithm is so fast that there really is not much point in trying to improve it. If we were using C rather than python, we could have taken advantage of the fact that a computer that does all its arithmetic in binary, and use the binary GCD algorithm. In python the resulting code is less clear, and the run times are slightly slower, probably owing to the fact that bitwise operations “don’t get optimized down to single CPU instructions (…) like in C,” or so says a commenter here… The truth is I haven’t been able to implement the algorithm so that it runs faster than the standard Euclidean one, so I’ll spare you my code…


The Discrete Fourier Transform

April 30, 2009

I’m currently a little fed up with number theory, so its time to change topics completely. Specially since the post on basic integer factorization completes what I believe is a sufficient toolkit to tackle a very cool subject: the fast Fourier transform (FFT).

I have some mixed feelings about how does Fourier analysis qualify for the “uncomplicated complexity” rule I imposed on myself when starting this blog. There certainly is a lot of very elaborate math behind it, but I think that if you skip all the mathematical subtleties and dive head first into a “the Fourier transform gives you the frequencies present in the signal” type of intuitive description, the journey from the more naive implementations of the discrete Fourier transform (DFT) to the many flavors of the FFT is a very enjoyable one, which requires only a limited amount of complication, but gentle doses of complexity. Which is exactly what I’m after…

So I will take the definition of the DFT for granted, and concentrate on the algorithms to compute it efficiently. There will have to be a few leaps of faith, but I will try to mark them clearly, so you can whisper “amen” and keep going without a second thought. Although things may unfold differently, I think there is material for about half a dozen posts. Towards the end I may devote some time to optimizing the code, with things such as avoiding recursion, or doing the calculations in place, but be warned that most of my efforts will be devoted to optimizing the math behind the code, not the code itself.

The Discrete Fourier Transform
Without further explanation, we will begin by writing down the analytical expression of the DFT,

\displaystyle X_k = \sum_{n=0}^{N-1}{x_n e^{-2\pi i k n / N}},

and of its corresponding inverse transform,

\displaystyle x_n =\frac{1}{N} \sum_{k=0}^{N-1}{X_k e^{2\pi i k n / N}}.

With the built-in support for complex arithmetic, there really isn’t much mistery in turning these two formulas into python functions, or as I have chosen, one with an inverse switch:

from __future__ import division
import math
import time

def dft(x, inverse = False, verbose = False) :
    t = time.clock()
    N = len(x)
    inv = -1 if not inverse else 1
    X =[0] * N
    for k in xrange(N) :
        for n in xrange(N) :
            X[k] += x[n] * math.e**(inv * 2j * math.pi * k * n / N)
        if inverse :
            X[k] /= N
    t = time.clock() - t
    if verbose :
        print "Computed","an inverse" if inverse else "a","DFT of size",N,
        print "in",t,"sec."
    return X

Of course, having two nested loops of the size of the input means that computation time will grow with N2, which is extremely inefficient.

Twiddle Factors
Apart from the FFT algorithms we will start with in the next post, there are some pretty straightforward optimizations at hand. While some texts reserve the name for similar factors appearing in FFT algorithms, the complex exponentials coming up in the DFT formula are often known as twiddle factors. While in principle there are N2 of them to calculate, we can take advantage of the fact that

\displaystyle e^{\varphi i} = \cos \varphi + i \sin \varphi,

and owing to the periodic nature of trigonometric functions, it isn’t hard to see that n·k can be replaced by n·k mod N without altering the result, which means only N different ones do exist.

A python implementation with pre-calculated twiddle factors could look something like this…

from __future__ import division
import math
import time

def dft(x, inverse = False, verbose = False) :
    t = time.clock()
    N = len(x)
    inv = -1 if not inverse else 1
    twiddles = [math.e**(inv*2j*math.pi*k/N) for k in xrange(N)]
    X =[0] * N
    for k in xrange(N) :
        for n in xrange(N) :
            X[k] += x[n] * twiddles[n * k % N]
        if inverse :
            X[k] /= N
    t = time.clock() - t
    if verbose :
        print "Computed","an inverse" if inverse else "a","DFT of size",N,
        print "in",t,"sec."
    return X

On my machine this code is about 5 times faster than the original one, although it still scales with time as N2, so we are only postponing the inevitable. Do notice as well that the increase in speed comes at the cost of an additional array of complex of size N memory requirement.

Particular Cases
Without being touched with insightful inspiration, the only other route to optimization is by working out the math on paper for particular cases, and code explicitly the resulting formulas to avoid doing costly exponentiations.

As examples, if N = 2, it is easy to show, that

\displaystyle X_0 = x_0 + x_1, \\ X_1 = x_0 - x_1,

and if N = 4, one can similarly arrive at

\displaystyle X_0 = x_0 + x_1  + x_2 + x_3, \\ X_1 = x_0 - x_2 - i (x_1 - x_3), \\ X_2 = x_0 + x_2 - (x_1+x_3), \\ X_3 = x_0 - x_2 + i (x_1 - x_3).

Adding these particular cases to the python code could be done as follows…

def dft(x, inverse = False, verbose = False) :
    t = time.clock()
    N = len(x)
    X =[0] * N
    inv = -1 if not inverse else 1
    if N == 1 :
        X[0] = x[0]
    elif N == 2 :
        X[0] = x[0] + x[1]
        X[1] = x[0] - x[1]
    elif N == 4 :
        a = x[0] + x[2]
        b = x[0] - x[2]
        c = x[1] + x[3]
        d = x[1] - x[3]
        X[0] = a + c
        X[1] = b + inv * 1j * d
        X[2] = a - c
        X[3] = b - inv * 1j * d
        if inverse :
            X = [k/N for k in X]
    else :
        twiddles = [math.e**(inv*2j*math.pi*k/N) for k in xrange(N//2+1)]
        for k in xrange(N//2+1,N) :
            twiddles += [twiddles[N-k].conjugate()]
        for k in xrange(N) :
            for n in xrange(N) :
                X[k] += x[n] * twiddles[n * k % N]
            if inverse :
                X[k] /= N
    t = time.clock() - t
    if verbose :
        print "Computed","an inverse" if inverse else "a","DFT of size",N,
        print "in",t,"sec."
    return X

Overall, and for this particular cases only, an additional factor of 3 increase in speed is achieved in this way. But this solution is not scalable to arbitrary input sizes, and so wiser approaches are needed to tackle even moderately large inputs of few tens of thousands samples…


Naive Integer Factorization

April 9, 2009

After three posts (1, 2, 3) on calculating prime numbers, it is probably worth putting that knowledge to a more useful task. As we will see in a near future, integer factorization, i.e. breaking down a (composite) number into its prime factors is one such task. In purity, factoring a number n is simply decomposing it as the product of two smaller non-trivial, i.e. different from 1 and n itself, divisors. But by repeatedly factoring the divisors one will eventually come up with a unique set of primes which, when multiplied together, render the original number, or so says the fundamental theorem of arithmetic… The point is, we will consider factorization a synonym of prime decomposition, be it formally correct or not.

There are some very sophisticated methods to factor very large numbers, but they use a lot of extremely complex math, so I doubt they will ever find their way onto this blog. So we are going to be left with the naive, straightforward approach as our only option, although I will try to give it an efficiency boost. What is this naive approach? Trial division, of course: given a number n, we know that its smallest factor will be smaller than the square root of n, so we can simply try and see if any of those numbers divide it. No, I will not try to code that yet… If you have read the entries on determining prime numbers, it should come as no surprise that we really do not need to do trial division by all numbers smaller than the square root of n, but only by the primes within. This is a consequence of the fact that, if a composite number divides n, then each of the prime factors of that composite number will also divide n. According to the prime number theorem the number of primes below x is asymptotic to x / log x. So by limiting our trials to prime numbers we can reduce the number of tests from n1/2 to something around 2 n1/2 / log n. If we rescue the primeListSofE function from the post on the sieve of Erathostenes, a python implementation of naive factorization could look something like this…

from time import clock

def factor(n, verbose = False) :
    """
    Returns all prime factors of n, using trial division by prime
    numbers only. Returns a list of (possibly repeating) prime factors
    """
    t = clock()
    ret =[]
    nn = n
    maxFactor = int(n**0.5)
    primes = primeListSofE(maxFactor, verbose)
    for p in primes :
        while nn % p == 0 :
            nn //= p
            ret += [p]
        if nn == 1 :
            break
    if nn != 1 :
        ret += [nn]
    t = clock() - t
    if verbose :
        print "Calculated factors of",n,"in",t,"sec."
    return ret

While this function will be about as good as we can make it for numbers which are the product of two large prime factors, it will be terribly inefficient for most numbers. Consider, as an extreme example, that we are trying to factor 255 ~ 3.6·1016. We would first calculate all primes up to 1.9·108, a challenging feat in itself with our available tools, only to find out that we only needed the first of those primes, i.e. 2. Taking into account that 50% of all numbers are divisible by 2, 33% are divisible by 3, 20% are divisible by 5… it doesn’t seem wise to disregard the potential time savings. What we can do to profit from this is to do the trial division checks at the same time as we determine the prime numbers, updating the largest prime to test on the fly. This has to be done in two stages, the first while we sieve up to n1/4, the second while we search the rest of the sieve up to n1/2 searching for more primes. The following Franken-code has been written mostly by cut-and-paste from primeListSofE and factor, which hopefully hasn’t affected its readability much:

from time import clock

def factorAndSieve(n, verbose = False) :
    """
    Returns all prime factors of n, using trial division while sieving
    for primes. Returns a list of (possibly repeating) prime factors
    """
    t = clock()
    ret =[]
    nn = n
    while nn % 2 == 0 : # remove 2's first, as 2 is not in sieve
        nn //= 2
        ret += [2]
    maxFactor = int(nn**0.5)
    maxI = (maxFactor-3) // 2
    maxP = int(maxFactor**0.5)
    sieve = [True for j in xrange(maxI+1)]
    i = 0
    for p in xrange(3, maxP+1,2) : # we then sieve as far as needed
        if p > maxP :
            break
        i = (p-3) // 2
        if sieve[i] :
            while nn % p == 0 :
                nn //= p
                ret += [p]
                maxFactor = int(nn**0.5)
                maxI = (maxFactor-3) // 2
                maxP = int(maxFactor**0.5)
            if nn == 1 :
                break
            else :
                i2 = (p*p - 3) // 2
                for k in xrange(i2, maxI+1, p) :
                    sieve[k] = False
    index = i
    for i in xrange(index, maxI+1) : # and inspect the rest of the sieve
        if i > maxI :
            break
        if sieve[i] :
            p = 2*i + 3
            while nn % p == 0 :
                nn //= p
                ret += [p]
                maxFactor = int(nn**0.5)
                maxI = (maxFactor-3) // 2
                maxP = int(maxFactor**0.5)
            if nn == 1 :
                break
    if nn != 1 :
        ret += [nn]
    t = clock() - t
    if verbose :
        print "Calculated factors of",n,"in",t,"sec."
        print "Stopped trial division at",2*i+3,"instead of",int(n**0.5)
    return ret

This new code will very often be much faster than the other one, but at times it will be just about as slow as in the other case, or even slower, since the mixing of both codes introduces some inefficiencies. The most extreme examples of such cases would be a prime number, or the square of a prime number on one side, and a power of 2 on the other one.

The graph above plots times to calculate the factors of numbers between 106 and 106 + 100. Prime numbers in this interval stick out as the red dots among the blue ones: 106 +3, +33, the twin primes +37 and +39, +81 and +99. And numbers with many small prime factors populate the bottom of the red cloud.

If the above graph is not enough to convince you of the benefits of the second approach, maybe this timings for very large numbers will:


>>> factor(10**15+37,True)
Calculated primes to 31622776 in 6.760 sec.
Calculated factors of 1000000000000037 in 8.466 sec.
[1000000000000037L]
>>> factorAndSieve(10**15+37,True)
Calculated factors of 1000000000000037 in 8.666 sec.
Stopped trial division at 31622775 instead of 31622776
[1000000000000037L]

>>> factor(2**55,True)
Calculated primes to 189812531 in 42.811 sec.
Calculated factors of 36028797018963968 in 43.261 sec.
[2, ..., 2]
>>> factorAndSieve(2**55,True)
Calculated factors of 36028797018963968 in 8.632e-05 sec.
Stopped trial division at 3 instead of 189812531
[2, ..., 2]