## The Cooley-Tukey FFT Algorithm for General Factorizations

May 29, 2009

(All code examples in the post have been included in the nrp_base.py module, which can be downloaded from this repository.)

As presented in the previous post, Cooley-Tukey’s FFT algorithm has a clear limitation: it can only be used to speed the calculation of DFTs of a size that is a power of two. It isn’t hard, though, to extend the same idea to a more general factorization of the input signal size. We start again from the formula that defines the DFT,

$\displaystyle X_k = \sum_{n=0}^{N-1}{ x_k e^{-2\pi i kn/N}}$

Say that N = P·Q. Following the previous approach, we could try to split the calculation of the DFT into Q DFTs of size P, by rewriting it as

$\displaystyle X_k = \sum_{q=0}^{Q-1}{ e^{-2\pi i \frac{kq}{N}} \sum_{p=0}^{P-1}{x_{Qp+q} e^{-2\pi i \frac{kp}{P}}}}$

where n = Q·p + q. We are again faced with the same issue to turn the inner summation into a proper DFT: while p ranges from 0 to P-1, k goes all the way up to N-1. To get around this, we can make k = P·r + s, and replacing in the previous formula we get

$\displaystyle X_{Pr+s} = \sum_{q=0}^{Q-1}{ e^{-2\pi i \frac{qr}{Q}} e^{-2\pi i \frac{qs}{N}} \sum_{p=0}^{P-1}{ x_{Qp+q} e^{-2\pi i \frac{ps}{P}}}}$

where the inner summation has been reduced to a standard DFT of size P. Furthermore, the splitting of the first exponential in two factors unveils that the outer summation is also a standard DFT, this one of size Q, of the product of the DFT of size P by twiddle factors.

The subscripts in the above formula can get confusing to follow, so take note that q and r go from 0 to Q-1, while p and s go from 0 to P-1. Therefore the computation of the size N DFT has effectively been split into the computation of Q DFTs of size P, followed by twiddle factor multiplication, and then by P DFTs of size Q. It may be helpful to picture the linear array of input data as a PxQ matirx stored in row-major order. Then the Q DFTs of size P are performed on the columns of this matrix, and the P DFTs of size Q on the rows. Note though that the output has to be rearranged, in what would be equivalent to a matrix transposition, before getting to the final answer. As an example, the following picture illustrates the process on an input of size 15.

#### Planning a Generalized Cooley-Tukey FFT

Disregarding the multiplication by twiddle factors and the matrix transpositions, the algorithm turns a calculation which requires a time proportional to N2 = P2Q2, into one which requires a time proportional to P·Q·(P+Q). This very simplistic complexity analysis suggests that the factoring of the size of the original signal into two multiplicative factors isn’t arbitrary, and will affect the performance of the algorithm. And in principle a split into two similarly sized factors, P ~ Q ~ √N should, although often doesn’t, outperform any other approach.

In many practical applications a small P or Q is used, which is then called the radix. If the smaller DFT is in the outer loop. i.e. if Q is the radix, the scheme is termed decimation in time (DIT), while if it is in the inner loop, i.e. if P is the radix, we have decimation in frequency (DIF), which explains the long name of the simplest Cooley-Tukey algorithm described earlier.

The splitting of the input size, N, into the two multiplicative factors, P and Q, can be done in d(N) – 2 different ways, where d(N) is the divisor function, i.e. the number of divisors of N. If we use the same algorithm recursively, the possible paths to the solution can grow very fast, and determining the optimal one is anything but easy. It is in any case a point where we don’t want to make excessive assumptions, and which should be left sufficiently open for user experimentation. So I’ve decided to copy FFTW‘s approach: the first step to peform a FFT is to plan it. In our case, a plan is a data structure which is similar to a tree, holding the information of the factorizations chosen, the function that should be called to perform the next step, and references to those next steps. The following picture sketches two possible plans for a DFT of size 210 = 2·3·5·7.

In line with the image, I have coded a planning function that gives two choices to the user:

• decimation in time (dit=True) vs. decimation in frequency (dif=True)
• minimum possible radix (min_rad=True) vs. maximum possible radix (max_rad=True)

Note that the standard radix 2 with decimation in time plan would result from setting dit=True and min_rad=True when the input size is a power of two, and that it would be the default behavior of this planner.

@nrp_base.performance_timer
def plan_fft(x, **kwargs) :
"""
Returns an FFT plan for x.

The plan uses direct DFT for prime sizes, and the generalized CT algorithm
with either maximum or minimum radix for others.

Keyword Arguments
dif     -- If True, will use decimation in frequency. Default is False.
dit     -- If True, will use decimation in time. Overrides dif. Default
is True.
max_rad -- If True, will use tje largest possible radix. Default is False.
min_rad -- If True, will use smallest possible radix. Overrides max_rad.
Default is True.

"""
dif = kwargs.pop('dif', None)
dit = kwargs.pop('dit', None)

if dif is None and dit is None :
dif = False
dit = True
elif dit is None :
dit = not dif
else :
dif = not dit

if max_rad is None and min_rad is None :
elif min_rad is None :
else :

N = len(x)
plan = []

fct = sorted(nrp_numth.naive_factorization(N))
NN = N
ptr = plan
for f in fct :
NN //= f
P = [f, dft, None]
Q = [NN, fft_CT, []]
if dit :
ptr += [Q, P]
else :
ptr += [P, Q]
if NN == fct[-1] :
Q[1:3] = [dft, None]
break
else :
ptr = Q[2]
else :
items ={}
def follow_tree(plan, N, fct, **kwargs) :
if fct is None:
fct = nrp_numth.naive_factorization(N)
divs = sorted(nrp_numth.divisors(fct))
val_P = divs[(len(divs) - 1) // 2]
P = [val_P]
if val_P in items :
P = items[val_P]
else :
fct_P = nrp_numth.naive_factorization(val_P)
if len(fct_P) > 1 :
P += [fft_CT, []]
follow_tree(P[2], val_P, fct_P, dit = dit)
else :
P +=[dft, None]
items[val_P] = P
val_Q = N // val_P
Q = [val_Q]
if val_Q in items :
Q = items[val_Q]
else :
fct_Q = nrp_numth.naive_factorization(val_Q)
if len(fct_Q) > 1 :
Q += [fft_CT, []]
follow_tree(Q[2], val_Q, fct_Q, dit = dit)
else :
Q +=[dft, None]
items[val_Q] = Q
if dit :
plan += [Q, P]
else :
plan += [P, Q]

follow_tree(plan, N, None, dit = dit)

return plan


While the planner can probably be further optimized, especially for max_rad=True, with all the repeated factorizations, this shouldn’t be the critical part of the DFT calculation. Specially so since the plan can be calculated in advance, and reused for all inputs of the same size.

#### Twiddle Factors Reloaded

As already discussed in previous posts, one of the simplest ways of trimming down the computation time is to store the twiddle factors, which are almost certain to show up again, and reuse them whenever possible, rather than going through complex exponentiations over and over again. There is a slight difference though with regard to the radix 2 case. There, when computing a size N FFT, the twiddle factors that intervened in the calculations where all those with numerators in the exponent (apart from the ±2·π·i) ranging from 0 to N/2 – 1, inclusive. If N is split as P·Q, rather than as 2·N/2, then the largest numerator that will appear will be (P-1)·(Q-1), which is definitely larger than N/2, but not all of the items up to that value will be required, with many of them missing.

Take for instance a DFT of size 512. If we go for the radix 2 FFT, we will in total use 256 different twiddle factors, ranging from 0 to 255. If rather than that we go for a max_rad=True approach, the first split will be 512 = 16·32, and so the largest twiddle factor will have numerator 15·31 = 465. But with all the intermediate twiddle factors not showing up in the calculation, overall only 217 different ones are needed.

It is also interesting to note that the same twiddle factors that are used in the FFTs, may appear in the DFTs of the leaves of the planning tree. So it is worth modifying the current dft function, so that it can accept precalculated twiddles. Because the twiddle factors sequence will have gaps, a list is no longer the most convenient data structure to hold these values, a dictionary being a better fit for the job. Combining all of these ideas, we can rewrite our function as

@nrp_base.performance_timer
def dft(x, **kwargs) :
"""
Computes the (inverse) discrete Fourier transform naively.

Arguments
x         - Te list of values to transform

Keyword Arguments
inverse   - True to compute an inverse transform. Default is False.
normalize - True to normalize inverse transforms. Default is same as
inverse.
twiddles  - Dictionary of twiddle factors to be used by function, used
for compatibility as last step for FFT functions, don't set
manually!!!

"""
inverse = kwargs.pop('inverse',False)
normalize = kwargs.pop('normalize',inverse)
N = len(x)
inv = -1 if not inverse else 1
twiddles = kwargs.pop('twiddles', {'first_N' : N, 0 : 1})
NN = twiddles['first_N']
X =[0] * N
for k in xrange(N) :
for n in xrange(0,N) :
index = (n * k * NN // N) % NN
twid= twiddles.setdefault(index,
math.e**(inv*2j*math.pi*index/NN))
X[k] += x[n] * twid
if inverse and normalize :
X[k] /= NN
return X


Note also the changes to the user interface, with the addition of the normalize keyword argument. This has been introduced to correct a notorious bug in the inverse DFT calculations for the radix 2 FFT algorithm. The changes are implemented and can be checked in the module nrp_dsp.py, although they are very similar in nature to what will shortly be presented for the generalized factorization FFT.

#### Putting it all Together

We now have most of the building blocks to put together a FFT function using Cooley-Tukey’s algorithm for general factorizations, a possible realization of which follows:

@nrp_base.performance_timer
def fft_CT(x, plan,**kwargs) :
"""
Computes the (inverse) DFT using Cooley-Tukey's generalized algorithm.

Arguments
x        - The list of values to transform.
plan     - A plan data structure for FFT execution.

Keyword Arguments
inverse  - if True, computes an inverse FFT. Default is False.
twiddles - dictionary of twiddle factors to be used by function, used
for ptimization during recursion, don't set manually!!!

"""
inverse = kwargs.pop('inverse', False)
inv = 1 if inverse else -1
N = len(x)
twiddles = kwargs.pop('twiddles', {'first_N' : N, 0 : 1})
NN = twiddles['first_N']
X = [0] * N
P = plan[0]
Q = plan[1]
# do inner FFT
for q in xrange(Q[0]) :
X[q::Q[0]] = P[1](x[q::Q[0]], plan=P[2], twiddles=twiddles,
inverse=inverse,normalize=False)
# multiply by twiddles
for q in xrange(Q[0]) :
for s in xrange(P[0]) :
index = q * s * NN // N

mult = twiddles.setdefault(index,
math.e**(inv*2j*math.pi*index/NN))
X[Q[0]*s+q] *= mult
# do outer FFT
for s in xrange(P[0]) :
X[Q[0]*s:Q[0]*(s+1)] = Q[1](X[Q[0]*s:Q[0]*(s+1)],plan=Q[2],
twiddles=twiddles,inverse=inverse,
normalize=False)
# transpose matrix
ret = [0] * N
for s in xrange(P[0]) :
ret[s::P[0]] = X[s*Q[0]:(s+1)*Q[0]]
if inverse and N == NN : # normalize inverse only in first call
ret = [val / N for val in ret]
return ret


For power of two sizes, this algorithm is noticeably slower, by a factor of more than 2 depending on the input size, than the hardcoded radix 2 prior version. This is mostly due to the time spent doing matrix inversion in the last step. Also due to the matrix inversion, minimum radix plans usually outperform maximum radix ones, despite the benefits in FFT and twiddle factor computations the latter present over the former. Of course none of this matters if the input size presents several prime factors other than two, in which case the new algorithm clearly outperforms the radix 2 version, which is then close to the DFT version. Still, it is worth noting for later optimizations that matrix inversion should be done in a more efficient fashion.

## Signature Preserving Function Decorators

May 25, 2009

(The code examples in this post are also available here.)

On the comments to the post on the greatest common divisor, Steve pointed out that the code to time function performance could be embedded in the code in a neater way using function decorators. At the time, I confess I did not have a clue of what a decorator is. But I’ve been doing my homework, and so here’s my take on the subject…

#### So what the @#\$%& is a decorator?

Many moons ago, when I ditched my ZX Spectrum for a 80286 PC (with a i287 math coprocessor), and moved from BASIC to C, I eventually found out that there were such things as pointers to functions. What a sick thing to do to functions! What were this Kernighan and Ritchie!? Perverts!? So, still mourning for my lost GOTOs and GOSUBs, I dismissed the whole subject as pornographic, and decided not to learn it on moral grounds. Well, I’ve come a long way since then, am much more open-minded now, and have even developed a keen appreciation for errr… adult entertainment. So, bring in the syntactic porn sugar!!!

The basic idea is that when you write something like…

@my_decorator
def decorated_function(a, b, c) :
<function definition goes here>


…it is evaluated the same as if you had written…

def decorated_function(a, b, c) :
<function definition goes here>
decorated_function = my_decorator(decorated_function)


If this last assignment has sent you into stupor, that is simply the sweet way in which python’s indirection handles function pointers, so all there is to it is that my_decorator has to be a function (or a callable object, i.e. one with a __call__ method) that takes a function as only argument, and returns a function (or a callable object). This last return function has to be compatible with the arguments of decorated_function, since calls to the latter will actually call the former.

You can even raise the bar by passing arguments to the decorator, but the basic scheme is the same, if you write this…

@my_decorator(dec_a, dec_b, dec_c)
def decorated_function(a, b, c) :
<function definition goes here>


…python does the same as if you had written this…

def decorated_function(a, b, c) :
<function definition goes here>
decorated_function = my_decorator(dec_a, dec_b, dec_c)(decorated_function)


…where my_decorator(dec_a, dec_b, dec_c) should return a function, possibly different depending on the passed parameters, which takes a function as a parameter, which should return yet another function, that better be compatible with the arguments of decorated_function, for the same reasons as before.

Decorators can even be nested, but the functioning is quite predictable s well. You write this…

@one_decorator(one_args)
@two_decorators(two_args)
def decorated_function(a, b, c) :
<function definition goes here>


…and python won’t know the difference from doing this…

def decorated_function(a, b, c) :
<function definition goes here>
decorated_function = one_decorator(one_args)(two_decorator(two_args)(decorated_function))


…and by now you should be able to figure out what is being returned by what.

If you are still in distress, it may be a good idea to stop here and first read David‘s great article at Siafoo , or Bruce Eckel‘s series (1, 2, 3) at Artima. To see them suckers at work, check the Python Decorator Library. There’s also the option to drink from the pristine springs of PEP 318, not that I recommend it to the uninitiated, though.

#### A Decorator to Time Function Execution

So lets learn by doing, and build a decorator to time the performance of the decorated function, trying to mimic as much as possible the approach I have been following up till now: the last of the function’s argument is verbose, which defaults to False, and setting it to True activates printout of the time required by the function to run. This is possible, but messy. On the other hand, what can more easily be achieved is to control the timing with a keyword argument, as we will implement…

I have added more goodies… I learnt from the folks at Leohnard Euler’s Flying Circus that, while time.clock has superior performance than time.time under Windows, it is the opposite on Linux, so we will choose our timing function at run time, through os.name, which holds a string with the operating system your computer is running.

from __future__ import division
from os import name as OP_SYS
import time

def performance_timer_1(_func_) :
"""
A decorator function to time execution of a function.

Timing is activated by setting the verbose keyword to True.
"""

stopwatch = time.time
if OP_SYS in ('nt', 'dos') :
stopwatch = time.clock

def _wrap_(*args, **kwargs) :
"""
This is a function wrapper.
"""

verbose = kwargs.pop('verbose',False)
t = stopwatch()
ret = _func_(*args, **kwargs)
t = stopwatch() - t
if verbose :
print "Call to",_func_.__name__,"took",t,"sec."
return ret

return _wrap


Run this file on IDLE, or import it into an interactive python session, and you can do things like these:

>>> @performance_timer_1 def decorated_sum(a,b) : """ This is a function. """ return a+b

>>> decorated_sum(3,5) 8

>>> decorated_sum(3,5,verbose=True) Running function decorated_sum took 2.72380986743e-06 sec. 8

If you are wondering about the silly docstrings in the function and the wrapper, try the following:

>>> help(decorated_sum) Help on function wrapper:

wrapper(*args, **kwargs) This is a function wrapper.

Now that’s odd, isn’t it? Well, not really, as it is a natural byproduct of the devilish practice of assigning functions to functions: we are getting help for the function with which the decorator replaced the original function during decoration. Part of it is easy to fix, though, as the following code demonstrates…

import inspect

def performance_timer_2(_func_) :
"""
A decorator function to time execution of a function.

Timing is activated by setting the verbose keyword to True.
"""

stopwatch = time.time
if OP_SYS in ('nt', 'dos') :
stopwatch = time.clock

def _wrap_(*args, **kwargs) :
"""
This is a function wrapper.
"""

verbose = kwargs.pop('verbose',False)
t = stopwatch()
ret = _func_(*args, **kwargs)
t = stopwatch() - t
if verbose :
print "Call to",_func_.__name__,"took",t,"sec."
return ret
_wrap_.__name__ = _func_.__name__
_wrap_.__doc__ = _func_.__doc__
return _wrap_


If you try the same thing again, this time you will get…

@performance_timer_2 def decorated_sum(a,b) : """ This is a function. """ return a+b

>>> help(decorated_sum) Help on function decorated_sum:

decorated_sum(*args, **kwargs) This is a function.

We’re almost there!!! There is just one disturbing thing remaining: the function signature. See, if we had invoked help on the undecorated decorated_sum, we would have got decorated_sum(a, b), not the generic arguments and keyword arguments thing.

#### Preserving the Function Signature

There are ways to get around it, but things are going to get messy. The primal source is Michele Simionato‘s decorator module. Most of what I will present in the rest of this post is shamelessly ripped off from inspired on his code. You can check the documentation, or the last section of David’s Siafoo article, if you want to use that module.

There are two problems though… The first is of course having to rely on a third party module which isn’t part of the standard library: the rules forbid it!

But there is also a more fundamental issue: we would like to preserve the original function signature, but at the same time we are changing it!!! Going back to the prior example, if we preserved the signature unaltered as decorated_sum(a, b), we would get a TypeError: decorated_sum() got an unexpected keyword argument 'verbose' every time we tried to activate the timer. Trust me, I found the hard way…

What we would like to get is something like decorated_sum(a, b, **kwargs). To achieve something like this we are going to have to dynamically generate a function at run-time…

def performance_timer_3(_func_) :
"""
A decorator function to time execution of a function.

Timing is activated by setting the verbose keyword to True.
"""

stopwatch = time.time
if OP_SYS in ('nt', 'dos') :
stopwatch = time.clock

def _wrap_(*args, **kwargs) :
"""
This is a function wrapper.
"""

verbose = kwargs.pop('verbose',False)
t = stopwatch()
ret = _func_(*args, **kwargs)
t = stopwatch() - t
if verbose :
print "Call to",_func_.__name__,"took",t,"sec."
return ret

sig = list(inspect.getargspec(_func_))
wrap_sig = list(inspect.getargspec(_wrap_))
if not sig[2] :
sig[2] = wrap_sig[2]
src =  'def %s%s :\n' %(_func_.__name__, inspect.formatargspec(*sig))
src += '    return _wrap_%s\n' % (inspect.formatargspec(*sig))
evaldict = {'_wrap_' : _wrap_}
exec src in evaldict
ret = evaldict[_func_.__name__]
ret.__doc__ = _func_.__doc__
return ret


If you are not familiar with them, you may want to check the documentation for the inspect module, as well as for the exec statement, to get a hang of what’s going on. It is important to note how the _wrap_ function is not simply named in the dynamically generated source code, but included in evaldict. If we didn’t do it this way, it would be out of scope once we left the decorator, and an error will occur. Again, you can trust me on this one…

But the thing is we have finally arrived where we wanted to…

>>> @performance_timer_3 def decorated_sum(a,b) : """ This is a function. """ return a+b

>>> help(decorated_sum) Help on function decorated_sum:

decorated_sum(a, b, **kwargs) This is a function.

So every time you want to do a function decorator with a wrapper activated by a keyword, all you need to do is take this last example as a template, and with some cutting and pasting you’ll be done in no time.

Now wait a minute!!! Wasn’t avoiding this kind of thing what decorators where supposed to be all about? Didn’t we get into this mess to avoid cutting and pasting a couple of lines of code to the beginning and end of every function? What if we built a decorator that automated the decorating process?

def signature_decorator(_wrap_, has_kwargs = False) :
"""
A decorator to create signature preserving wrappers.

_wrap_     : the wrapper function, which must take a function as first
argument, can then take several optional arguments, which
must have defaults, plus the *args and **kwargs to pass to
the wrapped function.
has_kwargs : should be set to True if the wrapper also accepts keyword
arguments to alter operation.
"""
def decorator(func):
"""
An intermediate decorator function
"""
sig = list(inspect.getargspec(func))
wrap_sig = list(inspect.getargspec(_wrap_))
sig[0], wrap_sig[0] = sig[0] + wrap_sig[0][1:], wrap_sig[0] + sig[0]
wrap_sig[0][0] = '_func_'
sig[3] = list(sig[3]) if sig[3] is not None else []
sig[3] += list(wrap_sig[3]) if wrap_sig[3] is not None else []
sig[3] = tuple(sig[3]) if sig[3] else None
wrap_sig[3] = None
if sig[2] is None and has_kwargs:
sig[2] = wrap_sig[2]
wrap_sig[1:3] = sig[1:3]
src = 'def %s%s :\n' % (func.__name__, inspect.formatargspec(*sig))
src += '    return _wrap_%s\n' % (inspect.formatargspec(*wrap_sig))
evaldict = {'_func_':func, '_wrap_': _wrap_}
exec src in evaldict
ret = evaldict[func.__name__]
ret.__doc__ = func.__doc__
return ret
return decorator

def timer_wrapper_1(_func_, verbose = False, *args, **kwargs) :
"""
A wrapper to time functions, activated by the last argument.

For use with signature_decorator
"""
stopwatch = time.time
if OP_SYS in ('nt','dos') :
stopwatch = time.clock
t = stopwatch()
ret = _func_(*args,**kwargs)
t = stopwatch() - t
if verbose :
print "Call to",_func_.__name__,"took",t,"sec."
return ret

def timer_wrapper_2(func, *args, **kwargs) :
"""
A wrapper to time functions, activated by a keyword argument..

For use with signature_decorator
"""
stopwatch = time.time
if OP_SYS in ('nt','dos') :
stopwatch = time.clock
verbose = kwargs.pop('verbose',False)
t = stopwatch()
ret = func(*args,**kwargs)
t = stopwatch() - t
if verbose :
print "Call to",_func_.__name__,"took",t,"sec."
return ret


Do note that the wrappers are now defined outside the decorator, and that they must take the wrapped function as first argument. Do note as well the decorators two parameters, the first being the wrapper, the second whether the wrapper takes keyword arguments or not. To get a hang of how this works try the following…

>>> @signature_decorator(timer_wrapper_1) def decorated_sum(a,b=2) : """ This is a function. """ return a+b

>>> help(decorated_sum) Help on function decorated_sum:

decorated_sum(a, b=2, verbose=False) This is a function.

>>> @signature_decorator(timer_wrapper_2,True) def decorated_sum(a,b=2) : """ This is a function. """ return a+b

>>> help(decorated_sum) Help on function decorated_sum:

decorated_sum(a, b=2, **kwargs) This is a function.

There are a million things that could go wrong with this last decorator, specially with conflicting variable or keyword names in the function and the wrapper. If you check the decorator module source code, you can find ways to build robustness to such issues. As for myself, a week’s worth of sleep deprivation is enough: I already know more about decorators than ever thought possible or wished. If you are curious, you can check the final, a little more elaborate, version of the performance_timer decorator I’m sticking to, which does not use this last refinement of the decorator_signature, checking nrp_base.py.

## Home Sweet Home

May 25, 2009

The move to wordpress is now completed… All the relevant content that was in blogspot, is now here for your enjoyment. The importing was smooth, but I had to do a lot of reformatting by hand, and with python’s sensitivity to whitespace errors, I can’t guarantee that I haven’t messed anything up. If you find anything broken, let me know and I’ll fix it.

Once I got started, I got a little carried away, and have done some more housekeeping:

• There now is a Coming Soon page on the top menu.
• There is also a Post Index there, which should provide a convenient way to access the content.

And the big news is that I have set up a Mercurial repository at bitbucket, here’s the link. This should provide two very convenient features for the future:

• No longer will anyone wanting to try out some of the code I post have to cut and paste from several older posts to get the functions running, as any dependencies will now be handled through a few modules stored in the repository.
• For future blog entries, rather than cutting and pasting from the posted code, a single file containing all the code will also be available at bitbucket.

Lastly, I am starting to consider moving on to python 3.0 a wise choice. If there are any regular readers out there with an opinion on this, your comments are more than welcome.

## Divisors

May 13, 2009

When I wrote the first post of the FFT series I confessed it was out of boredom from number theory. That statement got me into a coffee break discussion with a friend and mathematician, who claimed that the FFT is as number theory as it gets, and went on to explain why. Since I usually don’t understand a word of what he is saying when he gets started on one of this rants, which invariably lead to Riemann’s zeta function, I just pretended to acknowledge his doubts, and then forgot about the conversation altogether.

Well, a few weeks down the road it seems he was right: while trying to make my FFT code work, I have found myself dealing with all sorts of number theoretical stuff, ranging from simple ones, like factorization, to arcane, as the Chinese remainder theorem. And since I have this thing about complete expositions, I’m afraid that this interleaving of apparently unrelated number theory posts with the FFT ones is going to be the norm. As a matter of fact the odd post on the Euclidean algorithm was supposed to be a stepping stone to an improved radix 2 FFT, which I finally managed to do without.

But lets get going with today’s subject: computing the divisors of a positive integer. May not seem a very sexy topic at first glance, but there’s beauty in it, as the striking Divisor Plot website put together by Jeffrey Ventrella proves…

Getting into the algorithmic treatment of the problem, there is of course the dumber approach, on the line of “to compute all divisors of a number, do trial division by all numbers smaller than it.” I will not pursue that line of thinking though, and will focus on computing the divisors once the prime factorization of the number is known, which is the efficient way of dealing with this problem.

So I will assume that we know the expansion of the number into its prime factors as

$\displaystyle n = \prod_{i=1}^{\omega(n)} p_i^{a_i}$,

where the function ω(n) represents the number of distinct prime factors, pi, each appearing with the corresponding exponent, ai.

A divisor will have a subset of the prime factors of n as its own factors, so we could represent any divisor as a list of ω(n) numbers, the i-th ranging from 0 up to and including ai. This scheme makes it very easy to calculate the total number of divisors as

$\displaystyle d(n)=\prod_{i=1}^{\omega(n)} (a_i+1)$.

Unordered Generation of Divisors
Furthermore, it points out that the computation of all the divisors could be achieved using ω(n) nested loops. Of course the number of loops to nest will vary from input to input, so recursion is probably the best of choices, or so thinks Tim Peters, as the code below has only cosmetic modifications to what he posted on the Python-list@python.org mailing list:

def divisors(factors) :
"""
Generates all divisors, unordered, from the prime factorization.
"""
ps = sorted(set(factors))
omega = len(ps)

def rec_gen(n = 0) :
if n == omega :
yield 1
else :
pows = [1]
for j in xrange(factors.count(ps[n])) :
pows += [pows[-1] * ps[n]]
for q in rec_gen(n + 1) :
for p in pows :
yield p * q

for p in rec_gen() :
yield p


This approach, while extremely fast, is not much use if you want to produce the divisors sorted, as you will then have to first generate and store all divisors, then sort them.

Ordered Generation of Divisors
If what you are after is not the full set of divisors, or any arbitrary subset of them, but something like “the first divisor larger than x” or “the last divisor smaller than x,” it may be worth trying other algorithms which, while being less efficient, manage to produce the divisors properly sorted.

There’s little literature on the subject, so its hard to find a valid benchmark . Even PARI/GP, which is the golden standard on computational number theory seems, from the little sense I have managed to make out of their Subversion repository, to be generating all divisors (in a similar manner to above) and then sorting them.

So here’s a calculation scheme that can be effectively modified to yield ordered divisors… We start with a pivot value of 1, and an ordered list of all prime factors. Before every factor we split our flow in two:

• one branch skips the factor, and proceeds to the first factor different from the current one, keeping the pivot unchanged
• the other branch first updates the pivot, multiplying it by the factor, yields the result, and then proceeds to the next factor, with the modified pivot.

The flow is probably easier to understand with the example below, showing the branching when computing the divisors of 540 = 22·33·5. Blue arrows skip to the first different factors, red ones go to the next factors, a line going through a factor implies multiplication by it…

To keep track of this structure, I’m going to get fancy, and will build a tree-like data structure, where each item will be a list, holding the factor in the first position, and pointers, i.e. nested lists, to the next elements in the second and third ones. Again, to make things clear, here’s the data structure for 540, using the same color code as before…

Rather than traversing this structure recursively, we will approach it in a more structured way, by keeping a separate list of factors that can be visited in a next step. In the example above, when we start out, we can visit the first factor (a 2), or skip to the third (the first 3), so the list will hold two elements. If we select going to the first 2, we then have two new options, i.e. going on to the second 2, or taking the shortcut to the first three. Note that the two instances point at the first three are different, since once has a pivot value of 1 (and will thus generate the divisor 3), while the other has it set to 2 (an will generate the divisor 6).

If we don’t care about the other of divisors, this method can even be implemented by hand, by keeping a list of pivots and remaining factors, and on each step erasing the first element of the list and generating a divisor, plus adding up to two new items at the bottom. The first steps, plus the resulting list of divisors, are worked out below, again for the case of n = 540.

Now, while when doing it on paper there is not much more choice other than processing the items sequentially, we could choose to process always the element that produces the smallest divisor, and thus produce them in order. The simplest implementation of this idea would simply use a sort on the list of processable items before yielding the next divisor…

def divisors_ordered(factors) :
"""
Generates all divisors, ordered, from the prime factorization
"""
factors =  [(j,factors.count(j)) for j in sorted(set(factors))]
omega = len(factors)
fcts =[factors[0][0],[],[]]
aux = [fcts, None]
for j in xrange(omega) :
p,a = factors[j]
if omega - j > 1 :
aux[1] = [factors[j+1][0],[],[]]
else :
aux[1] = None
for k in xrange(1, a) :
b = [p,[],[]]
aux[0][1] = b
aux[0][2] = aux[1]
aux[0] = b
aux[0][1] = aux[1]
aux[0][2] = aux[1]
aux[0] = aux[1]
div_tree = [[fcts[0],1, fcts]]
yield 1
while div_tree :
div_tree.sort()
divisor, pivot, pointer = div_tree.pop(0)
yield divisor
if pointer[1] :
div_tree += [[divisor * pointer[1][0] , divisor, pointer[1]]]
if pointer[2] :
div_tree += [[pivot * pointer[2][0], pivot, pointer[2]]]


Of course doing a full sort before yielding every divisor kills the efficiency of this algorithm. The fancy (or should I simply say correct?) way of keeping an ordered list, popping the smallest item only, and pushing arbitrary elements, is with a priority queue, which I will implement as a binary heap. Since we will be building our heap from scratch, we only need to worry about push and pop functions.

def push_heap(item, heap = None, compare = cmp) :
"""
Pushes an item into a min-heap, or creates one.
"""
heap_size = None
if not heap :
heap = [item, 1]
else :
heap_size = heap.pop()
heap += [item, heap_size + 1]
i = heap[-1] - 1
val = heap[i]
heap_ordered = False
while i > 0 and not heap_ordered :
n_i = i - 1
if n_i & 1 : # faster than n_i % 2
n_i -= 1
n_i >>= 1 # faster than n_i // 2
n_val = heap[n_i]
if compare(val, n_val) < 0 :
heap&#91;i&#93;, heap&#91;n_i&#93; = heap&#91;n_i&#93;, heap&#91;i&#93;
i, val = n_i, heap&#91;n_i&#93;
else :
heap_ordered = True
return heap

def pop_heap(heap, compare = cmp) :
"""
Pops the smallest item from a min-heap.
"""
heap_size = heap.pop()
if not heap_size:
return None
elif heap_size == 1 :
return heap.pop()
ret, heap&#91;0&#93; = heap&#91;0&#93;, heap.pop()
heap_size -= 1
heap += &#91;heap_size&#93;
i = 0
heap_ordered = False
while not heap_ordered :
n_i = 2 * i + 1
if n_i < heap_size :
if n_i + 1 < heap_size and compare(heap&#91;n_i&#93;,heap&#91;n_i + 1&#93;) > 0 :
n_i += 1
if compare(heap[i], heap[n_i]) > 0 :
heap[i], heap[n_i] = heap[n_i], heap[i]
i = n_i
else :
heap_ordered = True
else :
heap_ordered = True
return ret

def divisors_ordered(factors) :
"""
Generates an ordered list of all divisors
"""
factors =  [(j,factors.count(j)) for j in sorted(set(factors))]
omega = len(factors)
fcts =[factors[0][0],[],[]]
aux = [fcts, None]
for j in xrange(omega) :
p,a = factors[j]
if omega - j > 1 :
aux[1] = [factors[j+1][0],[],[]]
else :
aux[1] = None
for k in xrange(1, a) :
b = [p,[],[]]
aux[0][1] = b
aux[0][2] = aux[1]
aux[0] = b
aux[0][1] = aux[1]
aux[0][2] = aux[1]
aux[0] = aux[1]
yield 1
div_heap = push_heap([fcts[0],1, fcts])
while div_heap :
divisor, pivot, pointer = pop_heap(div_heap)
yield divisor
if pointer[1] :
div_heap = push_heap([divisor * pointer[1][0] , divisor, pointer[1]], div_heap)
if pointer[2] :
div_heap = push_heap([pivot * pointer[2][0], pivot, pointer[2]], div_heap)


The following graph shows the time required by each of the three implementations presented to produce an ordered list of all divisors of pn#, i.e. the product of the first n prime numbers. Do notice how the lower computational complexity of the heap sorting eventually beats the initial speed advantage of the sorting version. Still, for full lists of divisors, generating all of them and then sorting beats any other approach. It is also worth pointing out how the time required for the heap version of the ordered algorithm eventually settles around being 14 times slower than the unordered generator with delayed sorting, i.e. the computational complexities of both are the same, which probably means that sorting dominates the time of the unordered algorithm.

Just to show why we could want to use an algorithm that is more than ten times slower, the following graph shows the time needed to compute the first divisor larger than 1000 for the first few primorials. While using the heap version of the ordered algorithm time is close to constant, the unordered algorithm shows the same behavior as in the previous test.

It is nevertheless worth noting that the break even point is p11# = 200,560,490,130, so unless you are after the first few divisors of a huge number, the first algorithm presented should always be the choice.