Well, all that is about to change!

Business as usual will resume shortly with a post on Fibonacci numbers, and there’s also stuff in the making regarding the Chinese remainder theorem as a preliminary to the prime-factor FFT algorithm. The general plan from over a year ago sort of still holds. And I’m sticking with Python 2.x, at least for the time being.

But I have more ambitious plans, and have to decide if this blog will be part of them…

Over the past months I have spent quite a bit of time looking into several physics problems. Things like the speed at which a liquid evaporates, the rolling resistance of a deformable wheel or the stability of a bicycle. Yes, my wife agrees: I have weird ways of enjoying my free time… So that my newly acquired knowledge will not be lost like tears in the rain, I have the firm intention of writing about all that stuff. Not that any of those subjects would in anyway fit within this blog as it is, so I have registered another one just in case.

Also, after 10 years in the same company, and 5 years doing the same job, I am starting to feel an itching urge to give my career a good shake. As a Spanish saying goes, “el hombre propone y Dios dispone,” (man proposes and then God decides) but the firm intention is to leave Spain and find a challenging, enjoyable, better paying job abroad within the next couple of years. If you happen to want to hire an experienced R&D engineer with an EU passport, do contact me.

Not that I expect my inbox to be flooded by this announcement, so I will also be working on honing some skills to better sell myself in the free market, and will definitely want to leave written testimony of my exploits. Some statistics and data analysis posts could see the light here, but I don’t really see relational databases or web programming fitting in this blog. I could of course open yet another one devoted to more generic programming topics, but at some point I’ll have to decide if it isn’t better to have a single one with more varied content.

Anyway, if there’s anyone out there still listening: welcome back, make yourselves comfortable, and enjoy the ride!

]]>

Project Euler’s problem 266, posted this very last weekend, was proposed by yours truly:

The divisors of 12 are: 1,2,3,4,6 and 12.

The largest divisor of 12 that does not exceed the square root of 12 is 3.

We shall call the largest divisor of an integer n that does not exceed the square root of n the pseudo square root (PSR) of n.

It can be seen that PSR(3102)=47.Let p be the product of the primes below 190.

Find PSR(p) mod 10^{16}.

I had something much more modest in mind, such as “all primes below 100”, since the way I figured it out this was to be solved treating it as a variation of the knapsack problem and then using dynamic programming on it. An approach that will see you die of old age if you try it for the proposed value…

But I then had the pleasure of witnessing how xan, hk and daniel.is.fischer started coming up with alternative, much more sophisticated and elegant methods, that pushed the question all the way up to were it is now. It is both a humbling and edifying experience to see these people at work, so if you have an idea for a problem, do propose it to them.

I won’t go into any more details as to how to solve this particular problem: the right place to ask questions is here.

]]>

Although I have already written something on binary exponentiation, as it applied to modular exponentiation, there’s plenty more to be said about it in more general settings. Basically, today we are after computing the n-th power of a, but a need not be an integer, rational, real or even complex number, but can be any more general mathematical object that can be raised to a power, as a square matrix or a polynomial could be…

For an n-th power to exist, the object being exponentiated must fulfill certain conditions, which can basically be summarized in belonging to a more general set that has a well-behaved multiplication operation defined. Good behavior can be translated into closure and associativity, which in mathematical slang can be expressed as the object belonging to a semigroup. This will often mean that the object belongs to a monoid, and even to a ring or rng… But I’ll follow V.I. Arnold‘s advice, and not get too carried away with the abstract generalizations, so enough said: the thing must be multipliable, and the result must be of the same kind.

Because we are dealing with general mathematical objects, we first need to define the multiplication operation. There are two ways of dealing with this in python. To work on an example, say we want to deal with 2×2 matrices. We could define a `Matrix2x2`

class, and write a `__mul__`

method for it, so that whenever we write `a*b`

, what gets evaluated is `a.__mul__(b)`

. For the same price, I’ve thrown in an `__add__`

method…

class Matrix2x2(object) : def __init__(self,elements = None) : if elements : self._data_ = elements[:4] else : self._data_ = [0] * 4 def __repr__(self) : as_string = [str(j) for j in self._data_] str_length = [len(j) for j in as_string] longest = max(str_length) for j in xrange(4) : as_string[j] =' '*(longest - str_length[j]) + as_string[j] ret = 'Matrix2x2 object:' for j in xrange(2) : ret += '\n[ %s %s ]' % (as_string[2*j], as_string[2*j+1]) return ret def __mul__(self, b) : ret = [self._data_[0]*b._data_[0] + self._data_[1]*b._data_[2]] ret += [self._data_[0]*b._data_[1] + self._data_[1]*b._data_[3]] ret += [self._data_[2]*b._data_[0] + self._data_[3]*b._data_[2]] ret += [self._data_[2]*b._data_[1] + self._data_[3]*b._data_[3]] return Matrix2x2(ret) def __add__(self, b) : return Matrix2x2([self._data_[j] + b._data_[j] for j in xrange(4)])

Or if we don’t want to go through the hassle of dealing with objects, we can agree with ourselves in keeping the matrix’s data in a list in row-major order, and define a standalone multiplication function…

def mat_mul(a, b) : """ Returns the product of two 2x2 square matrices. Computes the product of two 2x2 matrices, each stored in a four element list in row major order. """ ret = [a[0]*b[0] + a[1]*b[2]] ret += [a[0]*b[1] + a[1]*b[3]] ret += [a[2]*b[0] + a[3]*b[2]] ret += [a[2]*b[1] + a[3]*b[3]] return ret

To set a baseline for performance, the starting point has to be the most direct implementation of exponentiation: repeated multiplication. There really isn’t much mistery to it…

import nrp_base @nrp_base.performance_timer def direct_pow(a, n, **kwargs) : """ Computes a**n by direct multiplication. Arguments a - The object to be exponentiated. n - The integral exponent. Keyword Argument mul - A function taking two arguments of the same type as a, and returning their product. If undefined, a's __mul__ method will be used. """ mul = kwargs.pop('mul',None) ret = a if mul is None : mul = lambda x,y : x*y for j in xrange(n-1) : ret = mul(ret,a) return ret

OK, so now we are ready for the fun to begin… The basic idea behind speeding exponentiation up is that a^{4} can be computed with two, instead of three, multiplications, if rather than doing it as a^{4} = a·a·a·a, we do it as a^{2} = a·a, a^{4} = a^{2}·a^{2}, storing the intermediate result for a^{2}…

To expand this idea to exponents other than 4 we basically need to write the exponent in binary. Examples are great for figuring out this kind of things, so we’ll take computing a^{19} as ours. Knowing that 19 in binary is 10011, we now have a choice on how to proceed, which basically amount to start using the bits of the exponent from the least significant one, or from the most significant one…

That 19 in binary is 10011 can also be interpreted as 19 = 2^{0} + 2^{1} + 2^{4}, and so a^{19} = a^{20}·a^{21}·a^{24}… The algorithm then is pretty straightforward:

- start computing sequentially from j = 0 the values of a
^{2j}, where each new value is the square of the preceding one, - when the first non-zero bit is found, set the return value to the corresponding a
^{2j}value, - for subsequent non-zero bits, multiply the return value by the corresponding a
^{2j}value, - once all bits of the exponent have been searched, the return value holds the sought value.

If the exponent is n, then there will be log_{2} n squarings, to compute the a^{2j} values, plus one multiplication less than there are 1’s in the binary expansion of the exponent.

It is also possible to write a^{19} = ((a^{23}·a)^{2})·a. While it is easy to verify that is the case, it may be harder to see where does it come from. The algorithm is as follows:

- Take the exponent bits from most to least significant,
- since the first bit is always a 1, set the return value to a,
- for every bit processed, first square the return value, and then, if the bit is a 1, multiply the return value by a,
- again, once the last bit is used, the return value holds the sought value.

The analysis of the algorithm is similar to the previous one: there will again be log_{2} n squarings, and the same number of multiplications as well.

These second flavor has one clear disadvantage: there’s no simple way to generate the bits of a number from most to least significant. So this approach requires computing and storing the bits from least to most significant, and then retrieve them in inverse order. No that it is anything too complicated, but there are additional time costs involved.

So why would one even worry about this scheme then? Well, it also has a potential advantage: the multiplications performed always involves the original object being exponentiated. This can prove beneficial in two different ways:

- it lends itself better for optimization of particular cases, and
- if the original element involves small numbers, but the end result involves large numbers, the multiplications are usually faster than in the previous algorithm.

To see the differences with examples, it is better to first write the python code…

@nrp_base.performance_timer def bin_pow(a, n, **kwargs) : """ Computes a**n by binary exponentiation. Arguments a - The object to be exponentiated. n - The integral exponent. Keyword Argument mul - A function taking two arguments of the same type as a, and returning their product. If undefined, a's __mul__ method will be used. sqr - A function taking a single argument of the same type as a, and returning its square. If undefined, mul(a,a) will be used. To use a method of the object's class, use sqr=ClassName.function_name. lsb - Set to True to use least significant bits first. Default is False. msb - Set to True to use most significant bits first. Overrides lsb. Default is True. """ mul = kwargs.pop('mul',None) sqr = kwargs.pop('sqr',None) lsb = kwargs.pop('lsb',None) msb = kwargs.pop('msb',None) if mul is None : mul = lambda x, y : x * y if sqr is None : sqr = lambda x : mul(x, x) if lsb is None and msb is None : msb = True lsb = False elif msb is None : msb = not lsb else : lsb = not msb if msb : bits = [] while n > 1 : # The last bit is always a 1... if n & 1 : bits += [1] else : bits += [0] n >>= 1 ret = a while bits : ret = sqr(ret) if bits.pop() : ret = mul(ret,a) else : val = a while not n & 1 : n >>= 1 val = sqr(val) ret = val while n > 1: n >>= 1 val = sqr(val) if n & 1 : ret = mul(ret,val) return ret

`bin_pow`

for a Ride…To get a better grip on how the algorithm performs, we’ll take a very simple matrix, which we will find again in a future post, when we discuss Fibonacci numbers…

`>>> a = Matrix2x2([1,1,1,0])`

>>> a

Matrix2x2 object:

[ 1 1 ]

[ 1 0 ]

So lets raise it first to a small power, and see how the two algorithms perform…

`>>> val = bin_pow(a,10,lsb=True,verbose=True,timer_loops=100)`

Call to bin_pow took:

2.84952e-05 sec. (min)

0.000205613 sec. (max)

3.363e-05 sec. (avg)

0.000178629 sec. (st_dev)

>>> val = bin_pow(a,10,msb=True,verbose=True,timer_loops=100)

Call to bin_pow took:

3.12889e-05 sec. (min)

5.16825e-05 sec. (max)

3.32891e-05 sec. (avg)

2.93951e-05 sec. (st_dev)

Not many surprises here, as the algorithm having to store and retrieve bits is a little slower than the other. But lets see what happens if we scale things…

`>>> val = bin_pow(a,1000000,lsb=True,verbose=True)`

Call to bin_pow took 2.599 sec.

>>> val = bin_pow(a,1000000,msb=True,verbose=True)

Call to bin_pow took 1.76194 sec.

This sort of shows that the multiplication thing can be very relevant. A last little experiment gives more insight into the workings of this…

`>>> 2**20`

1048576

>>> val = bin_pow(a,2**20,lsb=True,verbose=True)

Call to bin_pow took 1.95651 sec.

>>> val = bin_pow(a,2**20,msb=True,verbose=True)

Call to bin_pow took 1.90171 sec.

Funny, isn’t it? Why would it be that the time differences are so much smaller now, even though the exponents are roughly the same? This other times may help answer that…

`>>> val = bin_pow(a,2**20-1,lsb=True,verbose=True)`

Call to bin_pow took 2.61069 sec.

>>> val = bin_pow(a,2**20-1,msb=True,verbose=True)

Call to bin_pow took 1.89222 sec.

Got it now? 2^{20} has a binary expression full of 0’s, 20 of them, preceded by a single 1. This means that neither algorithm is performing any multiplication, just squarings, so the performance is almost identical. On the other hand, 2^{20}-1 has a binary expression full of 1’s, 20 of them, with no 0 anywhere. This means that, although there will be one less squaring than for the previous case, there will be 19 multiplications more. And here’s where the `msb`

approach excels, because the involved multiplications are simpler.

]]>

As presented in the previous post, Cooley-Tukey’s FFT algorithm has a clear limitation: it can only be used to speed the calculation of DFTs of a size that is a power of two. It isn’t hard, though, to extend the same idea to a more general factorization of the input signal size. We start again from the formula that defines the DFT,

Say that N = P·Q. Following the previous approach, we could try to split the calculation of the DFT into Q DFTs of size P, by rewriting it as

where n = Q·p + q. We are again faced with the same issue to turn the inner summation into a proper DFT: while p ranges from 0 to P-1, k goes all the way up to N-1. To get around this, we can make k = P·r + s, and replacing in the previous formula we get

where the inner summation has been reduced to a standard DFT of size P. Furthermore, the splitting of the first exponential in two factors unveils that the outer summation is also a standard DFT, this one of size Q, of the product of the DFT of size P by twiddle factors.

The subscripts in the above formula can get confusing to follow, so take note that q and r go from 0 to Q-1, while p and s go from 0 to P-1. Therefore the computation of the size N DFT has effectively been split into the computation of Q DFTs of size P, followed by twiddle factor multiplication, and then by P DFTs of size Q. It may be helpful to picture the linear array of input data as a PxQ matirx stored in row-major order. Then the Q DFTs of size P are performed on the columns of this matrix, and the P DFTs of size Q on the rows. Note though that the output has to be rearranged, in what would be equivalent to a matrix transposition, before getting to the final answer. As an example, the following picture illustrates the process on an input of size 15.

Disregarding the multiplication by twiddle factors and the matrix transpositions, the algorithm turns a calculation which requires a time proportional to N^{2} = P^{2}Q^{2}, into one which requires a time proportional to P·Q·(P+Q). This very simplistic complexity analysis suggests that the factoring of the size of the original signal into two multiplicative factors isn’t arbitrary, and will affect the performance of the algorithm. And in principle a split into two similarly sized factors, P ~ Q ~ √N should, although often doesn’t, outperform any other approach.

In many practical applications a small P or Q is used, which is then called the radix. If the smaller DFT is in the outer loop. i.e. if Q is the radix, the scheme is termed decimation in time (DIT), while if it is in the inner loop, i.e. if P is the radix, we have decimation in frequency (DIF), which explains the long name of the simplest Cooley-Tukey algorithm described earlier.

The splitting of the input size, N, into the two multiplicative factors, P and Q, can be done in d(N) – 2 different ways, where d(N) is the divisor function, i.e. the number of divisors of N. If we use the same algorithm recursively, the possible paths to the solution can grow very fast, and determining the optimal one is anything but easy. It is in any case a point where we don’t want to make excessive assumptions, and which should be left sufficiently open for user experimentation. So I’ve decided to copy FFTW‘s approach: the first step to peform a FFT is to plan it. In our case, a plan is a data structure which is similar to a tree, holding the information of the factorizations chosen, the function that should be called to perform the next step, and references to those next steps. The following picture sketches two possible plans for a DFT of size 210 = 2·3·5·7.

In line with the image, I have coded a planning function that gives two choices to the user:

**decimation in time**(`dit=True`

) vs.**decimation in frequency**(`dif=True`

)**minimum possible radix**(`min_rad=True`

) vs.**maximum possible radix**(`max_rad=True`

)

Note that the standard radix 2 with decimation in time plan would result from setting `dit=True`

and `min_rad=True`

when the input size is a power of two, and that it would be the default behavior of this planner.

@nrp_base.performance_timer def plan_fft(x, **kwargs) : """ Returns an FFT plan for x. The plan uses direct DFT for prime sizes, and the generalized CT algorithm with either maximum or minimum radix for others. Keyword Arguments dif -- If True, will use decimation in frequency. Default is False. dit -- If True, will use decimation in time. Overrides dif. Default is True. max_rad -- If True, will use tje largest possible radix. Default is False. min_rad -- If True, will use smallest possible radix. Overrides max_rad. Default is True. """ dif = kwargs.pop('dif', None) dit = kwargs.pop('dit', None) max_rad = kwargs.pop('max_rad', None) min_rad = kwargs.pop('min_rad', None) if dif is None and dit is None : dif = False dit = True elif dit is None : dit = not dif else : dif = not dit if max_rad is None and min_rad is None : max_rad = False min_rad = True elif min_rad is None : min_rad = not max_rad else : max_rad = not min_rad N = len(x) plan = [] if min_rad : fct = sorted(nrp_numth.naive_factorization(N)) NN = N ptr = plan for f in fct : NN //= f P = [f, dft, None] Q = [NN, fft_CT, []] if dit : ptr += [Q, P] else : ptr += [P, Q] if NN == fct[-1] : Q[1:3] = [dft, None] break else : ptr = Q[2] else : items ={} def follow_tree(plan, N, fct, **kwargs) : if fct is None: fct = nrp_numth.naive_factorization(N) divs = sorted(nrp_numth.divisors(fct)) val_P = divs[(len(divs) - 1) // 2] P = [val_P] if val_P in items : P = items[val_P] else : fct_P = nrp_numth.naive_factorization(val_P) if len(fct_P) > 1 : P += [fft_CT, []] follow_tree(P[2], val_P, fct_P, dit = dit) else : P +=[dft, None] items[val_P] = P val_Q = N // val_P Q = [val_Q] if val_Q in items : Q = items[val_Q] else : fct_Q = nrp_numth.naive_factorization(val_Q) if len(fct_Q) > 1 : Q += [fft_CT, []] follow_tree(Q[2], val_Q, fct_Q, dit = dit) else : Q +=[dft, None] items[val_Q] = Q if dit : plan += [Q, P] else : plan += [P, Q] follow_tree(plan, N, None, dit = dit) return plan

While the planner can probably be further optimized, especially for `max_rad=True`

, with all the repeated factorizations, this shouldn’t be the critical part of the DFT calculation. Specially so since the plan can be calculated in advance, and reused for all inputs of the same size.

As already discussed in previous posts, one of the simplest ways of trimming down the computation time is to store the twiddle factors, which are almost certain to show up again, and reuse them whenever possible, rather than going through complex exponentiations over and over again. There is a slight difference though with regard to the radix 2 case. There, when computing a size N FFT, the twiddle factors that intervened in the calculations where all those with numerators in the exponent (apart from the ±2·π·i) ranging from 0 to N/2 – 1, inclusive. If N is split as P·Q, rather than as 2·N/2, then the largest numerator that will appear will be (P-1)·(Q-1), which is definitely larger than N/2, but not all of the items up to that value will be required, with many of them missing.

Take for instance a DFT of size 512. If we go for the radix 2 FFT, we will in total use 256 different twiddle factors, ranging from 0 to 255. If rather than that we go for a `max_rad=True`

approach, the first split will be 512 = 16·32, and so the largest twiddle factor will have numerator 15·31 = 465. But with all the intermediate twiddle factors not showing up in the calculation, overall only 217 different ones are needed.

It is also interesting to note that the same twiddle factors that are used in the FFTs, may appear in the DFTs of the leaves of the planning tree. So it is worth modifying the current `dft`

function, so that it can accept precalculated twiddles. Because the twiddle factors sequence will have gaps, a list is no longer the most convenient data structure to hold these values, a dictionary being a better fit for the job. Combining all of these ideas, we can rewrite our function as

@nrp_base.performance_timer def dft(x, **kwargs) : """ Computes the (inverse) discrete Fourier transform naively. Arguments x - Te list of values to transform Keyword Arguments inverse - True to compute an inverse transform. Default is False. normalize - True to normalize inverse transforms. Default is same as inverse. twiddles - Dictionary of twiddle factors to be used by function, used for compatibility as last step for FFT functions, don't set manually!!! """ inverse = kwargs.pop('inverse',False) normalize = kwargs.pop('normalize',inverse) N = len(x) inv = -1 if not inverse else 1 twiddles = kwargs.pop('twiddles', {'first_N' : N, 0 : 1}) NN = twiddles['first_N'] X =[0] * N for k in xrange(N) : for n in xrange(0,N) : index = (n * k * NN // N) % NN twid= twiddles.setdefault(index, math.e**(inv*2j*math.pi*index/NN)) X[k] += x[n] * twid if inverse and normalize : X[k] /= NN return X

Note also the changes to the user interface, with the addition of the `normalize`

keyword argument. This has been introduced to correct a notorious bug in the inverse DFT calculations for the radix 2 FFT algorithm. The changes are implemented and can be checked in the module nrp_dsp.py, although they are very similar in nature to what will shortly be presented for the generalized factorization FFT.

We now have most of the building blocks to put together a FFT function using Cooley-Tukey’s algorithm for general factorizations, a possible realization of which follows:

@nrp_base.performance_timer def fft_CT(x, plan,**kwargs) : """ Computes the (inverse) DFT using Cooley-Tukey's generalized algorithm. Arguments x - The list of values to transform. plan - A plan data structure for FFT execution. Keyword Arguments inverse - if True, computes an inverse FFT. Default is False. twiddles - dictionary of twiddle factors to be used by function, used for ptimization during recursion, don't set manually!!! """ inverse = kwargs.pop('inverse', False) inv = 1 if inverse else -1 N = len(x) twiddles = kwargs.pop('twiddles', {'first_N' : N, 0 : 1}) NN = twiddles['first_N'] X = [0] * N P = plan[0] Q = plan[1] # do inner FFT for q in xrange(Q[0]) : X[q::Q[0]] = P[1](x[q::Q[0]], plan=P[2], twiddles=twiddles, inverse=inverse,normalize=False) # multiply by twiddles for q in xrange(Q[0]) : for s in xrange(P[0]) : index = q * s * NN // N mult = twiddles.setdefault(index, math.e**(inv*2j*math.pi*index/NN)) X[Q[0]*s+q] *= mult # do outer FFT for s in xrange(P[0]) : X[Q[0]*s:Q[0]*(s+1)] = Q[1](X[Q[0]*s:Q[0]*(s+1)],plan=Q[2], twiddles=twiddles,inverse=inverse, normalize=False) # transpose matrix ret = [0] * N for s in xrange(P[0]) : ret[s::P[0]] = X[s*Q[0]:(s+1)*Q[0]] if inverse and N == NN : # normalize inverse only in first call ret = [val / N for val in ret] return ret

For power of two sizes, this algorithm is noticeably slower, by a factor of more than 2 depending on the input size, than the hardcoded radix 2 prior version. This is mostly due to the time spent doing matrix inversion in the last step. Also due to the matrix inversion, minimum radix plans usually outperform maximum radix ones, despite the benefits in FFT and twiddle factor computations the latter present over the former. Of course none of this matters if the input size presents several prime factors other than two, in which case the new algorithm clearly outperforms the radix 2 version, which is then close to the DFT version. Still, it is worth noting for later optimizations that matrix inversion should be done in a more efficient fashion.

]]>

On the comments to the post on the greatest common divisor, Steve pointed out that the code to time function performance could be embedded in the code in a neater way using function decorators. At the time, I confess I did not have a clue of what a decorator is. But I’ve been doing my homework, and so here’s my take on the subject…

Many moons ago, when I ditched my ZX Spectrum for a 80286 PC (with a i287 math coprocessor), and moved from BASIC to C, I eventually found out that there were such things as pointers to functions. What a sick thing to do to functions! What were this Kernighan and Ritchie!? Perverts!? So, still mourning for my lost GOTOs and GOSUBs, I dismissed the whole subject as pornographic, and decided not to learn it on moral grounds. Well, I’ve come a long way since then, am much more open-minded now, and have even developed a keen appreciation for errr… adult entertainment. So, bring in the syntactic ~~porn~~ sugar!!!

The basic idea is that when you write something like…

@my_decorator def decorated_function(a, b, c) : <function definition goes here>

…it is evaluated the same as if you had written…

def decorated_function(a, b, c) : <function definition goes here> decorated_function = my_decorator(decorated_function)

If this last assignment has sent you into stupor, that is simply the sweet way in which python’s indirection handles function pointers, so all there is to it is that `my_decorator`

has to be a function (or a callable object, i.e. one with a __call__ method) that takes a function as only argument, and returns a function (or a callable object). This last return function has to be compatible with the arguments of `decorated_function`

, since calls to the latter will actually call the former.

You can even raise the bar by passing arguments to the decorator, but the basic scheme is the same, if you write this…

@my_decorator(dec_a, dec_b, dec_c) def decorated_function(a, b, c) : <function definition goes here>

…python does the same as if you had written this…

def decorated_function(a, b, c) : <function definition goes here> decorated_function = my_decorator(dec_a, dec_b, dec_c)(decorated_function)

…where `my_decorator(dec_a, dec_b, dec_c)`

should return a function, possibly different depending on the passed parameters, which takes a function as a parameter, which should return yet another function, that better be compatible with the arguments of `decorated_function`

, for the same reasons as before.

Decorators can even be nested, but the functioning is quite predictable s well. You write this…

@one_decorator(one_args) @two_decorators(two_args) def decorated_function(a, b, c) : <function definition goes here>

…and python won’t know the difference from doing this…

def decorated_function(a, b, c) : <function definition goes here> decorated_function = one_decorator(one_args)(two_decorator(two_args)(decorated_function))

…and by now you should be able to figure out what is being returned by what.

If you are still in distress, it may be a good idea to stop here and first read David‘s great article at Siafoo , or Bruce Eckel‘s series (1, 2, 3) at Artima. To see them suckers at work, check the Python Decorator Library. There’s also the option to drink from the pristine springs of PEP 318, not that I recommend it to the uninitiated, though.

So lets learn by doing, and build a decorator to time the performance of the decorated function, trying to mimic as much as possible the approach I have been following up till now: the last of the function’s argument is `verbose`

, which defaults to `False`

, and setting it to `True`

activates printout of the time required by the function to run. This is possible, but messy. On the other hand, what can more easily be achieved is to control the timing with a keyword argument, as we will implement…

I have added more goodies… I learnt from the folks at Leohnard Euler’s Flying Circus that, while `time.clock`

has superior performance than `time.time`

under Windows, it is the opposite on Linux, so we will choose our timing function at run time, through `os.name`

, which holds a string with the operating system your computer is running.

from __future__ import division from os import name as OP_SYS import time def performance_timer_1(_func_) : """ A decorator function to time execution of a function. Timing is activated by setting the verbose keyword to True. """ stopwatch = time.time if OP_SYS in ('nt', 'dos') : stopwatch = time.clock def _wrap_(*args, **kwargs) : """ This is a function wrapper. """ verbose = kwargs.pop('verbose',False) t = stopwatch() ret = _func_(*args, **kwargs) t = stopwatch() - t if verbose : print "Call to",_func_.__name__,"took",t,"sec." return ret return _wrap

Run this file on IDLE, or import it into an interactive python session, and you can do things like these:

`>>> @performance_timer_1`

def decorated_sum(a,b) :

"""

This is a function.

"""

return a+b

`>>> decorated_sum(3,5)`

8

`>>> decorated_sum(3,5,verbose=True)`

Running function decorated_sum took 2.72380986743e-06 sec.

8

If you are wondering about the silly docstrings in the function and the wrapper, try the following:

`>>> help(decorated_sum)`

Help on function wrapper:

`wrapper(*args, **kwargs)`

This is a function wrapper.

Now that’s odd, isn’t it? Well, not really, as it is a natural byproduct of the devilish practice of assigning functions to functions: we are getting help for the function with which the decorator replaced the original function during decoration. Part of it is easy to fix, though, as the following code demonstrates…

import inspect def performance_timer_2(_func_) : """ A decorator function to time execution of a function. Timing is activated by setting the verbose keyword to True. """ stopwatch = time.time if OP_SYS in ('nt', 'dos') : stopwatch = time.clock def _wrap_(*args, **kwargs) : """ This is a function wrapper. """ verbose = kwargs.pop('verbose',False) t = stopwatch() ret = _func_(*args, **kwargs) t = stopwatch() - t if verbose : print "Call to",_func_.__name__,"took",t,"sec." return ret _wrap_.__name__ = _func_.__name__ _wrap_.__doc__ = _func_.__doc__ return _wrap_

If you try the same thing again, this time you will get…

`@performance_timer_2`

def decorated_sum(a,b) :

"""

This is a function.

"""

return a+b

`>>> help(decorated_sum)`

Help on function decorated_sum:

`decorated_sum(*args, **kwargs)`

This is a function.

We’re almost there!!! There is just one disturbing thing remaining: the function signature. See, if we had invoked `help`

on the undecorated `decorated_sum`

, we would have got `decorated_sum(a, b)`

, not the generic arguments and keyword arguments thing.

There are ways to get around it, but things are going to get messy. The primal source is Michele Simionato‘s decorator module. Most of what I will present in the rest of this post is ~~shamelessly ripped off from~~ inspired on his code. You can check the documentation, or the last section of David’s Siafoo article, if you want to use that module.

There are two problems though… The first is of course having to rely on a third party module which isn’t part of the standard library: the rules forbid it!

But there is also a more fundamental issue: we would like to preserve the original function signature, but at the same time we are changing it!!! Going back to the prior example, if we preserved the signature unaltered as `decorated_sum(a, b)`

, we would get a `TypeError: decorated_sum() got an unexpected keyword argument 'verbose'`

every time we tried to activate the timer. Trust me, I found the hard way…

What we would like to get is something like `decorated_sum(a, b, **kwargs)`

. To achieve something like this we are going to have to dynamically generate a function at run-time…

def performance_timer_3(_func_) : """ A decorator function to time execution of a function. Timing is activated by setting the verbose keyword to True. """ stopwatch = time.time if OP_SYS in ('nt', 'dos') : stopwatch = time.clock def _wrap_(*args, **kwargs) : """ This is a function wrapper. """ verbose = kwargs.pop('verbose',False) t = stopwatch() ret = _func_(*args, **kwargs) t = stopwatch() - t if verbose : print "Call to",_func_.__name__,"took",t,"sec." return ret sig = list(inspect.getargspec(_func_)) wrap_sig = list(inspect.getargspec(_wrap_)) if not sig[2] : sig[2] = wrap_sig[2] src = 'def %s%s :\n' %(_func_.__name__, inspect.formatargspec(*sig)) src += ' return _wrap_%s\n' % (inspect.formatargspec(*sig)) evaldict = {'_wrap_' : _wrap_} exec src in evaldict ret = evaldict[_func_.__name__] ret.__doc__ = _func_.__doc__ return ret

If you are not familiar with them, you may want to check the documentation for the inspect module, as well as for the `exec`

statement, to get a hang of what’s going on. It is important to note how the `_wrap_`

function is not simply named in the dynamically generated source code, but included in `evaldict`

. If we didn’t do it this way, it would be out of scope once we left the decorator, and an error will occur. Again, you can trust me on this one…

But the thing is we have finally arrived where we wanted to…

`>>> @performance_timer_3`

def decorated_sum(a,b) :

"""

This is a function.

"""

return a+b

`>>> help(decorated_sum)`

Help on function decorated_sum:

`decorated_sum(a, b, **kwargs)`

This is a function.

So every time you want to do a function decorator with a wrapper activated by a keyword, all you need to do is take this last example as a template, and with some cutting and pasting you’ll be done in no time.

Now wait a minute!!! Wasn’t avoiding this kind of thing what decorators where supposed to be all about? Didn’t we get into this mess to avoid cutting and pasting a couple of lines of code to the beginning and end of every function? What if we built a decorator that automated the decorating process?

def signature_decorator(_wrap_, has_kwargs = False) : """ A decorator to create signature preserving wrappers. _wrap_ : the wrapper function, which must take a function as first argument, can then take several optional arguments, which must have defaults, plus the *args and **kwargs to pass to the wrapped function. has_kwargs : should be set to True if the wrapper also accepts keyword arguments to alter operation. """ def decorator(func): """ An intermediate decorator function """ sig = list(inspect.getargspec(func)) wrap_sig = list(inspect.getargspec(_wrap_)) sig[0], wrap_sig[0] = sig[0] + wrap_sig[0][1:], wrap_sig[0] + sig[0] wrap_sig[0][0] = '_func_' sig[3] = list(sig[3]) if sig[3] is not None else [] sig[3] += list(wrap_sig[3]) if wrap_sig[3] is not None else [] sig[3] = tuple(sig[3]) if sig[3] else None wrap_sig[3] = None if sig[2] is None and has_kwargs: sig[2] = wrap_sig[2] wrap_sig[1:3] = sig[1:3] src = 'def %s%s :\n' % (func.__name__, inspect.formatargspec(*sig)) src += ' return _wrap_%s\n' % (inspect.formatargspec(*wrap_sig)) evaldict = {'_func_':func, '_wrap_': _wrap_} exec src in evaldict ret = evaldict[func.__name__] ret.__doc__ = func.__doc__ return ret return decorator def timer_wrapper_1(_func_, verbose = False, *args, **kwargs) : """ A wrapper to time functions, activated by the last argument. For use with signature_decorator """ stopwatch = time.time if OP_SYS in ('nt','dos') : stopwatch = time.clock t = stopwatch() ret = _func_(*args,**kwargs) t = stopwatch() - t if verbose : print "Call to",_func_.__name__,"took",t,"sec." return ret def timer_wrapper_2(func, *args, **kwargs) : """ A wrapper to time functions, activated by a keyword argument.. For use with signature_decorator """ stopwatch = time.time if OP_SYS in ('nt','dos') : stopwatch = time.clock verbose = kwargs.pop('verbose',False) t = stopwatch() ret = func(*args,**kwargs) t = stopwatch() - t if verbose : print "Call to",_func_.__name__,"took",t,"sec." return ret

Do note that the wrappers are now defined outside the decorator, and that they must take the wrapped function as first argument. Do note as well the decorators two parameters, the first being the wrapper, the second whether the wrapper takes keyword arguments or not. To get a hang of how this works try the following…

`>>> @signature_decorator(timer_wrapper_1)`

def decorated_sum(a,b=2) :

"""

This is a function.

"""

return a+b

`>>> help(decorated_sum)`

Help on function decorated_sum:

`decorated_sum(a, b=2, verbose=False)`

This is a function.

`>>> @signature_decorator(timer_wrapper_2,True)`

def decorated_sum(a,b=2) :

"""

This is a function.

"""

return a+b

`>>> help(decorated_sum)`

Help on function decorated_sum:

`decorated_sum(a, b=2, **kwargs)`

This is a function.

There are a million things that could go wrong with this last decorator, specially with conflicting variable or keyword names in the function and the wrapper. If you check the decorator module source code, you can find ways to build robustness to such issues. As for myself, a week’s worth of sleep deprivation is enough: I already know more about decorators than ever thought possible or wished. If you are curious, you can check the final, a little more elaborate, version of the `performance_timer`

decorator I’m sticking to, which does not use this last refinement of the `decorator_signature`

, checking nrp_base.py.

]]>

The move to wordpress is now completed… All the relevant content that was in blogspot, is now here for your enjoyment. The importing was smooth, but I had to do a lot of reformatting by hand, and with python’s sensitivity to whitespace errors, I can’t guarantee that I haven’t messed anything up. If you find anything broken, let me know and I’ll fix it.

Once I got started, I got a little carried away, and have done some more housekeeping:

- There now is a Coming Soon page on the top menu.
- There is also a Post Index there, which should provide a convenient way to access the content.

And the big news is that I have set up a Mercurial repository at bitbucket, here’s the link. This should provide two very convenient features for the future:

- No longer will anyone wanting to try out some of the code I post have to cut and paste from several older posts to get the functions running, as any dependencies will now be handled through a few modules stored in the repository.
- For future blog entries, rather than cutting and pasting from the posted code, a single file containing all the code will also be available at bitbucket.

Lastly, I am starting to consider moving on to python 3.0 a wise choice. If there are any regular readers out there with an opinion on this, your comments are more than welcome.

]]>

Well, a few weeks down the road it seems he was right: while trying to make my FFT code work, I have found myself dealing with all sorts of number theoretical stuff, ranging from simple ones, like factorization, to arcane, as the Chinese remainder theorem. And since I have this thing about complete expositions, I’m afraid that this interleaving of apparently unrelated number theory posts with the FFT ones is going to be the norm. As a matter of fact the odd post on the Euclidean algorithm was supposed to be a stepping stone to an improved radix 2 FFT, which I finally managed to do without.

But lets get going with today’s subject: computing the divisors of a positive integer. May not seem a very sexy topic at first glance, but there’s beauty in it, as the striking Divisor Plot website put together by Jeffrey Ventrella proves…

Getting into the algorithmic treatment of the problem, there is of course the dumber approach, on the line of “to compute all divisors of a number, do trial division by all numbers smaller than it.” I will not pursue that line of thinking though, and will focus on computing the divisors once the prime factorization of the number is known, which is the efficient way of dealing with this problem.

So I will assume that we know the expansion of the number into its prime factors as

,

where the function ω(n) represents the number of distinct prime factors, p_{i}, each appearing with the corresponding exponent, a_{i}.

A divisor will have a subset of the prime factors of n as its own factors, so we could represent any divisor as a list of ω(n) numbers, the i-th ranging from 0 up to and including a_{i}. This scheme makes it very easy to calculate the total number of divisors as

.

**Unordered Generation of Divisors**

Furthermore, it points out that the computation of all the divisors could be achieved using ω(n) nested loops. Of course the number of loops to nest will vary from input to input, so recursion is probably the best of choices, or so thinks Tim Peters, as the code below has only cosmetic modifications to what he posted on the Python-list@python.org mailing list:

def divisors(factors) : """ Generates all divisors, unordered, from the prime factorization. """ ps = sorted(set(factors)) omega = len(ps) def rec_gen(n = 0) : if n == omega : yield 1 else : pows = [1] for j in xrange(factors.count(ps[n])) : pows += [pows[-1] * ps[n]] for q in rec_gen(n + 1) : for p in pows : yield p * q for p in rec_gen() : yield p

This approach, while extremely fast, is not much use if you want to produce the divisors sorted, as you will then have to first generate and store all divisors, then sort them.

**Ordered Generation of Divisors**

If what you are after is not the full set of divisors, or any arbitrary subset of them, but something like “the first divisor larger than x” or “the last divisor smaller than x,” it may be worth trying other algorithms which, while being less efficient, manage to produce the divisors properly sorted.

There’s little literature on the subject, so its hard to find a valid benchmark . Even PARI/GP, which is the golden standard on computational number theory seems, from the little sense I have managed to make out of their Subversion repository, to be generating all divisors (in a similar manner to above) and then sorting them.

So here’s a calculation scheme that can be effectively modified to yield ordered divisors… We start with a pivot value of 1, and an ordered list of all prime factors. Before every factor we split our flow in two:

- one branch skips the factor, and proceeds to the first factor different from the current one, keeping the pivot unchanged
- the other branch first updates the pivot, multiplying it by the factor, yields the result, and then proceeds to the next factor, with the modified pivot.

The flow is probably easier to understand with the example below, showing the branching when computing the divisors of 540 = 2^{2}·3^{3}·5. Blue arrows skip to the first different factors, red ones go to the next factors, a line going through a factor implies multiplication by it…

To keep track of this structure, I’m going to get fancy, and will build a tree-like data structure, where each item will be a list, holding the factor in the first position, and pointers, i.e. nested lists, to the next elements in the second and third ones. Again, to make things clear, here’s the data structure for 540, using the same color code as before…

Rather than traversing this structure recursively, we will approach it in a more structured way, by keeping a separate list of factors that can be visited in a next step. In the example above, when we start out, we can visit the first factor (a 2), or skip to the third (the first 3), so the list will hold two elements. If we select going to the first 2, we then have two new options, i.e. going on to the second 2, or taking the shortcut to the first three. Note that the two instances point at the first three are different, since once has a pivot value of 1 (and will thus generate the divisor 3), while the other has it set to 2 (an will generate the divisor 6).

If we don’t care about the other of divisors, this method can even be implemented by hand, by keeping a list of pivots and remaining factors, and on each step erasing the first element of the list and generating a divisor, plus adding up to two new items at the bottom. The first steps, plus the resulting list of divisors, are worked out below, again for the case of n = 540.

Now, while when doing it on paper there is not much more choice other than processing the items sequentially, we could choose to process always the element that produces the smallest divisor, and thus produce them in order. The simplest implementation of this idea would simply use a sort on the list of processable items before yielding the next divisor…

def divisors_ordered(factors) : """ Generates all divisors, ordered, from the prime factorization """ factors = [(j,factors.count(j)) for j in sorted(set(factors))] omega = len(factors) fcts =[factors[0][0],[],[]] aux = [fcts, None] for j in xrange(omega) : p,a = factors[j] if omega - j > 1 : aux[1] = [factors[j+1][0],[],[]] else : aux[1] = None for k in xrange(1, a) : b = [p,[],[]] aux[0][1] = b aux[0][2] = aux[1] aux[0] = b aux[0][1] = aux[1] aux[0][2] = aux[1] aux[0] = aux[1] div_tree = [[fcts[0],1, fcts]] yield 1 while div_tree : div_tree.sort() divisor, pivot, pointer = div_tree.pop(0) yield divisor if pointer[1] : div_tree += [[divisor * pointer[1][0] , divisor, pointer[1]]] if pointer[2] : div_tree += [[pivot * pointer[2][0], pivot, pointer[2]]]

Of course doing a full sort before yielding every divisor kills the efficiency of this algorithm. The fancy (or should I simply say correct?) way of keeping an ordered list, popping the smallest item only, and pushing arbitrary elements, is with a priority queue, which I will implement as a binary heap. Since we will be building our heap from scratch, we only need to worry about push and pop functions.

def push_heap(item, heap = None, compare = cmp) : """ Pushes an item into a min-heap, or creates one. """ heap_size = None if not heap : heap = [item, 1] else : heap_size = heap.pop() heap += [item, heap_size + 1] i = heap[-1] - 1 val = heap[i] heap_ordered = False while i > 0 and not heap_ordered : n_i = i - 1 if n_i & 1 : # faster than n_i % 2 n_i -= 1 n_i >>= 1 # faster than n_i // 2 n_val = heap[n_i] if compare(val, n_val) < 0 : heap[i], heap[n_i] = heap[n_i], heap[i] i, val = n_i, heap[n_i] else : heap_ordered = True return heap def pop_heap(heap, compare = cmp) : """ Pops the smallest item from a min-heap. """ heap_size = heap.pop() if not heap_size: return None elif heap_size == 1 : return heap.pop() ret, heap[0] = heap[0], heap.pop() heap_size -= 1 heap += [heap_size] i = 0 heap_ordered = False while not heap_ordered : n_i = 2 * i + 1 if n_i < heap_size : if n_i + 1 < heap_size and compare(heap[n_i],heap[n_i + 1]) > 0 : n_i += 1 if compare(heap[i], heap[n_i]) > 0 : heap[i], heap[n_i] = heap[n_i], heap[i] i = n_i else : heap_ordered = True else : heap_ordered = True return ret def divisors_ordered(factors) : """ Generates an ordered list of all divisors """ factors = [(j,factors.count(j)) for j in sorted(set(factors))] omega = len(factors) fcts =[factors[0][0],[],[]] aux = [fcts, None] for j in xrange(omega) : p,a = factors[j] if omega - j > 1 : aux[1] = [factors[j+1][0],[],[]] else : aux[1] = None for k in xrange(1, a) : b = [p,[],[]] aux[0][1] = b aux[0][2] = aux[1] aux[0] = b aux[0][1] = aux[1] aux[0][2] = aux[1] aux[0] = aux[1] yield 1 div_heap = push_heap([fcts[0],1, fcts]) while div_heap : divisor, pivot, pointer = pop_heap(div_heap) yield divisor if pointer[1] : div_heap = push_heap([divisor * pointer[1][0] , divisor, pointer[1]], div_heap) if pointer[2] : div_heap = push_heap([pivot * pointer[2][0], pivot, pointer[2]], div_heap)

The following graph shows the time required by each of the three implementations presented to produce an ordered list of all divisors of p_{n}#, i.e. the product of the first n prime numbers. Do notice how the lower computational complexity of the heap sorting eventually beats the initial speed advantage of the sorting version. Still, for full lists of divisors, generating all of them and then sorting beats any other approach. It is also worth pointing out how the time required for the heap version of the ordered algorithm eventually settles around being 14 times slower than the unordered generator with delayed sorting, i.e. the computational complexities of both are the same, which probably means that sorting dominates the time of the unordered algorithm.

Just to show why we could want to use an algorithm that is more than ten times slower, the following graph shows the time needed to compute the first divisor larger than 1000 for the first few primorials. While using the heap version of the ordered algorithm time is close to constant, the unordered algorithm shows the same behavior as in the previous test.

It is nevertheless worth noting that the break even point is p_{11}# = 200,560,490,130, so unless you are after the first few divisors of a huge number, the first algorithm presented should always be the choice.

]]>

Once the DFT has been introduced, it is time to start computing it efficiently. Focusing on the direct transform,

if the size of the input is even, we can write N = 2·M and it is possible to split the N element summation in the previous formula into two M element ones, one over n = 2·m, another over n = 2·m + 1. By doing so, after some cancellations in the exponents we arrive at

There’s something very promising here: each of the two summations are very close to being the expression of a DFT of half the size of the original. The only thing preventing them from fully being so is the fact that k runs from 0 to N-1, rather than up to M-1 only. To get around this, we will consider two different cases: k < M and k ≥ M. The first trivially produces a pair of half sized DFTs, which we will term E_{k} and O_{k}, owing to one running over the **e**ven indexes, the other over the **o**dd ones:

.

As for the case k ≥ M, we may as well write it out as k = M + k’. And if we plug this value into the original formula, work a little with the algebra, and then undo the variable change, i.e. plug k’ = M – k back in, we arrive at the following combined formula,

.

The cool thing about this decomposition is that we have turned our N element DFT into two N/2 DFTs, plus some overhead composing the results, which includes calculating twiddle factors, i.e. the complex exponential multiplying the O’s in the last formula. Given the quadratic nature of the time required to calculate the DFT, this amounts to cutting the time in half. And the best part is that we need not stop after one such decomposition: if the original size is a power of two, we can keep doing the same trick recursively until we reach a DFT of only two elements, for which we already have produced explicit formulas. A similar approach can be used for the inverse DFT, the details are built into the following code…

from __future__ import division import math import time def fft_CT(x, inverse = False, verbose = False) : """ Computes the DFT of x using Cooley-Tukey's FFT algorithm. """ t = time.clock() N = len(x) inv = -1 if not inverse else 1 if N % 2 : return dft(x, inverse, False) x_e = x[::2] x_o = x[1::2] X_e = fft_CT(x_e, inverse, False) X_o = fft_CT(x_o, inverse, False) X = [] M = N // 2 for k in range(M) : X += [X_e[k] + X_o[k] * math.e ** (inv * 2j * math.pi * k / N)] for k in range(M,N) : X += [X_e[k-M] - X_o[k-M] * math.e ** (inv * 2j * math.pi * (k-M) / N)] if inverse : X = [j/2 for j in X] t = time.clock() - t if verbose : print "Computed","an inverse" if inverse else "a","CT FFT of size",N, print "in",t,"sec." return X

The speed gain with this approach, restricted for now to power of two sizes, are tremendous even for moderately large input sizes. Mostly so due to the fact the new algorithm scales as N log N. For instance, a 2^{20} (~10^{6}) item input can be processed with the FFT approach in a couple of minutes, while the projected duration using the naive DFT will be more like a couple of weeks…

**Twiddle Factors Again**

While it may not be immediately obvious, the previous code is generously wasting resources by calculating over and over again the same twiddle factors. Imagine, for instance, a 1024 element DFT computed using this algorithm. At some point in the recursion the original DFT will have been exploded into 64 DFTs of 16 elements each. Each of this 64 DFTs must then be composed with the exact same 8 different twiddle factors, but right now they are being calculated from scratch each of the 64 times. Furthermore, half of this 8 different twiddle factors are the same that where needed to compose the 128 DFTs of 8 elements into the 64 DFTs of 16 elements… Summing it all up, to compute a FFT of size 2^{n}, which requires the calculation of 2^{n-1} different twiddle factors, we are calculating as many as n·2^{n-1} of them.

We can easily work around this by precalculating them first, and then passing the list to other functions. For instance as follows…

def fft_CT_twiddles(x, inverse = False, verbose = False, twiddles = None) : """ Computes the DFT of x using Cooley-Tukey's FFT algorithm. Twiddle factors are precalculated in the first function call, then passed down recursively. """ t = time.clock() N = len(x) inv = -1 if not inverse else 1 if N % 2 : return dft(x, inverse, False) M = N // 2 if not twiddles : twiddles = [math.e**(inv*2j*math.pi*k/N) for k in xrange(M)]+[N] x_e = x[::2] x_o = x[1::2] X_e = fft_CT_twiddles(x_e, inverse, False, twiddles) X_o = fft_CT_twiddles(x_o, inverse, False, twiddles) X = [] for k in range(M) : X += [X_e[k] + X_o[k] * twiddles[k * twiddles[-1] // N]] for k in range(M,N) : X += [X_e[k-M] - X_o[k-M] * twiddles[(k - M) * twiddles[-1] // N]] if inverse : X = [j/2 for j in X] t = time.clock() - t if verbose : print "Computed","an inverse" if inverse else "a","CT FFT of size",N, print "in",t,"sec." return X

This additional optimization does not improve the time scaling of the algorithm, but reduces the calculation time to about two thirds of the original one.

]]>

The least common multiple of 2 numbers, lcm(a,b), is the smallest number that is divisible by both a and b. The greatest common divisor of 2 numbers, gcd(a,b), is the largest number that divides both a and b. But just how do you go about calculating them? A most important property is that one can be calculated from the other based on the relation lcm(a,b)·gcd(a,b) = a·b, which ends up meaning that in practice the gcd is always calculated first. But lets not get ahead of ourselves…

Brute Force

I will disregard the dumber brute force approaches, such as “to get gcd(a,b), try all numbers down from min(a,b), until you find one that divides both a and b”, but a very resource wasteful implementation could use a sieving approach for the lcm. See, the lcm has to be smaller than a·b, which is definitely a multiple of both a and b. So if we set up an array of zeros of size a·b, we could use sieving to first set all multiples of a to 1, then do a search over all multiples of b to find the first common multiple. This approach will definitely break down for even moderately large a or b, but it is probably as good as most of us would get on our own without wikipedia…

from time import clock def lcmBruteForce(a, b, verbose = False) : t = clock() sieve = [0]*a*b a,b = max(a,b), min(a,b) for j in range(a-1,a*b,a) : sieve[j] += 1 ret = 0 for j in range(b-1,a*b,b) : if sieve[j] == 1 : ret = j + 1 break t = clock()-t if verbose : print "Calculated the lcm of",a,"and",b,"in",t,"sec." return ret

Integer Factorization

In my case I believe it was Ernesto, back in 1985 or 1986, who showed me into the secret of calculating the lcm and the gcd from the prime factorization of both numbers:

- to find the gcd, keep only factors common to both and b, raised to the smaller exponent, and
- for the lcm, keep all factors, common or not, raised to the largest exponent.

This approach will prove to be very inefficient, but lets give it a go. We will reuse the `factorAndSieve`

function from the post on integer factorization. I have also written an auxiliary function, `listToDict`

, that turns the list of factors with repeating entries, into a dictionary, where keys are factors and corresponding entries are exponents:

import time def list_to_dict(seq) : ret = dict() for j in seq : if j in ret : ret[j] += 1 else : ret[j] = 1 return ret def gcd_factor(a, b, verbose = False) : t = time.clock() aa = list_to_dict(factorAndSieve(a)) bb = list_to_dict(factorAndSieve(b)) rr = dict() for j in bb : if j in aa : rr[j] = min(aa[j], bb[j]) ret = 1 for j in rr : ret *= j**rr[j] t = time.clock() - t if verbose : print "Calculated the gcd of",a,"and",b,"in",t,"sec." return ret

The Euclidean Algorithm

When we looked at the calculation of the modular multiplicative inverse, we already took a look at the extended Euclidean algorithm, and pointed out it gives the gcd of the two passed numbers as a by-product. Not that it was an interesting thing then, since when calculating multiplicative inverses, we normally are interested in relatively prime number pairs, i.e. gcd(a,b) = 1. But if the adjective extended is dropped from the name, we are left with the Euclidean algorithm, which is the standard to calculate the gcd.

The basic idea behind the algorithm is the observation that if a > b, and g is the gcd of a and b, then it is also the gcd of a – b and b. To make it more efficient, rather than subtracting, we can cut corners by using the equivalent fact that, if g is the gcd of a and b, then it is also the gcd of b and a mod b. If we perform that same operation recursively, after a finite number of iterations, the result of the modular operation will eventually reach zero, and the gcd then is equal to the last non-zero remainder. Or

import time def gcd_euclid(a, b, verbose = False) : t = time.clock() aa, bb = a,b while b != 0 : a, b = b, a%b t = time.clock() - t if verbose : print "Calculated the gcd of",aa,"and",bb,"in",t,"sec." return a

**The Binary GCD Algorithm**

Euclid’s algorithm is so fast that there really is not much point in trying to improve it. If we were using C rather than python, we could have taken advantage of the fact that a computer that does all its arithmetic in binary, and use the binary GCD algorithm. In python the resulting code is less clear, and the run times are slightly slower, probably owing to the fact that bitwise operations “don’t get optimized down to single CPU instructions (…) like in C,” or so says a commenter here… The truth is I haven’t been able to implement the algorithm so that it runs faster than the standard Euclidean one, so I’ll spare you my code…

]]>

I have some mixed feelings about how does Fourier analysis qualify for the “uncomplicated complexity” rule I imposed on myself when starting this blog. There certainly is a lot of very elaborate math behind it, but I think that if you skip all the mathematical subtleties and dive head first into a “the Fourier transform gives you the frequencies present in the signal” type of intuitive description, the journey from the more naive implementations of the discrete Fourier transform (DFT) to the many flavors of the FFT is a very enjoyable one, which requires only a limited amount of complication, but gentle doses of complexity. Which is exactly what I’m after…

So I will take the definition of the DFT for granted, and concentrate on the algorithms to compute it efficiently. There will have to be a few leaps of faith, but I will try to mark them clearly, so you can whisper “amen” and keep going without a second thought. Although things may unfold differently, I think there is material for about half a dozen posts. Towards the end I may devote some time to optimizing the code, with things such as avoiding recursion, or doing the calculations in place, but be warned that most of my efforts will be devoted to optimizing the math behind the code, not the code itself.

**The Discrete Fourier Transform**

Without further explanation, we will begin by writing down the analytical expression of the DFT,

and of its corresponding inverse transform,

With the built-in support for complex arithmetic, there really isn’t much mistery in turning these two formulas into python functions, or as I have chosen, one with an `inverse`

switch:

from __future__ import division import math import time def dft(x, inverse = False, verbose = False) : t = time.clock() N = len(x) inv = -1 if not inverse else 1 X =[0] * N for k in xrange(N) : for n in xrange(N) : X[k] += x[n] * math.e**(inv * 2j * math.pi * k * n / N) if inverse : X[k] /= N t = time.clock() - t if verbose : print "Computed","an inverse" if inverse else "a","DFT of size",N, print "in",t,"sec." return X

Of course, having two nested loops of the size of the input means that computation time will grow with N^{2}, which is extremely inefficient.

**Twiddle Factors**

Apart from the FFT algorithms we will start with in the next post, there are some pretty straightforward optimizations at hand. While some texts reserve the name for similar factors appearing in FFT algorithms, the complex exponentials coming up in the DFT formula are often known as twiddle factors. While in principle there are N^{2} of them to calculate, we can take advantage of the fact that

,

and owing to the periodic nature of trigonometric functions, it isn’t hard to see that n·k can be replaced by n·k mod N without altering the result, which means only N different ones do exist.

A python implementation with pre-calculated twiddle factors could look something like this…

from __future__ import division import math import time def dft(x, inverse = False, verbose = False) : t = time.clock() N = len(x) inv = -1 if not inverse else 1 twiddles = [math.e**(inv*2j*math.pi*k/N) for k in xrange(N)] X =[0] * N for k in xrange(N) : for n in xrange(N) : X[k] += x[n] * twiddles[n * k % N] if inverse : X[k] /= N t = time.clock() - t if verbose : print "Computed","an inverse" if inverse else "a","DFT of size",N, print "in",t,"sec." return X

On my machine this code is about 5 times faster than the original one, although it still scales with time as N^{2}, so we are only postponing the inevitable. Do notice as well that the increase in speed comes at the cost of an additional array of complex of size N memory requirement.

**Particular Cases**

Without being touched with insightful inspiration, the only other route to optimization is by working out the math on paper for particular cases, and code explicitly the resulting formulas to avoid doing costly exponentiations.

As examples, if N = 2, it is easy to show, that

and if N = 4, one can similarly arrive at

Adding these particular cases to the python code could be done as follows…

def dft(x, inverse = False, verbose = False) : t = time.clock() N = len(x) X =[0] * N inv = -1 if not inverse else 1 if N == 1 : X[0] = x[0] elif N == 2 : X[0] = x[0] + x[1] X[1] = x[0] - x[1] elif N == 4 : a = x[0] + x[2] b = x[0] - x[2] c = x[1] + x[3] d = x[1] - x[3] X[0] = a + c X[1] = b + inv * 1j * d X[2] = a - c X[3] = b - inv * 1j * d if inverse : X = [k/N for k in X] else : twiddles = [math.e**(inv*2j*math.pi*k/N) for k in xrange(N//2+1)] for k in xrange(N//2+1,N) : twiddles += [twiddles[N-k].conjugate()] for k in xrange(N) : for n in xrange(N) : X[k] += x[n] * twiddles[n * k % N] if inverse : X[k] /= N t = time.clock() - t if verbose : print "Computed","an inverse" if inverse else "a","DFT of size",N, print "in",t,"sec." return X

Overall, and for this particular cases only, an additional factor of 3 increase in speed is achieved in this way. But this solution is not scalable to arbitrary input sizes, and so wiser approaches are needed to tackle even moderately large inputs of few tens of thousands samples…

]]>