2008-06-07

Prime Number Profiling

I've been trying to write an efficient prime number generator. So far the best I've been able to do is about 2.3 seconds for the prime numbers less than one million. Using a normal sieve, it is possible for a C++ program to generate the primes in less than 0.05 seconds. So my goal is to get it to 0.5 seconds without having to resort to funky unsafe operations or basically writing the C solution in Haskell.

Here is the current code:
import Data.List
import Prime.Queue

type Prime = Int
type PrimeCache = ([Prime], Queue Prime)

data Primes = Primes !(Prime, PrimeCache) Primes

primeList :: [Prime]
primeList = 2:3:(nextPrimes (3, ([2], enq newQueue 3)))
where
p2l (Primes (p, _) next) = p:(p2l next)

nextPrimes :: (Prime, PrimeCache) -> [Prime]
nextPrimes (p, cache) = newP:(nextPrimes h)
where
h@(newP, _) = sweep (fst cache, enq (snd cache) p) (p+2)

sweep c n | not $ any ((== 0).(mod n)) (fst shortList) = (n, shortList)
| otherwise = sweep shortList (n+2) -- once odd, always odd
where
shortList = extendedList (limit n) c

extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l@(x:xs), q) | x>=limit || empty q
= (l, q)
| otherwise
= extendedList limit (min:l, newSet)
where
(min, newSet) = deq q

limit n = floor $ sqrt (fromIntegral n :: Float)


Yes, it's not pretty. All it does is keep a list of prime numbers already found and see if the candidate is a factor of any of them.

It implements two basic optimization:

  • After 2, there are no more even primes, so starting at 3 it adds 2 to the next number to test for primeness.
  • It doesn't have to check all previously found prime numbers, only prime numbers that are < sqrt(p).

It uses the handy Queue code from Eric Kidd. So before any optimization is really done other than the simplest algorithmic optimizations, we'll benchmark it.

The profiling has:
 Sun Jun  8 00:14 2008 Time and Allocation Profiling Report  (Final)

10 +RTS -P -s10.stats -RTS

total time = 3.80 secs (190 ticks @ 20 ms)
total alloc = 532,086,816 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc ticks bytes

sweep Prime.PrimeList3 82.6 77.3 157 102850044
limit Prime.PrimeList3 15.8 18.4 30 24537068
nextPrimes Prime.PrimeList3 0.5 1.1 1 1491443
main Main 0.5 1.4 1 1851678
extendedList Prime.PrimeList3 0.0 1.1 0 1502535


individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 0 152
main Main 174 1 0.0 0.0 0.0 0.0 0 8792
CAF Main 168 3 0.0 0.0 0.5 1.4 0 24
main Main 175 0 0.5 1.4 0.5 1.4 1 7397920
CAF GHC.Num 156 1 0.0 0.0 0.0 0.0 0 104
CAF GHC.Handle 120 4 0.0 0.0 0.0 0.0 0 8620
CAF Prime.PrimeList3 95 4 0.0 0.0 99.5 98.6 0 32
primeList Prime.PrimeList3 176 1 0.0 0.0 99.5 98.6 0 12
enq Prime.Queue 183 1 0.0 0.0 0.0 0.0 0 24
nextPrimes Prime.PrimeList3 177 156994 0.5 1.1 99.5 98.6 1 5965772
sweep Prime.PrimeList3 179 500000 82.6 77.3 98.4 96.9 157 411400176
extendedList Prime.PrimeList3 181 500169 0.0 1.1 0.0 1.1 0 6010140
deq Prime.Queue 184 169 0.0 0.0 0.0 0.0 0 13856
empty Prime.Queue 182 169 0.0 0.0 0.0 0.0 0 0
limit Prime.PrimeList3 180 500000 15.8 18.4 15.8 18.4 30 98148272
enq Prime.Queue 178 78207 0.5 0.6 0.5 0.6 1 3132920


Not surprisingly, the function that does a square root and type conversion is rather expensive in time as well as space (15.8% of the time and 18.4% in space). The biggest factor though is the sweep function. Nothing jumps out as a problem in sweep, but we can fix the limit problem by doing the comparison. Instead of x>sqrt(y), we can do (x*x)>y:
extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l@(x:_), q) | (x*x)>=limit || empty q -- n to the square rather then x to root(n)
= (l, q)
| otherwise
= extendedList limit (min:l, newSet)
where
(min, newSet) = deq q

limit n=n


Now for the profile:

 Sun Jun  8 00:43 2008 Time and Allocation Profiling Report  (Final)

10 +RTS -P -s10.stats -RTS

total time = 3.36 secs (168 ticks @ 20 ms)
total alloc = 434,489,520 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc ticks bytes

sweep Prime.PrimeList3 98.8 94.8 166 102989282
main Main 0.6 1.7 1 1851678
nextPrimes Prime.PrimeList3 0.0 1.4 0 1491443
extendedList Prime.PrimeList3 0.0 1.4 0 1502535


individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 0 152
main Main 174 1 0.0 0.0 0.0 0.0 0 8792
CAF Main 168 3 0.0 0.0 0.6 1.7 0 24
main Main 175 0 0.6 1.7 0.6 1.7 1 7397920
CAF GHC.Num 156 1 0.0 0.0 0.0 0.0 0 104
CAF GHC.Handle 120 4 0.0 0.0 0.0 0.0 0 8620
CAF Prime.PrimeList3 95 4 0.0 0.0 99.4 98.3 0 32
primeList Prime.PrimeList3 176 1 0.0 0.0 99.4 98.3 0 12
enq Prime.Queue 182 1 0.0 0.0 0.0 0.0 0 24
nextPrimes Prime.PrimeList3 177 156994 0.0 1.4 99.4 98.3 0 5965772
sweep Prime.PrimeList3 179 500000 98.8 94.8 98.8 96.2 166 411957128
extendedList Prime.PrimeList3 180 500169 0.0 1.4 0.0 1.4 0 6010140
deq Prime.Queue 183 169 0.0 0.0 0.0 0.0 0 11432
empty Prime.Queue 181 169 0.0 0.0 0.0 0.0 0 0
enq Prime.Queue 178 78059 0.6 0.7 0.6 0.7 1 3129368


The execution time (without profiling) dropped to about 1.9 seconds, or 82% the time.

Not bad, but now sweep is where all the time is being spent and all the memory is being allocated. The nastiest part of the function is the test for factors, splitting it out:
sweep c n | hasFactor n (fst shortList) = (n, shortList)
| otherwise = sweep shortList (n+2) -- once odd, always odd
where
shortList = extendedList (limit n) c

hasFactor n = not.(any ((== 0).(mod n)))


And running that, we find (as a diff):
-sweep                          Prime.PrimeList3      98.8   94.8    166 102989282
-main Main 0.6 1.7 1 1851678
+hasFactor Prime.PrimeList3 95.8 91.9 159 99832288
+sweep Prime.PrimeList3 2.4 2.9 4 3156994
+main Main 1.2 1.7 2 1851678


Most of the time and allocation is done in the fairly simple hasFactor function. Here's a guess, most of the time, the small numbers are the factors, but the list is in reverse order. So put the list in the other order:
extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l, q) | (x*x)>=limit || empty q
= (l, q)
| otherwise
= extendedList limit (l++[min], newSet)
where
(min, newSet) = deq q
x = last l


And we now have a running time of 0.9 seconds, the profile diff looks like:
<       total time  =        3.62 secs   (181 ticks @ 20 ms)
< total alloc = 438,489,520 bytes (excludes profiling overheads)
---
> total time = 1.44 secs (72 ticks @ 20 ms)
> total alloc = 154,277,996 bytes (excludes profiling overheads)
10,14c10,15
< hasNoFactor Prime.PrimeList3 96.7 91.1 175 99832288
< nextPrimes Prime.PrimeList3 2.2 1.4 4 1491443
< sweep Prime.PrimeList3 0.6 3.8 1 4156994
< main Main 0.6 1.7 1 1851678
< extendedList Prime.PrimeList3 0.0 1.4 0 1502535
---
> hasNoFactor Prime.PrimeList3 56.9 74.4 41 28678852
> extendedList Prime.PrimeList3 36.1 4.2 26 1603090
> main Main 4.2 4.8 3 1851678
> nextPrimes Prime.PrimeList3 2.8 3.9 2 1491443
> enq Prime.Queue 0.0 2.0 0 782348
> sweep Prime.PrimeList3 0.0 10.8 0 4156994


Figuring out if a number has a factor is now only using half the time, but it's still doing the bulk of the allocations. I'm not sure anything more can be done to speed it up, but on a guess, perhaps it's the partial functions and composition which is causing all the allocations. Lets remove them:


hasNoFactor n = any (\ x -> 0==(n `mod` x))


This does result in an improvement. The run time for the -O2 (no profiling) version takes a small amount under 0.9 seconds. But the amount of allocation is way down.
<       total time  =        1.44 secs   (72 ticks @ 20 ms)
< total alloc = 154,277,996 bytes (excludes profiling overheads)
---
< total time = 1.18 secs (59 ticks @ 20 ms)
< total alloc = 43,562,588 bytes (excludes profiling overheads)
10,15c10,15
< hasNoFactor Prime.PrimeList3 56.9 74.4 41 28678852
< extendedList Prime.PrimeList3 36.1 4.2 26 1603090
< main Main 4.2 4.8 3 1851678
< nextPrimes Prime.PrimeList3 2.8 3.9 2 1491443
< enq Prime.Queue 0.0 2.0 0 782348
< sweep Prime.PrimeList3 0.0 10.8 0 4156994
---
< hasNoFactor Prime.PrimeList3 71.2 9.2 42 1000000
< extendedList Prime.PrimeList3 27.1 14.7 16 1603090
< nextPrimes Prime.PrimeList3 1.7 13.7 1 1491443
< enq Prime.Queue 0.0 7.2 0 782348
< sweep Prime.PrimeList3 0.0 38.2 0 4156994
< main Main 0.0 17.0 0 1851678


I don't think we're going to squeeze any more speed out of the factoring part, but 27% of the time is spent extending the list. The likely culprit is the last function being run all the time. So lets keep track of the last number in that list, and while we're at it, keep the squared number instead of the original since it's squared:

import Data.List
import qualified Data.IntSet as IntSet
import Prime.Queue

type Prime = Int
type PrimeCache = (([Prime], Int), Queue Prime)

data Primes = Primes (Prime, PrimeCache) Primes

primeList :: [Prime]
primeList = 2:3:(nextPrimes (3, (([2], 4), enq newQueue 3)))
where
p2l (Primes (p, _) next) = p:(p2l next)

nextPrimes :: (Prime, PrimeCache) -> [Prime]
nextPrimes (p, cache) = newP:(nextPrimes h)
where
h@(newP, _) = sweep (fst cache, enq (snd cache) p) (p+2)

sweep c n | hasNoFactor n (fst $ fst shortList) = sweep shortList (n+2) -- once odd, always odd
| otherwise = (n, shortList)
where
shortList = extendedList n c

hasNoFactor n = any ((== 0).(mod n))

extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l, q) | x>=limit || empty q
= (l, q)
| otherwise
= extendedList limit (((fst l)++[min], min*min), newSet)
where
(min, newSet) = deq q
x = snd l


Amazingly, this change brought the running time down to 0.6 seconds. The profile is now:
 total time  =        1.10 secs   (55 ticks @ 20 ms)
total alloc = 164,277,996 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc ticks bytes

hasNoFactor Prime.PrimeList3 92.7 69.8 51 28678852
sweep Prime.PrimeList3 3.6 16.2 2 6656994
enq Prime.Queue 1.8 1.9 1 782348
nextPrimes Prime.PrimeList3 1.8 3.6 1 1491443
extendedList Prime.PrimeList3 0.0 3.9 0 1603090
main Main 0.0 4.5 0 1851678


individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 0 152
main Main 174 1 0.0 0.0 0.0 0.0 0 8792
CAF Main 168 3 0.0 0.0 0.0 4.5 0 24
main Main 175 0 0.0 4.5 0.0 4.5 0 7397920
CAF GHC.Num 156 1 0.0 0.0 0.0 0.0 0 104
CAF GHC.Handle 120 4 0.0 0.0 0.0 0.0 0 8620
CAF Prime.PrimeList3 95 4 0.0 0.0 100.0 95.5 0 32
primeList Prime.PrimeList3 176 1 0.0 0.0 100.0 95.5 0 12
enq Prime.Queue 182 1 0.0 0.0 0.0 0.0 0 24
nextPrimes Prime.PrimeList3 177 156994 1.8 3.6 100.0 95.5 1 5965772
sweep Prime.PrimeList3 179 500000 3.6 16.2 96.4 89.9 2 26627976
hasNoFactor Prime.PrimeList3 184 500000 92.7 69.8 92.7 69.8 51 114715408
extendedList Prime.PrimeList3 180 500169 0.0 3.9 0.0 3.9 0 6412360
deq Prime.Queue 183 169 0.0 0.0 0.0 0.0 0 11432
empty Prime.Queue 181 169 0.0 0.0 0.0 0.0 0 0
enq Prime.Queue 178 78059 1.8 1.9 1.8 1.9 1 3129368


At 0.6 seconds to find all prime numbers under 1E6, I'm happy to stop, especially since the C++ version cannot create an infinite list of prime numbers.

2008-05-02

Foreign Function Interface in Haskell

I'm messing around learning to make foreign function interfaces in Haskell. It was sufficiently annoying that I decided to actually document what I did. I picked something, anything. It didn't have to be useful, and I would certainly not implement it in a sane way. The point is to get a chance to try various aspects of the Haskell FFI. I settled just boolean operator evaluation. And not even any boolean operations, just integer operations. So to start, the C header:


/* c_boolean.h */
typedef int (*Operation)(int, int);

int eval(Operation o, int a, int b);
int plus(int a, int b);
int toString(int a, char *buffer, int bufferLen);
void print(char *buffer, int bufferLen);


Pretty simple, define the binary operation as Operation, create a function to evaluate an operation (although really, that's unnecessary but the point of the exercise is to try different things so I had to included function pointers). Then define a simple operation plus, then to add some string manipulation and Haskell managed memory a toString function that converts the int to a string. And finally, a function that returns nothing but actually does something, print.

The C code is pretty dull:


/* c_boolean.c */
#include <stdio.h>
#include "c_boolean.h"

int eval(Operation o, int a, int b) {
return o(a, b);
}

int plus(int a, int b) {
return a+b;
}

int toString(int a, char *buffer, int bufferLen) {
return snprintf(buffer, bufferLen, "%d", a);
}

void print(char *buffer, int bufferLen) {
printf("%s", buffer);
flush(stdout);
}



Like I said, boring. But the interesting stuff is coming up, the Haskell interface to the C code. To write the code, I referred to several places:
http://www.haskell.org/haskellwiki/FFI_Introduction
http://blog.danieroux.com/2007/01/01/simple-demonstration-of-haskell-ffi/
and of course the standard Foreign API functions in:
http://haskell.org/ghc/docs/latest/html/libraries/index.html


-- boolean.hs
import Foreign
import Foreign.C.Types
import Foreign.C.String
import Foreign.Ptr

type Operation = CInt -> CInt -> IO CInt
type C_Operation = FunPtr Operation
foreign import ccall "wrapper"
mkOperation :: Operation -> IO (FunPtr Operation)

foreign import ccall "c_boolean.h eval"
c_eval :: C_Operation -> CInt -> CInt -> IO CInt
my_eval :: Operation -> CInt -> CInt -> IO CInt
my_eval o a b = do
co <- mkOperation o c_eval co a b

foreign import ccall "c_boolean.h plus"
c_plus :: CInt -> CInt -> IO CInt
foreign import ccall "c_boolean.h toString"
c_toString :: CInt -> (Ptr CChar) -> CInt -> IO CInt

with_toString :: Int -> (CStringLen -> IO a) -> CInt -> IO a
with_toString l f n = allocaArray l runF
where
runF str = do
len <- c_toString (fromIntegral n) str n
f (str, l)

foreign import ccall "c_boolean.h print"
c_print :: (Ptr CChar) -> CInt -> IO ()

my_print :: CStringLen -> IO ()
my_print (s, n) = c_print s (fromIntegral n)

minus :: CInt -> CInt -> IO CInt
minus a b = return (a-b)

main = do
cp <- my_eval c_plus 2 6
m <- my_eval minus 5 2
sequence_ [ putStr "C plus: ",
with_toString 10 my_print cp,
putStr "\nHaskell minus: ",
with_toString 10 my_print m,
putStr "\n" ]


Then the compilation and running:

$ gcc -I. c_boolean.c -c && ar rc c_boolean.a c_boolean.o && ranlib c_boolean.a && ghc -L. -lc_boolean -fglasgow-exts -ffi --make boolean.hs -o boolean
[1 of 1] Compiling Main ( boolean.hs, boolean.o )
Linking boolean ...
$ ./boolean
C plus:
Haskell minus:
83$


One problem, even though I used sequence_ and flushed the printf in C, the 8 and 3 are printed at the very end.

Update, fixed formatting.