2008-06-07

Prime Number Profiling

I've been trying to write an efficient prime number generator. So far the best I've been able to do is about 2.3 seconds for the prime numbers less than one million. Using a normal sieve, it is possible for a C++ program to generate the primes in less than 0.05 seconds. So my goal is to get it to 0.5 seconds without having to resort to funky unsafe operations or basically writing the C solution in Haskell.

Here is the current code:
import Data.List
import Prime.Queue

type Prime = Int
type PrimeCache = ([Prime], Queue Prime)

data Primes = Primes !(Prime, PrimeCache) Primes

primeList :: [Prime]
primeList = 2:3:(nextPrimes (3, ([2], enq newQueue 3)))
where
p2l (Primes (p, _) next) = p:(p2l next)

nextPrimes :: (Prime, PrimeCache) -> [Prime]
nextPrimes (p, cache) = newP:(nextPrimes h)
where
h@(newP, _) = sweep (fst cache, enq (snd cache) p) (p+2)

sweep c n | not $ any ((== 0).(mod n)) (fst shortList) = (n, shortList)
| otherwise = sweep shortList (n+2) -- once odd, always odd
where
shortList = extendedList (limit n) c

extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l@(x:xs), q) | x>=limit || empty q
= (l, q)
| otherwise
= extendedList limit (min:l, newSet)
where
(min, newSet) = deq q

limit n = floor $ sqrt (fromIntegral n :: Float)


Yes, it's not pretty. All it does is keep a list of prime numbers already found and see if the candidate is a factor of any of them.

It implements two basic optimization:

  • After 2, there are no more even primes, so starting at 3 it adds 2 to the next number to test for primeness.
  • It doesn't have to check all previously found prime numbers, only prime numbers that are < sqrt(p).

It uses the handy Queue code from Eric Kidd. So before any optimization is really done other than the simplest algorithmic optimizations, we'll benchmark it.

The profiling has:
 Sun Jun  8 00:14 2008 Time and Allocation Profiling Report  (Final)

10 +RTS -P -s10.stats -RTS

total time = 3.80 secs (190 ticks @ 20 ms)
total alloc = 532,086,816 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc ticks bytes

sweep Prime.PrimeList3 82.6 77.3 157 102850044
limit Prime.PrimeList3 15.8 18.4 30 24537068
nextPrimes Prime.PrimeList3 0.5 1.1 1 1491443
main Main 0.5 1.4 1 1851678
extendedList Prime.PrimeList3 0.0 1.1 0 1502535


individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 0 152
main Main 174 1 0.0 0.0 0.0 0.0 0 8792
CAF Main 168 3 0.0 0.0 0.5 1.4 0 24
main Main 175 0 0.5 1.4 0.5 1.4 1 7397920
CAF GHC.Num 156 1 0.0 0.0 0.0 0.0 0 104
CAF GHC.Handle 120 4 0.0 0.0 0.0 0.0 0 8620
CAF Prime.PrimeList3 95 4 0.0 0.0 99.5 98.6 0 32
primeList Prime.PrimeList3 176 1 0.0 0.0 99.5 98.6 0 12
enq Prime.Queue 183 1 0.0 0.0 0.0 0.0 0 24
nextPrimes Prime.PrimeList3 177 156994 0.5 1.1 99.5 98.6 1 5965772
sweep Prime.PrimeList3 179 500000 82.6 77.3 98.4 96.9 157 411400176
extendedList Prime.PrimeList3 181 500169 0.0 1.1 0.0 1.1 0 6010140
deq Prime.Queue 184 169 0.0 0.0 0.0 0.0 0 13856
empty Prime.Queue 182 169 0.0 0.0 0.0 0.0 0 0
limit Prime.PrimeList3 180 500000 15.8 18.4 15.8 18.4 30 98148272
enq Prime.Queue 178 78207 0.5 0.6 0.5 0.6 1 3132920


Not surprisingly, the function that does a square root and type conversion is rather expensive in time as well as space (15.8% of the time and 18.4% in space). The biggest factor though is the sweep function. Nothing jumps out as a problem in sweep, but we can fix the limit problem by doing the comparison. Instead of x>sqrt(y), we can do (x*x)>y:
extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l@(x:_), q) | (x*x)>=limit || empty q -- n to the square rather then x to root(n)
= (l, q)
| otherwise
= extendedList limit (min:l, newSet)
where
(min, newSet) = deq q

limit n=n


Now for the profile:

 Sun Jun  8 00:43 2008 Time and Allocation Profiling Report  (Final)

10 +RTS -P -s10.stats -RTS

total time = 3.36 secs (168 ticks @ 20 ms)
total alloc = 434,489,520 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc ticks bytes

sweep Prime.PrimeList3 98.8 94.8 166 102989282
main Main 0.6 1.7 1 1851678
nextPrimes Prime.PrimeList3 0.0 1.4 0 1491443
extendedList Prime.PrimeList3 0.0 1.4 0 1502535


individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 0 152
main Main 174 1 0.0 0.0 0.0 0.0 0 8792
CAF Main 168 3 0.0 0.0 0.6 1.7 0 24
main Main 175 0 0.6 1.7 0.6 1.7 1 7397920
CAF GHC.Num 156 1 0.0 0.0 0.0 0.0 0 104
CAF GHC.Handle 120 4 0.0 0.0 0.0 0.0 0 8620
CAF Prime.PrimeList3 95 4 0.0 0.0 99.4 98.3 0 32
primeList Prime.PrimeList3 176 1 0.0 0.0 99.4 98.3 0 12
enq Prime.Queue 182 1 0.0 0.0 0.0 0.0 0 24
nextPrimes Prime.PrimeList3 177 156994 0.0 1.4 99.4 98.3 0 5965772
sweep Prime.PrimeList3 179 500000 98.8 94.8 98.8 96.2 166 411957128
extendedList Prime.PrimeList3 180 500169 0.0 1.4 0.0 1.4 0 6010140
deq Prime.Queue 183 169 0.0 0.0 0.0 0.0 0 11432
empty Prime.Queue 181 169 0.0 0.0 0.0 0.0 0 0
enq Prime.Queue 178 78059 0.6 0.7 0.6 0.7 1 3129368


The execution time (without profiling) dropped to about 1.9 seconds, or 82% the time.

Not bad, but now sweep is where all the time is being spent and all the memory is being allocated. The nastiest part of the function is the test for factors, splitting it out:
sweep c n | hasFactor n (fst shortList) = (n, shortList)
| otherwise = sweep shortList (n+2) -- once odd, always odd
where
shortList = extendedList (limit n) c

hasFactor n = not.(any ((== 0).(mod n)))


And running that, we find (as a diff):
-sweep                          Prime.PrimeList3      98.8   94.8    166 102989282
-main Main 0.6 1.7 1 1851678
+hasFactor Prime.PrimeList3 95.8 91.9 159 99832288
+sweep Prime.PrimeList3 2.4 2.9 4 3156994
+main Main 1.2 1.7 2 1851678


Most of the time and allocation is done in the fairly simple hasFactor function. Here's a guess, most of the time, the small numbers are the factors, but the list is in reverse order. So put the list in the other order:
extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l, q) | (x*x)>=limit || empty q
= (l, q)
| otherwise
= extendedList limit (l++[min], newSet)
where
(min, newSet) = deq q
x = last l


And we now have a running time of 0.9 seconds, the profile diff looks like:
<       total time  =        3.62 secs   (181 ticks @ 20 ms)
< total alloc = 438,489,520 bytes (excludes profiling overheads)
---
> total time = 1.44 secs (72 ticks @ 20 ms)
> total alloc = 154,277,996 bytes (excludes profiling overheads)
10,14c10,15
< hasNoFactor Prime.PrimeList3 96.7 91.1 175 99832288
< nextPrimes Prime.PrimeList3 2.2 1.4 4 1491443
< sweep Prime.PrimeList3 0.6 3.8 1 4156994
< main Main 0.6 1.7 1 1851678
< extendedList Prime.PrimeList3 0.0 1.4 0 1502535
---
> hasNoFactor Prime.PrimeList3 56.9 74.4 41 28678852
> extendedList Prime.PrimeList3 36.1 4.2 26 1603090
> main Main 4.2 4.8 3 1851678
> nextPrimes Prime.PrimeList3 2.8 3.9 2 1491443
> enq Prime.Queue 0.0 2.0 0 782348
> sweep Prime.PrimeList3 0.0 10.8 0 4156994


Figuring out if a number has a factor is now only using half the time, but it's still doing the bulk of the allocations. I'm not sure anything more can be done to speed it up, but on a guess, perhaps it's the partial functions and composition which is causing all the allocations. Lets remove them:


hasNoFactor n = any (\ x -> 0==(n `mod` x))


This does result in an improvement. The run time for the -O2 (no profiling) version takes a small amount under 0.9 seconds. But the amount of allocation is way down.
<       total time  =        1.44 secs   (72 ticks @ 20 ms)
< total alloc = 154,277,996 bytes (excludes profiling overheads)
---
< total time = 1.18 secs (59 ticks @ 20 ms)
< total alloc = 43,562,588 bytes (excludes profiling overheads)
10,15c10,15
< hasNoFactor Prime.PrimeList3 56.9 74.4 41 28678852
< extendedList Prime.PrimeList3 36.1 4.2 26 1603090
< main Main 4.2 4.8 3 1851678
< nextPrimes Prime.PrimeList3 2.8 3.9 2 1491443
< enq Prime.Queue 0.0 2.0 0 782348
< sweep Prime.PrimeList3 0.0 10.8 0 4156994
---
< hasNoFactor Prime.PrimeList3 71.2 9.2 42 1000000
< extendedList Prime.PrimeList3 27.1 14.7 16 1603090
< nextPrimes Prime.PrimeList3 1.7 13.7 1 1491443
< enq Prime.Queue 0.0 7.2 0 782348
< sweep Prime.PrimeList3 0.0 38.2 0 4156994
< main Main 0.0 17.0 0 1851678


I don't think we're going to squeeze any more speed out of the factoring part, but 27% of the time is spent extending the list. The likely culprit is the last function being run all the time. So lets keep track of the last number in that list, and while we're at it, keep the squared number instead of the original since it's squared:

import Data.List
import qualified Data.IntSet as IntSet
import Prime.Queue

type Prime = Int
type PrimeCache = (([Prime], Int), Queue Prime)

data Primes = Primes (Prime, PrimeCache) Primes

primeList :: [Prime]
primeList = 2:3:(nextPrimes (3, (([2], 4), enq newQueue 3)))
where
p2l (Primes (p, _) next) = p:(p2l next)

nextPrimes :: (Prime, PrimeCache) -> [Prime]
nextPrimes (p, cache) = newP:(nextPrimes h)
where
h@(newP, _) = sweep (fst cache, enq (snd cache) p) (p+2)

sweep c n | hasNoFactor n (fst $ fst shortList) = sweep shortList (n+2) -- once odd, always odd
| otherwise = (n, shortList)
where
shortList = extendedList n c

hasNoFactor n = any ((== 0).(mod n))

extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l, q) | x>=limit || empty q
= (l, q)
| otherwise
= extendedList limit (((fst l)++[min], min*min), newSet)
where
(min, newSet) = deq q
x = snd l


Amazingly, this change brought the running time down to 0.6 seconds. The profile is now:
 total time  =        1.10 secs   (55 ticks @ 20 ms)
total alloc = 164,277,996 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc ticks bytes

hasNoFactor Prime.PrimeList3 92.7 69.8 51 28678852
sweep Prime.PrimeList3 3.6 16.2 2 6656994
enq Prime.Queue 1.8 1.9 1 782348
nextPrimes Prime.PrimeList3 1.8 3.6 1 1491443
extendedList Prime.PrimeList3 0.0 3.9 0 1603090
main Main 0.0 4.5 0 1851678


individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 0 152
main Main 174 1 0.0 0.0 0.0 0.0 0 8792
CAF Main 168 3 0.0 0.0 0.0 4.5 0 24
main Main 175 0 0.0 4.5 0.0 4.5 0 7397920
CAF GHC.Num 156 1 0.0 0.0 0.0 0.0 0 104
CAF GHC.Handle 120 4 0.0 0.0 0.0 0.0 0 8620
CAF Prime.PrimeList3 95 4 0.0 0.0 100.0 95.5 0 32
primeList Prime.PrimeList3 176 1 0.0 0.0 100.0 95.5 0 12
enq Prime.Queue 182 1 0.0 0.0 0.0 0.0 0 24
nextPrimes Prime.PrimeList3 177 156994 1.8 3.6 100.0 95.5 1 5965772
sweep Prime.PrimeList3 179 500000 3.6 16.2 96.4 89.9 2 26627976
hasNoFactor Prime.PrimeList3 184 500000 92.7 69.8 92.7 69.8 51 114715408
extendedList Prime.PrimeList3 180 500169 0.0 3.9 0.0 3.9 0 6412360
deq Prime.Queue 183 169 0.0 0.0 0.0 0.0 0 11432
empty Prime.Queue 181 169 0.0 0.0 0.0 0.0 0 0
enq Prime.Queue 178 78059 1.8 1.9 1.8 1.9 1 3129368


At 0.6 seconds to find all prime numbers under 1E6, I'm happy to stop, especially since the C++ version cannot create an infinite list of prime numbers.