2010-02-15

Parsing with Parsec

At work I deal with geospatial data a lot. One of the formats I often work with lot is called RMX. It's a text based format for describing geometry and attribution on geometry. If I want to do any complicated work with RMX using Haskell I need to parse it in Haskell. I've chosen to use Parsec as the parser to learn a bit more about using parser combinators, and I've chosen to parse RMX because it is simple, but its grammar is not entirely trivial.
Here's a sample of RMX: A|Data Format|ASCII|| A|Default Language|FRE|| A|Build Date|Fri Nov 7 16:15:36 2008|| S|-129.9804805|55.2963299|-129.9823406|55.3020698|541-|B2|ALASKA|541-|B1|UNITED STATES|| S|3.7562004|45.7552403|3.7562795|45.7555702|122+|BUILT-UP AREA|La Chamba|| S|3.7589094|45.7558103|3.7588263|45.7553744|122+|BUILT-UP AREA|La Chamba|| S|3.7607802|45.8210094|3.7604503|45.8206902|222+|BUILT-UP AREA|Noirétable|142+|INDU COMPLEX|Z.I. Rue De L'auvergne|| P|-91.376790|39.931760|130|D\RECREATION|\MADISON PARK\2434\\\\\:ID/37951151|| I won't go into detail about what it all means, I'm just going to describe the basic grammar. An RMX file is composed of line based records. Each record ends with ||<newline>. There are three kinds of records: annotation (start with A|), point (start with P|), and segment (start with S|).
  • Annotation records have two free form pipe delimited text fields.
  • Point records have a point (with the coordinate values pipe separated), followed by one or more udm triples.
  • Segment records have two points (with the coordinate values pipe separated), followed by one or more udm triples.
  • A UDM triple is just three pipe delimited fields which hold attribution information. The first element of a triple must have at least one character (actually it has to have 3-5 characters, but I don't care about that level of detail right now)
It's a pretty simple format, but not entirely trivial so it's a good starting point for learning to write a Parsec parser. First part, boring imports: import System.IO import Control.Monad import Text.ParserCombinators.Parsec import Text.ParserCombinators.Parsec.Error
Next part, declarations of what we want the results of parsing to look like: type UDMTriple = (String, String, String) data RMXRecord a = ARec String String | PRec (a, a) [UDMTriple] | SRec ((a, a), (a, a)) [UDMTriple] deriving Show We define a udm triple as a triple of strings, and an RMX record is exactly as described above.
Now, onto actual Parsec stuff:. The most basic terminal, a pipe character is used a lot so I named it: pipe = char '|'
Next, are two more simple definitions, fields are all pipe delimited. However, in case the input is bad we also want to allow new lines to terminate a field: field1 = many1 (noneOf "|\n\r") field = many (noneOf "|\n\r") The difference between field1 and field is that field1 returns a field of one or more characters, while field will return a field of zero or more characters.
The parser for the points is more involved: number = do sign <- option 1 ( do s <- oneOf "+-" return $ if s == '-' then (-1.0) else (1.0)) i <- many digit d <- try (char '.' >> try (many (digit))) return $ sign*(read (i++"."++d)) This was a PITA. There's no predefined parser for signed decimal numbers in Parsec so I had to write my own. Some search on the web located other similar solutions. I'm sure it's not the best way to do it, but it works. It first has to try to pull off the (optional) sign, then the part of the float before the decimal point, then the (optional) decimal followed by more (optional) digits. Then it reassembles the thing into a string and uses read to turn it into a number. Ugly.
Using the number parser is the point parser, which simply parses two pipe delimited numbers: point = do x <- number <?> "Floating point number" pipe y <- number <?> "Floating point number" return (x, y) Here we have the first use of <?>, which is used when the previous match fails and we want a meaningful error message.
Continuing from the bottom of the parser up, the next building block is a parser for triples: triples = do try (pipe >> notFollowedBy (char '|')) acc <- field1 <?> "Attribute code" pipe s1 <- field <?> "String 1" pipe s2 <- field <?> "String 2" return (acc, s1, s2) The notFollowedBy part is to disambiguate the case of the records ending with ||, which would otherwise be parsed as the start of a new (incomplete) triple with a zero length acc. The function returns a single UDMTriple.
The end of an RMX record is defined with: eor = try (string "||\n\r") <|> try (string "||\r\n") <|> string "||\n" <|> string "||\r" <?>"end of record" The try function tries the match, and if the match fails the characters it pulled off the stream are put back. This is important because it has to try to match against different length strings. This also has the first use of the <|> operator. <|> is basically an or operator. If the left side doesn't match then the right side is attempted.
The next chunk of code defines the parsers for the three different kinds of records: arec :: GenParser Char st (RMXRecord Double) arec = do char 'A' key <- (pipe >> field) value <- (pipe >> field) return $ ARec key value prec = do char 'P' p <- (pipe >> point) attrs <- (many1 triples) <?> "UDM triples" return $ PRec p attrs srec = do char 'S' p1 <- (pipe >> point) p2 <- (pipe >> point) attrs <- (many1 triples) <?> "UDM triples" return $ SRec (p1, p2) attrs They are simply stating the definition of the description at the top of the post, they are composed of points, and triples or fields. The functions each return a single RMXRecord.
The other part of parsing a single record is the top level record definition: A record can be any of the three kinds: line :: GenParser Char st (RMXRecord Double) line = arec <|> prec <|> srec
And the definition of what an rmx file is: rmxFile :: GenParser Char st [RMXRecord Double] rmxFile = endBy line ((try eor) <|> (string "||" >> eof >> return "")) The definition of an rmx file. It's a bunch of lines terminated by eor or ||eof. That definition handles the case of a file missing the newline terminator on the last record.
Finally, a helper function that takes care of opening the file, parsing it, and running a function on the records: type ParsedLine = (String, Either ParseError (RMXRecord Double)) withRMX :: String -> (ParsedLine -> IO a) -> IO [a] withRMX fn fun = withFile fn ReadMode (processFile <=< hGetContents) where processFile c = sequence $! map (\ x -> fun (x, unlist $ parseRMX fn x)) (lines c) unlist :: Either ParseError [RMXRecord Double] -> Either ParseError (RMXRecord Double) unlist (Left x) = Left x unlist (Right (y:ys)) = Right y parseRMX :: String -> String -> Either ParseError [RMXRecord Double] parseRMX fn rmx = parse rmxFile fn rmx The function splits the file by lines, and parses each line separately. This is mainly because (generally) Parsec has to load the entire input into memory and parse the whole thing before you get any output. This would not be good for a 5 GB input file. So the withRMX uses a lazy contents reader and reads line by line.
Another item to note is that parseRMX returns a list of RMXRecords, but the function used by withRMX gets only one RMXRecord at a time. withRMX splits the input into lines so it knows that each input line has exactly one output record. parseRMX cannot make that assumption since it could be given a string with new lines and multiple records in the string.
Another helper function makes it simpler to process only records that don't have errors: type GoodLine = (String, RMXRecord Double) good :: (GoodLine -> IO a) -> (ParsedLine -> IO a) good fun = (\ (l, e) -> either err (\ r -> fun (l, r)) e) where err e = error $ foldl (\ a b -> a++(messageString b)) "" (errorMessages e) The good function takes care of creating an error if there's a parse error (it will abort everything though), and if there's no error then it runs the given function on the records.
Here's a simple program that makes use of the parser to only output annotation records. main = do (from:rest) <- getArgs withRMX from (good onlyAnnotations) onlyAnnotations :: GoodLine -> IO () onlyAnnotations (line, ARec _ _) = putStrLn line onlyAnnotations _ = return () There are, of course, much simpler ways of obtaining all the annotation records since all annotation records start with 'A'. But the function could do anything, for example extracting segments and points inside a bounding box.
Here's the entire code of the parsing module: module Data.RMX (good, withRMX, parseRMX, UDMTriple, RMXRecord(ARec, PRec, SRec), ParsedLine, GoodLine) where import System.IO import Text.ParserCombinators.Parsec import Text.ParserCombinators.Parsec.Error import Control.Monad type UDMTriple = (String, String, String) data RMXRecord a = ARec String String | PRec (a, a) [UDMTriple] | SRec ((a, a), (a, a)) [UDMTriple] deriving Show type ParsedLine = (String, Either ParseError (RMXRecord Double)) type GoodLine = (String, RMXRecord Double) good :: (GoodLine -> IO a) -> (ParsedLine -> IO a) good fun = (\ (l, e) -> either err (\ r -> fun (l, r)) e) where err e = error $ foldl (\ a b -> a++(messageString b)) "" (errorMessages e) withRMX :: String -> (ParsedLine -> IO a) -> IO [a] withRMX fn fun = withFile fn ReadMode (processFile <=< hGetContents) where processFile c = sequence $! map (\ x -> fun (x, unlist $ parseRMX fn x)) (lines c) unlist :: Either ParseError [RMXRecord Double] -> Either ParseError (RMXRecord Double) unlist (Left x) = Left x unlist (Right (y:ys)) = Right y parseRMX :: String -> String -> Either ParseError [RMXRecord Double] parseRMX fn rmx = parse rmxFile fn rmx rmxFile :: GenParser Char st [RMXRecord Double] rmxFile = endBy line ((try eor) <|> (string "||" >> eof >> return "")) line :: GenParser Char st (RMXRecord Double) line = arec <|> prec <|> srec arec :: GenParser Char st (RMXRecord Double) arec = do char 'A' key <- (pipe >> field) value <- (pipe >> field) return $ ARec key value prec = do char 'P' p <- (pipe >> point) attrs <- (many1 triples) <?> "UDM triples" return $ PRec p attrs srec = do char 'S' p1 <- (pipe >> point) p2 <- (pipe >> point) attrs <- (many1 triples) <?> "UDM triples" return $ SRec (p1, p2) attrs point = do x <- number <?> "Floating point number" pipe y <- number <?> "Floating point number" return (x, y) triples = do try (pipe >> notFollowedBy (char '|')) acc <- field1 <?> "Attribute code" pipe s1 <- field <?> "String 1" pipe s2 <- field <?> "String 2" return (acc, s1, s2) field1 = many1 (noneOf "|\n\r") field = many (noneOf "|\n\r") pipe = char '|' number = do sign <- option 1 ( do s <- oneOf "+-" return $ if s == '-' then (-1.0) else (1.0)) i <- many digit d <- try (char '.' >> try (many (digit))) return $ sign*(read (i++"."++d)) eor = try (string "||\n\r") <|> try (string "||\r\n") <|> string "||\n" <|> string "||\r" <?> "end of record"

2009-08-08

GHC 6.10.4 on RH4

Getting GHC 6.10.4 working on RHEL4 is a bit of a task. Certainly RH provides no RPM for such an old version of their OS and the Linux binary provided by the GHC team doesn't work on such an old libc. So I had to build it. Unfortunately, trying to build it from scratch (boot strapping) resulted in an error that I needed a newer version of gnu make. It needs gmake 3.81, and RH4 has gmake 3.80. Since I don't have root access to the machine, I had to build it from source and install it to ~/opt. No surprises in building or installing make.

Back to ghc 6.10.4 for the bootstrapping. Now that there's a newer version of make, the configure step completes. But when make is run, it complains that it can't run '-Wall'. Clearly, it didn't put the gcc executable into the command line so it's trying to execute the first parameter to gcc. So rather than try to figure out what's going wrong, I decided to find an older binary of ghc which would run on RH 4. GHC 6.8 did not run, but 6.6 did. So I installed GHC 6.6, then built ghc 6.10.4 directly from that. The 6.10.4 non-boostrapping worked without a problem.

Next came the testsuite. There's not much point in having a compiler if I don't know it works. So I ran the test suite, and tried building my own project while it was running. My process hit a wall:
can't load .so/.DLL for: rt (/usr/lib/librt.so: symbol __librt_multiple_threads, version GLIBC_PRIVATE not defined in file libc.so.6 with link time reference). Also many of the tests in the testsuite were failing.

With the help of the #ghc IRC channel, I found a known bug, and some pointers to possible solutions. It turns out that in RH4, glibc and the kernel have different linuxthread implementations. A program built to run with non-NPTL threads will give the above complaint if the system tries to run it with the NPTL libraries. The standard way of getting the system to pick the non-NPTL libraries is to set an environment variable:
LD_ASSUME_KERNEL=2.4.1

This got rid of the pthread errorr, but I was faced with another error:
[x@bldrh4 HaskellRME]$ LD_ASSUME_KERNEL=2.4.1 runhaskell Setup.lhs configure
Configuring haskellrme-0.0...
Setup.lhs: ghc version >=6.4 is required but the version of
/home/x/opt/ghc-6.10.4/bin/ghc could not be determined.
[x@bldrh4 HaskellRME]$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.10.4


So, WTF? Just to be sure my new GHC was fully working, I rebuilt GHC with the 6.10.4 that I had just built. I found out that the compiler was just fine, and ran way faster than 6.6 and produced much smaller stuff. The rebuild of ghc took under 2 hours on this machine, but the first time around with 6.6 it took more than 3 hours, and with the 6.6 compiler the whole 6.10.4 directory was nearly 2 GB, but with the rebuilt it was only 1.2 GB. I also found out that there was no problem with my compiler, but I installed the new new one anyways.

Now the testsuite runs with only two unexpected errors (break017(ghci) and ghcpkg02(normal)), but the cabal stuff still can't seem to determine the version of GHC I have. Thanks to the patient help of the IRC channel again, I narrowed the problem down to something to do with multi-threading and spawning sub-processes. The problem, it appears is when the sub-process exits too quickly for the parent. When that happens, the parent gets nothing and in this case thinks that ghc has no version. Using runghc and -v3 would slow down the execution enough that it would sometimes work:
[x@bldrh4 HaskellRME]$ runghc Setup.lhs configure -v3
Configuring haskellrme-0.0...
Creating dist (and its parents)
searching for ghc in path.
found ghc at /home/x/opt/ghc-6.10.4/bin/ghc
("/home/x/opt/ghc-6.10.4/bin/ghc",["--numeric-version"])
/home/x/opt/ghc-6.10.4/bin/ghc is version 6.10.4
looking for package tool: ghc-pkg near compiler in
/home/x/opt/ghc-6.10.4/bin
found package tool in /home/x/opt/ghc-6.10.4/bin/ghc-pkg
("/home/x/opt/ghc-6.10.4/bin/ghc-pkg",["--version"])
/home/x/opt/ghc-6.10.4/bin/ghc-pkg is version 6.10.4
("/home/x/opt/ghc-6.10.4/bin/ghc",["--supported-languages"])
Setup.lhs: waitForProcess: does not exist (No child processes)

[x@bldrh4 HaskellRME]$ runghc Setup.lhs configure -v3
Configuring haskellrme-0.0...
Creating dist (and its parents)
searching for ghc in path.
found ghc at /home/x/opt/ghc-6.10.4/bin/ghc
("/home/x/opt/ghc-6.10.4/bin/ghc",["--numeric-version"])
/home/x/opt/ghc-6.10.4/bin/ghc is version 6.10.4
looking for package tool: ghc-pkg near compiler in
/home/x/opt/ghc-6.10.4/bin
found package tool in /home/x/opt/ghc-6.10.4/bin/ghc-pkg
("/home/x/opt/ghc-6.10.4/bin/ghc-pkg",["--version"])
/home/x/opt/ghc-6.10.4/bin/ghc-pkg is version 6.10.4
("/home/x/opt/ghc-6.10.4/bin/ghc",["--supported-languages"])
Reading installed packages...
("/home/x/opt/ghc-6.10.4/bin/ghc-pkg",["dump","--global"])
Setup.lhs: At least the following dependencies are missing:
regex-tdfa -any && -any


Just compiling the Setup.lhs (with --make) resulted in a binary that always works. After that, it was just a matter of installing cabal-install, installing regex-tdfa, then building my package with the executable Setup which all went without a hitch.

So thanks to all the help, I now have a working RH4 GHC and working binaries for the ancient RH4.

2009-06-28

Compojure and SQLite Part 2

Continuing from the previous post, I added table inserts. In addition, I started adding some decorators to help avoid repeating code. Since there was some refactoring, many functions changed a lot.

Here's the top chunk:
(ns db-edit
(:require [clojure.contrib.sql :as sql])
(:use clojure.contrib.json.write)
(:use compojure))

; Information about the SQLite database
(def db-params {:classname "org.sqlite.JDBC"
:subprotocol "sqlite"
:subname "./database.sqlite"})

(def http-error-map
{400 {:name "Bad Request"
:long "The request could not be understood by the server due to malformed syntax."}
401 {:name "Unauthorized"
:long "The request requires user authentication."} ; Note WWW-Authenticate header should be added
404 {:name "Not Found"
:long "The server has not found anything matching the Request-URI."}})

(defn error [err]
(if-let [errMap (get http-error-map err)]
{:status err
:headers {}
:body (str "HTTP error " err ": " (get errMap :name) "\n" (get errMap :long))}
{:status err
:headers {}
:body (str "HTTP Error " err " encountered")}))

(defn sanitize-sql
"Cleans the string of nasty SQL related special chars"
[s]
(. s (replaceAll "/[\\s'\";#%]/" "")))

(defn cleanPath
"Sanitize the incoming string for '..'"
[p]
(.. p (replaceAll "/\\.+" "/") (replaceAll "/+" "/")))

(defn log [ level & rest ]
(if (== level :debug)
(. org.mortbay.log.Log info (str rest))
(if (== level :info)
(. org.mortbay.log.Log debug (str rest))
(. org.mortbay.log.Log warn (str rest)))))


None of that code change, although I did add the log function to help with debugging. It just sends messages to the Jetty log since I'm using Jetty as the servlet container.

The three functions for retrieving data were simplified a little. They no longer have sql/with-connection or json-str. The functions now have nothing specific to compojure or the web, and they don't have anything for setting up the database connection. The functions are now much more focused on exactly what they are intended to do. Here are the new definitions:
(defn get-table-meta
"Demonstrate getting table info"
[f]
(map f
(resultset-seq
(-> (sql/connection)
(.getMetaData)
(.getTables nil nil nil nil)))))

(defn get-column-meta
"Given a table name, return a collection of the field names"
[t f]
(map f
(resultset-seq
(-> (sql/connection)
(.getMetaData)
(.getColumns nil nil t nil)))))

(defn get-table
"Given a table name, return a collection of structmaps for all the data in the table"
[t]
(sql/with-query-results
res
[(str "SELECT * FROM " (sanitize-sql t))]
(into [] res)))


In order to make sure those functions have a database connection and have their output converted to JSON, I added two chunks of middleware:
(defn with-db
[handler]
(fn [request]
(sql/with-connection db-params (handler request))))

(defn with-json
[handler]
(fn [request]
(json-str (handler request))))


In Compojure, middleware is a wrapper around the request handlers. Middleware functions return a function that does something either to the request before it calls the request handler, or something to the response of the handler. In the first example above, the with-db function calls the handler as the body of the sql/with-connection function. There's no magic here. A handler function used to look like:
(defn get-table-meta
"Demonstrate getting table info"
[f]
(sql/with-connection db-params (map f
(resultset-seq
(-> (sql/connection)
(.getMetaData)
(.getTables nil nil nil nil))))))


But with the decorator, calling with-connection is factored out.

In the second example decorator with-json, the result of the handler is converted to a JSON string. This has the potential for causing problems. If a handler can return an error code or a response map it won't return an error or the expected response; it'll return the json string version of the error and response map. Fortunately, decorators have to be explicitly added to handlers:

(defn tables-handler [request]
(get-table-meta table-name))
(defn columns-handler [request]
(get-column-meta (:table (:params request)) column-name))
(defn table-handler [request]
(get-table (:table (:params request))))
(defn insert-handler [request]
(insert-row (:table (:params request)) (dissoc (conj (:params request) {}) :table)))

(decorate tables-handler (with-json) (with-db))
(decorate table-handler (with-json) (with-db))
(decorate columns-handler (with-json) (with-db))
(decorate insert-handler (with-json) (with-db))

(defroutes my-routes
(POST "/insert/:table" insert-handler)
(ANY "/columns/:table" columns-handler)
(ANY "/table/:table" table-handler)
(ANY "/tables" tables-handler)
(ANY "/*" (or (serve-file "./static" (cleanPath (params :*))) :next))
(ANY "/*" (error 400)))

(run-server {:port 8080}
"/*" (servlet my-routes))


As you can see the route definitions now use simple function handler functions. While before one of the routes was (ANY "/columns/:table" (get-column-meta (params :table) column-name)) it's now (ANY "/columns/:table" columns-handler) and the columns-handler function takes care of pulling out the arguments from the request and calling the get-column-meta function. This way the decorators can be added to the handlers, and keeps logic out of the otherwise simple path specification of the defroutes.

Inserting a new record is pretty simple:
(defn insert-row [table fields]
(sql/insert-records table fields)
["Added record", fields])


That's it. The with-db decorator is added so it has a db connection to use, and the handler takes care of making sure the arguments are correct.

Here's the full code:

./db-edit.clj:
(ns db-edit
(:require [clojure.contrib.sql :as sql])
(:use clojure.contrib.json.write)
(:use compojure))

; Information about the SQLite database
(def db-params {:classname "org.sqlite.JDBC"
:subprotocol "sqlite"
:subname "./database.sqlite"})

(def http-error-map
{400 {:name "Bad Request"
:long "The request could not be understood by the server due to malformed syntax."}
401 {:name "Unauthorized"
:long "The request requires user authentication."} ; Note WWW-Authenticate header should be added
404 {:name "Not Found"
:long "The server has not found anything matching the Request-URI."}})

(defn error [err]
(if-let [errMap (get http-error-map err)]
{:status err
:headers {}
:body (str "HTTP error " err ": " (get errMap :name) "\n" (get errMap :long))}
{:status err
:headers {}
:body (str "HTTP Error " err " encountered")}))

(defn sanitize-sql
"Cleans the string of nasty SQL related special chars"
[s]
(. s (replaceAll "/[\\s'\";#%]/" "")))

(defn cleanPath
"Sanitize the incoming string for '..'"
[p]
(.. p (replaceAll "/\\.+" "/") (replaceAll "/+" "/")))

(defn log [ level & rest ]
(if (== level :debug)
(. org.mortbay.log.Log info (str rest))
(if (== level :info)
(. org.mortbay.log.Log debug (str rest))
(. org.mortbay.log.Log warn (str rest)))))

(defn with-db
[handler]
(fn [request]
(sql/with-connection db-params (handler request))))

(defn with-json
[handler]
(fn [request]
(json-str (handler request))))

(defn get-table-meta
"Demonstrate getting table info"
[f]
(map f
(resultset-seq
(-> (sql/connection)
(.getMetaData)
(.getTables nil nil nil nil)))))

(defn get-column-meta
"Given a table name, return a collection of the field names"
[t f]
(map f
(resultset-seq
(-> (sql/connection)
(.getMetaData)
(.getColumns nil nil t nil)))))

(defn get-table
"Given a table name, return a collection of structmaps for all the data in the table"
[t]
(sql/with-query-results
res
[(str "SELECT * FROM " (sanitize-sql t))]
(into [] res)))

(defn table-name
"Given a mapstruct from the get-table-meta function, give back a string"
[t]
(:table_name t))

(defn column-name
"Given a mapstruct from the table-col-names, give back the column name"
[c]
(:column_name c))

(defn insert-row [table fields]
(sql/insert-records table fields)
["Added record", fields])

(defn tables-handler [request]
(get-table-meta table-name))
(defn columns-handler [request]
(get-column-meta (:table (:params request)) column-name))
(defn table-handler [request]
(get-table (:table (:params request))))
(defn insert-handler [request]
(insert-row (:table (:params request)) (dissoc (conj (:params request) {}) :table)))

(decorate tables-handler (with-json) (with-db))
(decorate table-handler (with-json) (with-db))
(decorate columns-handler (with-json) (with-db))
(decorate insert-handler (with-json) (with-db))

(defroutes my-routes
(POST "/insert/:table" insert-handler)
(ANY "/columns/:table" columns-handler)
(ANY "/table/:table" table-handler)
(ANY "/tables" tables-handler)
(ANY "/*" (or (serve-file "./static" (cleanPath (params :*))) :next))
(ANY "/*" (error 400)))

(run-server {:port 8080}
"/*" (servlet my-routes))


static/index.html:
<script type="text/javascript">
// You may specify partial version numbers, such as "1" or "1.3",
// with the same result. Doing so will automatically load the
// latest version matching that partial revision pattern
// (i.e. both 1 and 1.3 would load 1.3.2 today).
google.load("jquery", "1.3.2");
google.load("jqueryui", "1.7.1");
google.setOnLoadCallback(function() {
loadTableNames();
});
</script>
<script type="text/javascript"
src="db-edit.js"></script>
</head>
<body>
<h1>DB Edit</h1>
<div id="content"></div>
</body>
</html>


static/db-edit.js
function tableNamesJSON(json) {
var str="<ul>\n";
$.each(json, function(i, t) {
str+=" <li><a href='#' onClick='loadTable(\""+t+"\")'>"+t+"</a></li>\n";
});
str+="</ul>\n";
return str;
}

function loadTableNames() {
$.getJSON("/tables", function(data, textStatus) {
$("#content").html(tableNamesJSON(data));
});
}

function tableDataJSON(json, columns) {
var str="";
$.each(json, function(i, r) {
str+=" <tr>\n"
$.each(r, function(j, c) {
str+=" <td>"+c+"</td>\n";
});
str+=" </tr>\n"
});
return str;
}

function columnsJSON(json) {
var str="";
$.each(json, function(i, c) {
str+=" <td>"+c+"</td>\n";
});
return str;
}

function loadTable(t) {
$("#content").html("<table id='t' border='1'><tr id='t_head'></tr></table>");
$.getJSON("/columns/"+t, function(data, status) {
$("#t_head").html(columnsJSON(data));
loadTableData(t, data);
});
$("#content").append(createInsertForm(t));
}

function loadTableData(t, columns) {
$.getJSON("/table/"+t, function(data, textStatus) {
$("#t").append(tableDataJSON(data, columns));
});
}

function createInsertForm(t) {
$.getJSON("/columns/"+t, function(data, status) {
var str="<form method='POST' action='/insert/"+t+"'>\n";
$.each(data, function(i, c) {
str+=" "+c+": <input type='text' name='"+c+"'>\n";
numCol=i;
});
str+=" <input type='submit'>\n";
str+="</form>";
$("#content").append(str);
});
}


Update: Escape JS and HTML. Thanks to a handy dandy page.

2009-06-22

Compojure and SQLite

Continuing with learning Clojure and Compojure better, I decided I'd wade through some basic DB access. Eric Lavigne has a great post with all the basics. I decided to write a small web app that allows a database to be viewer and (eventually) modified. Again, fairly simple so I don't confuse myself. The full clj file is include at the bottom of the post.

The app starts with some basic definitions:
(ns db-edit
(:require [clojure.contrib.sql :as sql])
(:use clojure.contrib.json.write)
(:use compojure))

Creates a name space for the app called "db-edit", and imports the clojure sql and json libs, as well as the compojure stuff. I'm still not quite sure of the full meaning of clojure name spaces, or the syntax used, but it works.

Next, define the stuff for creating a database connection:
(def db-params {:classname "org.sqlite.JDBC"
:subprotocol "sqlite"
:subname "./database.sqlite"})

The :classname is obviously the class to use for database access. The :subprotocol is the second part of the standard Connection.getConnection url parameter (the form of that url is "jdbc:subprotocol:subname") and is DB driver dependent. The :subname, is of course the third part of the url parameter.

Following the DB definition are some handy definitions for HTTP errors:
(def http-error-map
{400 {:name "Bad Request"
:long "The request could not be understood by the server due to malformed syntax."}
401 {:name "Unauthorized"
:long "The request requires user authentication."} ; Note WWW-Authenticate header should be added
404 {:name "Not Found"
:long "The server has not found anything matching the Request-URI."}})

It's just a map from HTTP error codes to an English name and a longer description of it. Followed by:
(defn error [err]
(if-let [errMap (get http-error-map err)]
{:status err
:headers {}
:body (str "HTTP error " err ": " (get errMap :name) "\n" (get errMap :long))}
{:status err
:headers {}
:body (str "HTTP Error " err " encountered")}))

The error function returns a compojure response for the error. It sets the HTTP status to the error code, and the gives a body with the name and description.

Since there's going to be some static file serving and some user submitted data going into the db, two functions are defined to do a cheap ass sanitization of strings. Although the path sanitizer is probably good enough, the sql one is probably not something you'd in a real production app. Of course, in a real production app you wouldn't be getting table names from the user, so it may be a moot point.
(defn sanitize-sql
"Cleans the string of nasty SQL related special chars"
[s]
(. s (replaceAll "/[\\s'\";#%]/" "")))

(defn cleanPath
"Sanitize the incoming string for '..'"
[p]
(.. p (replaceAll "/\\.+" "/") (replaceAll "/+" "/")))


Now, I'm going to go out of order. Here are the routes the app is going to handle:

  • /columns/TableName will return a JSON array with the column names of the table
  • /table/TableName will return a JSON array of objects. Each object is a row of the table
  • /tables will return a JSON array with the names of all tables in the database
  • Anything else is searched for in the ./static directory and served up as a regular file

(defroutes my-routes
(ANY "/columns/:table" (get-column-meta (params :table) column-name))
(ANY "/table/:table" (get-table (params :table)))
(ANY "/tables" (get-table-meta table-name))
(ANY "/*" (or (serve-file "./static" (cleanPath (params :*))) :next))
(ANY "/*" (error 400)))

(run-server {:port 8080}
"/*" (servlet my-routes))


The serve-file line is taken pretty much directly out of one of the wikibooks pages. The error line, doesn't do anything unless the file serve fails (that's the point of the :next).

Now, into the database stuff. First up: getting the list of table names:
(defn get-table-meta
"Demonstrate getting table info"
[f]
(sql/with-connection db-params
(json-str (map f
(resultset-seq
(-> (sql/connection)
(.getMetaData)
(.getTables nil nil nil nil)))))))

(defn table-name
"Given a mapstruct from the get-table-meta function, give back a string"
[t]
(:table_name t))

The get-table-meta is pretty much taken directly from the sql contrib docs but you also give it a function to extract only the meta information you need. The getTables method returns (in clojure) a sequence of structmaps. The table-name function just retrieves the table name from the structmap. I'm still not sure what that -> operator is doing... I think it's creating a function, but I'm not sure why the example code did it that way.

Finding the column names of a table is nearly the same:
(defn get-column-meta
"Given a table name, return a collection of the field names"
[t f]
(sql/with-connection db-params
(json-str (map f
(resultset-seq
(-> (sql/connection)
(.getMetaData)
(.getColumns nil nil t nil)))))))

(defn column-name
"Given a mapstruct from the table-col-names, give back the column name"
[c]
(:column_name c))

The main function takes the table name and the filter function (I only column-name is used at the moment), and returns a JSON array of column names. The filter function could return a structmap which includes the type as well and the JSON would be an array of objects.

The final chunk is the function to return an entire table as an array of json objects:
(defn get-table
"Given a table name, return a collection of structmaps for all the data in the table"
[t]
(json-str (sql/with-connection
db-params
(sql/with-query-results
res
[(str "SELECT * FROM " (sanitize-sql t))]
(into [] res)))))

You can't parametrize the table name of an SQL statement, so concatenating the user given table name to the select is a rather unsafe operation. The sanitize-sql is used, but it's not exactly industrial strength security.

That's all there was to it. I'm still not up to speed on Clojure idioms so I still have to look into certain things to see how they work. But putting it together was relatively painless considering I haven't written Java for several years and I'm still new to Clojure. I'm still pleased with Compojure, the goal of the program was quite simple so I hoped the code would be equally simple and I was not disappointed. The code is short and fairly clear, without too many magic chunks.

Next time, a basic UI and probably inserting rows.

Full code:
(ns db-edit
(:require [clojure.contrib.sql :as sql])
(:use clojure.contrib.json.write)
(:use compojure))

; Information about the SQLite database
(def db-params {:classname "org.sqlite.JDBC"
:subprotocol "sqlite"
:subname "./database.sqlite"})

(def http-error-map
{400 {:name "Bad Request"
:long "The request could not be understood by the server due to malformed syntax."}
401 {:name "Unauthorized"
:long "The request requires user authentication."} ; Note WWW-Authenticate header should be added
404 {:name "Not Found"
:long "The server has not found anything matching the Request-URI."}})

(defn error [err]
(if-let [errMap (get http-error-map err)]
{:status err
:headers {}
:body (str "HTTP error " err ": " (get errMap :name) "\n" (get errMap :long))}
{:status err
:headers {}
:body (str "HTTP Error " err " encountered")}))

(defn sanitize-sql
"Cleans the string of nasty SQL related special chars"
[s]
(. s (replaceAll "/[\\s'\";#%]/" "")))

(defn cleanPath
"Sanitize the incoming string for '..'"
[p]
(.. p (replaceAll "/\\.+" "/") (replaceAll "/+" "/")))

(defn get-table-meta
"Demonstrate getting table info"
[f]
(sql/with-connection db-params
(json-str (map f
(resultset-seq
(-> (sql/connection)
(.getMetaData)
(.getTables nil nil nil nil)))))))

(defn get-column-meta
"Given a table name, return a collection of the field names"
[t f]
(sql/with-connection db-params
(json-str (map f
(resultset-seq
(-> (sql/connection)
(.getMetaData)
(.getColumns nil nil t nil)))))))

(defn get-table
"Given a table name, return a collection of structmaps for all the data in the table"
[t]
(json-str (sql/with-connection
db-params
(sql/with-query-results
res
[(str "SELECT * FROM " (sanitize-sql t))]
(into [] res)))))

(defn table-name
"Given a mapstruct from the get-table-meta function, give back a string"
[t]
(:table_name t))

(defn column-name
"Given a mapstruct from the table-col-names, give back the column name"
[c]
(:column_name c))


(defn insert-row [table params] (error 404))

(defroutes my-routes
(POST "/insert/:table" (insert-row (params :table) params))
(ANY "/columns/:table" (get-column-meta (params :table) column-name))
(ANY "/table/:table" (get-table (params :table)))
(ANY "/tables" (get-table-meta table-name))
(ANY "/*" (or (serve-file "./static" (cleanPath (params :*))) :next))
(ANY "/*" (error 400)))

(run-server {:port 8080}
"/*" (servlet my-routes))


Update: The second part includes inserting data, decorators, and middleware.

2009-06-09

Basic Compojure Calculator

Compojure turned out to be relatively simple for making a basic URL driven calculator. Go to a URL like http://localhost:8080/add/4/5 and you get a page that simply says, "9". The Compojure documentation isn't complete yet (the site is quite upfront about that, and honesty is always nice), but along with examples online, there's probably enough to make a real web app. The full code is at the end of the post, but first I'll try to explain the code section by section.

(use 'compojure)
Gives the defroute thing (function or macro? I still get them confused) that we'll need later.

(def op-mapping
{"add" +,
"subtract" -,
"multiply" *,
"divide" /})

Obviously this just makes a hash mapping the English name for an operator to the actual function. The English names are what are used in the URL.

(defn doop [op ls rs]
"Do the operation on the two values"
(if (contains? op-mapping op)
(str ((get op-mapping op) ls rs))
"Unrecognized operation"))

Looks for the operation in the map, return an error if it can't be found. Otherwise run the function it maps to on the two given values. Nothing is Compojure specific yet.

(defroutes calc
(GET "/:op/:ls/:rs" (doop (params :op) (. Float valueOf (params :ls)) (. Float valueOf (params :rs))))
(ANY "/*" "Bad URL"))

This is the first chunk of Compojure specific code. All it does is define a route, the URL to what-do-you-want-to-do mapping. The first GET definition is a pretty neat way of specifying parameters. The pattern is given "/:op/:ls/:rs" and then the route parameters are available later with (params :op) and the like. Makes it easy to parse things out and you don't have to number everything.

(run-server {:port 8080}
"/*" (servlet calc))

The last chunk of Compojure specific code. Also the last chunk of the web app. It just starts up the web server and points it to the route chains.

Every small chunk was simple, clear, and had a specific useful purpose. Overall, the web framework is looking good so far. I think I'll keep experimenting with it.

Now for the full code:
(use 'compojure)

;;; Available operations
(def op-mapping
{"add" +,
"subtract" -,
"multiply" *,
"divide" /})

(defn doop [op ls rs]
"Do the operation on the two values"
(if (contains? op-mapping op)
(str ((get op-mapping op) ls rs))
"Unrecognized operation"))

(defroutes calc
(GET "/:op/:ls/:rs" (doop (params :op) (. Float valueOf (params :ls)) (. Float valueOf (params :rs))))
(ANY "/*" "Bad URL"))

(run-server {:port 8080}
"/*" (servlet calc))

2009-06-08

Clojure and Compojure

A while back I tried out Clojure. It was my first real foray into the land of Lisp like languages, other than copying, pasting, and slightly modifying elisp chunks for Emacs which doesn't really count. It was very... interesting. I was writing a small IM bot to try it out, but couldn't find any reliable Java libraries for protocols other than XMPP. Nevertheless, I did get a basic bot working with XMPP and learned some basic Clojure as well.

The language recently had its 1.0 release and I figure it's time to try it out again. The last time I tried it out, the docs were somewhat confusing, but the main problem was simply getting the libraries to interact with Clojure well and I ran out of time. The only major issue I had at the time was needing some small Java code to handle something... static variables or overriding private methods or something like that. This time, I'm going to try making a small web site using Clojure and Compojure.

Compojure looks like an interesting, fairly minimal web framework.

Step 1, download:
$ git clone git://github.com/weavejester/compojure.git
Initialized empty Git repository in compojure/.git/
remote: Counting objects: 3371, done.
remote: Compressing objects: 100% (1619/1619), done.
remote: Total 3371 (delta 1680), reused 2806 (delta 1335)
Receiving objects: 100% (3371/3371), 5.25 MiB | 55 KiB/s, done.
Resolving deltas: 100% (1680/1680), done.


Yay! Easy enough.

Step 2, get other packages it needs like Clojure and Jetty, and compile Compojure:
$ ant deps
Buildfile: build.xml

deps:
[get] Getting: http://cloud.github.com/downloads/weavejester/compojure/deps.zip
[get] To: compojure/deps.zip
[unzip] Expanding: compojure/deps.zip into compojure

BUILD SUCCESSFUL
Total time: 2 minutes 20 seconds
$ ls deps
clojure-contrib.jar commons-io-1.4.jar jetty-util-6.1.15.jar
clojure.jar grizzly-http-servlet-1.9.10.jar servlet-api-2.5-20081211.jar
commons-codec-1.3.jar grizzly-http-webserver-1.9.10.jar
commons-fileupload-1.2.1.jar jetty-6.1.15.jar
$ ant
....
BUILD SUCCESSFUL
Total time: 6 seconds


Step 3, Write basic helloworld and start the server:
helloworld$ cat helloworld.clj
(ns example-app
(:use compojure.http.servlet)
(:use compojure.server.jetty))

(defn hello-world [request]
{:status 200
:headers {}
:body "Hello World"})

(run-server {:port 8080}
"/*" (servlet hello-world))
helloworld$ /usr/lib/j2sdk/bin/java -cp $(for i in ../compojure/deps/*.jar; do echo -n $i: ; done)../compojure/compojure.jar clojure.lang.Script ./helloworld.clj
2009-06-08 22:43:51.799::INFO: Logging to STDERR via org.mortbay.log.StdErrLog
clojure.proxy.javax.servlet.http.HttpServlet
2009-06-08 22:43:51.859::INFO: jetty-6.1.15
2009-06-08 22:43:51.952::INFO: Started SocketConnector@0.0.0.0:8080

And in another terminal:

$ wget -O - http://localhost:8080/ 2>/dev/null
Hello World


Cool! From nothing to the simplest example in half an hour (including figuring out how to run the clj file form the command line). Next time, hopefully I'll have managed a more complicated web app.

2008-06-07

Prime Number Profiling

I've been trying to write an efficient prime number generator. So far the best I've been able to do is about 2.3 seconds for the prime numbers less than one million. Using a normal sieve, it is possible for a C++ program to generate the primes in less than 0.05 seconds. So my goal is to get it to 0.5 seconds without having to resort to funky unsafe operations or basically writing the C solution in Haskell.

Here is the current code:
import Data.List
import Prime.Queue

type Prime = Int
type PrimeCache = ([Prime], Queue Prime)

data Primes = Primes !(Prime, PrimeCache) Primes

primeList :: [Prime]
primeList = 2:3:(nextPrimes (3, ([2], enq newQueue 3)))
where
p2l (Primes (p, _) next) = p:(p2l next)

nextPrimes :: (Prime, PrimeCache) -> [Prime]
nextPrimes (p, cache) = newP:(nextPrimes h)
where
h@(newP, _) = sweep (fst cache, enq (snd cache) p) (p+2)

sweep c n | not $ any ((== 0).(mod n)) (fst shortList) = (n, shortList)
| otherwise = sweep shortList (n+2) -- once odd, always odd
where
shortList = extendedList (limit n) c

extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l@(x:xs), q) | x>=limit || empty q
= (l, q)
| otherwise
= extendedList limit (min:l, newSet)
where
(min, newSet) = deq q

limit n = floor $ sqrt (fromIntegral n :: Float)


Yes, it's not pretty. All it does is keep a list of prime numbers already found and see if the candidate is a factor of any of them.

It implements two basic optimization:

  • After 2, there are no more even primes, so starting at 3 it adds 2 to the next number to test for primeness.
  • It doesn't have to check all previously found prime numbers, only prime numbers that are < sqrt(p).

It uses the handy Queue code from Eric Kidd. So before any optimization is really done other than the simplest algorithmic optimizations, we'll benchmark it.

The profiling has:
 Sun Jun  8 00:14 2008 Time and Allocation Profiling Report  (Final)

10 +RTS -P -s10.stats -RTS

total time = 3.80 secs (190 ticks @ 20 ms)
total alloc = 532,086,816 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc ticks bytes

sweep Prime.PrimeList3 82.6 77.3 157 102850044
limit Prime.PrimeList3 15.8 18.4 30 24537068
nextPrimes Prime.PrimeList3 0.5 1.1 1 1491443
main Main 0.5 1.4 1 1851678
extendedList Prime.PrimeList3 0.0 1.1 0 1502535


individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 0 152
main Main 174 1 0.0 0.0 0.0 0.0 0 8792
CAF Main 168 3 0.0 0.0 0.5 1.4 0 24
main Main 175 0 0.5 1.4 0.5 1.4 1 7397920
CAF GHC.Num 156 1 0.0 0.0 0.0 0.0 0 104
CAF GHC.Handle 120 4 0.0 0.0 0.0 0.0 0 8620
CAF Prime.PrimeList3 95 4 0.0 0.0 99.5 98.6 0 32
primeList Prime.PrimeList3 176 1 0.0 0.0 99.5 98.6 0 12
enq Prime.Queue 183 1 0.0 0.0 0.0 0.0 0 24
nextPrimes Prime.PrimeList3 177 156994 0.5 1.1 99.5 98.6 1 5965772
sweep Prime.PrimeList3 179 500000 82.6 77.3 98.4 96.9 157 411400176
extendedList Prime.PrimeList3 181 500169 0.0 1.1 0.0 1.1 0 6010140
deq Prime.Queue 184 169 0.0 0.0 0.0 0.0 0 13856
empty Prime.Queue 182 169 0.0 0.0 0.0 0.0 0 0
limit Prime.PrimeList3 180 500000 15.8 18.4 15.8 18.4 30 98148272
enq Prime.Queue 178 78207 0.5 0.6 0.5 0.6 1 3132920


Not surprisingly, the function that does a square root and type conversion is rather expensive in time as well as space (15.8% of the time and 18.4% in space). The biggest factor though is the sweep function. Nothing jumps out as a problem in sweep, but we can fix the limit problem by doing the comparison. Instead of x>sqrt(y), we can do (x*x)>y:
extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l@(x:_), q) | (x*x)>=limit || empty q -- n to the square rather then x to root(n)
= (l, q)
| otherwise
= extendedList limit (min:l, newSet)
where
(min, newSet) = deq q

limit n=n


Now for the profile:

 Sun Jun  8 00:43 2008 Time and Allocation Profiling Report  (Final)

10 +RTS -P -s10.stats -RTS

total time = 3.36 secs (168 ticks @ 20 ms)
total alloc = 434,489,520 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc ticks bytes

sweep Prime.PrimeList3 98.8 94.8 166 102989282
main Main 0.6 1.7 1 1851678
nextPrimes Prime.PrimeList3 0.0 1.4 0 1491443
extendedList Prime.PrimeList3 0.0 1.4 0 1502535


individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 0 152
main Main 174 1 0.0 0.0 0.0 0.0 0 8792
CAF Main 168 3 0.0 0.0 0.6 1.7 0 24
main Main 175 0 0.6 1.7 0.6 1.7 1 7397920
CAF GHC.Num 156 1 0.0 0.0 0.0 0.0 0 104
CAF GHC.Handle 120 4 0.0 0.0 0.0 0.0 0 8620
CAF Prime.PrimeList3 95 4 0.0 0.0 99.4 98.3 0 32
primeList Prime.PrimeList3 176 1 0.0 0.0 99.4 98.3 0 12
enq Prime.Queue 182 1 0.0 0.0 0.0 0.0 0 24
nextPrimes Prime.PrimeList3 177 156994 0.0 1.4 99.4 98.3 0 5965772
sweep Prime.PrimeList3 179 500000 98.8 94.8 98.8 96.2 166 411957128
extendedList Prime.PrimeList3 180 500169 0.0 1.4 0.0 1.4 0 6010140
deq Prime.Queue 183 169 0.0 0.0 0.0 0.0 0 11432
empty Prime.Queue 181 169 0.0 0.0 0.0 0.0 0 0
enq Prime.Queue 178 78059 0.6 0.7 0.6 0.7 1 3129368


The execution time (without profiling) dropped to about 1.9 seconds, or 82% the time.

Not bad, but now sweep is where all the time is being spent and all the memory is being allocated. The nastiest part of the function is the test for factors, splitting it out:
sweep c n | hasFactor n (fst shortList) = (n, shortList)
| otherwise = sweep shortList (n+2) -- once odd, always odd
where
shortList = extendedList (limit n) c

hasFactor n = not.(any ((== 0).(mod n)))


And running that, we find (as a diff):
-sweep                          Prime.PrimeList3      98.8   94.8    166 102989282
-main Main 0.6 1.7 1 1851678
+hasFactor Prime.PrimeList3 95.8 91.9 159 99832288
+sweep Prime.PrimeList3 2.4 2.9 4 3156994
+main Main 1.2 1.7 2 1851678


Most of the time and allocation is done in the fairly simple hasFactor function. Here's a guess, most of the time, the small numbers are the factors, but the list is in reverse order. So put the list in the other order:
extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l, q) | (x*x)>=limit || empty q
= (l, q)
| otherwise
= extendedList limit (l++[min], newSet)
where
(min, newSet) = deq q
x = last l


And we now have a running time of 0.9 seconds, the profile diff looks like:
<       total time  =        3.62 secs   (181 ticks @ 20 ms)
< total alloc = 438,489,520 bytes (excludes profiling overheads)
---
> total time = 1.44 secs (72 ticks @ 20 ms)
> total alloc = 154,277,996 bytes (excludes profiling overheads)
10,14c10,15
< hasNoFactor Prime.PrimeList3 96.7 91.1 175 99832288
< nextPrimes Prime.PrimeList3 2.2 1.4 4 1491443
< sweep Prime.PrimeList3 0.6 3.8 1 4156994
< main Main 0.6 1.7 1 1851678
< extendedList Prime.PrimeList3 0.0 1.4 0 1502535
---
> hasNoFactor Prime.PrimeList3 56.9 74.4 41 28678852
> extendedList Prime.PrimeList3 36.1 4.2 26 1603090
> main Main 4.2 4.8 3 1851678
> nextPrimes Prime.PrimeList3 2.8 3.9 2 1491443
> enq Prime.Queue 0.0 2.0 0 782348
> sweep Prime.PrimeList3 0.0 10.8 0 4156994


Figuring out if a number has a factor is now only using half the time, but it's still doing the bulk of the allocations. I'm not sure anything more can be done to speed it up, but on a guess, perhaps it's the partial functions and composition which is causing all the allocations. Lets remove them:


hasNoFactor n = any (\ x -> 0==(n `mod` x))


This does result in an improvement. The run time for the -O2 (no profiling) version takes a small amount under 0.9 seconds. But the amount of allocation is way down.
<       total time  =        1.44 secs   (72 ticks @ 20 ms)
< total alloc = 154,277,996 bytes (excludes profiling overheads)
---
< total time = 1.18 secs (59 ticks @ 20 ms)
< total alloc = 43,562,588 bytes (excludes profiling overheads)
10,15c10,15
< hasNoFactor Prime.PrimeList3 56.9 74.4 41 28678852
< extendedList Prime.PrimeList3 36.1 4.2 26 1603090
< main Main 4.2 4.8 3 1851678
< nextPrimes Prime.PrimeList3 2.8 3.9 2 1491443
< enq Prime.Queue 0.0 2.0 0 782348
< sweep Prime.PrimeList3 0.0 10.8 0 4156994
---
< hasNoFactor Prime.PrimeList3 71.2 9.2 42 1000000
< extendedList Prime.PrimeList3 27.1 14.7 16 1603090
< nextPrimes Prime.PrimeList3 1.7 13.7 1 1491443
< enq Prime.Queue 0.0 7.2 0 782348
< sweep Prime.PrimeList3 0.0 38.2 0 4156994
< main Main 0.0 17.0 0 1851678


I don't think we're going to squeeze any more speed out of the factoring part, but 27% of the time is spent extending the list. The likely culprit is the last function being run all the time. So lets keep track of the last number in that list, and while we're at it, keep the squared number instead of the original since it's squared:

import Data.List
import qualified Data.IntSet as IntSet
import Prime.Queue

type Prime = Int
type PrimeCache = (([Prime], Int), Queue Prime)

data Primes = Primes (Prime, PrimeCache) Primes

primeList :: [Prime]
primeList = 2:3:(nextPrimes (3, (([2], 4), enq newQueue 3)))
where
p2l (Primes (p, _) next) = p:(p2l next)

nextPrimes :: (Prime, PrimeCache) -> [Prime]
nextPrimes (p, cache) = newP:(nextPrimes h)
where
h@(newP, _) = sweep (fst cache, enq (snd cache) p) (p+2)

sweep c n | hasNoFactor n (fst $ fst shortList) = sweep shortList (n+2) -- once odd, always odd
| otherwise = (n, shortList)
where
shortList = extendedList n c

hasNoFactor n = any ((== 0).(mod n))

extendedList :: Int -> PrimeCache -> PrimeCache
extendedList limit (l, q) | x>=limit || empty q
= (l, q)
| otherwise
= extendedList limit (((fst l)++[min], min*min), newSet)
where
(min, newSet) = deq q
x = snd l


Amazingly, this change brought the running time down to 0.6 seconds. The profile is now:
 total time  =        1.10 secs   (55 ticks @ 20 ms)
total alloc = 164,277,996 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc ticks bytes

hasNoFactor Prime.PrimeList3 92.7 69.8 51 28678852
sweep Prime.PrimeList3 3.6 16.2 2 6656994
enq Prime.Queue 1.8 1.9 1 782348
nextPrimes Prime.PrimeList3 1.8 3.6 1 1491443
extendedList Prime.PrimeList3 0.0 3.9 0 1603090
main Main 0.0 4.5 0 1851678


individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc ticks bytes

MAIN MAIN 1 0 0.0 0.0 100.0 100.0 0 152
main Main 174 1 0.0 0.0 0.0 0.0 0 8792
CAF Main 168 3 0.0 0.0 0.0 4.5 0 24
main Main 175 0 0.0 4.5 0.0 4.5 0 7397920
CAF GHC.Num 156 1 0.0 0.0 0.0 0.0 0 104
CAF GHC.Handle 120 4 0.0 0.0 0.0 0.0 0 8620
CAF Prime.PrimeList3 95 4 0.0 0.0 100.0 95.5 0 32
primeList Prime.PrimeList3 176 1 0.0 0.0 100.0 95.5 0 12
enq Prime.Queue 182 1 0.0 0.0 0.0 0.0 0 24
nextPrimes Prime.PrimeList3 177 156994 1.8 3.6 100.0 95.5 1 5965772
sweep Prime.PrimeList3 179 500000 3.6 16.2 96.4 89.9 2 26627976
hasNoFactor Prime.PrimeList3 184 500000 92.7 69.8 92.7 69.8 51 114715408
extendedList Prime.PrimeList3 180 500169 0.0 3.9 0.0 3.9 0 6412360
deq Prime.Queue 183 169 0.0 0.0 0.0 0.0 0 11432
empty Prime.Queue 181 169 0.0 0.0 0.0 0.0 0 0
enq Prime.Queue 178 78059 1.8 1.9 1.8 1.9 1 3129368


At 0.6 seconds to find all prime numbers under 1E6, I'm happy to stop, especially since the C++ version cannot create an infinite list of prime numbers.