2010-02-15

Parsing with Parsec

At work I deal with geospatial data a lot. One of the formats I often work with lot is called RMX. It's a text based format for describing geometry and attribution on geometry. If I want to do any complicated work with RMX using Haskell I need to parse it in Haskell. I've chosen to use Parsec as the parser to learn a bit more about using parser combinators, and I've chosen to parse RMX because it is simple, but its grammar is not entirely trivial.
Here's a sample of RMX: A|Data Format|ASCII|| A|Default Language|FRE|| A|Build Date|Fri Nov 7 16:15:36 2008|| S|-129.9804805|55.2963299|-129.9823406|55.3020698|541-|B2|ALASKA|541-|B1|UNITED STATES|| S|3.7562004|45.7552403|3.7562795|45.7555702|122+|BUILT-UP AREA|La Chamba|| S|3.7589094|45.7558103|3.7588263|45.7553744|122+|BUILT-UP AREA|La Chamba|| S|3.7607802|45.8210094|3.7604503|45.8206902|222+|BUILT-UP AREA|Noirétable|142+|INDU COMPLEX|Z.I. Rue De L'auvergne|| P|-91.376790|39.931760|130|D\RECREATION|\MADISON PARK\2434\\\\\:ID/37951151|| I won't go into detail about what it all means, I'm just going to describe the basic grammar. An RMX file is composed of line based records. Each record ends with ||<newline>. There are three kinds of records: annotation (start with A|), point (start with P|), and segment (start with S|).
  • Annotation records have two free form pipe delimited text fields.
  • Point records have a point (with the coordinate values pipe separated), followed by one or more udm triples.
  • Segment records have two points (with the coordinate values pipe separated), followed by one or more udm triples.
  • A UDM triple is just three pipe delimited fields which hold attribution information. The first element of a triple must have at least one character (actually it has to have 3-5 characters, but I don't care about that level of detail right now)
It's a pretty simple format, but not entirely trivial so it's a good starting point for learning to write a Parsec parser. First part, boring imports: import System.IO import Control.Monad import Text.ParserCombinators.Parsec import Text.ParserCombinators.Parsec.Error
Next part, declarations of what we want the results of parsing to look like: type UDMTriple = (String, String, String) data RMXRecord a = ARec String String | PRec (a, a) [UDMTriple] | SRec ((a, a), (a, a)) [UDMTriple] deriving Show We define a udm triple as a triple of strings, and an RMX record is exactly as described above.
Now, onto actual Parsec stuff:. The most basic terminal, a pipe character is used a lot so I named it: pipe = char '|'
Next, are two more simple definitions, fields are all pipe delimited. However, in case the input is bad we also want to allow new lines to terminate a field: field1 = many1 (noneOf "|\n\r") field = many (noneOf "|\n\r") The difference between field1 and field is that field1 returns a field of one or more characters, while field will return a field of zero or more characters.
The parser for the points is more involved: number = do sign <- option 1 ( do s <- oneOf "+-" return $ if s == '-' then (-1.0) else (1.0)) i <- many digit d <- try (char '.' >> try (many (digit))) return $ sign*(read (i++"."++d)) This was a PITA. There's no predefined parser for signed decimal numbers in Parsec so I had to write my own. Some search on the web located other similar solutions. I'm sure it's not the best way to do it, but it works. It first has to try to pull off the (optional) sign, then the part of the float before the decimal point, then the (optional) decimal followed by more (optional) digits. Then it reassembles the thing into a string and uses read to turn it into a number. Ugly.
Using the number parser is the point parser, which simply parses two pipe delimited numbers: point = do x <- number <?> "Floating point number" pipe y <- number <?> "Floating point number" return (x, y) Here we have the first use of <?>, which is used when the previous match fails and we want a meaningful error message.
Continuing from the bottom of the parser up, the next building block is a parser for triples: triples = do try (pipe >> notFollowedBy (char '|')) acc <- field1 <?> "Attribute code" pipe s1 <- field <?> "String 1" pipe s2 <- field <?> "String 2" return (acc, s1, s2) The notFollowedBy part is to disambiguate the case of the records ending with ||, which would otherwise be parsed as the start of a new (incomplete) triple with a zero length acc. The function returns a single UDMTriple.
The end of an RMX record is defined with: eor = try (string "||\n\r") <|> try (string "||\r\n") <|> string "||\n" <|> string "||\r" <?>"end of record" The try function tries the match, and if the match fails the characters it pulled off the stream are put back. This is important because it has to try to match against different length strings. This also has the first use of the <|> operator. <|> is basically an or operator. If the left side doesn't match then the right side is attempted.
The next chunk of code defines the parsers for the three different kinds of records: arec :: GenParser Char st (RMXRecord Double) arec = do char 'A' key <- (pipe >> field) value <- (pipe >> field) return $ ARec key value prec = do char 'P' p <- (pipe >> point) attrs <- (many1 triples) <?> "UDM triples" return $ PRec p attrs srec = do char 'S' p1 <- (pipe >> point) p2 <- (pipe >> point) attrs <- (many1 triples) <?> "UDM triples" return $ SRec (p1, p2) attrs They are simply stating the definition of the description at the top of the post, they are composed of points, and triples or fields. The functions each return a single RMXRecord.
The other part of parsing a single record is the top level record definition: A record can be any of the three kinds: line :: GenParser Char st (RMXRecord Double) line = arec <|> prec <|> srec
And the definition of what an rmx file is: rmxFile :: GenParser Char st [RMXRecord Double] rmxFile = endBy line ((try eor) <|> (string "||" >> eof >> return "")) The definition of an rmx file. It's a bunch of lines terminated by eor or ||eof. That definition handles the case of a file missing the newline terminator on the last record.
Finally, a helper function that takes care of opening the file, parsing it, and running a function on the records: type ParsedLine = (String, Either ParseError (RMXRecord Double)) withRMX :: String -> (ParsedLine -> IO a) -> IO [a] withRMX fn fun = withFile fn ReadMode (processFile <=< hGetContents) where processFile c = sequence $! map (\ x -> fun (x, unlist $ parseRMX fn x)) (lines c) unlist :: Either ParseError [RMXRecord Double] -> Either ParseError (RMXRecord Double) unlist (Left x) = Left x unlist (Right (y:ys)) = Right y parseRMX :: String -> String -> Either ParseError [RMXRecord Double] parseRMX fn rmx = parse rmxFile fn rmx The function splits the file by lines, and parses each line separately. This is mainly because (generally) Parsec has to load the entire input into memory and parse the whole thing before you get any output. This would not be good for a 5 GB input file. So the withRMX uses a lazy contents reader and reads line by line.
Another item to note is that parseRMX returns a list of RMXRecords, but the function used by withRMX gets only one RMXRecord at a time. withRMX splits the input into lines so it knows that each input line has exactly one output record. parseRMX cannot make that assumption since it could be given a string with new lines and multiple records in the string.
Another helper function makes it simpler to process only records that don't have errors: type GoodLine = (String, RMXRecord Double) good :: (GoodLine -> IO a) -> (ParsedLine -> IO a) good fun = (\ (l, e) -> either err (\ r -> fun (l, r)) e) where err e = error $ foldl (\ a b -> a++(messageString b)) "" (errorMessages e) The good function takes care of creating an error if there's a parse error (it will abort everything though), and if there's no error then it runs the given function on the records.
Here's a simple program that makes use of the parser to only output annotation records. main = do (from:rest) <- getArgs withRMX from (good onlyAnnotations) onlyAnnotations :: GoodLine -> IO () onlyAnnotations (line, ARec _ _) = putStrLn line onlyAnnotations _ = return () There are, of course, much simpler ways of obtaining all the annotation records since all annotation records start with 'A'. But the function could do anything, for example extracting segments and points inside a bounding box.
Here's the entire code of the parsing module: module Data.RMX (good, withRMX, parseRMX, UDMTriple, RMXRecord(ARec, PRec, SRec), ParsedLine, GoodLine) where import System.IO import Text.ParserCombinators.Parsec import Text.ParserCombinators.Parsec.Error import Control.Monad type UDMTriple = (String, String, String) data RMXRecord a = ARec String String | PRec (a, a) [UDMTriple] | SRec ((a, a), (a, a)) [UDMTriple] deriving Show type ParsedLine = (String, Either ParseError (RMXRecord Double)) type GoodLine = (String, RMXRecord Double) good :: (GoodLine -> IO a) -> (ParsedLine -> IO a) good fun = (\ (l, e) -> either err (\ r -> fun (l, r)) e) where err e = error $ foldl (\ a b -> a++(messageString b)) "" (errorMessages e) withRMX :: String -> (ParsedLine -> IO a) -> IO [a] withRMX fn fun = withFile fn ReadMode (processFile <=< hGetContents) where processFile c = sequence $! map (\ x -> fun (x, unlist $ parseRMX fn x)) (lines c) unlist :: Either ParseError [RMXRecord Double] -> Either ParseError (RMXRecord Double) unlist (Left x) = Left x unlist (Right (y:ys)) = Right y parseRMX :: String -> String -> Either ParseError [RMXRecord Double] parseRMX fn rmx = parse rmxFile fn rmx rmxFile :: GenParser Char st [RMXRecord Double] rmxFile = endBy line ((try eor) <|> (string "||" >> eof >> return "")) line :: GenParser Char st (RMXRecord Double) line = arec <|> prec <|> srec arec :: GenParser Char st (RMXRecord Double) arec = do char 'A' key <- (pipe >> field) value <- (pipe >> field) return $ ARec key value prec = do char 'P' p <- (pipe >> point) attrs <- (many1 triples) <?> "UDM triples" return $ PRec p attrs srec = do char 'S' p1 <- (pipe >> point) p2 <- (pipe >> point) attrs <- (many1 triples) <?> "UDM triples" return $ SRec (p1, p2) attrs point = do x <- number <?> "Floating point number" pipe y <- number <?> "Floating point number" return (x, y) triples = do try (pipe >> notFollowedBy (char '|')) acc <- field1 <?> "Attribute code" pipe s1 <- field <?> "String 1" pipe s2 <- field <?> "String 2" return (acc, s1, s2) field1 = many1 (noneOf "|\n\r") field = many (noneOf "|\n\r") pipe = char '|' number = do sign <- option 1 ( do s <- oneOf "+-" return $ if s == '-' then (-1.0) else (1.0)) i <- many digit d <- try (char '.' >> try (many (digit))) return $ sign*(read (i++"."++d)) eor = try (string "||\n\r") <|> try (string "||\r\n") <|> string "||\n" <|> string "||\r" <?> "end of record"

No comments: