Here's a sample of RMX:
A|Data Format|ASCII||
A|Default Language|FRE||
A|Build Date|Fri Nov 7 16:15:36 2008||
S|-129.9804805|55.2963299|-129.9823406|55.3020698|541-|B2|ALASKA|541-|B1|UNITED STATES||
S|3.7562004|45.7552403|3.7562795|45.7555702|122+|BUILT-UP AREA|La Chamba||
S|3.7589094|45.7558103|3.7588263|45.7553744|122+|BUILT-UP AREA|La Chamba||
S|3.7607802|45.8210094|3.7604503|45.8206902|222+|BUILT-UP AREA|Noirétable|142+|INDU COMPLEX|Z.I. Rue De L'auvergne||
P|-91.376790|39.931760|130|D\RECREATION|\MADISON PARK\2434\\\\\:ID/37951151||
I won't go into detail about what it all means, I'm just going to describe the basic grammar. An RMX file is composed of line based records. Each record ends with ||<newline>
. There are three kinds of records: annotation (start with A|
), point (start with P|
), and segment (start with S|
).
- Annotation records have two free form pipe delimited text fields.
- Point records have a point (with the coordinate values pipe separated), followed by one or more udm triples.
- Segment records have two points (with the coordinate values pipe separated), followed by one or more udm triples.
- A UDM triple is just three pipe delimited fields which hold attribution information. The first element of a triple must have at least one character (actually it has to have 3-5 characters, but I don't care about that level of detail right now)
import System.IO
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Error
Next part, declarations of what we want the results of parsing to look like:
type UDMTriple = (String, String, String)
data RMXRecord a = ARec String String
| PRec (a, a) [UDMTriple]
| SRec ((a, a), (a, a)) [UDMTriple]
deriving Show
We define a udm triple as a triple of strings, and an RMX record is exactly as described above. Now, onto actual Parsec stuff:. The most basic terminal, a pipe character is used a lot so I named it:
pipe = char '|'
Next, are two more simple definitions, fields are all pipe delimited. However, in case the input is bad we also want to allow new lines to terminate a field:
field1 = many1 (noneOf "|\n\r")
field = many (noneOf "|\n\r")
The difference between field1
and field
is that field1
returns a field of one or more characters, while field
will return a field of zero or more characters.The parser for the points is more involved:
number = do
sign <- option 1 ( do s <- oneOf "+-"
return $ if s == '-' then (-1.0) else (1.0))
i <- many digit
d <- try (char '.' >> try (many (digit)))
return $ sign*(read (i++"."++d))
This was a PITA. There's no predefined parser for signed decimal numbers in Parsec so I had to write my own. Some search on the web located other similar solutions. I'm sure it's not the best way to do it, but it works. It first has to try to pull off the (optional) sign, then the part of the float before the decimal point, then the (optional) decimal followed by more (optional) digits. Then it reassembles the thing into a string and uses read to turn it into a number. Ugly.Using the number parser is the point parser, which simply parses two pipe delimited numbers:
point = do
x <- number <?> "Floating point number"
pipe
y <- number <?> "Floating point number"
return (x, y)
Here we have the first use of <?>
, which is used when the previous match fails and we want a meaningful error message. Continuing from the bottom of the parser up, the next building block is a parser for triples:
triples = do
try (pipe >> notFollowedBy (char '|'))
acc <- field1 <?> "Attribute code"
pipe
s1 <- field <?> "String 1"
pipe
s2 <- field <?> "String 2"
return (acc, s1, s2)
The notFollowedBy part is to disambiguate the case of the records ending with ||
, which would otherwise be parsed as the start of a new (incomplete) triple with a zero length acc
. The function returns a single UDMTriple.The end of an RMX record is defined with:
eor = try (string "||\n\r")
<|> try (string "||\r\n")
<|> string "||\n"
<|> string "||\r"
<?>"end of record"
The try
function tries the match, and if the match fails the characters it pulled off the stream are put back. This is important because it has to try to match against different length strings. This also has the first use of the <|> operator. <|> is basically an or operator. If the left side doesn't match then the right side is attempted.The next chunk of code defines the parsers for the three different kinds of records:
arec :: GenParser Char st (RMXRecord Double)
arec = do
char 'A'
key <- (pipe >> field)
value <- (pipe >> field)
return $ ARec key value
prec = do
char 'P'
p <- (pipe >> point)
attrs <- (many1 triples) <?> "UDM triples"
return $ PRec p attrs
srec = do
char 'S'
p1 <- (pipe >> point)
p2 <- (pipe >> point)
attrs <- (many1 triples) <?> "UDM triples"
return $ SRec (p1, p2) attrs
They are simply stating the definition of the description at the top of the post, they are composed of points, and triples or fields. The functions each return a single RMXRecord.The other part of parsing a single record is the top level record definition: A record can be any of the three kinds:
line :: GenParser Char st (RMXRecord Double)
line = arec <|> prec <|> srec
And the definition of what an rmx file is:
rmxFile :: GenParser Char st [RMXRecord Double]
rmxFile = endBy line ((try eor) <|> (string "||" >> eof >> return ""))
The definition of an rmx file. It's a bunch of lines terminated by eor
or ||eof
. That definition handles the case of a file missing the newline terminator on the last record.Finally, a helper function that takes care of opening the file, parsing it, and running a function on the records:
type ParsedLine = (String, Either ParseError (RMXRecord Double))
withRMX :: String -> (ParsedLine -> IO a) -> IO [a]
withRMX fn fun = withFile fn ReadMode (processFile <=< hGetContents)
where
processFile c = sequence $! map (\ x -> fun (x, unlist $ parseRMX fn x)) (lines c)
unlist :: Either ParseError [RMXRecord Double] -> Either ParseError (RMXRecord Double)
unlist (Left x) = Left x
unlist (Right (y:ys)) = Right y
parseRMX :: String -> String -> Either ParseError [RMXRecord Double]
parseRMX fn rmx = parse rmxFile fn rmx
The function splits the file by lines, and parses each line separately. This is mainly because (generally) Parsec has to load the entire input into memory and parse the whole thing before you get any output. This would not be good for a 5 GB input file. So the withRMX
uses a lazy contents reader and reads line by line.Another item to note is that
parseRMX
returns a list of RMXRecord
s, but the function used by withRMX
gets only one RMXRecord
at a time. withRMX
splits the input into lines so it knows that each input line has exactly one output record. parseRMX
cannot make that assumption since it could be given a string with new lines and multiple records in the string.Another helper function makes it simpler to process only records that don't have errors:
type GoodLine = (String, RMXRecord Double)
good :: (GoodLine -> IO a) -> (ParsedLine -> IO a)
good fun = (\ (l, e) -> either err (\ r -> fun (l, r)) e)
where
err e = error $ foldl (\ a b -> a++(messageString b)) "" (errorMessages e)
The good
function takes care of creating an error if there's a parse error (it will abort everything though), and if there's no error then it runs the given function on the records.Here's a simple program that makes use of the parser to only output annotation records.
main = do
(from:rest) <- getArgs
withRMX from (good onlyAnnotations)
onlyAnnotations :: GoodLine -> IO ()
onlyAnnotations (line, ARec _ _) = putStrLn line
onlyAnnotations _ = return ()
There are, of course, much simpler ways of obtaining all the annotation records since all annotation records start with 'A'. But the function could do anything, for example extracting segments and points inside a bounding box.Here's the entire code of the parsing module:
module Data.RMX (good, withRMX, parseRMX, UDMTriple, RMXRecord(ARec, PRec, SRec), ParsedLine, GoodLine) where
import System.IO
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Error
import Control.Monad
type UDMTriple = (String, String, String)
data RMXRecord a = ARec String String
| PRec (a, a) [UDMTriple]
| SRec ((a, a), (a, a)) [UDMTriple]
deriving Show
type ParsedLine = (String, Either ParseError (RMXRecord Double))
type GoodLine = (String, RMXRecord Double)
good :: (GoodLine -> IO a) -> (ParsedLine -> IO a)
good fun = (\ (l, e) -> either err (\ r -> fun (l, r)) e)
where
err e = error $ foldl (\ a b -> a++(messageString b)) "" (errorMessages e)
withRMX :: String -> (ParsedLine -> IO a) -> IO [a]
withRMX fn fun = withFile fn ReadMode (processFile <=< hGetContents)
where
processFile c = sequence $! map (\ x -> fun (x, unlist $ parseRMX fn x)) (lines c)
unlist :: Either ParseError [RMXRecord Double] -> Either ParseError (RMXRecord Double)
unlist (Left x) = Left x
unlist (Right (y:ys)) = Right y
parseRMX :: String -> String -> Either ParseError [RMXRecord Double]
parseRMX fn rmx = parse rmxFile fn rmx
rmxFile :: GenParser Char st [RMXRecord Double]
rmxFile = endBy line ((try eor) <|> (string "||" >> eof >> return ""))
line :: GenParser Char st (RMXRecord Double)
line = arec <|> prec <|> srec
arec :: GenParser Char st (RMXRecord Double)
arec = do
char 'A'
key <- (pipe >> field)
value <- (pipe >> field)
return $ ARec key value
prec = do
char 'P'
p <- (pipe >> point)
attrs <- (many1 triples) <?> "UDM triples"
return $ PRec p attrs
srec = do
char 'S'
p1 <- (pipe >> point)
p2 <- (pipe >> point)
attrs <- (many1 triples) <?> "UDM triples"
return $ SRec (p1, p2) attrs
point = do
x <- number <?> "Floating point number"
pipe
y <- number <?> "Floating point number"
return (x, y)
triples = do
try (pipe >> notFollowedBy (char '|'))
acc <- field1 <?> "Attribute code"
pipe
s1 <- field <?> "String 1"
pipe
s2 <- field <?> "String 2"
return (acc, s1, s2)
field1 = many1 (noneOf "|\n\r")
field = many (noneOf "|\n\r")
pipe = char '|'
number = do
sign <- option 1 ( do s <- oneOf "+-"
return $ if s == '-' then (-1.0) else (1.0))
i <- many digit
d <- try (char '.' >> try (many (digit)))
return $ sign*(read (i++"."++d))
eor = try (string "||\n\r")
<|> try (string "||\r\n")
<|> string "||\n"
<|> string "||\r"
<?> "end of record"
No comments:
Post a Comment