\documentclass[letterpaper,twoside,11pt]{article} \usepackage{anysize} % lhs2TeX --poly src/ExprParse.lhs > ExprParse.ltx; pdflatex ExprParse.ltx \parindent0pt\parskip0.6ex % Better for literate code. %include polycode.fmt \title{CS 3EA3: Example Haskell Code for Recursive Descent Parsing} \author{Wolfram Kahl} \date{\today} \usepackage{hyperref} \begin{document} \maketitle The module |Data.Char| provides us with the functions |isDigit|, |isLetter|, and |ord :: Char -> Int|. \begin{code} module ExprParse where import Data.Char \end{code} An extremely simple expression datatype (abstract syntax): \begin{code} data Expr = Var String | Num Integer | Add Expr Expr | Mul Expr Expr \end{code} For the concrete syntax, we encode the different precedence levels into separate nonterminals to achieve a context-free grammar that allows recursive descent parsing: \begin{verbatim} Expr ::= Term + Expr | Term Term ::= Factor * Term | Factor Factor ::= Number | Identifier | ( Expr ) \end{verbatim} The precedence levels can easil be accommodated by the unparsing function% \footnote{Idiomatic Haskell would define the more efficient |showsPrecExpr :: Int -> Expr -> String -> String|. }%footnote ; whether parentheses are added is decided depending on the precedence argument, which stands for the precedence of the \emph{surrounding} constructor. Adding parentheses uses an auxiliary function |paren|: \begin{code} paren :: String -> String paren s = '(' : s ++ ")" \end{code} \begin{code} showPrecExpr :: Int -> Expr -> String showPrecExpr p (Var s) = s showPrecExpr p (Num k) = show k showPrecExpr p (Add e1 e2) = (if p > 4 then paren else id) (showPrecExpr 4 e1 ++ " + " ++ showPrecExpr 4 e2) showPrecExpr p (Mul e1 e2) = (if p > 5 then paren else id) (showPrecExpr 5 e1 ++ " * " ++ showPrecExpr 5 e2) \end{code} For installing this as standard |show| function for |Expr| values, we choose lowest precedence by default: \begin{code} instance Show Expr where show = showPrecExpr 0 \end{code} For recursive descent parsing see also \url{http://en.wikipedia.org/wiki/Recursive_descent}; for an extremely ``low-tech'' Haskell version we use a parser type that takes a |String| as argument, and returns a pair consisting of a parsed |Expr|, and the not-yet-parsed rest of the |String|: \begin{code} parseExpr, parseTerm, parseFactor :: String -> (Expr, String) \end{code} This is a very simple choice; it forces us to use Haskell run-time errors for reacting to parse errors. The grammar rules can now straight-forwardly be translated into Haskell, always continuing to work on the rest strings: \begin{code} parseExpr cs = let (t, cs') = parseTerm cs in case cs' of '+' : cs'' -> let (e, cs''') = parseExpr cs'' in (Add t e, cs''') _ -> (t, cs') \end{code} \begin{code} parseTerm cs = let (f, cs') = parseFactor cs in case cs' of '*' : cs'' -> let (t, cs''') = parseTerm cs'' in (Mul f t, cs''') _ -> (f, cs') \end{code} \begin{code} parseFactor ('(' : cs) = let (e, cs') = parseExpr cs in case cs' of ')' : cs'' -> (e, cs'') _ -> error "closing parenthesis expected" parseFactor (c : cs) | isLetter c = let (ident, cs') = parseIdentRest [c] cs in (Var ident, cs') | isDigit c = let (num, cs') = parseNumberRest (numFromDigit c) cs in (Num num, cs') parseFactor cs = error ("factor expected; " ++ show cs ++ " found") \end{code} Since we allow multi-letter dentifiers and multi-digit numbers, we use auxiliary parser to complete them after we recognised the first character: \begin{code} parseIdentRest :: String -> String -> (String, String) parseIdentRest s (c : cs) | isLetter c = parseIdentRest (s ++ [c]) cs parseIdentRest s cs = (s, cs) parseNumberRest :: Integer -> String -> (Integer, String) parseNumberRest n (c : cs) | isDigit c = parseNumberRest (10 * n + numFromDigit c) cs parseNumberRest n cs = (n, cs) numFromDigit :: Char -> Integer numFromDigit c = toInteger (ord c - ord '0') \end{code} \end{document}