Parsec Parser Testing with QuickCheck
July 29, 2007
I must admit I’ve liked the concept of QuickCheck ever since I first found it. I tend to write tests that are “too nice” to my code, often missing corner cases and errors caused by strange input. QuickCheck helps me avoid this by generating test data for me. Even in trivial projects it has helped me detect bugs I wouldn’t have found through unit tests written myself.
In this entry I begin testing the parser for HST – A Smalltalk interpreter developed in Haskell (the code developed so far for the parser can be found here and here). To do this I’m going to make use of QuickCheck. I couldn’t find much information on testing a Parsec based parser using QuickCheck, but I’ll have a go and see what happens.
The approach I’ve decided to try here is to have QuickCheck randomly generate an instance of the Abstract Syntax Tree (AST) I’ve developed. I’ll develop a pretty printer that can convert this to a
String. Parsing the
String should then result in the originally generated AST. I’m not entirely sure this will work, but at least it should give me a good starting point.
First I must develop a pretty printer for my AST type. For this I’m going to use the pretty printing library that is part of Haskell’s standard library. It appears in the package Text.PrettyPrint.HughesPJ. Writing the pretty printer requires me to write a method converting each type to the libraries
This turns out to be (mostly) quite straight forward for the limited AST that’s been implemented so far. The trickiest part is
StringLiteral. I need to deal with single quote characters, as they are the string delimiters in Smalltalk. To include a single quote character in a string literal the programmer uses two single quotes (e.g., The string “Here’s a quote” would be programmed as
'Here''s a quote'). To account for this a string escaping function is written.
identifier (Identifier i) = text i
literal (StringLiteral s) = text $
"\\'" ++ (concatMap stringEscape s) ++ "\\'"
stringEscape c | c == '\\'' = "''"
| otherwise = [c]
primary (PrimaryLiteral p) = literal p
primary (PrimaryIdentifier i) = identifier i
keyword (Keyword k) = text $ k ++ ":"
<+> function allows us to specify that two
Docs are ‘beside’ each other and separated by a space. This is useful for those syntax constructs that are combinations of others.
message (KeywordMessage k p) = keyword k <+> primary p
expression (BasicExpression p m) = primary p <+> message m
Testing in ghci:
> case (parse basicExpression "" "Transcript show: 'Hello World'") of
Right e -> expression e
Transcript show: 'Hello World'
So far, so good.
First up, I’m going to define the QuickCheck property that’s going to be tested. That will get the easy step out of the way.
checkExpressionParser ast =
case (parse basicExpression "" (render $ expression ast)) of
Left _ -> False
Right a -> ast == a
where types = ast :: BasicExpression
This results in a compilation error. This message is that there is no instance of
Expression. This was expected, as I have not defined how QuickCheck is to generate these types.
For those types that are composed of others, defining an instance of
Arbitrary is easy.
instance Arbitrary Expression where
arbitrary = liftM2 BasicExpression arbitrary arbitrary
instance Arbitrary Message where
arbitrary = liftM2 KeywordMessage arbitrary arbitrary
Primary is slightly more involved, as it could result in a
PrimaryIdentifier. QuickCheck provides the
oneof function which can be used for this.
instance Arbitrary Primary where
arbitrary = oneof
[ liftM PrimaryLiteral arbitrary
, liftM PrimaryIdentifier arbitrary
For keywords and identifiers I’ll create a common generator, as a keyword in a keyword message is simply an identifier followed by a colon. This generator is required to generate a letter, followed by a sequence of letters or digits. The definition of letter can be implementation defined, but for now I’m just going to worry about the minimums, which is the lowercase and uppercase alphabet along with underscore.
I begin by defining lists that can be passed to QuickCheck’s frequency list for letters or digits. This is followed by the generators for letters and digits themselves. Finally these can be used to create an generator for identifiers. After this the instances of
Keyword become trivial.
[ (26, choose ('a', 'z'))
, (26, choose ('A', 'Z'))
, (1, elements ['_'])
digitFreqList = [ (10, choose ('0', '9')) ]
letter = frequency alphaFreqList
letterOrDigit = frequency $ alphaFreqList ++ digitFreqList
identifierGenerator = liftM2 (:) letter $
sized (\\n -> replicateM n letterOrDigit)
instance Arbitrary Identifier where
arbitrary = liftM2 Identifier identifierGenerator
instance Arbitrary Keyword where
arbitrary = liftM2 Keyword identifierGenerator
StringLiteral we allow any ASCII character and also zero length strings.
instance Arbitrary StringLiteral where
arbitrary = liftM StringLiteral $
sized (\\n -> replicateM n $ choose ('\\0', '\\255'))
Running the Tests
I’ve defined a method that runs the check. I will run this from ghci
runChecks = quickCheck checkExpressionParser
Running this produces the following output (note that this may be different on each run, as the test data is generated randomly):
Falsifiable, after 0 tests:
BasicExpression (PrimaryIdentifier (Identifier "fkY"))
(KeywordMessage (Keyword "y4d")
(PrimaryIdentifier (Identifier "U0_")))
This corresponds to the Smalltalk code "
fkY y4d: U0_". Getting a failure here is a relief, as I knew the parser wasn't complete yet. This test is failing because of the digits in the keyword and identifier. I can add in this feature with the following parser code:
stLetter = choice [letter, char '_', digit]
identifierString = many1 stLetter
keyword = Keyword `liftM` (identifierString >>~ char ':')
identifier = Identifier `liftM` identifierString
Now, the parser is able to make it through 39 tests before QuickCheck reports and error. This time, it is with escaped quote characters in string literals. Again this is something left out of the original parser, so it is good that QuickCheck finds this. Modifying the parser again:
stringLiteral = StringLiteral `liftM` (quote >> stString >>~ quote)
quote = char '\\''
stString = many $
choice [noneOf ['\\''], try $ string "''" >> return '\\'']
Now, all QuickCheck tests pass.
It is worth noting that this testing currently has short comings. The two most notable are:
- No test for ensuring that a syntactially invalid program fails parsing,
- Whitespace variations are not tested (e.g., tabs, line breaks etc.).
I'm not sure of how these can be tested using QuickCheck properties yet.
I must admit, I was expecting this step of developing HST to take longer and be more difficult. I am impressed with how quickly and easily Parsec, QuickCheck, and Text.PrettyPrint.HughesPJ have allowed me to develop this. Although the testing shown here does have shortcomings I'm very pleased with the testing that is in place and I think I have a good base from which to continue developing.