2007-04-28
Writing a parser: ADL parser node types
In this post, I'll show you the different types of parser nodes. If you remember from one of the first posts I made in this series, the parser analyzes the tokens it gets from the tokenizer and turns it into structured trees. The nodes in this tree represent the different constructs (additions, multiplications, comparisons, ...).
- ParserNode
- Statement
- Assignment:
x := ... - IfStatement:
if ... then ... else ... end if - WhileStatement:
while ... do ... end while - ForStatement:
for i := ... to ... by ... do ... end for - FunctionCallStatement:
f(...)
- Assignment:
- Expression
- UnaryExpression
- Negation:
-x - NotExpression:
not x
- Negation:
- BinaryExpression
- Addition:
x + y - Subtraction:
x - y - Multiplication:
x * y - Division:
x / y - AndExpression:
x and y - OrExpression:
x or y - Comparison:
x < yorx > yorx == yor ...
- Addition:
- Variable:
x - IntegerConstant:
123 - StringConstant:
"abc" - FunctionCall:
f(...)
- UnaryExpression
- Statement
A statement is an instruction that does not return a value. An expression is an instruction that does return a value. A unary expression has an operator and a single child expression. A binary expression has an operator and 2 child expressions.
You might have noticed that I defined both a FunctionCall and a FunctionCallStatement. The difference is that a FunctionCall is an expression that returns the value of the function call, and a FunctionCallStatement is a statement that ignores the value of the function call.
The basic types look like this:public abstract class ParserNode { } public abstract class Statement : ParserNode { } public abstract class Expression : ParserNode { }As you can see, they don't contain any code. Because I'm just explaining how parsers work, they don't have to contain code. You should add code if you use these classes to build an interpreter or a compiler. For an interpreter, the Statement class could have a method Execute to execute the statement, and the Expression class could have a method Evaluate to evaluate the expression and return the result.
There's no point in showing the source code in this blog post for all the classes, so you can download them here. Just create a new folder ParserNodes in the TC.Adl project and include the downloaded C# files.
Next time, we'll write the actual parser and a test application.
2007-04-25
Writing a parser: ADL Tokenizer correction
char.IsWhiteSpace() to test for white-space. The reason I didn't, was ignorance: I had totally forgot about that method. So it's probably a good idea to change the method SkipWhitespace to this: void SkipWhitespace() { while (char.IsWhiteSpace(fCurrentChar)) ReadNextChar(); }
2007-04-24
Writing a parser: ADL Tokenizer
We'll now start writing the tokenizer.
One thing I haven't explained yet is how we will deal with white-space. The easiest solution is to just skip the white-space between tokens, and that's what we'll do. Usually white-space is defined as one of the following characters: space (' '), tab ('\t'), new-line ('\n') or carriage-return ('\r'). But I've noticed that any character below space (' ') is never used as text. So we can just consider any character <= ' ' to be white-space, which makes it easier to test a character.
Because we want our tokenizer to read characters from strings as well as streams, we'll use a TextReader. TextReader has 2 derived classes: StringReader (to read text from a string) and StreamReader (to read text from a stream). The only method in TextReader we will use is Read(), which returns the next character or -1 if the end of the stream or string has been reached.
To extract a token, we read the first character and decide what type of token it will be. For each token, we will have to store the characters it contains. We'll use a StringBuilder for this.
using System; using System.Collections.Generic; using System.Text; using System.IO; namespace TC.Adl { public class Tokenizer { TextReader fSource; char fCurrentChar; StringBuilder fTokenValueBuffer; } }Our Tokenizer class has only 3 fields:
fSource- The TextReader to read characters from.
fCurrentChar- The current character (most recently read).
fTokenValueBuffer- The StringBuilder to store the characters of the current token.
fSource, initialize fTokenValueBuffer and read the first character: public Tokenizer(TextReader source) { if (source == null) throw new ArgumentNullException("source"); fSource = source; fTokenValueBuffer = new StringBuilder(); ReadNextChar(); }
Now we'll add some private helper methods.
Reading a character is simple. Just callTextReader.Read(), which returns an int. Because a char cannot be -1 (which is returned by TextReader.Read() at the end of the stream), we will use character '\0' as the end-of-stream character. void ReadNextChar() { int lChar = fSource.Read(); if (lChar > 0) fCurrentChar = (char)lChar; else fCurrentChar = '\0'; }Skipping white-space is also easy: just keep reading characters until we get a character that is not white-space:
void SkipWhitespace() { while (fCurrentChar > '\0' && fCurrentChar <= ' ') ReadNextChar(); }To determine if we're at the end of the source, we just have to check the current character for
'\0': bool AtEndOfSource { get { return fCurrentChar == '\0'; } }We'll also write a simple method to store the current character and read the next:
void StoreCurrentCharAndReadNext()
{
fTokenValueBuffer.Append(fCurrentChar);
ReadNextChar();
}And a method to extract the stored characters and clear the buffer: string ExtractStoredChars() { string lValue = fTokenValueBuffer.ToString(); fTokenValueBuffer.Length = 0; return lValue; }Because the source code can have errors, we'll need some methods that throw exceptions:
void CheckForUnexpectedEndOfSource() { if (AtEndOfSource) throw new ParserException("Unexpected end of source."); } void ThrowInvalidCharException() { if (fTokenValueBuffer.Length == 0) throw new ParserException("Invalid character '" + fCurrentChar.ToString() + "'."); else { throw new ParserException("Invalid character '" + fCurrentChar.ToString() + "' after '" + fTokenValueBuffer.ToString() + "'."); } }Which introduces our ParserException class (a standard Exception class):
using System; using System.Collections.Generic; using System.Text; using System.Runtime.Serialization; namespace TC.Adl { [Serializable] public class ParserException : Exception { public ParserException() { } public ParserException(string message) : base(message) { } public ParserException(string message, Exception innerException) : base(message, innerException) { } protected ParserException(SerializationInfo info, StreamingContext context) : base(info, context) { } } }Now that we've written the private helper methods, we can write the only public method: the
ReadNextToken method. This method reads a token and returns it. First we skip the initial white-space. If we've reached the end of the source, we'll return null, else we'll check the first character to determine the type of token:
- If the token starts with a letter, it's a word.
- If the token starts with a digit, it's an integer constant.
- If the token starts with a quote, it's a string constant.
- If it's any other character, we assume it's a symbol.
public Token ReadNextToken() { SkipWhitespace(); if (AtEndOfSource) return null; if (char.IsLetter(fCurrentChar)) return ReadWord(); if (char.IsDigit(fCurrentChar)) return ReadIntegerConstant(); if (fCurrentChar == '"') return ReadStringConstant(); return ReadSymbol(); }A word starts with a letter (already tested in
ReadNextToken()) followed by zero or more letters or digits. So all we have to do is keep reading until we've reached a character that is not a letter or digit: Token ReadWord()
{
do
{
StoreCurrentCharAndReadNext();
}
while (char.IsLetterOrDigit(fCurrentChar));
return new Token(TokenType.Word, ExtractStoredChars());
}An integer constant just contains digits: Token ReadIntegerConstant()
{
do
{
StoreCurrentCharAndReadNext();
}
while (char.IsDigit(fCurrentChar));
return new Token(TokenType.Integer, ExtractStoredChars());
}A string constant contains a sequence of characters, enclosed in quotes. Because the quote character is used as a delimiter, the characters in between cannot be quotes. But all other characters are allowed. If the end of the source is reached before the closing quote, we throw an exception. We don't want the quotes to be included in the value of the token, so we'll skip them with ReadNextChar. Token ReadStringConstant()
{
ReadNextChar();
while (!AtEndOfSource && fCurrentChar != '"')
{
StoreCurrentCharAndReadNext();
}
CheckForUnexpectedEndOfSource();
ReadNextChar();
return new Token(TokenType.String, ExtractStoredChars());
}Reading a symbol is more complicated. A symbol can be one or two characters, so we'll have to treat each case individually: Token ReadSymbol()
{
switch (fCurrentChar)
{
// the symbols + - * / ( ) ,
case '+':
case '-':
case '*':
case '/':
case '(':
case ')':
case ',':
StoreCurrentCharAndReadNext();
return new Token(TokenType.Symbol, ExtractStoredChars());
// the symbols := ==
case ':':
case '=':
StoreCurrentCharAndReadNext();
if (fCurrentChar == '=')
{
StoreCurrentCharAndReadNext();
return new Token(TokenType.Symbol, ExtractStoredChars());
}
CheckForUnexpectedEndOfSource();
ThrowInvalidCharException();
break;
// the symbols < <> <=
case '<':
StoreCurrentCharAndReadNext();
if (fCurrentChar == '>' || fCurrentChar == '=')
{
StoreCurrentCharAndReadNext();
}
return new Token(TokenType.Symbol, ExtractStoredChars());
// the symbols > >=
case '>':
StoreCurrentCharAndReadNext();
if (fCurrentChar == '=')
{
StoreCurrentCharAndReadNext();
}
return new Token(TokenType.Symbol, ExtractStoredChars());
default:
CheckForUnexpectedEndOfSource();
ThrowInvalidCharException();
break;
}
return null;
}That's it, our Tokenizer class is done. To test it, we'll write a small application. Add a new Form to the project TC.Adl.Test, named TokenizerTest. It will have a multiline TextBox where we can type in source code (TextBoxSource), a Button to tokenize the source code (ButtonTokenize), and a ListBox that will display the tokens (ListBoxTokens). It should look like this (follow link to enlarge):
The Click event handler of the button creates a new Tokenizer, reads all the tokens, and adds them to the ListBox: private void ButtonTokenize_Click(object sender, EventArgs e) { ListBoxTokens.Items.Clear(); ListBoxTokens.BeginUpdate(); try { using (StringReader lSource = new StringReader(TextBoxSource.Text)) { Tokenizer lTokenizer = new Tokenizer(lSource); Token lToken = lTokenizer.ReadNextToken(); while (lToken != null) { ListBoxTokens.Items.Add(lToken.Type.ToString() + ":\t" + lToken.Value); lToken = lTokenizer.ReadNextToken(); } } } catch (ParserException lException) { MessageBox.Show(this, lException.Message, this.Text, MessageBoxButtons.OK, MessageBoxIcon.Error); } finally { ListBoxTokens.EndUpdate(); } }Now you can start the application, run the TokenizerTest and try out some code to see what tokens are generated. Try out this sample:
for i := 1 to 100 do if i < 50 then print(i, " < 50") else print(i, " >= 50") end if end forAs you can see, converting characters to tokens is not very complicated. You just have to get used to it.
2007-04-22
Writing a parser: ADL tokens
- Word
- A word starts with a letter, followed by zero or more letters or numbers.
- Examples:
x,abc,f2,else. - Examples:
- Integer
- An integer is a sequence of one or more digits.
- Examples:
1,42,3141592654. - Examples:
- String
- A string is a sequence of characters enclosed in quotes.
- Examples:
"x","abc","Quid pro quo.". - Examples:
- Symbol
- A symbol is one of the following sequences:
+ - * / ( ) , := == < > <> <= >=
TokenType: namespace TC.Adl { public enum TokenType { None = 0, Word, Integer, String, Symbol } }(The value
None is the default value and should not occur.)
A token has a type (of type TokenType) and a value (of type string). The value is the sequence of characters that represent the token.
using System; using System.Collections.Generic; using System.Text; namespace TC.Adl { public class Token { public Token(TokenType type, string value) { fType = type; fValue = value; } readonly TokenType fType; public TokenType Type { get { return fType; } } readonly string fValue; public string Value { get { return fValue; } } public bool Equals(TokenType type, string value) { return fType == type && fValue == value; } } }
You may have noticed that there are no comments or argument validation code. This is just to make the code simpler and easier to understand at first sight. The code I'm writing in Visual Studio is fully commented and has all the necessary argument validation code. I'll release the entire library afterwards.
Next time, we'll write the tokenizer.
Writing a parser: base VS2005 solution
I've created an (almost) empty VS2005 solution that will be the base solution to build our parser from. It contains 2 projects: TC.Adl (the ADL Class Library that will contain the parser) and TC.Adl.Test (the WinForms application for testing the library).
The test application has a main form (FormMain) that displays a list of test cases to run. This list is automatically filled with all the classes in TC.Adl.Test that derive from System.Windows.Forms.Form. That way, we can add test cases without having to modify FormMain each time.
You can download it here.
2007-04-21
Writing a parser: introduction to ADL
This post will introduce ADL (Acronymic Demonstrational Language), a language I designed to demonstrate how to write a parser.
Statements
ADL has 5 different types of statements:- if-statement
- while-loop
- for-loop
- assignment
- function call
if expression then
statements
end ifAn if-statement can also have an else-clause: if expression then
statements
else
statements
end ifA while-loop looks like this: while expression do
statements
end whileA for-loop looks like this: for variable := expression to expression do
statements
end forA for-loop can also have a by-clause that defines the number to add to the variable on each iteration: for variable := expression to expression by expression do
statements
end forAn assignment looks like this: variable := expressionA function call looks like this:
function()It can also have one or more arguments separated by commas:
function(expression, expression, expression)
Expressions
Expressions in ADL consist of variables, constants, function calls, unary expressions, binary expressions and groups.
Variable names start with a letter, followed by zero or more letters or numbers.
ADL has 2 types of constants: strings and integer numbers. A string is a sequence of characters, enclosed between 2 quotes. An integer is a sequence of 1 or more digits.
Unary expressions are expressions with an operator that has only 1 operand:- positive operator: +expression
- negative operator: -expression
- logical NOT-operator: not expression
- logical AND-operator: expression and expression
- logical OR-operator: expression or expression
- comparison: expression comparator expression (comparator is one of the following: ==, <>, <, >, <=, >=)
- addition: expression + expression
- subtraction: expression - expression
- multiplication: expression * expression
- division: expression / expression
Precedence
When an expression is built from multiple operators, the expression is evaluated in the following order:- Grouping:
( ) - Unary expression:
+ - not - Multiplicative expression:
* / - Additive expression:
+ - - Comparison:
== <> < > <= >= - Logical OR:
or - Logical AND:
and
Writing a parser: basic terminology
Programming code (or any code that should be parsed) is usually plain text. It's just a sequence of characters. A tokenizer converts this sequence of characters into a sequence of tokens. A token is a group of characters that form a meaningful unit.
The parser analyzes this sequence of tokens and transforms it into a tree of operators (primitive functions) and operands (the arguments of the operators).
Writing a parser
Inspired by a thread about parsers on Channel 9, I've decided to start a series of blog posts on writing a parser. The parser I (and hopefully some of you) will write, will be able to parse a simple BASIC-like language with enough features to make it interesting, without being too complex.
The name of the language will be ADL, which stands for Acronymic Demonstrational Language. A pretty lame name but if you have a better name, please leave a comment here. I'll try to show you how easy it is to write a good parser using simple techniques (reading 1 character at a time, no state-machine) without a lot of dry theory. The first introductory post (after this one) will have some theory and some terminology you need to know, but after that the fun will start.
2007-04-08
Basic Notation, fourth draft
I've just written the fourth draft of the specification for Basic Notation 1.0. You can download it in XPS format (333 KB), or PDF format (300 KB).
I did not blog about the third draft, in which I added support for namespaces, because I wasn't sure if adding namespaces was a good idea. Namespaces add complexity and make it a bit harder for people to understand the specification. So, in the fourth draft, I've removed namespaces, but I've kept prefixes, because they're easier to understand, don't add much complexity and can be used instead of namespaces.



