2007-04-24
Writing a parser: ADL Tokenizer
We'll now start writing the tokenizer.
One thing I haven't explained yet is how we will deal with white-space. The easiest solution is to just skip the white-space between tokens, and that's what we'll do. Usually white-space is defined as one of the following characters: space (' '), tab ('\t'), new-line ('\n') or carriage-return ('\r'). But I've noticed that any character below space (' ') is never used as text. So we can just consider any character <= ' ' to be white-space, which makes it easier to test a character.
Because we want our tokenizer to read characters from strings as well as streams, we'll use a TextReader. TextReader has 2 derived classes: StringReader (to read text from a string) and StreamReader (to read text from a stream). The only method in TextReader we will use is Read(), which returns the next character or -1 if the end of the stream or string has been reached.
To extract a token, we read the first character and decide what type of token it will be. For each token, we will have to store the characters it contains. We'll use a StringBuilder for this.
using System; using System.Collections.Generic; using System.Text; using System.IO; namespace TC.Adl { public class Tokenizer { TextReader fSource; char fCurrentChar; StringBuilder fTokenValueBuffer; } }Our Tokenizer class has only 3 fields:
fSource- The TextReader to read characters from.
fCurrentChar- The current character (most recently read).
fTokenValueBuffer- The StringBuilder to store the characters of the current token.
fSource, initialize fTokenValueBuffer and read the first character: public Tokenizer(TextReader source) { if (source == null) throw new ArgumentNullException("source"); fSource = source; fTokenValueBuffer = new StringBuilder(); ReadNextChar(); }
Now we'll add some private helper methods.
Reading a character is simple. Just callTextReader.Read(), which returns an int. Because a char cannot be -1 (which is returned by TextReader.Read() at the end of the stream), we will use character '\0' as the end-of-stream character. void ReadNextChar() { int lChar = fSource.Read(); if (lChar > 0) fCurrentChar = (char)lChar; else fCurrentChar = '\0'; }Skipping white-space is also easy: just keep reading characters until we get a character that is not white-space:
void SkipWhitespace() { while (fCurrentChar > '\0' && fCurrentChar <= ' ') ReadNextChar(); }To determine if we're at the end of the source, we just have to check the current character for
'\0': bool AtEndOfSource { get { return fCurrentChar == '\0'; } }We'll also write a simple method to store the current character and read the next:
void StoreCurrentCharAndReadNext()
{
fTokenValueBuffer.Append(fCurrentChar);
ReadNextChar();
}And a method to extract the stored characters and clear the buffer: string ExtractStoredChars() { string lValue = fTokenValueBuffer.ToString(); fTokenValueBuffer.Length = 0; return lValue; }Because the source code can have errors, we'll need some methods that throw exceptions:
void CheckForUnexpectedEndOfSource() { if (AtEndOfSource) throw new ParserException("Unexpected end of source."); } void ThrowInvalidCharException() { if (fTokenValueBuffer.Length == 0) throw new ParserException("Invalid character '" + fCurrentChar.ToString() + "'."); else { throw new ParserException("Invalid character '" + fCurrentChar.ToString() + "' after '" + fTokenValueBuffer.ToString() + "'."); } }Which introduces our ParserException class (a standard Exception class):
using System; using System.Collections.Generic; using System.Text; using System.Runtime.Serialization; namespace TC.Adl { [Serializable] public class ParserException : Exception { public ParserException() { } public ParserException(string message) : base(message) { } public ParserException(string message, Exception innerException) : base(message, innerException) { } protected ParserException(SerializationInfo info, StreamingContext context) : base(info, context) { } } }Now that we've written the private helper methods, we can write the only public method: the
ReadNextToken method. This method reads a token and returns it. First we skip the initial white-space. If we've reached the end of the source, we'll return null, else we'll check the first character to determine the type of token:
- If the token starts with a letter, it's a word.
- If the token starts with a digit, it's an integer constant.
- If the token starts with a quote, it's a string constant.
- If it's any other character, we assume it's a symbol.
public Token ReadNextToken() { SkipWhitespace(); if (AtEndOfSource) return null; if (char.IsLetter(fCurrentChar)) return ReadWord(); if (char.IsDigit(fCurrentChar)) return ReadIntegerConstant(); if (fCurrentChar == '"') return ReadStringConstant(); return ReadSymbol(); }A word starts with a letter (already tested in
ReadNextToken()) followed by zero or more letters or digits. So all we have to do is keep reading until we've reached a character that is not a letter or digit: Token ReadWord()
{
do
{
StoreCurrentCharAndReadNext();
}
while (char.IsLetterOrDigit(fCurrentChar));
return new Token(TokenType.Word, ExtractStoredChars());
}An integer constant just contains digits: Token ReadIntegerConstant()
{
do
{
StoreCurrentCharAndReadNext();
}
while (char.IsDigit(fCurrentChar));
return new Token(TokenType.Integer, ExtractStoredChars());
}A string constant contains a sequence of characters, enclosed in quotes. Because the quote character is used as a delimiter, the characters in between cannot be quotes. But all other characters are allowed. If the end of the source is reached before the closing quote, we throw an exception. We don't want the quotes to be included in the value of the token, so we'll skip them with ReadNextChar. Token ReadStringConstant()
{
ReadNextChar();
while (!AtEndOfSource && fCurrentChar != '"')
{
StoreCurrentCharAndReadNext();
}
CheckForUnexpectedEndOfSource();
ReadNextChar();
return new Token(TokenType.String, ExtractStoredChars());
}Reading a symbol is more complicated. A symbol can be one or two characters, so we'll have to treat each case individually: Token ReadSymbol()
{
switch (fCurrentChar)
{
// the symbols + - * / ( ) ,
case '+':
case '-':
case '*':
case '/':
case '(':
case ')':
case ',':
StoreCurrentCharAndReadNext();
return new Token(TokenType.Symbol, ExtractStoredChars());
// the symbols := ==
case ':':
case '=':
StoreCurrentCharAndReadNext();
if (fCurrentChar == '=')
{
StoreCurrentCharAndReadNext();
return new Token(TokenType.Symbol, ExtractStoredChars());
}
CheckForUnexpectedEndOfSource();
ThrowInvalidCharException();
break;
// the symbols < <> <=
case '<':
StoreCurrentCharAndReadNext();
if (fCurrentChar == '>' || fCurrentChar == '=')
{
StoreCurrentCharAndReadNext();
}
return new Token(TokenType.Symbol, ExtractStoredChars());
// the symbols > >=
case '>':
StoreCurrentCharAndReadNext();
if (fCurrentChar == '=')
{
StoreCurrentCharAndReadNext();
}
return new Token(TokenType.Symbol, ExtractStoredChars());
default:
CheckForUnexpectedEndOfSource();
ThrowInvalidCharException();
break;
}
return null;
}That's it, our Tokenizer class is done. To test it, we'll write a small application. Add a new Form to the project TC.Adl.Test, named TokenizerTest. It will have a multiline TextBox where we can type in source code (TextBoxSource), a Button to tokenize the source code (ButtonTokenize), and a ListBox that will display the tokens (ListBoxTokens). It should look like this (follow link to enlarge):
The Click event handler of the button creates a new Tokenizer, reads all the tokens, and adds them to the ListBox: private void ButtonTokenize_Click(object sender, EventArgs e) { ListBoxTokens.Items.Clear(); ListBoxTokens.BeginUpdate(); try { using (StringReader lSource = new StringReader(TextBoxSource.Text)) { Tokenizer lTokenizer = new Tokenizer(lSource); Token lToken = lTokenizer.ReadNextToken(); while (lToken != null) { ListBoxTokens.Items.Add(lToken.Type.ToString() + ":\t" + lToken.Value); lToken = lTokenizer.ReadNextToken(); } } } catch (ParserException lException) { MessageBox.Show(this, lException.Message, this.Text, MessageBoxButtons.OK, MessageBoxIcon.Error); } finally { ListBoxTokens.EndUpdate(); } }Now you can start the application, run the TokenizerTest and try out some code to see what tokens are generated. Try out this sample:
for i := 1 to 100 do if i < 50 then print(i, " < 50") else print(i, " >= 50") end if end forAs you can see, converting characters to tokens is not very complicated. You just have to get used to it.
however... correct me if i'm wrong but the tokeniser doesn't treat negative integers aswell as it could do... the '-' will be in a separate operator token, would it not be better to group this as one integer token with the actual value being negative? If you get my drift that is..
I'm assuming you didn't support escaped character sequences in string symbols on purpose? Was that just to avoid over-complicating the example?
I have a question though. If I want the ADL language to have comments embedded in it, should the tokenizer ignore them, or the parser? I want to use the tokens in syntax coloring later, so I am thinking that I should tokenize them. What would you recommend?
Thank you
Other cases where comments should not be ignored is when inline documentation tags can be embedded in comments, like in C#.
Very interesting series.
I have a question about handling hex and octal numbers.
Would you consider it the job of the lexer to understand the '0x' and '0' prefixes and convert these tokens into an "integer modifier" type?
thanks,
You would just have to change the ReadIntegerConstant-method. If you want to store the value as integer in the Tokens, you would have to change the Token-class and use sub-classes per Token-type instead of the simple TokenType-enum I'm using.
"below space (' ') is never used as text" - can be changed as - "below ASCII char value of space (' ')" For a second I was confused when you mentioned below..
Links to this post:


