String to Expression Part 1: The Tokenizer
Recently I had to make rich filtering API that had to handle all the typical query fineries of
ORs, equality, comparisons and functions. Naturally I did what any good developer would do, I went and found a library that did it all for me and moved onto more important things.
But it did make me pause for a bit and consider, how would I implemented that myself, would I be able make a simple compiler to convert a string into a .NET expression tree. This series of posts is me answering that question
Hopefully by the end of of this we can convert some like:
"( 2 + 3 ) * 6 + 2" into an
or with only configuration change convert:
"birthdate > 1987-02-19 and Name = 'Henry'" into an
The Tokenizer - A less pretentious Lexer
All experts agree the first step to a compiler is a Tokenizer (or as the sophisticated call it, a Lexer). A Tokenizer has one simple goal, to take in a string
and output a list of labeled symbols we’ll call Tokens
the tokenizer has no idea what these symbols mean or how to use them, it simply does the messy string parsing so we only have to think about a list of Tokens.
Thankfully for us there is core library in almost all programming languages that is designed to make string parsing easier, for this task we are using regular expressions.
Just so everyone is on the same page here are the objects we will be dealing with
TokenDefinition allows us to configure what parts of the string we are interested in, we give it a name and a regex.
Token is a single piece of string, it has a link back to the definition that created and the value of the match
Tokenizer is where all the work is done, given a bunch of tokenDefinitions and a string itll output a list of tokens
So after we finishing putting it together we expect to be able to use it like
Filling in the gaps
With all the infrastructure in place what we need to do is flesh out our
Tokenizer. Our handling of the token definitions is going to be very simple; take all the regexs, make them all named captures and throw them into one giant regular expression
Now every match of that regular expression will be a single token.
We can work out which token definition it matched by looking at the capture’s name.
You may have also noted we add an ignore flag on our tokens so we can remove annoying things such as whitespace during tokenization. Ignored tokens are matched but never returned.
Also if you follow the logic you’ll see in the odd case where multiple TokenDefinitions match the same token we just take the first TokenDefinition’s match. This does mean that order of your
TokenDefinitions matter, so you will need to define the more specific definitions before the broader ones.
And with very minimal code we have a fully functional, and configurable,
What is next
As awesome as our tokenizer is, all we have done is converted the problem from “converting a string into an Expressions” to the problem of “converting a list of tokens into an expression”. It is small step, but one in the right direction. The larger problem is for our
Parser to solve, which is what we are going to write in the next post.
The full code of the entire parser is can be found at https://github.com/codecutout/StringToExpression
The code for the
Tokenizer is is at https://github.com/codecutout/StringToExpression/tree/master/StringToExpression/Tokenizer. It may not look exactly like the excerpts on this post, because I omitted some of the error checking code in this post, but the idea is the same.