SQL Parsing with Python, Pt. I
Some time ago I was in search for some kind of SQL parser module for Python. As you can guess, my search wasn’t really successfull. pyparsing needs a grammar, but I’m not really interested in writing a full-blown grammar for SQL with it’s various dialects. AFAICT Gadfly implements a customized SQL parser, but I was not able to figure out how to use it. And there’s sqlliterals, but as the name suggests it’s just for identifying literals within statements.
I expect such a module to do the following:
- It should be fast ;-)
- scalable – what the parser needs to know about a string containing SQL statements depends on what I want to do with it. If I just want to split that string in separate statements I don’t need to know as much as when I want to know what identifiers occur in it.
- non-validating – Parsing shouldn’t fail if the statement is syntactically incorrect as I want to use it for an SQL editor where the statements are (naturally) incorrect most of the time.
- some beautifying is a nice-to-have – We all know SQL generated by a script and copied from a logging output to a database front-end – it always looks ugly and is pretty unreadable.
The only thing I’ve found that comes close to my needs is Pygments (yes, the syntax highlighter). It does a really good job highlighting SQL, so it “understands” at least something. To be concrete, it has a rock-solid and fast lexer. There’s a FAQ entry if it’s possible to use Pygments for progamming language processing. The answer is:
The Pygments lexing machinery is quite powerful can be used to build lexers for basically all languages. However, parsing them is not possible, though some lexers go some steps in this direction in order to e.g. highlight function names differently.
Also, error reporting is not the scope of Pygments. It focuses on correctly highlighting syntactically valid documents, not finding and compensating errors.
There’s a nice distinction between lexers and parser in this answer. The former does lexical and the latter syntactical analysis of a given stream of characters. For my needs this led to a two step process: first to do the lexing using Pygments mechanism (BTW, read “focuses on […] syntactically valid documents, not finding […] errors” as “non-validating”) and then to add parsing on top of the token stream.
So I stripped the lexing mechanism for SQL out of Pygments to get rid of some overhead not needed for my purposes (e.g. loading of extensions). The only changes to the Pygments tokenizer was to replace a huge regex for finding keywords by a dictionary-based lookup and to add a few new tokens to make live easier in the second processing stage. The achieved performance improvement by replacing the regex doesn’t play a significant role for Pygments, it just speeds up the lexing a bit.
In the second step the very efficient token-type/value stream generated in step 1 is grouped together in nested lists of simple classes. Instantiating a bunch of classes results in a performance loss, but with the benefit of some helpful methods for analyzing the now classified tokens. As the classes are pretty simple most of the performance loss was recaptured by using slots.
The grouping mechanism is again non-validating. It tries to find patterns like identifiers and their aliases or WHERE clauses and their conditions in the token stream, but it just leaves the tokens unchanged if it doesn’t find what it’s looking for. So it actually defines some sort of “grammar”, but very unrestrictive. That means that the quality of answers to questions like “Is there a WHERE clause?” or “Which tables are affected?” heavily depends on the syntactical clearness of the given statement at the time when it’s processed and it’s up to the user or application using the module to intrepet the answers right. As the module is seen as a helper during the develoment process of SQL statements it doesn’t insist on syntactical correctness as the targeted database system would do when the statement is executed.
All in all these are the two (simple) processing steps:
- Lexing a given string using regular expressions and generate a token-type/value stream (in fact, this is the Pygments lexer).
- Instantiate a simple token class for each item in the stream to ease syntactical analysis.
- Group those tokens in nested lists by using simple functions by finding patterns.
Depending on what should be done with the statements, only the first step is required. For example, splitting statements or simple formatting rules don’t require parsing.
Pt. II of this blog post will cover the two remaining points: scalibality and the beautifiying goodie. And it will give a closer look at the module I came up with, including some performance stats. For the impatient: the source code of this module is available here.