Skip to content

SQL Parsing with Python, Pt. II

March 29, 2009

After I’ve described some basics in part I let’s have a closer look to the actual Python module called sqlparse. First off, have a look at the project page on how to download and install this module in case you’re interested. But now, let’s have some fun…

The API of the module is pretty simple. It provides three top-level functions on module level: sqlparse.split(sql) splits sql into separate statements, sqlparse.parse(sql) parses sql and returns a tree-like structure and sqlparse.format(sql, **kwds) returns a beautified version of sql according to kwds.

As mentioned in my previous post, what effort needs to be done to build the return values depends on what lexing and parsing work is needed to find the result. For example sqlparse.split() does the following:

  • Generate a token-type/value stream basically with a copy of Pygments lexer
  • Apply a filter (in terms of a Pygments stream filter) to find statements
  • Serialize the token stream back to unicode

sqlparse.parse() does all the grouping work and sqlparse.format() runs through those groups, modifies them according to the given formatting rules and finally converts it back to unicode.

Here’s an example session in a Python shell:

>>> import sqlparse
>>> # Splitting statements:
>>> sql = 'select * from foo; select * from bar;'
>>> sqlparse.split(sql)
<> # Formatting statemtents:
>>> sql = 'select * from foo where id in (select id from bar);'
>>> print sqlparse.format(sql, reindent=True, keyword_case='upper')
SELECT *
FROM foo
WHERE id IN
  (SELECT id
   FROM bar);

>>> # Parsing
>>> sql = 'select * from "someschema"."mytable" where id = 1'
>>> res = sqlparse.parse(sql)
>>> res
<<>> stmt = res[0]
>>> stmt.to_unicode()  # converting it back to unicode
<<<u>>> # This is how the internal representation looks like:
>>> stmt.tokens
<<>>

Now, how does the grouping work? Grouping is done with a set of simple functions. Each function searches for a simple pattern and if it finds one a new group is built. Let’s have a look at the function that finds the WHERE clauses.

def group_where(tlist):
    [group_where(sgroup) for sgroup in tlist.get_sublists()
     if not isinstance(sgroup, Where)]
    idx = 0
    token = tlist.token_next_match(idx, T.Keyword, 'WHERE')
    stopwords = ('ORDER', 'GROUP', 'LIMIT', 'UNION')
    while token:
        tidx = tlist.token_index(token)
        end = tlist.token_next_match(tidx+1, T.Keyword, stopwords)
        if end is None:  # WHERE is at the end of the statement
            end = tlist.tokens[-1]
        else:
            end = tlist.tokens[tlist.token_index(end)-1]
        group = tlist.group_tokens(Where, tlist.tokens_between(token, end))
        idx = tlist.token_index(group)
        token = tlist.token_next_match(idx, T.Keyword, 'WHERE')

tlist is a list of tokens, possible subgroups are handled first (bottom-up approach). Then it grabs the first occuring “WHERE” and looks for the next matching stop word. I’m pretty unsure if the stop words approach is right here, but at least it works for now… If it finds a stop word, a group using the class Where is created, the tokens between WHERE and the stop word are attached to it and – still within the while loop – the next WHERE keyword is used as the next starting point.

So why not use a grammar here? At first, this piece of code is pretty simple and easy to maintain. But it can also handle grammatically incorrect statements more lazily, e.g. it’s no problem to add an if clause that – when for example an unexpected token occurs – the function just jumps to the next occurance of WHERE without changing anything or even raising an exception. To achieve this the token classes provide helper functions to inspect the surroundings of an occurrence (in fact, just simple list operations). There’s no limitation what a grouping function can do with the given token list, so you could even “guess” a proper group with some nasty algorithm.

The current problem with this approach is performance. Here are some numbers:

Size split() parse()
100kb (20600 tokens) 0.3 secs. 1.8 secs.
1.9MB (412500 tokens) 5.53 secs. 37 secs.

Most of the performance is lost when giving up the stream-oriented approach in the parsing phase. The numbers are based on the first unrevised working version. I expect performance improvents especially in the way how token lists are handled behind the scenes with upcoming versions. For real life statements the parser behaves quite well. BTW, the Pygments lexer takes about 6 seconds (compared to 5.5 secs. for splitting) for the 1.9MB of SQL.

The non-validating approach is a disadvantage too. You’ll never know if a statement is valid. You can even parse middle high german phrases and receive a result:

>>> sqlparse.parse('swer an rehte güete wendet sîn gemüete')[0].tokens
<<>>

It’s up to the user of this module to provide suitable input and to interpret the output. Furthermore the parser only supports not every nifty edge of an SQL dialect. Currently it’s mostly ANSI-SQL with some PostgreSQL specific stuff. But it should be easy to implement further grouping functions to provide more SQL varieties.

The splitting feature is currently used by CrunchyFrog and does a pretty good job there. I assume that SQL splitting works stable and reliable in most cases. Beautifiying and parsing is very new in the module and full functionality needs to be proven with time. Luckily the top-level API with it’s three functions is damn simple and keeps the doors open for significant changes behind the scenes if they’re needed.

The sources of the sqlparse module are currently hosted on github.com but may move to Google Code anytime soon. Refer to the project page on Google Code for downloads and how to access the sources.

In addition there’s a simple AppEngine application that exposes the formatting features as an online service.

One Comment

Trackbacks

  1. SQL Parsing with Python, Pt. I « Andi Albrecht

Comments are closed.

%d bloggers like this: