March 30, 2006

Parsing C++ Source Code: An overview of available tools

C++ is a popular programming language. On the other hand, compared to Java, there are few good programming tools available for C++. My favorite example is refactoring. Why should it be like that?

It happens to be the case that C++ was not designed with these issues in mind and also, it carries some legacy from it's ancestor, C. Technically speaking, C++ cannot be accurately specified using a context free grammar. A context free grammar is a grammar in which you can look at only a part of the whole document, and can give a name to that part without looking at other parts. An unambiguous grammar is one in which you can give exactly one name to that part.

Due to this, much hand-tinkering is required to write a C++ parser. However, many parsers take approach of accepting some source code which are not written in C++ and sort it out later in a second try called semantic pass. C++ has many dialects due to it's evolution and various compiler providers. These dialects differ from each other significantly.

If one can spend some money, there is a respected product which can parse C++ very accurately and also help you building tools on it. It can be found at: http://www.edg.com/

For free options, there are basically three approaches. First is to start from scratch and write a grammar (or use someone elses). There is a C++ grammar written by Edward D. Willink for his FOG . The thesis also contains good account of issues in the C++ grammar. The popular antlr parser generator also claims a C++ grammar which is updated from old PCCTS based grammar.

The second option is to use a parser tool that generates some type of intermediate representation which is easy to process programmatically. The noteworthy here is elsa. It is a C++ parser built using a special parser generator called elkhound. It can be found at Elkhound and Elsa site.

The last approach is to let compiler do the job! Prominent example is the modified C++ frontend for LLVM project, which translates the C++ code to bytecode and then provides infrastructure to write processing of this bytecode.

Apart from these, there are parsers coded for IDEs like Eclipse KDevelop, Anjuta etc. These are usually heuristics-driven, meaning that they would make some educated guesses when required to make analysis efficient.

One often forgotten pragmatic difficulty is the pre-processing of source. The first two approaches most of the time takes for granted preprocessed code.

These issues make parsing C++ code difficult and hinder the development of tools for C++ source code.

No comments:

Post a Comment