OpenToken

current version: 4.0

The OpenToken package is a facility for performing token analysis and parsing within the Ada language. It is designed to provide all the functionality of a traditional lexical analyzer/parser generator, such as lex/yacc. But due to the magic of inheritance and runtime polymorphism it is implemented entirely in Ada as withed-in code. No precompilation step is required, and no messy tool-generated source code is created.

For example, here is an OpenToken LR specification of a typical arithmetic grammar, taken from Example 5.10, section 5.3, page 295 of the red dragon book:

Grammar specification:

L -> E EOF
E -> E + T
E -> T
T -> T * F
T -> F
F -> ( E )
F -> integer

Ada code:

Grammar : constant Production_List.Instance :=
 L <= E & EOF                      + Print_Value'Access and
 E <= E & Plus & T                 + Add_Integers       and
 E <= T                                                 and
 T <= T & Times & F                + Multiply_Integers  and
 T <= F                                                 and
 F <= Left_Paren & E & Right_Paren + Synthesize_Second  and
 F <= Int_Literal;

This grammar is processed at runtime into a state machine data structure, then used to parse input text. It uses dynamic memory allocation at runtime to hold a stack of the tokens and productions.

And here is a specification of an equivalent grammar, rewritten to allow a recursive-descent implementation:

Grammar specification:

L -> E  EOF
E -> T {+ T}
T -> F {* F}
F -> ( E )
F -> integer

Ada code:

E : Operation_List.Handle    := new Operation_List.Instance;
F : Integer_Selection.Handle :=
 (Left_Paren & E & Right_Paren + Build_Parens'Access or Int) + Build_Selection'Access;
T : Operation_List.Handle    := F ** Times * Times_Element'Access + Init_Times'Access;
L : Integer_Sequence.Handle  := E & EOF + Build_Print'Access;

E.all := Operation_List.Get
 (Element     => T,
  Separator   => Plus,
  Initialize  => Init_Plus'Access,
  Add_Element => Plus_Element'Access);

This grammar is used directly at runtime to parse input text. It does not use dynamic memory allocation during parsing (it is used while building the grammar), but it does use more stack space than the LR version. Both parsers are reasonably efficient.

OpenToken also allows implementing horribly inefficient recursive descent grammars that still work; it supports infinite lookahead with backtracking. This can be useful when just getting started with parsers; first get it working, then optimize it.

The lexer phase of OpenToken parsers is implemented by classes of recognizers. Each recognizer runs in parallel, and the one matching the longest input lexeme wins. This is not as efficient as a tradational lexers, for example those generated by lex, since those use a finite state machine. But it is easier to use, because the recognizers are already written, well documented, and can be independently tested.

Some reusable parse tokens are also included in OpenToken, but they tend to be more domain-specific.

Ada's type safety features make misbehaving lexers and parsers easier to debug. In addition, OpenToken includes a trace feature, that outputs useful information as the parse progresses, also aiding debugging.

OpenToken is distributed under GPL version 3, with the GNAT modification that allows use in non-GPL projects.

OpenToken was originally written by Ted Dennison. Contributions were made by Christoph Grein and Stephen Leake.

OpenToken is currently maintained by Stephen Leake. Submit bugs via the Debian Bug system, against the opentoken source package. Discussion about OpenToken is on the newsgroup comp.lang.ada.

OpenToken may be obtained in several ways:

OpenToken has been tested on the following operating systems/compilers:

The simplest way to use the source distribution is to install it in the standard GNAT directories, where the compiler will find it by default. Then just put 'with "OpenToken";' in your project file. To install it on Windows, using a DOS shell, and assuming GNAT and 'make' are in your PATH:

cd opentoken/Build/windows_release
make -f Makefile.install install

On GNU/Linux, use 'linux_release' instead of 'windows_release'. This assumes you have write permission in the GNAT directory tree.

See developer instructions if you want to run the tests or work on OpenToken.

Ted's original OpenToken web page is no longer being maintained. The AdaPower mailing list and discussion board are no longer used.

There is a User's Guide that describes how to create a simple application using OpenToken. There are also several examples in the source; see the Examples directory. The Test directory also has useful examples, although oriented towards testing OpenToken itself.

OpenToken uses this test page to verify that the HTML lexer handles valid HTML 4.01.

History

Version 4.0 7 Feb 2010

Version 3.1 August 4, 2009

Version 3.0b 13 August 2000

This version contains another code reorganization to go with another new parsing facility. This time it is recursive decent parsing. The new method has the following advantages over table-driven parsers:

The disadvantages are:

Given the above balance, I do intend to make this the standard supported parsing facility for future versions of OpenToken. The "b" designation is there to indicate that some things might not be in quite their permanent form yet, and that there isn't yet the full set of reusable tokens to support it that I would like to see in a release. I'm hoping for feedback both in the form of criticisms/suggestions, and reusable tokens in order to help finalize this facility.

A general list of the changes is below:

Version 2.0 27 January 2000

This is the first version to include parsing capability. The existing packages underwent a major reorganization to accommodate the new functionality. As some of the restructuring that was done is incompatible with old code, the major revision has been bumped up to 2. A partial list of changes is below:

Version 1.3.6

This version fixes a rare bug in the Ada style based numeric recognizers. The SLOC counter can now successfully count all the source files in Gnat's adainclude directory.

Version 1.3.5

This version adds a simple Ada SLOC counting program into the examples. A bug with the Real token recognizer that caused constraint_errors has been fixed. Also bugs causing constraint errors in the ada-style based integer and real recognizers on long non-based numbers have been fixed.

Version 1.3

This version adds the default token capability to the Analyzer package. This allows a more flexible (if somewhat inefficient) means of error handling to the analyzer. The default token can be used as an error token, or it can be made into a non-reportable token to ignore unknown elements entirely.

Identifier tokens were generalized a bit to allow user-defined character sets for the first and subsequent characters. This not only gives it the ability to handle syntaxes that don't exacly match Ada's, but it allows one to define identifiers for languages that aren't latin-1 based. Also, the ability to turn off non-repeatable underscores was added.

Integer and Real tokens had an option added to support signed literals. This option is set on by default (which causes a minor backward incompatibility). Syntaxes that have addition or subtraction operators will need to turn this option off.

A test to verify proper handling of default parameters was added to the Test directory. A makefile was also added to the same directory to facilitate automatic compiling and running of the tests. This makefile will not work in a non-Gnat/NT environment without some modification.

New recognizers were added for enclosed comments (eg: C's /* */ comments)and  single character escape sequences. Also a "null" recognizer was added for use as a default token.

Version 1.2.1

This version adds the CSV field token recognizer that was inadvertently left out of 1.2. This recognizer was designed to match fields in comma-separated value (CSV) files, which is a somewhat standard file format for databases and spreadsheets. Also, the extraneous CVS directories in the zip version of the distribution were removed.

Version 1.2

The long-awaited string recognizer has been added. It is capable of recognizing both C and Ada-style strings. In addition, there are a great many submissions by Christoph Grein in this release. He contributed mostly complete lexical analyzers for both Java and Ada, along with all the extra token recognizers he needed to accomplish this feat. He didn't need as many extra recognizers as I would have thought he'd need. But even so, slightly less than 1/2 of the recognizers in this release were contributed by Chris (with a broken arm, no less!)

Version 1.1

The main code change to this version is a default text feeder function that has been added to the analyzer. It reads its input from Ada.Text_IO.Current_Input, so you can change the file to whatever you want fairly easily. The capability to create and use your own feeder function still exists, but it should not be necessary in most cases. If you already have code that does this, it should still compile and work properly.

The other addition is the first version of the OpenToken user's guide. All it contains right now is a user manual walking through the steps needed to make a simple token analyzer. Feedback and/or ideas on this are welcome.

Version 1.0

This is the very first publicly released version. This package is based on work I did while working on the JPATS trainer for FlightSafety International. The germ of this idea came while I was trying to port a fairly ambitious, but fatally buggy Ada 83 token recognition package written for a previous simulator. But once I was done, I was rather suprised at the flexibility of the final product. Seeing the possible benefit to the community, and to the company through user-submitted enhancement and debugging, I suggested that this code be released as Open Source. They were open-minded enough to agree. Bravo!


my home page Author : Stephen Leake Valid HTML 4.01! Created with Emacs powered by Ada Last modified: Mon Feb 15 02:27:16 EST 2010