Certification of Context-Free Grammar Algorithms

Denis Firsov

PhD thesis, Tallinn University of Technology, 2016.

Abstract

 

Context-free grammars are a widely used formalism in compiler construction for defining the syntactical structure of programming languages. In this thesis, our main goal is to implement and certify a parser generator for context-free languages. A parser generator takes a context-free grammar and returns a function that tries to find a parse tree for a given string. A certified parser generator delivers valid parse trees and will find one, if it exists, for the given context-free grammar.

 

There are different approaches to show that programs are correct: model checking, axiomatic semantics, expressive type systems. We are interested in constructive branches of logic. Proofs carried out within a constructive logic may be considered as programs in a functional language. It is important because of the possibility of extraction of the purported object from an existence proof. Our work is done in the Agda dependently typed programming language. Agda provides a single framework to write functional code and prove properties about it.

 

The main constituents of a context-free grammar are finite sets of terminals, nonterminals, and rules. Therefore, in the first part of the thesis, we investigate various encodings and properties of finiteness in constructive mathematics. A set is considered listable, if it can be completely enumerated in a list. We begin by showing that listable sets have decidable equality. This result allows us to conclude that certain basic variations of listability are logically equivalent to each other. We also develop a library of combinators to ease programming with finite sets. The library includes combinators for concise definition of functions on listable sets and a prover for quantified formulas over decidable properties on listable sets. Additionally, we propose that new listable sets can be defined by listing subsets of a base set with decidable equality.

 

The second part of the thesis is devoted to parsing. We implement and certify the Cocke-Younger-Kasami algorithm, as it has a simple and elegant structure. Moreover, the algorithm allows one to parse general context-free languages. The relative simplicity greatly contributes to certifying the implementation. The naive recursive encoding leads to excessive recomputations, but we recover the efficient algorithm by introducing memoization. The refinement to the memoized version is done in a provably correctness-preserving manner.

 

The downside of the CYK parsing algorithm is that it requires context-free grammars in Chomsky normal form. The last part of the thesis focuses on normalization of context-free grammars. We divide normalization into four independent transformations. For each transformation, we prove that it achieves progress towards normality and also preserves the language of the grammar. We also show that the composition of transformations in the appropriate order converts any context-free grammar into its Chomsky normal form. Moreover, the proof of soundness of the normalization procedure is a function for converting any parse tree for the normalized grammar back into a parse tree for the original grammar.

 

[pdf, slides, code (Agda 2.5.1; standard library 0.12)]