World Library  
Flag as Inappropriate
Email this Article

Grammar induction

Article Id: WHEBN0004375576
Reproduction Date:

Title: Grammar induction  
Author: World Heritage Encyclopedia
Language: English
Subject: Inductive programming, Inference, Genetic programming, Data mining, Reinforcement learning
Collection: Computational Linguistics, Genetic Programming, Grammar, Inference, MacHine Learning, Natural Language Processing
Publisher: World Heritage Encyclopedia

Grammar induction

Grammar induction, also known as grammatical inference or syntactic pattern recognition, refers to the process in machine learning of learning a formal grammar (usually as a collection of re-write rules or productions or alternatively as a finite state machine or automaton of some kind) from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

There is now a rich literature on learning different types of grammar and automata, under various different learning models and using various different methodologies.


  • Grammar Classes 1
  • Learning Models 2
  • Methodologies 3
    • Grammatical inference by trial-and-error 3.1
    • Grammatical inference by genetic algorithms 3.2
    • Grammatical inference by greedy algorithms 3.3
    • Distributional Learning 3.4
    • Learning of Pattern languages 3.5
  • Applications 4
  • See also 5
  • Notes 6
  • References 7

Grammar Classes

Grammatical inference has often been very focused on the problem of learning finite state machines of various types (see the article Induction of regular languages for details on these approaches), since there have been efficient algorithms for this problem since the 1980s.

More recently these approaches have been extended to the problem of inference of context-free grammars and richer formalisms, such as multiple context-free grammars and parallel multiple context-free grammars. Other classes of grammars for which grammatical inference has been studied are contextual grammars, and pattern languages.

Learning Models

The simplest form of learning is where the learning algorithm merely receives a set of examples drawn from the language in question, but other learning models have been studied. One frequently studied alternative is the case where the learner can ask membership queries as in the exact query learning model or minimally adequate teacher model introduced by Angluin.


There are a wide variety of methods for grammatical inference. Two of the classic sources are Fu (1977) and Fu (1982). Duda, Hart & Stork (2001) also devote a brief section to the problem, and cite a number of references. The basic trial-and-error method they present is discussed below. For approaches to infer subclasses of regular languages in particular, see Induction of regular languages. A more recent textbook is de la Higuera (2010) [1] which covers the theory of grammatical inference of regular languages and finite state automata. D'Ulizia, Ferri and Grifoni [2] provide a survey that explores grammatical inference methods for natural languages.

Grammatical inference by trial-and-error

The method proposed in Section 8.7 of Duda, Hart & Stork (2001) suggests successively guessing grammar rules (productions) and testing them against positive and negative observations. The rule set is expanded so as to be able to generate each positive example, but if a given rule set also generates a negative example, it must be discarded. This particular approach can be characterized as "hypothesis testing" and bears some similarity to Mitchel's version space algorithm. The Duda, Hart & Stork (2001) text provide a simple example which nicely illustrates the process, but the feasibility of such an unguided trial-and-error approach for more substantial problems is dubious.

Grammatical inference by genetic algorithms

Grammatical Induction using evolutionary algorithms is the process of evolving a representation of the grammar of a target language through some evolutionary process. Formal grammars can easily be represented as a tree structure of production rules that can be subjected to evolutionary operators. Algorithms of this sort stem from the genetic programming paradigm pioneered by John Koza. Other early work on simple formal languages used the binary string representation of genetic algorithms, but the inherently hierarchical structure of grammars couched in the EBNF language made trees a more flexible approach.

Koza represented Lisp programs as trees. He was able to find analogues to the genetic operators within the standard set of tree operators. For example, swapping sub-trees is equivalent to the corresponding process of genetic crossover, where sub-strings of a genetic code are transplanted into an individual of the next generation. Fitness is measured by scoring the output from the functions of the lisp code. Similar analogues between the tree structured lisp representation and the representation of grammars as trees, made the application of genetic programming techniques possible for grammar induction.

In the case of Grammar Induction, the transplantation of sub-trees corresponds to the swapping of production rules that enable the parsing of phrases from some language. The fitness operator for the grammar is based upon some measure of how well it performed in parsing some group of sentences from the target language. In a tree representation of a grammar, a terminal symbol of a production rule corresponds to a leaf node of the tree. Its parent nodes corresponds to a non-terminal symbol (e.g. a noun phrase or a verb phrase) in the rule set. Ultimately, the root node might correspond to a sentence non-terminal.

Grammatical inference by greedy algorithms

Like all greedy algorithms, greedy grammar inference algorithms make, in iterative manner, decisions that seem to be the best at that stage. These made decisions deal usually with things like the making of a new or the removing of the existing rules, the choosing of the applied rule or the merging of some existing rules. Because there are several ways to define 'the stage' and 'the best', there are also several greedy grammar inference algorithms.

These context-free grammar generating algorithms make the decision after every read symbol:

  • Lempel-Ziv-Welch algorithm creates a context-free grammar in a deterministic way such that it is necessary to store only the start rule of the generated grammar.
  • Sequitur and its modifications.

These context-free grammar generating algorithms first read the whole given symbol-sequence and then start to make decisions:

Distributional Learning

A more recent approach is based on Distributional Learning. Algorithms using these approaches have been applied to learning context-free grammars and mildly context-sensitive languages and have been proven to be correct and efficient for large subclasses of these grammars.[3]

Learning of Pattern languages

Angluin defines a pattern to be a string of constant symbols from Σ and variable symbols from a disjoint set. The language of such a pattern is the set of all its nonempty ground instances i.e. all strings resulting from consistent replacement of its variable symbols by nonempty strings of constant symbols.[note 1] A pattern is called descriptive for a finite input set of strings if its language is minimal (with respect to set inclusion) among all pattern languages subsuming the input set.

Angluin gives a polynomial algorithm to compute, for a given input string set, all descriptive patterns in one variable x.[note 2] To this end, she builds an automaton representing all possibly relevant patterns; using sophisticated arguments about word lengths, which rely on x being the only variable, the state count can be drastically reduced.[4]

Erlebach et al. give a more efficient version of Angluin's pattern learning algorithm, as well as a parallelized version.[5]

Arimura et al. show that a language class obtained from limited unions of patterns can be learned in polynomial time.[6]


The principle of grammar induction has been applied to other aspects of natural language processing, and have been applied (among many other problems) to morpheme analysis, and even place name derivations. Grammar induction has also been used for lossless data compression and statistical inference via MML and MDL principles.

See also


  1. ^ The language of a pattern with at least two occurrences of the same variable is not regular due to the pumping lemma.
  2. ^ x may occur several times, but no other variable y may occur


  1. ^ de la Higuera, Colin (2010). Grammatical Inference: Learning Automata and Grammars. Cambridge: Cambridge University Press. 
  2. ^ D’Ulizia, A., Ferri, F., Grifoni, P. (2011) "A Survey of Grammatical Inference Methods for Natural Language Learning", Artificial Intelligence Review, Vol. 36, No. 1, pp. 1-27.
  3. ^ Clark and Eyraud (2007) Journal of Machine Learning Research, Ryo Yoshinaka (2011) Theoretical Computer Science
  4. ^ Dana Angluin (1980). "Finding Patterns Common to a Set of Strings". Journal of Computer and System Sciences 21: 46–62.  
  5. ^ T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger, T. Zeugmann (1997). "Learning One-Variable Pattern Languages Very Efficiently on Average, in Parallel, and by Asking Queries". In M. Li and A. Maruoka. Proc. 8th International Workshop on Algorithmic Learning Theory — ALT'97. LNAI 1316. Springer. pp. 260–276. 
  6. ^ Hiroki Arimura, Takeshi Shinohara, Setsuko Otsuki (1994). "Finding Minimal Generalizations for Unions of Pattern Languages and Its Application to Inductive Inference from Positive Data". Proc. STACS 11. LNCS 775. Springer. pp. 649–660. 
  • Duda, Richard O.; Hart, Peter E.; Stork, David G. (2001), Pattern Classification (2 ed.),  
  • Fu, King Sun (1982), Syntactic Pattern Recognition and Applications,  
  • Fu, King Sun (1977), Syntactic Pattern Recognition, Applications,  
  • Horning, James Jay (1969), A Study of Grammatical Inference (Ph.D. Thesis ed.),  
  • Gold, E. Mark (1967), Language Identification in the Limit 10,  , see also the corresponding WorldHeritage article
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.

Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.