Important:Pyodide takes time to initialize.
Initialization completion is indicated by a red border around Run all button.
We previously saw how to
produce a grammar that is an intersection of two regular grammars. One of the
interesting things about regular grammars is that, you can produce an
intersection between a regular grammar and a context-free grammar, and the
result will be context-free. The traditional technique for intersecting
between a CFL and an RL is to produce the PDA and the DFA equivalents of both,
and produce a product PDA. However, there is an even better way called
the Bar-Hiller construction^{1} that lets you compute an
intersection between a CFG and RG directly.
The essential idea is to first recognize that a right linear grammar that
encodes a regular language can be made to have a final nonterminal to expand.
The below are patterns in right linear grammars.
<s> := a <a>
| b <b>
<a> := c
| {empty}
<b> := b
Given such a grammar, one can convert it as below so that the last nonterminal
to be expanded is <f>
<s> := a <a>
| b <b>
<a> := c <f>
| <f>
<b> := b <f>
<f> := {empty}
Why do we do this? because this lets us represent the grammar as a
non-deterministic finite automation with a single start state (<S>) and
a single end state (<F>). This is required for the Bar-Hiller construction.
Next, we also convert the context-free grammar to Chomsky-Normal-Form
(actually we do not need as much restrictions, as we will see later).
The CNF looks like below.
<S> := <A><B>
<A> := a
| {empty}
<B> := b
The essential idea is to take all nonterminal symbols from the regular
grammar, and all nonterminal symbols from the context-free grammar, and
produce triplets which starts with a nonterminal from RG, and ends with
another nonterminal from the RG, and has a nonterminal from CFG in the
middle. E.g. <a,A,b>.
The intersection grammar is represented by the start symbol <s,S,f> where
<s> is the start symbol of the regular grammar, and <f> is the final
symbol as we discussed above. <S> is the start symbol of the context-free
grammar. The essential idea is that if we want to produce <s,S,f> then
it can only be produced if the rules can be written such that they start with
<s> and end with <f>. That is, the definition of <s,S,f> is as follows:
<s,S,f> := <s,A,x><x,B,f>
where <x> is a nonterminal symbol in the regular grammar such that it is
reachable from <s>, and <f> is reachable from <x>. Further, it means
that if we go from <s> to <x> by consuming a string, then that string must
also be parsable by <A>. In our example, this could be one rule.
<s,S,f> := <s,A,a><a,B,f>
If one of the tokens in the context-free rule is a terminal symbol, then we
get an opportunity to immediately verify our construction.
<s,A,a> := [<s>,a,<a>]
As you can see, <A> has one of the rules that contain a single terminal
symbol – a. So, we can immediately see that the requirement <s,A,a>
was satisfied. That is, <s> goes to <a> by consuming a, and this is
witnessed by [<s>,a,<a>]. So, we will keep this rule in the intersection
grammar as
<s,A,a> := a
What about the second rule?
<s,A,a> := [<s>,{empty},<a>]
This however, does not work because there is no epsilon transition from <s>
to <a>. Hence, this rule is skipped in the resulting grammar.
Let us see how to implement this technique.
We start by importing the prerequisites.
System Imports
These are available from Pyodide, but you may wish to make sure that they are
installed if you are attempting to run the program directly on the machine.
sympy
Available Packages
These are packages that refer either to my previous posts or to pure python
packages that I have compiled, and is available in the below locations. As
before, install them if you need to run the program directly on the machine.
Next, we want to limit the number of forms of production rules that we want to
handle. Hence, we convert the context-free grammar to a normal form such that
it conforms exclusively to the following pattern.
\[A \rightarrow a\]
\[A \rightarrow B\]
\[A \rightarrow aB\]
\[A \rightarrow Bb\]
\[A \rightarrow BC\]
\[S \rightarrow \epsilon\]
That is, each production rule has at most two tokens. The new normal form is
provided by binary_normal_form(). Note that this is not exactly the
Chomsky Normal Form. We
do not need the full restrictions of CNF. However, our algorithms will also
work if a grammar in CNF form is given.
Let us define a simple expression grammar to use as an example.
We verify that our BNF converter works.
Here is another, the grammar for JSON
We again verify that our BNF converter works.
Triplet rules
As we discussed previously, we transform the grammar such that we produce
every variations of triplets with nonterminals from regular grammar starting
and ending, and nonterminal from context-free grammar in the middle.
That is, given this,
<S> := <A><B>
and given <a>, <b>, <c>, <f> as the regular nonterminals, each
reachable from previous, we produce
We modify our grammar display so that it knows about our triples.
Remove invalid terminal transitions
Next, we remove invalid terminal transitions. That is, given
<s,A,a> := [<s>,a,<a>]
We check that <s> can reach <a> by consuming a. If not,
this is an invalid rule, and we remove it from the production rules.
Remove undefined nonterminals
At this point we may have multiple nonterminals with no rules defining them.
That is, such nonterminals can’t be expanded. Hence, these can be removed.
Bar-Hiller, M. Perles, and E. Shamir. On formal properties of simple phrase structure grammars. Zeitschrift fur Phonetik Sprachwissenschaft und Kommunikationforshung, 14(2):143–172, 1961. ↩