Assignment 2: Parsing with PLY (Python Lex-Yacc) (Python代写,北美程序代写,美国程序代写,University of Colorado代写,ECEN4553代写)

Lexical analysis: regular expressions and using lex Parsing: the LALR(1) shift-reduce algorithm, and using yacc, including shift-reduce errors, reduce-reduce errors, and precedence declarations

联系我们
微信: biyeprodaixie 欢迎联系咨询

本次CS代写的主要涉及如下领域: Python代写,北美程序代写,美国程序代写,University of Colorado代写,ECEN4553代写

Assignment 2: Parsing with PLY (Python Lex-Yacc)

ECEN 4553 & 5013, CSCI 4555 & 5525

Prof. Jeremy G. Siek

The main ideas covered this week are
  • Lexical analysis: regular expressions and using lex
  • Parsing: the LALR(1) shift-reduce algorithm, and using yacc, including shift-reduce errors, reduce-reduce errors, and precedence declarations

The process of transforming a text file (a sequence of characters) into an abstract syntax tree is traditionally subdivided into two parts:lexical analysis(often called scanning) andparsing. It is entirely feasible to implement a compiler without doing lexical analysis, instead just parsing. However, scannerless parsers tend to be slower, which use to matter when computers were slow, and sometimes still matters for very large files. The Python Lex-Yacc tool, abbreviated PLY [1], is an easy-to-use Python imitation of the orig- inallexandyaccC programs. Lex was written by Eric Schmidt and Mike Lesk [3] at Bell Labs, and is the standard lexical analyzer generator on many Unix systems. The input tolexis a spec- ification consisting of a list of the kinds of tokens and a regular expression for each. The output oflexis a program that analyzes a text file, turning it into a sequence of tokens. YACC stands from Yet Another Compiler Compiler and was written by Stephen C. Johnson at AT&T [2]. The input toyaccis a context-free grammar together an action (chunk of code) for each production. The output ofyaccis a program that parses a text file and fires the appropriate actions when a production is applied. The PLY tool combines the functionality of bothlexandyacc. We will use the PLY tool to generate a lexer and parser for theP 0 subset of Python, shown in Figure 1.

program ::= module
module ::= simple_statement+
simple_statement ::= "print" expression ("," expression)*
unary_op ::= "+" | "-"
binary_op ::= "+" | "-" | "*" | "/" | "%" | "**"
expression ::= integer
| unary_op expression
| expression binary_op expression
| "(" expression ")"
Figure 1: Concrete syntax for theP 0 subset of Python.

1 Lexical Analysis

The lexical analyzer turns a sequence of characters (a string) into a sequence of tokens. For exam- ple, the string

’print 1 + 3’

will be converted into the list of tokens

[’print’,’1’,’+’,’3’]

Actually, to be more accurate, each token will contain the tokentype(like identifier or integer) and the token’svalue, which is the string from the input that matched the token. With the PLY tool, the types of the tokens must be specified by initializing thetokensvariable. For example,

tokens = (’PRINT’,’INT’,’PLUS’) Next we must specify which sequences of characters will map to each type of token. We do this using regular expression. The term “regular” comes from “regular languages”, which are the (particularly simple) set of languages that can be recognized (parsed) by a finite automata. A regular expressionis a pattern formed of the following core elements:

  1. a character, e.g.a. The only string that matches this regular expression isa.
  2. two regular expressions, one followed by the other (concatenation), e.g.bc. The only string that matches this regular expression isbc.
  3. one regular expression or another (alternation), e.g. a|bc. Both the string’a’and’bc’ would be matched by this pattern.
  4. a regular expression repeated zero or more times (Kleene closure), e.g.(a|bc)*. The string ’bcabcbc’would match this pattern, but not’bccba’.
  5. the empty sequence (epsilon)

In Python, there is special syntax and language support for specifying regular expressions. To create a regular expression, start withr’followed by the regular expression, and ending with’. For example, the following specifies the regular expression for the’PRINT’token.

t_PRINT = r’print’

Thet is a naming convention that PLY uses to know when you are defining the regular expression for a token. The Python support for regular expressions goes beyond the “core” elements and include many other convenient short-hands, for example+is for repetition one or more times. If you want to refer to the actual character+, use a backslash to escape it. Section 4.2.1 Regular Expression Syntax of the Python Library Reference gives an in-depth description of the extended regular expressions supported by Python.

t_PLUS = r’+’ Sometimes you need to do some extra processing for certain kinds of tokens. For example, for theINTtoken it is nice to convert the matched input string into a Python integer. With PLY you can do this by defining a function for the token. The function must have the regular expression as its documentation string and the body the function should fill in thevaluefield of the token. Here’s how it would look for theINTtoken. The.regular expression stands for any decimal numeral (0-9).

def t_INT(t): r’\d+’ try: t.value = int(t.value) except ValueError: print "integer value too large", t.value t.value = 0 return t In addition to defining regular expressions for each of the tokens, you’ll often want to perform special handling of newlines and whitespace. The following is the code for counting newlines and for telling the lexer to ignore whitespace. (We’ll need to reconsider this later to handle Python indentation rules.)

def t_newline(t):
r’\n+’
t.lexer.lineno += len(t.value)

t_ignore = ’ \t’ If a portion of the input string is not matched by any of the tokens, then the lexer calls the error function that you provide. The following is an example error function.

def t_error(t):
print "Illegal character ’%s’" % t.value[0]
t.lexer.skip(1)

Last but not least, you’ll need to instruct PLY to generate the lexer from your specification with the following code.

import ply.lex as lex
lex.lex()

Figure 2 shows the complete code for an example lexer.

Exercise 1.1. Write a PLY lexer specification forP 0 and test it on a few input programs, looking at the output list of tokens to see if they make sense.

2 Parsing

Figure 3 shows an example use of PLY to generate a parser. The code specifies a grammar and it specifies actions for each rule. For each grammar rule there is a function whose name must begin withp_. The document string of the function contains the specification of the grammar rule. PLY uses just a colon:instead of the usual::=to separate the left and right-hand sides of a grammar production. The left-hand side symbol for the first function (as it appears in the Python file) is considered the start symbol. The body of these functions contains code that carries out the action for the production. Typically, what you want to do in the actions is build an abstract syntax tree, as we do here. The parametertof the function contains the results from the actions that were carried out to parse the right-hand side of the production. You can index intotto access these results, starting with t[1]for the first symbol of the right-hand side. To specify the result of the current action, assign the result intot[0]. So, for example, in the productionexpression : INT, we build aConstnode containing an integer that we obtain fromt[1], and we assign theConstnode tot[0].

tokens = (’PRINT’,’INT’,’PLUS’)
t_PRINT = r’print’
t_PLUS = r’\+’
def t_INT(t):
r’\d+’
try:
t.value = int(t.value)
except ValueError:
print "integer value too large", t.value
t.value = 0
return t
t_ignore = ’ \t’
def t_newline(t):
r’\n+’
t.lexer.lineno += t.value.count("\n")
def t_error(t):
print "Illegal character ’%s’" % t.value[0]
t.lexer.skip(1)
import ply.lex as lex
lex.lex()

Figure 2: Example lexer implemented using the PLY lexer generator.

from compiler.ast import Printnl, Add, Const
def p_print_statement(t):
’statement : PRINT expression’
t[0] = Printnl([t[2]], None)
def p_plus_expression(t):
’expression : expression PLUS expression’
t[0] = Add((t[1], t[3]))
def p_int_expression(t):
’expression : INT’
t[0] = Const(t[1])
def p_error(t):
print "Syntax error at ’%s’" % t.value
import ply.yacc as yacc
yacc.yacc()
Figure 3: First attempt at writing a parser using PLY.

The PLY parser generator takes your grammar and generates a parser that uses the LALR(1) shift-reduce algorithm, which is the most common parsing algorithm in use today. LALR(1) stands for Look Ahead Left-to-right with Rightmost-derivation and 1 token of lookahead. Unfortunately, the LALR(1) algorithm cannot handle all context-free grammars, so sometimes you will get error from PLY. To understand these errors and know how to avoid them, you have to know a little bit about the parsing algorithm.

2.1 The LALR(1) Algorithm

The LALR(1) algorithm uses a stack and a finite automata. Each element of the stack is a pair: a state number and a non-terminal symbol. Each state in the finite automata represents where the parser stands in the parsing process with respect to certain grammar rules. Figure 4 shows an example LALR(1) parse table generated by PLY for the grammar specified in Figure 3. When PLY generates a parse table, it also outputs a textual representation of the parse table to the file parser.outthat is useful for debugging purposes. Consider state 1 in Figure 3. The parser has just read in aPRINTtoken, so the top of the stack is (1,PRINT). The parser is part of the way through parsing the input according to grammar rule 1, which is signified by showing rule 1 with a dot after the PRINT token and before the expression non-terminal. A rule with a dot in it is called anitem. There are several rules that could apply next, both rule 2 and 3, so state 1 also shows those rules with a dot at the beginning of their right-hand sides. The edges between states indicate which transitions the automata should make depending on the next input token. So, for example, if the next input token is INT then the parser will push INT and the target state 4 on the stack and transition to state 4. Suppose we are now at the end of the input. In state 4 it says we should reduce by rule 3, so we pop from the stack the same number of items as the number of symbols in the right-hand side of the rule, in this case just one. We then momentarily jump to the state at the top of the stack (state 1) and then follow the goto edge that corresponds to the left-hand side of the rule we just reduced by, in this caseexpression, so we arrive at state 3. (A slightly longer example parse is shown in Figure 4.) In general, the shift-reduce algorithm works as follows. Look at the next input token.

  • If there there is a shift edge for the input token, push the edge’s target state and the input token on the stack and proceed to the edge’s target state.
  • If there is a reduce action for the input token, popkelements from the stack, wherekis the number of symbols in the right-hand side of the rule being reduced. Jump to the state at the top of the stack and then follow the goto edge for the non-terminal that matches the left- hand side of the rule we’re reducing by. Push the edge’s target state and the non-terminal on the stack.

Notice that in state 6 of Figure 4 there is both a shift and a reduce action for the tokenPLUS, so the algorithm does not know which action to take in this case. When a state has both a shift and a reduce action for the same token, we say there is ashift/reduce conflict. In this case, the conflict will arise, for example, when trying to parse the inputprint 1 + 2 + 3. After having consumedprint 1 + 2the parser will be in state 6, and it will not know whether to reduce to form an expression of1 + 2, or whether it should proceed by shifting the next+from the input. A similar kind of problem, known as areduce/reduceconflict, arises when there are two reduce actions in a state for the same token. To understand which grammars gives rise to shift/reduce

State 0 start ::=. statement statement ::=. PRINT expression

State 1
statement ::= PRINT. expression
expression ::=. expression PLUS expression
expression ::=. INT
PRINT, shift
State 2
start ::= statement.
statement, goto
State 3
statement ::=PRINT expression.
expression ::= expression. PLUS expression
end, reduce by rule 1
State 4
expression ::= INT.
end, reduce by rule 3
PLUS, reduce by rule 3
INT, shift expression, goto
State 5
expression ::= expression PLUS. expression
expression ::=. expression PLUS expression
expression ::=. INT
INT, shift PLUS, shift
State 6
expression ::= expression PLUS expression.
expression ::= expression. PLUS expression
end, reduce by rule 2
PLUS, reduce by rule 2
PLUS, shift expression, goto

Grammar:

  1. start ::= statement
  2. statement ::= PRINT expression
  3. expression ::= expression PLUS expression
  4. expression ::= INT
Example parse of 'print 1 + 2'
Stack
[]
[(1,PRINT)]
[(1,PRINT),(4,INT)]
[(1,PRINT),(3,expression)]
[(1,PRINT),(3,expression),(5,+)]
[(1,PRINT),(3,expression),(5,+),(4,INT)]
[(1,PRINT),(3,expression),(5,+),(6,expression)]
[(1,PRINT),(3,expression)]
[(2,statement)]
Input
'print 1 + 2'
'1 + 2'
'+ 2'
'+ 2'
'2'
''
''
''
''
Action
shift to state 1
shift to state 4
reduce by rule 3 to state 1, goto 3
shift to state 5
shift to state 4
reduce by rule 3 to state 5, goto 6
reduce by rule 2 to state 1, goto 3
reduce by rule 1 to state 0, goto 2
accept
Figure 4: An LALR(1) parse table and a trace of the algorithm for an example.

and reduce/reduce conflicts, it helps to know how the parse table is generated from the grammar.

2.2 Parse Table Generation

The parse table is generated one state at a time. State 0 represents the start of the parser. We add the production for the start symbol to this state with a dot at the beginning of the right-hand side. If the dot appears immediately before another non-terminal, we add all the productions with that non-terminal on the left-hand side. Again, we place a dot at the beginning of the right-hand side of each the new productions. This process, call itstate closure, is continued until there are no more productions to add. We then examine each item in the current stateI. Suppose an item has the formA::=α.Xβ, whereAandXare symbols andαandβare sequences of symbols. We create a new state, call itJ. IfXis a terminal, we create a shift edge fromItoJ, whereas ifXis a non-terminal, we create a goto edge fromItoJ. We then need to add some items to stateJ. We start by adding all items from stateIthat have the formB::=γ.Xκ(whereBis any symbol and γandκare arbitrary sequences of symbols), but with the dot moved past theX. We then perform state closure onJ. This process repeats until there are no more states or edges to add. We then mark states as accepting states if they have an item that is the start production with a dot at the end. Also, to add in the reduce actions, we look for any state containing an item with a dot at the end. Letnbe the rule number for this item. We then put a reducenaction into that state for every tokenY. For example, in Figure 4 state 4 has an item with a dot at the end. We therefore put a reduce by rule 3 action into state 4 for every token. (Figure 4 does not show a reduce rule for INT in state 4 because it is impossible in this grammar does not allow two consecutive INT tokens in the input. We will not go into how this can be figured out, but in any event it does no harm to have a reduce rule for INT in state 4; it just means the input will be rejected at a later point in the parsing process.)

Exercise 2.1. On a piece of paper, walk through the parse table generation process for the grammar in Figure 3 and check your results against Figure 4.

2.3 Resolving conflicts with precedence declarations

To solve the shift/reduce conflict in state 6, we can add the following precedence rules, which says addition associates to the left and takes precedence over printing. This will cause state 6 to choose reduce instead of shift.

precedence = ( (’nonassoc’,’PRINT’), (’left’,’PLUS’) ) In general, the precedence variable should be assigned a tuple of tuples. The first element of each inner tuple should be an associativity (nonassoc, left, or right) and the rest of the elements should be tokens. The tokens that appear in the same inner tuple have the same precedence, whereas tokens that appear in later tuples have a higher precedence. Thus, for the typical prece- dence for arithmetic operations, we would specify the following:

precedence = (
(’left’,’PLUS’,’MINUS’),
(’left’,’TIMES’,’DIVIDE’)
)

Figure 5 shows a complete Python program that generates a lexer and parser using PLY, parses an input file (the file specified on the command-line), and prints the AST.

Exercise 2.2. Write a PLY grammar specification forP 0 and update your compiler so that it uses the generated lexer and parser instead of using the parser in thecompilermodule. Perform re- gression testing on your compiler to make sure that it still passes all of the tests that you created for assignment 1.

References

[1] D. Beazley.PLY (Python Lex-Yacc).http://www.dabeaz.com/ply/.

[2] S. C. Johnson. Yacc: Yet another compiler-compiler. InUNIX Programmer’s Manual, volume 2, pages 353–387. Holt, Rinehart, and Winston, New York, NY, USA, 1979.

[3] M. E. Lesk and E. Schmidt. Lex - a lexical analyzer generator. Technical report, Bell Laborato- ries, July 1975.

###############################################################################

Lexer

tokens = (’PRINT’,’INT’,’PLUS’) t_PRINT = r’print’ t_PLUS = r’+’ def t_INT(t): r’\d+’ try: t.value = int(t.value) except ValueError: print "integer value too large", t.value t.value = 0 return t t_ignore = ’ \t’ def t_newline(t): r’\n+’ t.lexer.lineno += t.value.count("\n") def t_error(t): print "Illegal character ’%s’" % t.value[0] t.lexer.skip(1) import ply.lex as lex lex.lex()

###############################################################################

Parser

from compiler.ast import Printnl, Add, Const precedence = ( (’nonassoc’,’PRINT’), (’left’,’PLUS’) ) def p_print_statement(t): ’statement : PRINT expression’ t[0] = Printnl([t[2]], None) def p_plus_expression(t): ’expression : expression PLUS expression’ t[0] = Add((t[1], t[3])) def p_int_expression(t): ’expression : INT’ t[0] = Const(t[1]) def p_error(t): print "Syntax error at ’%s’" % t.value import ply.yacc as yacc yacc.yacc()

###############################################################################

Main

import sys try: f = open(sys.argv[1]) p = yacc.parse(f.read()) print p except EOFError: print "Could not open file %s." % sys.argv[1]

Figure 5: Parser with precedence declarations to resolve conflicts.