remove excess text

This commit is contained in:
Niko Matsakis 2017-04-05 09:27:08 -04:00
parent e2ca68cdea
commit 8842bed1ab

View File

@ -1088,126 +1088,6 @@ fn calculator6() {
}
```
<a id="calculator7"></a>
### calculator7: `match` declarations and controlling the lexer/tokenizer
This section will describe how LALRPOP allows you to control the
generator lexer, which is the first phase of parsing, where we break
up the raw text by applying regular expressions. If you're already
familiar with what a lexer is, you may want to skip past the
introductory section. If you're more interested in using LALRPOP with
your own hand-written lexer (or you want to consume something other
than `&str` inputs), you should check out the
[tutorial on custom lexers][lexer tutorial].
#### What is a lexer?
Throughout all the examples we've seen, we've drawn a distinction
between two kinds of symbols:
- Terminals, written (thus far) as string literals or regular expressions,
like `"("` or `r"\d+"`.
- Nonterminals, written without quotes, that are defined in terms of productions,
like:
```
ExprOp: Opcode = { // (3)
"+" => Opcode::Add,
"-" => Opcode::Sub,
};
```
When LALRPOP generates a parser, it always works in a two-phase
process. The first phase is called the **lexer** or **tokenizer**. It
has the job of figuring out the sequence of **terminals**: so
basically it analyzes the raw characters of your text and breaks them
into a series of terminals. It does this without having any idea about
your grammar or where you are in your grammar. Next, the parser proper
is a bit of code that looks at this stream of tokens and figures out
which nonterminals apply:
+-------------------+ +----------------------+
Text -> | Lexer | -> | Parser |
| | | |
| Applies regex to | | Consumers terminals, |
| produce terminals | | executes your code |
+-------------------+ | as it recognizes |
| nonterminals |
+----------------------+
#### How do lexers work in LALRPOP?
LALRPOP's default lexer is based on regular expressions. By default,
it works by extracting all the terminals (e.g., `"("` or `r"\d+"`)
from your grammar and compiling them into one big list. At runtime, it
will walk over the string and, at each point, find the longest match
from the literals and regular expressions in your grammar and produces
one of those. As an example, let's go back to the
[very first grammar we saw][calculator1], which just parsed
parenthesized numbers:
```
pub Term: i32 = {
<n:Num> => n,
"(" <t:Term> ")" => t,
};
Num: i32 = <s:r"[0-9]+"> => i32::from_str(s).unwrap();
```
This grammar in fact contains three terminals:
- `"("` -- a string literal, which must match exactly
- `")"` -- a string literal, which must match exactly
- `r"[0-9]+"` -- a regular expression
When
To help keep things concrete, let's use a simple calculator grammar.
We'll start with the grammar we saw before in
[the section on building the AST](#calculator4). It looks roughly like this
(excluding the actions for now):
```
Expr: Box<Expr> = {
Expr ExprOp Factor,
Factor,
};
ExprOp: Opcode = {
"+"
"-"
};
Factor: Box<Expr> = {
Factor FactorOp Term,
Term,
};
FactorOp: Opcode = {
"*" => Opcode::Mul,
"/" => Opcode::Div,
};
Term: Box<Expr> = {
Num => Box::new(Expr::Number(<>)),
FnCode Term => Box::new(Expr::Fn(<>)),
"(" <Expr> ")"
};
FnCode: FnCode = {
"SIN" => FnCode::Sin,
"COS" => FnCode::Cos,
ID => FnCode::Other(<>.to_string()),
};
Num: i32 = {
r"[0-9]+" => i32::from_str(<>).unwrap()
};
If you've used other parser generators before -- particularly ones
that are not PEG or parser combinators -- you may have noticed that
LALRPOP doesn't require you to specify a **tokenizer**.
[main]: ./calculator/src/main.rs
[calculator]: ./calculator/