mirror of
https://github.com/fluencelabs/lalrpop
synced 2025-03-31 07:21:04 +00:00
remove excess text
This commit is contained in:
parent
e2ca68cdea
commit
8842bed1ab
120
doc/tutorial.md
120
doc/tutorial.md
@ -1088,126 +1088,6 @@ fn calculator6() {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
<a id="calculator7"></a>
|
|
||||||
### calculator7: `match` declarations and controlling the lexer/tokenizer
|
|
||||||
|
|
||||||
This section will describe how LALRPOP allows you to control the
|
|
||||||
generator lexer, which is the first phase of parsing, where we break
|
|
||||||
up the raw text by applying regular expressions. If you're already
|
|
||||||
familiar with what a lexer is, you may want to skip past the
|
|
||||||
introductory section. If you're more interested in using LALRPOP with
|
|
||||||
your own hand-written lexer (or you want to consume something other
|
|
||||||
than `&str` inputs), you should check out the
|
|
||||||
[tutorial on custom lexers][lexer tutorial].
|
|
||||||
|
|
||||||
#### What is a lexer?
|
|
||||||
|
|
||||||
Throughout all the examples we've seen, we've drawn a distinction
|
|
||||||
between two kinds of symbols:
|
|
||||||
|
|
||||||
- Terminals, written (thus far) as string literals or regular expressions,
|
|
||||||
like `"("` or `r"\d+"`.
|
|
||||||
- Nonterminals, written without quotes, that are defined in terms of productions,
|
|
||||||
like:
|
|
||||||
|
|
||||||
```
|
|
||||||
ExprOp: Opcode = { // (3)
|
|
||||||
"+" => Opcode::Add,
|
|
||||||
"-" => Opcode::Sub,
|
|
||||||
};
|
|
||||||
```
|
|
||||||
|
|
||||||
When LALRPOP generates a parser, it always works in a two-phase
|
|
||||||
process. The first phase is called the **lexer** or **tokenizer**. It
|
|
||||||
has the job of figuring out the sequence of **terminals**: so
|
|
||||||
basically it analyzes the raw characters of your text and breaks them
|
|
||||||
into a series of terminals. It does this without having any idea about
|
|
||||||
your grammar or where you are in your grammar. Next, the parser proper
|
|
||||||
is a bit of code that looks at this stream of tokens and figures out
|
|
||||||
which nonterminals apply:
|
|
||||||
|
|
||||||
+-------------------+ +----------------------+
|
|
||||||
Text -> | Lexer | -> | Parser |
|
|
||||||
| | | |
|
|
||||||
| Applies regex to | | Consumers terminals, |
|
|
||||||
| produce terminals | | executes your code |
|
|
||||||
+-------------------+ | as it recognizes |
|
|
||||||
| nonterminals |
|
|
||||||
+----------------------+
|
|
||||||
|
|
||||||
#### How do lexers work in LALRPOP?
|
|
||||||
|
|
||||||
LALRPOP's default lexer is based on regular expressions. By default,
|
|
||||||
it works by extracting all the terminals (e.g., `"("` or `r"\d+"`)
|
|
||||||
from your grammar and compiling them into one big list. At runtime, it
|
|
||||||
will walk over the string and, at each point, find the longest match
|
|
||||||
from the literals and regular expressions in your grammar and produces
|
|
||||||
one of those. As an example, let's go back to the
|
|
||||||
[very first grammar we saw][calculator1], which just parsed
|
|
||||||
parenthesized numbers:
|
|
||||||
|
|
||||||
```
|
|
||||||
pub Term: i32 = {
|
|
||||||
<n:Num> => n,
|
|
||||||
"(" <t:Term> ")" => t,
|
|
||||||
};
|
|
||||||
|
|
||||||
Num: i32 = <s:r"[0-9]+"> => i32::from_str(s).unwrap();
|
|
||||||
```
|
|
||||||
|
|
||||||
This grammar in fact contains three terminals:
|
|
||||||
|
|
||||||
- `"("` -- a string literal, which must match exactly
|
|
||||||
- `")"` -- a string literal, which must match exactly
|
|
||||||
- `r"[0-9]+"` -- a regular expression
|
|
||||||
|
|
||||||
When
|
|
||||||
|
|
||||||
To help keep things concrete, let's use a simple calculator grammar.
|
|
||||||
We'll start with the grammar we saw before in
|
|
||||||
[the section on building the AST](#calculator4). It looks roughly like this
|
|
||||||
(excluding the actions for now):
|
|
||||||
|
|
||||||
```
|
|
||||||
Expr: Box<Expr> = {
|
|
||||||
Expr ExprOp Factor,
|
|
||||||
Factor,
|
|
||||||
};
|
|
||||||
|
|
||||||
ExprOp: Opcode = {
|
|
||||||
"+"
|
|
||||||
"-"
|
|
||||||
};
|
|
||||||
|
|
||||||
Factor: Box<Expr> = {
|
|
||||||
Factor FactorOp Term,
|
|
||||||
Term,
|
|
||||||
};
|
|
||||||
|
|
||||||
FactorOp: Opcode = {
|
|
||||||
"*" => Opcode::Mul,
|
|
||||||
"/" => Opcode::Div,
|
|
||||||
};
|
|
||||||
|
|
||||||
Term: Box<Expr> = {
|
|
||||||
Num => Box::new(Expr::Number(<>)),
|
|
||||||
FnCode Term => Box::new(Expr::Fn(<>)),
|
|
||||||
"(" <Expr> ")"
|
|
||||||
};
|
|
||||||
|
|
||||||
FnCode: FnCode = {
|
|
||||||
"SIN" => FnCode::Sin,
|
|
||||||
"COS" => FnCode::Cos,
|
|
||||||
ID => FnCode::Other(<>.to_string()),
|
|
||||||
};
|
|
||||||
|
|
||||||
Num: i32 = {
|
|
||||||
r"[0-9]+" => i32::from_str(<>).unwrap()
|
|
||||||
};
|
|
||||||
|
|
||||||
If you've used other parser generators before -- particularly ones
|
|
||||||
that are not PEG or parser combinators -- you may have noticed that
|
|
||||||
LALRPOP doesn't require you to specify a **tokenizer**.
|
|
||||||
|
|
||||||
[main]: ./calculator/src/main.rs
|
[main]: ./calculator/src/main.rs
|
||||||
[calculator]: ./calculator/
|
[calculator]: ./calculator/
|
||||||
|
Loading…
x
Reference in New Issue
Block a user