mirror of
https://github.com/fluencelabs/lalrpop
synced 2025-03-16 17:00:53 +00:00
remove excess text
This commit is contained in:
parent
e2ca68cdea
commit
8842bed1ab
120
doc/tutorial.md
120
doc/tutorial.md
@ -1088,126 +1088,6 @@ fn calculator6() {
|
||||
}
|
||||
```
|
||||
|
||||
<a id="calculator7"></a>
|
||||
### calculator7: `match` declarations and controlling the lexer/tokenizer
|
||||
|
||||
This section will describe how LALRPOP allows you to control the
|
||||
generator lexer, which is the first phase of parsing, where we break
|
||||
up the raw text by applying regular expressions. If you're already
|
||||
familiar with what a lexer is, you may want to skip past the
|
||||
introductory section. If you're more interested in using LALRPOP with
|
||||
your own hand-written lexer (or you want to consume something other
|
||||
than `&str` inputs), you should check out the
|
||||
[tutorial on custom lexers][lexer tutorial].
|
||||
|
||||
#### What is a lexer?
|
||||
|
||||
Throughout all the examples we've seen, we've drawn a distinction
|
||||
between two kinds of symbols:
|
||||
|
||||
- Terminals, written (thus far) as string literals or regular expressions,
|
||||
like `"("` or `r"\d+"`.
|
||||
- Nonterminals, written without quotes, that are defined in terms of productions,
|
||||
like:
|
||||
|
||||
```
|
||||
ExprOp: Opcode = { // (3)
|
||||
"+" => Opcode::Add,
|
||||
"-" => Opcode::Sub,
|
||||
};
|
||||
```
|
||||
|
||||
When LALRPOP generates a parser, it always works in a two-phase
|
||||
process. The first phase is called the **lexer** or **tokenizer**. It
|
||||
has the job of figuring out the sequence of **terminals**: so
|
||||
basically it analyzes the raw characters of your text and breaks them
|
||||
into a series of terminals. It does this without having any idea about
|
||||
your grammar or where you are in your grammar. Next, the parser proper
|
||||
is a bit of code that looks at this stream of tokens and figures out
|
||||
which nonterminals apply:
|
||||
|
||||
+-------------------+ +----------------------+
|
||||
Text -> | Lexer | -> | Parser |
|
||||
| | | |
|
||||
| Applies regex to | | Consumers terminals, |
|
||||
| produce terminals | | executes your code |
|
||||
+-------------------+ | as it recognizes |
|
||||
| nonterminals |
|
||||
+----------------------+
|
||||
|
||||
#### How do lexers work in LALRPOP?
|
||||
|
||||
LALRPOP's default lexer is based on regular expressions. By default,
|
||||
it works by extracting all the terminals (e.g., `"("` or `r"\d+"`)
|
||||
from your grammar and compiling them into one big list. At runtime, it
|
||||
will walk over the string and, at each point, find the longest match
|
||||
from the literals and regular expressions in your grammar and produces
|
||||
one of those. As an example, let's go back to the
|
||||
[very first grammar we saw][calculator1], which just parsed
|
||||
parenthesized numbers:
|
||||
|
||||
```
|
||||
pub Term: i32 = {
|
||||
<n:Num> => n,
|
||||
"(" <t:Term> ")" => t,
|
||||
};
|
||||
|
||||
Num: i32 = <s:r"[0-9]+"> => i32::from_str(s).unwrap();
|
||||
```
|
||||
|
||||
This grammar in fact contains three terminals:
|
||||
|
||||
- `"("` -- a string literal, which must match exactly
|
||||
- `")"` -- a string literal, which must match exactly
|
||||
- `r"[0-9]+"` -- a regular expression
|
||||
|
||||
When
|
||||
|
||||
To help keep things concrete, let's use a simple calculator grammar.
|
||||
We'll start with the grammar we saw before in
|
||||
[the section on building the AST](#calculator4). It looks roughly like this
|
||||
(excluding the actions for now):
|
||||
|
||||
```
|
||||
Expr: Box<Expr> = {
|
||||
Expr ExprOp Factor,
|
||||
Factor,
|
||||
};
|
||||
|
||||
ExprOp: Opcode = {
|
||||
"+"
|
||||
"-"
|
||||
};
|
||||
|
||||
Factor: Box<Expr> = {
|
||||
Factor FactorOp Term,
|
||||
Term,
|
||||
};
|
||||
|
||||
FactorOp: Opcode = {
|
||||
"*" => Opcode::Mul,
|
||||
"/" => Opcode::Div,
|
||||
};
|
||||
|
||||
Term: Box<Expr> = {
|
||||
Num => Box::new(Expr::Number(<>)),
|
||||
FnCode Term => Box::new(Expr::Fn(<>)),
|
||||
"(" <Expr> ")"
|
||||
};
|
||||
|
||||
FnCode: FnCode = {
|
||||
"SIN" => FnCode::Sin,
|
||||
"COS" => FnCode::Cos,
|
||||
ID => FnCode::Other(<>.to_string()),
|
||||
};
|
||||
|
||||
Num: i32 = {
|
||||
r"[0-9]+" => i32::from_str(<>).unwrap()
|
||||
};
|
||||
|
||||
If you've used other parser generators before -- particularly ones
|
||||
that are not PEG or parser combinators -- you may have noticed that
|
||||
LALRPOP doesn't require you to specify a **tokenizer**.
|
||||
|
||||
[main]: ./calculator/src/main.rs
|
||||
[calculator]: ./calculator/
|
||||
|
Loading…
x
Reference in New Issue
Block a user