Merge pull request #214 from nikomatsakis/match-tutorial

extend the tutorial to explain lexing vs parsing
This commit is contained in:
Niko Matsakis 2017-04-05 09:25:20 -04:00 committed by GitHub
commit 3f0bb90222
5 changed files with 1606 additions and 6 deletions

1
.gitignore vendored
View File

@ -6,6 +6,7 @@ lalrpop/src/parser/lrgrammar.rs
doc/calculator/src/calculator1.rs
doc/calculator/src/calculator2.rs
doc/calculator/src/calculator2b.rs
doc/calculator/src/calculator3.rs
doc/calculator/src/calculator4.rs
doc/calculator/src/calculator5.rs

View File

@ -0,0 +1,38 @@
grammar;
pub Term = {
Num,
"(" <Term> ")",
"22" => format!("Twenty-two!"),
ID => format!("Id({})", <>),
};
Num: String = r"[0-9]+" => <>.to_string();
// `match`: Declares the precedence of regular expressions
// relative to one another when synthesizing
// the lexer
match {
// These items have highest precedence.
r"[0-9]+",
// Within a match level, fixed strings like `"22"` get
// precedence over regular expresions.
"22"
} else {
// These items have next highest precedence.
// Given an input like `123`, the number regex above
// will match; but otherwise, given something like
// `123foo` or `foo123`, this will match.
//
// Here, we also renamed the regex to the name `ID`, which we can
// use in the grammar itself.
r"\w+" => ID,
// This `_` means "add in all the other strings and
// regular expressions in the grammer here" (e.g.,
// `"("`).
_
} // you can have more `else` sections if you like

View File

@ -20,6 +20,16 @@ fn calculator2() {
assert!(calculator2::parse_Term("((22)").is_err());
}
pub mod calculator2b;
#[test]
fn calculator2b() {
assert_eq!(calculator2b::parse_Term("33").unwrap(), "33");
assert_eq!(calculator2b::parse_Term("foo33").unwrap(), "Id(foo33)");
assert_eq!(calculator2b::parse_Term("(22)").unwrap(), "Twenty-two!");
assert_eq!(calculator2b::parse_Term("(222)").unwrap(), "222");
}
pub mod calculator3;
macro_rules! test3 {
@ -74,7 +84,7 @@ fn calculator6() {
"[((22 * 44) + 66), (error * 3)]");
assert_eq!(&format!("{:?}", calculator6::parse_Exprs(&mut errors, "*").unwrap()),
"[(error * error)]");
assert_eq!(errors.len(), 4);
}

View File

@ -27,6 +27,8 @@ want to skip ahead, or just look at the LALRPOP sources:
([source][calculator1], [read](#calculator1))
- calculator2: LALRPOP shorthands and type inference
([source][calculator2], [read](#calculator2))
- calculator2b: Controlling the lexer with `match` declarations
([source][calculator2b], [read](#calculator2b))
- calculator3: Handling full-featured expressions like `22+44*66`
([source][calculator3], [read](#calculator3))
- calculator4: Building ASTs
@ -35,6 +37,8 @@ want to skip ahead, or just look at the LALRPOP sources:
([source][calculator5], [read](#calculator5))
- calculator6: Error recovery
([source][calculator6], [read](#calculator6))
- calculator7: `match` declarations and controlling the lexer/tokenizer
([source][calculator7], [read](#calculator7))
This tutorial is still incomplete. Here are some topics that I aim to
cover when I get time to write about them:
@ -169,11 +173,11 @@ case, just LALRPOP.
The `[dependencies]` section describes the dependencies that LALRPOP
needs at runtime. All LALRPOP parsers require at least the
`lalrpop-util` crate. In addition, if you don't want to write the
lexer by hand, you need to add a dependency on the regex crate. (If
lexer by hand, you need to add a dependency on the regex crate. (If
you don't know what a lexer is, don't worry, it's not important just
now; if you *do* know what a lexer is, and you want to know how to
write a lexer by hand and use it with LALRPOP, then check out the
[lexer tutorial].)
now, though we will cover it in [section 7][calculator7]; if you *do*
know what a lexer is, and you want to know how to write a lexer by
hand and use it with LALRPOP, then check out the [lexer tutorial].)
[lexer tutorial]: lexer_tutorial.md
@ -408,6 +412,317 @@ give you the idea:
| `<A> B => bar(<>)` | `<a:A> B => bar(a)` |
| `<A> <B> => bar(<>)` | `<a:A> <b:B> => bar(a, b)` |
<a id="calculator2b"></a>
### calculator2b: Controlling the lexer with `match` declarations
This example dives a bit deeper into how LALRPOP works. In particular,
it dives into the meaning of those strings and regular expression that
we used in the previous tutorial, and how they are used to process the
input string (a process which you can control). This first step of
breaking up the input using regular expressions is often called
**lexing** or **tokenizing**.
If you're comfortable with the idea of a lexer or tokenizer, you may
wish to skip ahead to the [calculator3](#calculator3) example, which covers
parsing bigger expressions, and come back here only when you find you
want more control. You may also be interested in the
[tutorial on writing a custom lexer][lexer tutorial].
#### Terminals vs nonterminals
You may have noticed that our grammar included two distinct kinds of
symbols. There were the nonterminals, `Term` and `Num`, which we
defined by specifying a series of symbols that they must match, along
with some action code that should execute once they have matched:
```rust
Num: i32 = r"[0-9]+" => i32::from_str(<>).unwrap();
// ~~~ ~~~ ~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~
// | | | Action code
// | | Symbol(s) that should match
// | Return type
// Name of nonterminal
```
But there are also **terminals**, which consist of the string literals
and regular expressions sprinkled throughout the grammar. (Terminals
are also often called **tokens**, and I will use the terms
interchangeably.)
This distinction between terminals and nonterminals is very important
to how LALRPOP works. In fact, when LALRPOP generates a parser, it
always works in a two-phase process. The first phase is called the
**lexer** or **tokenizer**. It has the job of figuring out the
sequence of **terminals**: so basically it analyzes the raw characters
of your text and breaks them into a series of terminals. It does this
without having any idea about your grammar or where you are in your
grammar. Next, the parser proper is a bit of code that looks at this
stream of tokens and figures out which nonterminals apply:
+-------------------+ +----------------------+
Text -> | Lexer | -> | Parser |
| | | |
| Applies regex to | | Consumers terminals, |
| produce terminals | | executes your code |
+-------------------+ | as it recognizes |
| nonterminals |
+----------------------+
LALRPOP's default lexer is based on regular expressions. By default,
it works by extracting all the terminals (e.g., `"("` or `r"\d+"`)
from your grammar and compiling them into one big list. At runtime, it
will walk over the string and, at each point, find the longest match
from the literals and regular expressions in your grammar and produces
one of those. As an example, let's look again at our example grammar:
```
pub Term: i32 = {
<n:Num> => n,
"(" <t:Term> ")" => t,
};
Num: i32 = <s:r"[0-9]+"> => i32::from_str(s).unwrap();
```
This grammar in fact contains three terminals:
- `"("` -- a string literal, which must match exactly
- `")"` -- a string literal, which must match exactly
- `r"[0-9]+"` -- a regular expression
When we generate a lexer, it is effectively going to be checking for
each of these three terminals in a loop, sort of like this pseudocode:
```
let mut i = 0; // index into string
loop {
skip whitespace; // we do this implicitly, at least by default
if (data at index i is "(") { produce "("; }
else if (data at index i is ")") { produce ")"; }
else if (data at index i matches regex "[0-9]+") { produce r"[0-9]+"; }
}
```
Note that this has nothing to do with your grammar. For example, the tokenizer
would happily tokenize a string like this one, which doesn't fit our grammar:
```
( 22 44 ) )
^ ^^ ^^ ^ ^
| | | | ")" terminal
| | | |
| | | ")" terminal
| +----+
| |
| 2 r"[0-9]+" terminals
|
"(" terminal
```
When these tokens are fed into the **parser**, it would notice that we
have one left paren but then two numbers (`r"[0-9]+"` terminals), and
hence report an error.
#### Precedence of fixed strings
Terminals in LALRPOP can be specified (by default) in two ways. As a
fixed string (like `"("`) or a regular expression (like
`r[0-9]+`). There is actually an important difference: if, at some
point in the input, both a fixed string **and** a regular expression
could match, LALRPOP gives the fixed string precedence. To demonstrate
this, let's modify our parser. If you recall, the current parser
parses parenthesized numbers, producing a `i32`. We're going to modify
if to produce a **string**, and we'll add an "easter egg" so that `22`
(or `(22)`, `((22))`, etc) produces the string `"Twenty-two"`:
```
pub Term = {
Num,
"(" <Term> ")",
"22" => format!("Twenty-two!"),
};
Num: String = r"[0-9]+" => <>.to_string();
```
If we write some simple unit tests, we can see that in fact an input
of `22` has matched the string literal. Interestingly, the input `222`
matches the regular expression instead; this is because LALRPOP
prefers to find the **longest** match first. After that, if there are
two matches of equal length, it prefers the fixed string:
```rust
#[test]
fn calculator2b() {
assert_eq!(calculator2b::parse_Term("33").unwrap(), "33");
assert_eq!(calculator2b::parse_Term("(22)").unwrap(), "Twenty-two!");
assert_eq!(calculator2b::parse_Term("(222)").unwrap(), "222");
}
```
#### Ambiguities between regular expressions
In the previous section, we saw that fixed strings have precedence
over regular expressions. But what if we have two regular expressions
that can match the same input? Which one wins? For example, consider
this various of the grammar above, where we also try to support
parenthesized **identifiers** like `((foo22))`:
```
pub Term = {
Num,
"(" <Term> ")",
"22" => format!("Twenty-two!"),
r"\w+" => format!("Id({})", <>), // <-- we added this
};
Num: String = r"[0-9]+" => <>.to_string();
```
Here I've written the regular expression `r\w+`. However, if you check
out the [docs for regex](https://docs.rs/regex), you'll see that `\w`
is defined to match alphabetic characteres but also digits. So there
is actually an ambiguity here: if we have something like `123`, it
could be considered to match either `r"[0-9]+"` **or** `r"\w+"`. If
you try this grammar, you'll find that LALRPOP helpfully reports an
error:
```
error: ambiguity detected between the terminal `r#"\w+"#` and the terminal `r#"[0-9]+"#`
r"\w+" => <>.to_string(),
~~~~~~
```
There are various ways to fix this. We might try adjusting our regular
expression so that the first character cannot be a number, so perhaps
something like `r"[[:alpha:]]\w*"`. This will work, but it actually
matches something different than what we had before (e.g., `123foo`
will not be considered to match, for better or worse). And anyway it's
not always convenient to make your regular expressions completely
disjoint like that. Another option is to use a `match` declaration,
which lets you control the precedence between regular expressions.
#### Simple `match` declarations
A `match` declaration lets you explicitly give the precedence between
terminals. In its simplest form, it consists of just ordering regular
expressions and string literals into groups, with the higher
precedence items coming first. So, for example, we could resolve
our conflict above by giving `r"[0-9]+"` **precedence** over `r"\w+"`,
thus saying that if something can be lexed as a number, we'll do that,
and otherwise consider it to be an identifier.
```
match {
r"[0-9]+"
} else {
r"\w+",
_
}
```
Here the match contains two levels; each level can have more than one
item in it. The top-level contains only `r"[0-9]+"`, which means that this
regular expression is given highest priority. The next level contains
`r\w+`, so that will match afterwards.
The final `_` indicates that other string literals and regular
expressions that appear elsewhere in the grammar (e.g., `"("` or
`"22"`) should be added into that final level of precedence (without
an `_`, it is illegal to use a terminal that does not appear in the
match declaration).
If we add this `match` section into our example, we'll find that it
compiles, but it doesn't work exactly like we wanted. Let's update our
unit test a bit to include some identifier examples::
```rust
#[test]
fn calculator2b() {
// These will all work:
assert_eq!(calculator2b::parse_Term("33").unwrap(), "33");
assert_eq!(calculator2b::parse_Term("foo33").unwrap(), "Id(foo33)");
assert_eq!(calculator2b::parse_Term("(foo33)").unwrap(), "Id(foo33)");
// This line will fail:
assert_eq!(calculator2b::parse_Term("(22)").unwrap(), "Twenty-two!");
}
```
The problem comes about when we parse `22`. Before, the fixed string
`22` got precedence, but with the new match declaration, we've
explicitly stated that the regular expression `r"[0-9]+"` has full
precedence. Since the `22` is not listed explicitly, it gets added at
the last level, where the `_` appears. We can fix this by adjusting
our `match` to mention `22` explicitly:
```
match {
r"[0-9]+",
"22"
} else {
r"\w+",
_
}
```
This raises the interesting question of what the precedence is **within**
a match rung -- after all, both the regex and `"22"` can match the same
string. The answer is that within a match rung, fixed literals get precedence
over regular expressions, just as before, and all regular expressions
must not overlap.
With this new `match` declaration, we will find that our tests all pass.
#### Renaming `match` declarations
There is one final twist before we reach the
[final version of our example that you will find in the repository][calculator2b]. We
can also use `match` declarations to give names to regular
expressions, so that we don't have to type them directly in our
grammar. For example, maybe instead of writing `r"\w+"`, we would
prefer to write `ID`. We could do that by modifying the match declaration like
so:
```
match {
r"[0-9]+",
"22"
} else {
r"\w+" => ID, // <-- give a name here
_
}
```
And then adjusting the definition of `Term` to reference `ID` instead:
```
pub Term = {
Num,
"(" <Term> ")",
"22" => format!("Twenty-two!"),
ID => format!("Id({})", <>), // <-- changed this
};
```
In fact, the match declaration can map a regular expression to any
kind of symbol you want (i.e., you can also map to a string literal or
even a regular expression). Whatever symbol appears after the `=>` is
what you should use in your grammar. As an example, in some languages
have case-insensitive keywords; if you wanted to write `"BEGIN"` in the
grammar itself, but have that map to a regular expression in the lexer, you might write:
```
match {
r"(?i)begin" => "BEGIN",
...
}
```
And now any reference in your grammar to `"BEGIN"` will actually match
any capitalization.
<a id="calculator3"></a>
### calculator3: Full-featured expressions
@ -768,18 +1083,141 @@ fn calculator6() {
"[((22 * 44) + 66), (error * 3)]");
assert_eq!(&format!("{:?}", calculator6::parse_Exprs(&mut errors, "*").unwrap()),
"[(error * error)]");
assert_eq!(errors.len(), 4);
}
```
<a id="calculator7"></a>
### calculator7: `match` declarations and controlling the lexer/tokenizer
This section will describe how LALRPOP allows you to control the
generator lexer, which is the first phase of parsing, where we break
up the raw text by applying regular expressions. If you're already
familiar with what a lexer is, you may want to skip past the
introductory section. If you're more interested in using LALRPOP with
your own hand-written lexer (or you want to consume something other
than `&str` inputs), you should check out the
[tutorial on custom lexers][lexer tutorial].
#### What is a lexer?
Throughout all the examples we've seen, we've drawn a distinction
between two kinds of symbols:
- Terminals, written (thus far) as string literals or regular expressions,
like `"("` or `r"\d+"`.
- Nonterminals, written without quotes, that are defined in terms of productions,
like:
```
ExprOp: Opcode = { // (3)
"+" => Opcode::Add,
"-" => Opcode::Sub,
};
```
When LALRPOP generates a parser, it always works in a two-phase
process. The first phase is called the **lexer** or **tokenizer**. It
has the job of figuring out the sequence of **terminals**: so
basically it analyzes the raw characters of your text and breaks them
into a series of terminals. It does this without having any idea about
your grammar or where you are in your grammar. Next, the parser proper
is a bit of code that looks at this stream of tokens and figures out
which nonterminals apply:
+-------------------+ +----------------------+
Text -> | Lexer | -> | Parser |
| | | |
| Applies regex to | | Consumers terminals, |
| produce terminals | | executes your code |
+-------------------+ | as it recognizes |
| nonterminals |
+----------------------+
#### How do lexers work in LALRPOP?
LALRPOP's default lexer is based on regular expressions. By default,
it works by extracting all the terminals (e.g., `"("` or `r"\d+"`)
from your grammar and compiling them into one big list. At runtime, it
will walk over the string and, at each point, find the longest match
from the literals and regular expressions in your grammar and produces
one of those. As an example, let's go back to the
[very first grammar we saw][calculator1], which just parsed
parenthesized numbers:
```
pub Term: i32 = {
<n:Num> => n,
"(" <t:Term> ")" => t,
};
Num: i32 = <s:r"[0-9]+"> => i32::from_str(s).unwrap();
```
This grammar in fact contains three terminals:
- `"("` -- a string literal, which must match exactly
- `")"` -- a string literal, which must match exactly
- `r"[0-9]+"` -- a regular expression
When
To help keep things concrete, let's use a simple calculator grammar.
We'll start with the grammar we saw before in
[the section on building the AST](#calculator4). It looks roughly like this
(excluding the actions for now):
```
Expr: Box<Expr> = {
Expr ExprOp Factor,
Factor,
};
ExprOp: Opcode = {
"+"
"-"
};
Factor: Box<Expr> = {
Factor FactorOp Term,
Term,
};
FactorOp: Opcode = {
"*" => Opcode::Mul,
"/" => Opcode::Div,
};
Term: Box<Expr> = {
Num => Box::new(Expr::Number(<>)),
FnCode Term => Box::new(Expr::Fn(<>)),
"(" <Expr> ")"
};
FnCode: FnCode = {
"SIN" => FnCode::Sin,
"COS" => FnCode::Cos,
ID => FnCode::Other(<>.to_string()),
};
Num: i32 = {
r"[0-9]+" => i32::from_str(<>).unwrap()
};
If you've used other parser generators before -- particularly ones
that are not PEG or parser combinators -- you may have noticed that
LALRPOP doesn't require you to specify a **tokenizer**.
[main]: ./calculator/src/main.rs
[calculator]: ./calculator/
[cargotoml]: ./calculator/Cargo.toml
[calculator1]: ./calculator/src/calculator1.lalrpop
[calculator2]: ./calculator/src/calculator2.lalrpop
[calculator2b]: ./calculator/src/calculator2b.lalrpop
[calculator3]: ./calculator/src/calculator3.lalrpop
[calculator4]: ./calculator/src/calculator4.lalrpop
[calculator5]: ./calculator/src/calculator5.lalrpop
[calculator6]: ./calculator/src/calculator6.lalrpop
[calculator7]: ./calculator/src/calculator7.lalrpop
[astrs]: ./calculator/src/ast.rs

File diff suppressed because it is too large Load Diff