Merge pull request #214 from nikomatsakis/match-tutorial

extend the tutorial to explain lexing vs parsing
2025-03-16 17:00:53 +00:00 · 2017-04-05 09:25:20 -04:00 · 2017-04-05 09:25:20 -04:00 · 3f0bb90222
commit 3f0bb90222
parent ece26d35b4 e2ca68cdea
5 changed files with 1606 additions and 6 deletions
--- a/.gitignore
+++ b/.gitignore
@ -6,6 +6,7 @@ lalrpop/src/parser/lrgrammar.rs

 doc/calculator/src/calculator1.rs
 doc/calculator/src/calculator2.rs
+doc/calculator/src/calculator2b.rs
 doc/calculator/src/calculator3.rs
 doc/calculator/src/calculator4.rs
 doc/calculator/src/calculator5.rs
--- a/doc/calculator/src/calculator2b.lalrpop
+++ b/doc/calculator/src/calculator2b.lalrpop
@ -0,0 +1,38 @@
+grammar;
+
+pub Term = {
+    Num,
+    "(" <Term> ")",
+    "22" => format!("Twenty-two!"),
+    ID => format!("Id({})", <>),
+};
+
+Num: String = r"[0-9]+" => <>.to_string();
+
+// `match`: Declares the precedence of regular expressions
+// relative to one another when synthesizing
+// the lexer
+match {
+    // These items have highest precedence.
+
+    r"[0-9]+",
+
+    // Within a match level, fixed strings like `"22"` get
+    // precedence over regular expresions.
+    "22"
+} else {
+    // These items have next highest precedence.
+
+    // Given an input like `123`, the number regex above
+    // will match; but otherwise, given something like
+    // `123foo` or `foo123`, this will match.
+    //
+    // Here, we also renamed the regex to the name `ID`, which we can
+    // use in the grammar itself.
+    r"\w+" => ID,
+
+    // This `_` means "add in all the other strings and
+    // regular expressions in the grammer here" (e.g.,
+    // `"("`).
+    _
+} // you can have more `else` sections if you like
--- a/doc/calculator/src/main.rs
+++ b/doc/calculator/src/main.rs
@ -20,6 +20,16 @@ fn calculator2() {
    assert!(calculator2::parse_Term("((22)").is_err());
 }

+pub mod calculator2b;
+
+#[test]
+fn calculator2b() {
+    assert_eq!(calculator2b::parse_Term("33").unwrap(), "33");
+    assert_eq!(calculator2b::parse_Term("foo33").unwrap(), "Id(foo33)");
+    assert_eq!(calculator2b::parse_Term("(22)").unwrap(), "Twenty-two!");
+    assert_eq!(calculator2b::parse_Term("(222)").unwrap(), "222");
+}
+
 pub mod calculator3;

 macro_rules! test3 {
@ -74,7 +84,7 @@ fn calculator6() {
               "[((22 * 44) + 66), (error * 3)]");
    assert_eq!(&format!("{:?}", calculator6::parse_Exprs(&mut errors, "*").unwrap()),
               "[(error * error)]");
-    
+
    assert_eq!(errors.len(), 4);
 }

--- a/doc/tutorial.md
+++ b/doc/tutorial.md
@ -27,6 +27,8 @@ want to skip ahead, or just look at the LALRPOP sources:
  ([source][calculator1], [read](#calculator1))
 - calculator2: LALRPOP shorthands and type inference
  ([source][calculator2], [read](#calculator2))
+- calculator2b: Controlling the lexer with `match` declarations
+  ([source][calculator2b], [read](#calculator2b))
 - calculator3: Handling full-featured expressions like `22+44*66`
  ([source][calculator3], [read](#calculator3))
 - calculator4: Building ASTs
@ -35,6 +37,8 @@ want to skip ahead, or just look at the LALRPOP sources:
  ([source][calculator5], [read](#calculator5))
 - calculator6: Error recovery
  ([source][calculator6], [read](#calculator6))
+- calculator7: `match` declarations and controlling the lexer/tokenizer
+  ([source][calculator7], [read](#calculator7))

 This tutorial is still incomplete. Here are some topics that I aim to
 cover when I get time to write about them:
@ -169,11 +173,11 @@ case, just LALRPOP.
 The `[dependencies]` section describes the dependencies that LALRPOP
 needs at runtime. All LALRPOP parsers require at least the
 `lalrpop-util` crate. In addition, if you don't want to write the
-lexer by hand, you need to add a dependency on the regex crate.  (If
+lexer by hand, you need to add a dependency on the regex crate. (If
 you don't know what a lexer is, don't worry, it's not important just
-now; if you *do* know what a lexer is, and you want to know how to
-write a lexer by hand and use it with LALRPOP, then check out the
-[lexer tutorial].)
+now, though we will cover it in [section 7][calculator7]; if you *do*
+know what a lexer is, and you want to know how to write a lexer by
+hand and use it with LALRPOP, then check out the [lexer tutorial].)

 [lexer tutorial]: lexer_tutorial.md

@ -408,6 +412,317 @@ give you the idea:
 | `<A> B => bar(<>)`   | `<a:A> B => bar(a)`        |
 | `<A> <B> => bar(<>)` | `<a:A> <b:B> => bar(a, b)` |

+<a id="calculator2b"></a>
+### calculator2b: Controlling the lexer with `match` declarations
+
+This example dives a bit deeper into how LALRPOP works. In particular,
+it dives into the meaning of those strings and regular expression that
+we used in the previous tutorial, and how they are used to process the
+input string (a process which you can control). This first step of
+breaking up the input using regular expressions is often called
+**lexing** or **tokenizing**.
+
+If you're comfortable with the idea of a lexer or tokenizer, you may
+wish to skip ahead to the [calculator3](#calculator3) example, which covers
+parsing bigger expressions, and come back here only when you find you
+want more control. You may also be interested in the
+[tutorial on writing a custom lexer][lexer tutorial].
+
+#### Terminals vs nonterminals
+
+You may have noticed that our grammar included two distinct kinds of
+symbols. There were the nonterminals, `Term` and `Num`, which we
+defined by specifying a series of symbols that they must match, along
+with some action code that should execute once they have matched:
+
+```rust
+   Num: i32 = r"[0-9]+" => i32::from_str(<>).unwrap();
+// ~~~  ~~~   ~~~~~~~~~    ~~~~~~~~~~~~~~~~~~~~~~~~~~
+// |    |     |                Action code
+// |    |     Symbol(s) that should match
+// |    Return type
+// Name of nonterminal
+```
+
+But there are also **terminals**, which consist of the string literals
+and regular expressions sprinkled throughout the grammar. (Terminals
+are also often called **tokens**, and I will use the terms
+interchangeably.)
+
+This distinction between terminals and nonterminals is very important
+to how LALRPOP works. In fact, when LALRPOP generates a parser, it
+always works in a two-phase process. The first phase is called the
+**lexer** or **tokenizer**. It has the job of figuring out the
+sequence of **terminals**: so basically it analyzes the raw characters
+of your text and breaks them into a series of terminals. It does this
+without having any idea about your grammar or where you are in your
+grammar. Next, the parser proper is a bit of code that looks at this
+stream of tokens and figures out which nonterminals apply:
+
+              +-------------------+    +----------------------+
+      Text -> | Lexer             | -> | Parser               |
+              |                   |    |                      |
+              | Applies regex to  |    | Consumers terminals, |
+              | produce terminals |    | executes your code   |
+              +-------------------+    | as it recognizes     |
+                                       | nonterminals         |
+                                       +----------------------+
+
+LALRPOP's default lexer is based on regular expressions. By default,
+it works by extracting all the terminals (e.g., `"("` or `r"\d+"`)
+from your grammar and compiling them into one big list. At runtime, it
+will walk over the string and, at each point, find the longest match
+from the literals and regular expressions in your grammar and produces
+one of those. As an example, let's look again at our example grammar:
+
+```
+pub Term: i32 = {
+    <n:Num> => n,
+    "(" <t:Term> ")" => t,
+};
+
+Num: i32 = <s:r"[0-9]+"> => i32::from_str(s).unwrap();
+```
+
+This grammar in fact contains three terminals:
+
+- `"("` -- a string literal, which must match exactly
+- `")"` -- a string literal, which must match exactly
+- `r"[0-9]+"` -- a regular expression
+
+When we generate a lexer, it is effectively going to be checking for
+each of these three terminals in a loop, sort of like this pseudocode:
+
+```
+let mut i = 0; // index into string
+loop {
+    skip whitespace; // we do this implicitly, at least by default
+    if (data at index i is "(") { produce "("; }
+    else if (data at index i is ")") { produce ")"; }
+    else if (data at index i matches regex "[0-9]+") { produce r"[0-9]+"; }
+}
+```
+
+Note that this has nothing to do with your grammar. For example, the tokenizer
+would happily tokenize a string like this one, which doesn't fit our grammar:
+
+```
+  (  22   44  )     )
+  ^  ^^   ^^  ^     ^
+  |  |    |   |     ")" terminal
+  |  |    |   |
+  |  |    |   ")" terminal
+  |  +----+
+  |  |
+  |  2 r"[0-9]+" terminals
+  |
+  "(" terminal
+```
+
+When these tokens are fed into the **parser**, it would notice that we
+have one left paren but then two numbers (`r"[0-9]+"` terminals), and
+hence report an error.
+
+#### Precedence of fixed strings
+
+Terminals in LALRPOP can be specified (by default) in two ways. As a
+fixed string (like `"("`) or a regular expression (like
+`r[0-9]+`). There is actually an important difference: if, at some
+point in the input, both a fixed string **and** a regular expression
+could match, LALRPOP gives the fixed string precedence. To demonstrate
+this, let's modify our parser. If you recall, the current parser
+parses parenthesized numbers, producing a `i32`. We're going to modify
+if to produce a **string**, and we'll add an "easter egg" so that `22`
+(or `(22)`, `((22))`, etc) produces the string `"Twenty-two"`:
+
+```
+pub Term = {
+    Num,
+    "(" <Term> ")",
+    "22" => format!("Twenty-two!"),
+};
+
+Num: String = r"[0-9]+" => <>.to_string();
+```
+
+If we write some simple unit tests, we can see that in fact an input
+of `22` has matched the string literal. Interestingly, the input `222`
+matches the regular expression instead; this is because LALRPOP
+prefers to find the **longest** match first. After that, if there are
+two matches of equal length, it prefers the fixed string:
+
+```rust
+#[test]
+fn calculator2b() {
+    assert_eq!(calculator2b::parse_Term("33").unwrap(), "33");
+    assert_eq!(calculator2b::parse_Term("(22)").unwrap(), "Twenty-two!");
+    assert_eq!(calculator2b::parse_Term("(222)").unwrap(), "222");
+}
+```
+
+#### Ambiguities between regular expressions
+
+In the previous section, we saw that fixed strings have precedence
+over regular expressions. But what if we have two regular expressions
+that can match the same input? Which one wins? For example, consider
+this various of the grammar above, where we also try to support
+parenthesized **identifiers** like `((foo22))`:
+
+```
+pub Term = {
+    Num,
+    "(" <Term> ")",
+    "22" => format!("Twenty-two!"),
+    r"\w+" => format!("Id({})", <>), // <-- we added this
+};
+
+Num: String = r"[0-9]+" => <>.to_string();
+```
+
+Here I've written the regular expression `r\w+`. However, if you check
+out the [docs for regex](https://docs.rs/regex), you'll see that `\w`
+is defined to match alphabetic characteres but also digits. So there
+is actually an ambiguity here: if we have something like `123`, it
+could be considered to match either `r"[0-9]+"` **or** `r"\w+"`. If
+you try this grammar, you'll find that LALRPOP helpfully reports an
+error:
+
+```
+error: ambiguity detected between the terminal `r#"\w+"#` and the terminal `r#"[0-9]+"#`
+
+      r"\w+" => <>.to_string(),
+      ~~~~~~
+```
+
+There are various ways to fix this. We might try adjusting our regular
+expression so that the first character cannot be a number, so perhaps
+something like `r"[[:alpha:]]\w*"`. This will work, but it actually
+matches something different than what we had before (e.g., `123foo`
+will not be considered to match, for better or worse). And anyway it's
+not always convenient to make your regular expressions completely
+disjoint like that. Another option is to use a `match` declaration,
+which lets you control the precedence between regular expressions.
+
+#### Simple `match` declarations
+
+A `match` declaration lets you explicitly give the precedence between
+terminals. In its simplest form, it consists of just ordering regular
+expressions and string literals into groups, with the higher
+precedence items coming first. So, for example, we could resolve
+our conflict above by giving `r"[0-9]+"` **precedence** over `r"\w+"`,
+thus saying that if something can be lexed as a number, we'll do that,
+and otherwise consider it to be an identifier.
+
+```
+match {
+    r"[0-9]+"
+} else {
+    r"\w+",
+    _
+}    
+```
+
+Here the match contains two levels; each level can have more than one
+item in it. The top-level contains only `r"[0-9]+"`, which means that this
+regular expression is given highest priority. The next level contains
+`r\w+`, so that will match afterwards. 
+
+The final `_` indicates that other string literals and regular
+expressions that appear elsewhere in the grammar (e.g., `"("` or
+`"22"`) should be added into that final level of precedence (without
+an `_`, it is illegal to use a terminal that does not appear in the
+match declaration).
+
+If we add this `match` section into our example, we'll find that it
+compiles, but it doesn't work exactly like we wanted. Let's update our
+unit test a bit to include some identifier examples::
+
+```rust
+#[test]
+fn calculator2b() {
+    // These will all work:
+    assert_eq!(calculator2b::parse_Term("33").unwrap(), "33");
+    assert_eq!(calculator2b::parse_Term("foo33").unwrap(), "Id(foo33)");
+    assert_eq!(calculator2b::parse_Term("(foo33)").unwrap(), "Id(foo33)");
+    
+    // This line will fail:
+    assert_eq!(calculator2b::parse_Term("(22)").unwrap(), "Twenty-two!");
+}
+```
+
+The problem comes about when we parse `22`. Before, the fixed string
+`22` got precedence, but with the new match declaration, we've
+explicitly stated that the regular expression `r"[0-9]+"` has full
+precedence. Since the `22` is not listed explicitly, it gets added at
+the last level, where the `_` appears. We can fix this by adjusting
+our `match` to mention `22` explicitly:
+
+```
+match {
+    r"[0-9]+",
+    "22"
+} else {
+    r"\w+",
+    _
+}    
+```
+
+This raises the interesting question of what the precedence is **within**
+a match rung -- after all, both the regex and `"22"` can match the same
+string. The answer is that within a match rung, fixed literals get precedence
+over regular expressions, just as before, and all regular expressions
+must not overlap.
+
+With this new `match` declaration, we will find that our tests all pass.
+
+#### Renaming `match` declarations
+
+There is one final twist before we reach the
+[final version of our example that you will find in the repository][calculator2b]. We
+can also use `match` declarations to give names to regular
+expressions, so that we don't have to type them directly in our
+grammar. For example, maybe instead of writing `r"\w+"`, we would
+prefer to write `ID`. We could do that by modifying the match declaration like 
+so:
+
+```
+match {
+    r"[0-9]+",
+    "22"
+} else {
+    r"\w+" => ID, // <-- give a name here
+    _
+}
+```
+
+And then adjusting the definition of `Term` to reference `ID` instead:
+
+```
+pub Term = {
+    Num,
+    "(" <Term> ")",
+    "22" => format!("Twenty-two!"),
+    ID => format!("Id({})", <>), // <-- changed this
+};
+```
+
+In fact, the match declaration can map a regular expression to any
+kind of symbol you want (i.e., you can also map to a string literal or
+even a regular expression). Whatever symbol appears after the `=>` is
+what you should use in your grammar. As an example, in some languages
+have case-insensitive keywords; if you wanted to write `"BEGIN"` in the
+grammar itself, but have that map to a regular expression in the lexer, you might write:
+
+```
+match {
+    r"(?i)begin" => "BEGIN",
+    ...
+}
+```
+
+And now any reference in your grammar to `"BEGIN"` will actually match
+any capitalization.
+
 <a id="calculator3"></a>
 ### calculator3: Full-featured expressions

@ -768,18 +1083,141 @@ fn calculator6() {
               "[((22 * 44) + 66), (error * 3)]");
    assert_eq!(&format!("{:?}", calculator6::parse_Exprs(&mut errors, "*").unwrap()),
               "[(error * error)]");
-    
+
    assert_eq!(errors.len(), 4);
 }
 ```

+<a id="calculator7"></a>
+### calculator7: `match` declarations and controlling the lexer/tokenizer
+
+This section will describe how LALRPOP allows you to control the
+generator lexer, which is the first phase of parsing, where we break
+up the raw text by applying regular expressions. If you're already
+familiar with what a lexer is, you may want to skip past the
+introductory section. If you're more interested in using LALRPOP with
+your own hand-written lexer (or you want to consume something other
+than `&str` inputs), you should check out the
+[tutorial on custom lexers][lexer tutorial].
+
+#### What is a lexer?
+
+Throughout all the examples we've seen, we've drawn a distinction
+between two kinds of symbols:
+
+- Terminals, written (thus far) as string literals or regular expressions,
+  like `"("` or `r"\d+"`.
+- Nonterminals, written without quotes, that are defined in terms of productions,
+  like:
+
+  ```
+  ExprOp: Opcode = { // (3)
+      "+" => Opcode::Add,
+      "-" => Opcode::Sub,
+  };
+  ```
+
+When LALRPOP generates a parser, it always works in a two-phase
+process. The first phase is called the **lexer** or **tokenizer**. It
+has the job of figuring out the sequence of **terminals**: so
+basically it analyzes the raw characters of your text and breaks them
+into a series of terminals. It does this without having any idea about
+your grammar or where you are in your grammar. Next, the parser proper
+is a bit of code that looks at this stream of tokens and figures out
+which nonterminals apply:
+
+              +-------------------+    +----------------------+
+      Text -> | Lexer             | -> | Parser               |
+              |                   |    |                      |
+              | Applies regex to  |    | Consumers terminals, |
+              | produce terminals |    | executes your code   |
+              +-------------------+    | as it recognizes     |
+                                       | nonterminals         |
+                                       +----------------------+
+
+#### How do lexers work in LALRPOP?
+
+LALRPOP's default lexer is based on regular expressions. By default,
+it works by extracting all the terminals (e.g., `"("` or `r"\d+"`)
+from your grammar and compiling them into one big list. At runtime, it
+will walk over the string and, at each point, find the longest match
+from the literals and regular expressions in your grammar and produces
+one of those. As an example, let's go back to the
+[very first grammar we saw][calculator1], which just parsed
+parenthesized numbers:
+
+```
+pub Term: i32 = {
+    <n:Num> => n,
+    "(" <t:Term> ")" => t,
+};
+
+Num: i32 = <s:r"[0-9]+"> => i32::from_str(s).unwrap();
+```
+
+This grammar in fact contains three terminals:
+
+- `"("` -- a string literal, which must match exactly
+- `")"` -- a string literal, which must match exactly
+- `r"[0-9]+"` -- a regular expression
+
+When
+
+To help keep things concrete, let's use a simple calculator grammar.
+We'll start with the grammar we saw before in
+[the section on building the AST](#calculator4). It looks roughly like this
+(excluding the actions for now):
+
+```
+Expr: Box<Expr> = {
+    Expr ExprOp Factor,
+    Factor,
+};
+
+ExprOp: Opcode = {
+    "+"
+    "-"
+};
+
+Factor: Box<Expr> = {
+    Factor FactorOp Term,
+    Term,
+};
+
+FactorOp: Opcode = {
+    "*" => Opcode::Mul,
+    "/" => Opcode::Div,
+};
+
+Term: Box<Expr> = {
+    Num => Box::new(Expr::Number(<>)),
+    FnCode Term => Box::new(Expr::Fn(<>)),
+    "(" <Expr> ")"
+};
+
+FnCode: FnCode = {
+    "SIN" => FnCode::Sin,
+    "COS" => FnCode::Cos,
+    ID => FnCode::Other(<>.to_string()),
+};
+
+Num: i32 = {
+    r"[0-9]+" => i32::from_str(<>).unwrap()
+};
+
+If you've used other parser generators before -- particularly ones
+that are not PEG or parser combinators -- you may have noticed that
+LALRPOP doesn't require you to specify a **tokenizer**.
+
 [main]: ./calculator/src/main.rs
 [calculator]: ./calculator/
 [cargotoml]: ./calculator/Cargo.toml
 [calculator1]: ./calculator/src/calculator1.lalrpop
 [calculator2]: ./calculator/src/calculator2.lalrpop
+[calculator2b]: ./calculator/src/calculator2b.lalrpop
 [calculator3]: ./calculator/src/calculator3.lalrpop
 [calculator4]: ./calculator/src/calculator4.lalrpop
 [calculator5]: ./calculator/src/calculator5.lalrpop
 [calculator6]: ./calculator/src/calculator6.lalrpop
+[calculator7]: ./calculator/src/calculator7.lalrpop
 [astrs]: ./calculator/src/ast.rs
--- a/lalrpop-test/src/match_section.rs
+++ b/lalrpop-test/src/match_section.rs