From: Eli Zaretskii Date: Sun, 31 Mar 2024 07:29:34 +0000 (+0300) Subject: Fix the new PEG library X-Git-Url: http://git.eshelyaron.com/gitweb/?a=commitdiff_plain;h=4fe250e5d47023adfe93db18cb2928b05d6a1496;p=emacs.git Fix the new PEG library * doc/lispref/peg.texi (Parsing Expression Grammars) (PEX Definitions, Parsing Actions, Writing PEG Rules): Fix markup, indexing, and wording. * etc/NEWS: Fix wording of PEG entry. * test/lisp/progmodes/peg-tests.el: Move from test/lisp/, to match the directory of peg.el. (cherry picked from commit 994bcc125b66397b455c8a7b70fb454b483df052) --- diff --git a/doc/lispref/peg.texi b/doc/lispref/peg.texi index ef4dfa7653e..fbf57852ee0 100644 --- a/doc/lispref/peg.texi +++ b/doc/lispref/peg.texi @@ -7,29 +7,34 @@ @chapter Parsing Expression Grammars @cindex text parsing @cindex parsing expression grammar +@cindex PEG Emacs Lisp provides several tools for parsing and matching text, from regular expressions (@pxref{Regular Expressions}) to full -@acronym{LL} grammar parsers (@pxref{Top,, Bovine parser -development,bovine}). @dfn{Parsing Expression Grammars} +left-to-right (a.k.a.@: @acronym{LL}) grammar parsers (@pxref{Top,, +Bovine parser development,bovine}). @dfn{Parsing Expression Grammars} (@acronym{PEG}) are another approach to text parsing that offer more structure and composibility than regular expressions, but less complexity than context-free grammars. -A @acronym{PEG} parser is defined as a list of named rules, each of -which matches text patterns, and/or contains references to other +A Parsing Expression Grammar (@acronym{PEG}) describes a formal language +in terms of a set of rules for recognizing strings in the language. In +Emacs, a @acronym{PEG} parser is defined as a list of named rules, each +of which matches text patterns and/or contains references to other rules. Parsing is initiated with the function @code{peg-run} or the macro @code{peg-parse} (see below), and parses text after point in the current buffer, using a given set of rules. @cindex parsing expression -The definition of each rule is referred to as a @dfn{parsing -expression} (@acronym{PEX}), and can consist of a literal string, a -regexp-like character range or set, a peg-specific construct -resembling an elisp function call, a reference to another rule, or a -combination of any of these. A grammar is expressed as a tree of -rules in which one rule is typically treated as a ``root'' or -``entry-point'' rule. For instance: +@cindex root, of parsing expression grammar +@cindex entry-point, of parsing expression grammar +Each rule in a @acronym{PEG} is referred to as a @dfn{parsing +expression} (@acronym{PEX}), and can be specified a a literal string, a +regexp-like character range or set, a peg-specific construct resembling +an Emacs Lisp function call, a reference to another rule, or a +combination of any of these. A grammar is expressed as a tree of rules +in which one rule is typically treated as a ``root'' or ``entry-point'' +rule. For instance: @example @group @@ -56,14 +61,17 @@ first rule is considered the ``entry-point'': @end group @end example -This macro represents the simplest use of the @acronym{PEG} library, -but also the least flexible, as the rules must be written directly -into the source code. A more flexible approach involves use of three -macros in conjunction: @code{with-peg-rules}, a @code{let}-like -construct that makes a set of rules available within the macro body; -@code{peg-run}, which initiates parsing given a single rule; and -@code{peg}, which is used to wrap the entry-point rule name. In fact, -a call to @code{peg-parse} expands to just this set of calls. The +@c FIXME: These two should be formally defined using @defmac and @defun. +@findex with-peg-rules +@findex peg-run +The @code{peg-parse} macro represents the simplest use of the +@acronym{PEG} library, but also the least flexible, as the rules must be +written directly into the source code. A more flexible approach +involves use of three macros in conjunction: @code{with-peg-rules}, a +@code{let}-like construct that makes a set of rules available within the +macro body; @code{peg-run}, which initiates parsing given a single rule; +and @code{peg}, which is used to wrap the entry-point rule name. In +fact, a call to @code{peg-parse} expands to just this set of calls. The above example could be written as: @example @@ -79,33 +87,43 @@ above example could be written as: This allows more explicit control over the ``entry-point'' of parsing, and allows the combination of rules from different sources. +@c FIXME: Use @defmac. +@findex define-peg-rule Individual rules can also be defined using a more @code{defun}-like syntax, using the macro @code{define-peg-rule}: @example +@group (define-peg-rule digit () [0-9]) +@end group @end example This also allows for rules that accept an argument (supplied by the -@code{funcall} PEG rule). +@code{funcall} PEG rule, @pxref{PEX Definitions}). +@c FIXME: Use @defmac. +@findex define-peg-ruleset Another possibility is to define a named set of rules with @code{define-peg-ruleset}: @example +@group (define-peg-ruleset number-grammar '((number sign digit (* digit)) digit ;; A reference to the definition above. (sign (or "+" "-" "")))) +@end group @end example Rules and rulesets defined this way can be referred to by name in later calls to @code{peg-run} or @code{with-peg-rules}: @example +@group (with-peg-rules number-grammar (peg-run (peg number))) +@end group @end example By default, calls to @code{peg-run} or @code{peg-parse} produce no @@ -125,11 +143,11 @@ act upon parsed strings, rules can include @dfn{actions}, see Parsing expressions can be defined using the following syntax: @table @code -@item (and E1 E2 ...) -A sequence of @acronym{PEX}s that must all be matched. The @code{and} form is -optional and implicit. +@item (and @var{e1} @var{e2}@dots{}) +A sequence of @acronym{PEX}s that must all be matched. The @code{and} +form is optional and implicit. -@item (or E1 E2 ...) +@item (or @var{e1} @var{e2}@dots{}) Prioritized choices, meaning that, as in Elisp, the choices are tried in order, and the first successful match is used. Note that this is distinct from context-free grammars, in which selection between @@ -141,43 +159,43 @@ Matches any single character, as the regexp ``.''. @item @var{string} A literal string. -@item (char @var{C}) -A single character @var{C}, as an Elisp character literal. +@item (char @var{c}) +A single character @var{c}, as an Elisp character literal. -@item (* @var{E}) -Zero or more instances of expression @var{E}, as the regexp @samp{*}. +@item (* @var{e}) +Zero or more instances of expression @var{e}, as the regexp @samp{*}. Matching is always ``greedy''. -@item (+ @var{E}) -One or more instances of expression @var{E}, as the regexp @samp{+}. +@item (+ @var{e}) +One or more instances of expression @var{e}, as the regexp @samp{+}. Matching is always ``greedy''. -@item (opt @var{E}) -Zero or one instance of expression @var{E}, as the regexp @samp{?}. +@item (opt @var{e}) +Zero or one instance of expression @var{e}, as the regexp @samp{?}. -@item SYMBOL +@item @var{symbol} A symbol representing a previously-defined PEG rule. -@item (range CH1 CH2) -The character range between CH1 and CH2, as the regexp @samp{[CH1-CH2]}. +@item (range @var{ch1} @var{ch2}) +The character range between @var{ch1} and @var{ch2}, as the regexp +@samp{[@var{ch1}-@var{ch2}]}. -@item [CH1-CH2 "+*" ?x] +@item [@var{ch1}-@var{ch2} "+*" ?x] A character set, which can include ranges, character literals, or strings of characters. @item [ascii cntrl] A list of named character classes. -@item (syntax-class @var{NAME}) +@item (syntax-class @var{name}) A single syntax class. -@item (funcall E ARGS...) -Call @acronym{PEX} E (previously defined with @code{define-peg-rule}) -with arguments @var{ARGS}. +@item (funcall @var{e} @var{args}@dots{}) +Call @acronym{PEX} @var{e} (previously defined with +@code{define-peg-rule}) with arguments @var{args}. @item (null) The empty string. - @end table The following expressions are used as anchors or tests -- they do not @@ -210,19 +228,19 @@ Beginning of symbol. @item (eos) End of symbol. -@item (if E) -Returns non-@code{nil} if parsing @acronym{PEX} E from point succeeds (point -is not moved). - -@item (not E) -Returns non-@code{nil} if parsing @acronym{PEX} E from point fails (point -is not moved). +@item (if @var{e}) +Returns non-@code{nil} if parsing @acronym{PEX} @var{e} from point +succeeds (point is not moved). -@item (guard EXP) -Treats the value of the Lisp expression EXP as a boolean. +@item (not @var{e}) +Returns non-@code{nil} if parsing @acronym{PEX} @var{e} from point fails +(point is not moved). +@item (guard @var{exp}) +Treats the value of the Lisp expression @var{exp} as a boolean. @end table +@c FIXME: peg-char-classes should be mentioned in the text below. @vindex peg-char-classes Character class matching can use the same named character classes as in regular expressions (@pxref{Top,, Character Classes,elisp}) @@ -234,12 +252,13 @@ in regular expressions (@pxref{Top,, Character Classes,elisp}) @cindex parsing stack By default the process of parsing simply moves point in the current buffer, ultimately returning @code{t} if the parsing succeeds, and -@code{nil} if it doesn't. It's also possible to define ``actions'' -that can run arbitrary Elisp at certain points in the parsed text. -These actions can optionally affect something called the @dfn{parsing -stack}, which is a list of values returned by the parsing process. -These actions only run (and only return values) if the parsing process -ultimately succeeds; if it fails the action code is not run at all. +@code{nil} if it doesn't. It's also possible to define @dfn{parsing +actions} that can run arbitrary Elisp at certain points in the parsed +text. These actions can optionally affect something called the +@dfn{parsing stack}, which is a list of values returned by the parsing +process. These actions only run (and only return values) if the parsing +process ultimately succeeds; if it fails the action code is not run at +all. Actions can be added anywhere in the definition of a rule. They are distinguished from parsing expressions by an initial backquote @@ -247,12 +266,13 @@ distinguished from parsing expressions by an initial backquote of hyphens (@samp{--}) somewhere within it. Symbols to the left of the hyphens are bound to values popped from the stack (they are somewhat analogous to the argument list of a lambda form). Values -produced by code to the right are pushed to the stack (analogous to -the return value of the lambda). For instance, the previous grammar -can be augmented with actions to return the parsed number as an actual -integer: +produced by code to the right of the hyphens are pushed onto the stack +(analogous to the return value of the lambda). For instance, the +previous grammar can be augmented with actions to return the parsed +number as an actual integer: @example +@group (with-peg-rules ((number sign digit (* digit `(a b -- (+ (* a 10) b))) `(sign val -- (* sign val))) @@ -261,6 +281,7 @@ integer: (and "" `(-- 1)))) (digit [0-9] `(-- (- (char-before) ?0)))) (peg-run (peg number))) +@end group @end example There must be values on the stack before they can be popped and @@ -271,43 +292,53 @@ only left-hand terms will consume (and discard) values from the stack. At the end of parsing, stack values are returned as a flat list. To return the string matched by a @acronym{PEX} (instead of simply -moving point over it), a rule like this can be used: +moving point over it), a grammar can use a rule like this: @example +@group (one-word `(-- (point)) (+ [word]) `(start -- (buffer-substring start (point)))) +@end group @end example -The first action pushes the initial value of point to the stack. The -intervening @acronym{PEX} moves point over the next word. The second -action pops the previous value from the stack (binding it to the -variable @code{start}), and uses that value to extract a substring -from the buffer and push it to the stack. This pattern is so common -that @acronym{PEG} provides a shorthand function that does exactly the -above, along with a few other shorthands for common scenarios: +@noindent +The first action above pushes the initial value of point to the stack. +The intervening @acronym{PEX} moves point over the next word. The +second action pops the previous value from the stack (binding it to the +variable @code{start}), then uses that value to extract a substring from +the buffer and push it to the stack. This pattern is so common that +@acronym{PEG} provides a shorthand function that does exactly the above, +along with a few other shorthands for common scenarios: @table @code -@item (substring @var{E}) -Match @acronym{PEX} @var{E} and push the matched string to the stack. - -@item (region @var{E}) -Match @var{E} and push the start and end positions of the matched -region to the stack. - -@item (replace @var{E} @var{replacement}) -Match @var{E} and replaced the matched region with the string @var{replacement}. - -@item (list @var{E}) -Match @var{E}, collect all values produced by @var{E} (and its -sub-expressions) into a list, and push that list to the stack. Stack +@findex substring (a PEG shorthand) +@item (substring @var{e}) +Match @acronym{PEX} @var{e} and push the matched string onto the stack. + +@findex region (a PEG shorthand) +@item (region @var{e}) +Match @var{e} and push the start and end positions of the matched +region onto the stack. + +@findex replace (a PEG shorthand) +@item (replace @var{e} @var{replacement}) +Match @var{e} and replaced the matched region with the string +@var{replacement}. + +@findex list (a PEG shorthand) +@item (list @var{e}) +Match @var{e}, collect all values produced by @var{e} (and its +sub-expressions) into a list, and push that list onto the stack. Stack values are typically returned as a flat list; this is a way of ``grouping'' values together. @end table @node Writing PEG Rules @section Writing PEG Rules +@cindex PEG rules, pitfalls +@cindex Parsing Expression Grammar, pitfalls in rules Something to be aware of when writing PEG rules is that they are greedy. Rules which can consume a variable amount of text will always @@ -319,9 +350,10 @@ backtracking. For instance, this rule will never succeed: (forest (+ "tree" (* [blank])) "tree" (eol)) @end example -The @acronym{PEX} @code{(+ "tree" (* [blank]))} will consume all -repetitions of the word ``tree'', leaving none to match the final -@code{"tree"}. +@noindent +The @acronym{PEX} @w{@code{(+ "tree" (* [blank]))}} will consume all +the repetitions of the word @samp{tree}, leaving none to match the final +@samp{tree}. In these situations, the desired result can be obtained by using predicates and guards -- namely the @code{not}, @code{if} and @@ -331,6 +363,7 @@ predicates and guards -- namely the @code{not}, @code{if} and (forest (+ "tree" (* [blank])) (not (eol)) "tree" (eol)) @end example +@noindent The @code{if} and @code{not} operators accept a parsing expression and interpret it as a boolean, without moving point. The contents of a @code{guard} operator are evaluated as regular Lisp (not a @@ -345,6 +378,7 @@ rule: (end-game "game" (eob)) @end example +@noindent when run in a buffer containing the text ``game over'' after point, will move point to just after ``game'' then halt parsing, returning @code{nil}. Successful parsing will always return @code{t}, or the diff --git a/etc/NEWS b/etc/NEWS index 2ba1268ddad..d168c63894b 100644 --- a/etc/NEWS +++ b/etc/NEWS @@ -1697,8 +1697,8 @@ preventing the installation of Compat if unnecessary. +++ ** New package PEG. -Emacs now includes a library for writing (P)arsing (E)xpression -(G)rammars, an approach to text parsing that provides more structure +Emacs now includes a library for writing Parsing Expression +Grammars (PEG), an approach to text parsing that provides more structure than regular expressions, but less complexity than context-free grammars. The Info manual "(elisp) Parsing Expression Grammars" has documentation and examples. diff --git a/test/lisp/peg-tests.el b/test/lisp/peg-tests.el deleted file mode 100644 index 864e09b4200..00000000000 --- a/test/lisp/peg-tests.el +++ /dev/null @@ -1,367 +0,0 @@ -;;; peg-tests.el --- Tests of PEG parsers -*- lexical-binding: t; -*- - -;; Copyright (C) 2008-2023 Free Software Foundation, Inc. - -;; This program is free software; you can redistribute it and/or modify -;; it under the terms of the GNU General Public License as published by -;; the Free Software Foundation, either version 3 of the License, or -;; (at your option) any later version. - -;; This program is distributed in the hope that it will be useful, -;; but WITHOUT ANY WARRANTY; without even the implied warranty of -;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -;; GNU General Public License for more details. - -;; You should have received a copy of the GNU General Public License -;; along with this program. If not, see . - -;;; Commentary: - -;; Tests and examples, that used to live in peg.el wrapped inside an `eval'. - -;;; Code: - -(require 'peg) -(require 'ert) - -;;; Tests: - -(defmacro peg-parse-string (pex string &optional noerror) - "Parse STRING according to PEX. -If NOERROR is non-nil, push nil resp. t if the parse failed -resp. succeeded instead of signaling an error." - (let ((oldstyle (consp (car-safe pex)))) ;PEX is really a list of rules. - `(with-temp-buffer - (insert ,string) - (goto-char (point-min)) - ,(if oldstyle - `(with-peg-rules ,pex - (peg-run (peg ,(caar pex)) - ,(unless noerror '#'peg-signal-failure))) - `(peg-run (peg ,pex) - ,(unless noerror '#'peg-signal-failure)))))) - -(define-peg-rule peg-test-natural () - [0-9] (* [0-9])) - -(ert-deftest peg-test () - (should (peg-parse-string peg-test-natural "99 bottles" t)) - (should (peg-parse-string ((s "a")) "a" t)) - (should (not (peg-parse-string ((s "a")) "b" t))) - (should (peg-parse-string ((s (not "a"))) "b" t)) - (should (not (peg-parse-string ((s (not "a"))) "a" t))) - (should (peg-parse-string ((s (if "a"))) "a" t)) - (should (not (peg-parse-string ((s (if "a"))) "b" t))) - (should (peg-parse-string ((s "ab")) "ab" t)) - (should (not (peg-parse-string ((s "ab")) "ba" t))) - (should (not (peg-parse-string ((s "ab")) "a" t))) - (should (peg-parse-string ((s (range ?0 ?9))) "0" t)) - (should (not (peg-parse-string ((s (range ?0 ?9))) "a" t))) - (should (peg-parse-string ((s [0-9])) "0" t)) - (should (not (peg-parse-string ((s [0-9])) "a" t))) - (should (not (peg-parse-string ((s [0-9])) "" t))) - (should (peg-parse-string ((s (any))) "0" t)) - (should (not (peg-parse-string ((s (any))) "" t))) - (should (peg-parse-string ((s (eob))) "" t)) - (should (peg-parse-string ((s (not (eob)))) "a" t)) - (should (peg-parse-string ((s (or "a" "b"))) "a" t)) - (should (peg-parse-string ((s (or "a" "b"))) "b" t)) - (should (not (peg-parse-string ((s (or "a" "b"))) "c" t))) - (should (peg-parse-string (and "a" "b") "ab" t)) - (should (peg-parse-string ((s (and "a" "b"))) "abc" t)) - (should (not (peg-parse-string (and "a" "b") "ba" t))) - (should (peg-parse-string ((s (and "a" "b" "c"))) "abc" t)) - (should (peg-parse-string ((s (* "a") "b" (eob))) "b" t)) - (should (peg-parse-string ((s (* "a") "b" (eob))) "ab" t)) - (should (peg-parse-string ((s (* "a") "b" (eob))) "aaab" t)) - (should (not (peg-parse-string ((s (* "a") "b" (eob))) "abc" t))) - (should (peg-parse-string ((s "")) "abc" t)) - (should (peg-parse-string ((s "" (eob))) "" t)) - (should (peg-parse-string ((s (opt "a") "b")) "abc" t)) - (should (peg-parse-string ((s (opt "a") "b")) "bc" t)) - (should (not (peg-parse-string ((s (or))) "ab" t))) - (should (peg-parse-string ((s (and))) "ab" t)) - (should (peg-parse-string ((s (and))) "" t)) - (should (peg-parse-string ((s ["^"])) "^" t)) - (should (peg-parse-string ((s ["^a"])) "a" t)) - (should (peg-parse-string ["-"] "-" t)) - (should (peg-parse-string ((s ["]-"])) "]" t)) - (should (peg-parse-string ((s ["^]"])) "^" t)) - (should (peg-parse-string ((s [alpha])) "z" t)) - (should (not (peg-parse-string ((s [alpha])) "0" t))) - (should (not (peg-parse-string ((s [alpha])) "" t))) - (should (not (peg-parse-string ((s ["][:alpha:]"])) "z" t))) - (should (peg-parse-string ((s (bob))) "" t)) - (should (peg-parse-string ((s (bos))) "x" t)) - (should (not (peg-parse-string ((s (bos))) " x" t))) - (should (peg-parse-string ((s "x" (eos))) "x" t)) - (should (peg-parse-string ((s (syntax-class whitespace))) " " t)) - (should (peg-parse-string ((s (= "foo"))) "foo" t)) - (should (let ((f "foo")) (peg-parse-string ((s (= f))) "foo" t))) - (should (not (peg-parse-string ((s (= "foo"))) "xfoo" t))) - (should (equal (peg-parse-string ((s `(-- 1 2))) "") '(2 1))) - (should (equal (peg-parse-string ((s `(-- 1 2) `(a b -- a b))) "") '(2 1))) - (should (equal (peg-parse-string ((s (or (and (any) s) - (substring [0-9])))) - "ab0cd1ef2gh") - '("2"))) - ;; The PEG rule `other' doesn't exist, which will cause a byte-compiler - ;; warning, but not an error at run time because the rule is not actually - ;; used in this particular case. - (should (equal (peg-parse-string ((s (substring (or "a" other))) - ;; Unused left-recursive rule, should - ;; cause a byte-compiler warning. - (r (* "a") r)) - "af") - '("a"))) - (should (equal (peg-parse-string ((s (list x y)) - (x `(-- 1)) - (y `(-- 2))) - "") - '((1 2)))) - (should (equal (peg-parse-string ((s (list (* x))) - (x "" `(-- 'x))) - "xxx") - ;; The empty loop body should be matched once! - '((x)))) - (should (equal (peg-parse-string ((s (list (* x))) - (x "x" `(-- 'x))) - "xxx") - '((x x x)))) - (should (equal (peg-parse-string ((s (region (* x))) - (x "x" `(-- 'x))) - "xxx") - ;; FIXME: Since string positions start at 0, this should - ;; really be '(3 x x x 0) !! - '(4 x x x 1))) - (should (equal (peg-parse-string ((s (region (list (* x)))) - (x "x" `(-- 'x 'y))) - "xxx") - '(4 (x y x y x y) 1))) - (should (equal (with-temp-buffer - (save-excursion (insert "abcdef")) - (list - (peg-run (peg "a" - (replace "bc" "x") - (replace "de" "y") - "f")) - (buffer-string))) - '(t "axyf"))) - (with-temp-buffer - (insert "toro") - (goto-char (point-min)) - (should (peg-run (peg "to"))) - (should-not (peg-run (peg "to"))) - (should (peg-run (peg "ro"))) - (should (eobp))) - (with-temp-buffer - (insert " ") - (goto-char (point-min)) - (peg-run (peg (+ (syntax-class whitespace)))) - (should (eobp))) - ) - -;;; Examples: - -;; peg-ex-recognize-int recognizes integers. An integer begins with a -;; optional sign, then follows one or more digits. Digits are all -;; characters from 0 to 9. -;; -;; Notes: -;; 1) "" matches the empty sequence, i.e. matches without consuming -;; input. -;; 2) [0-9] is the character range from 0 to 9. This can also be -;; written as (range ?0 ?9). Note that 0-9 is a symbol. -(defun peg-ex-recognize-int () - (with-peg-rules ((number sign digit (* digit)) - (sign (or "+" "-" "")) - (digit [0-9])) - (peg-run (peg number)))) - -;; peg-ex-parse-int recognizes integers and computes the corresponding -;; value. The grammar is the same as for `peg-ex-recognize-int' -;; augmented with parsing actions. Unfortunaletly, the actions add -;; quite a bit of clutter. -;; -;; The actions for the sign rule push -1 on the stack for a minus sign -;; and 1 for plus or no sign. -;; -;; The action for the digit rule pushes the value for a single digit. -;; -;; The action `(a b -- (+ (* a 10) b)), takes two items from the stack -;; and pushes the first digit times 10 added to the second digit. -;; -;; The action `(sign val -- (* sign val)), multiplies val with the -;; sign (1 or -1). -(defun peg-ex-parse-int () - (with-peg-rules ((number sign digit (* digit - `(a b -- (+ (* a 10) b))) - `(sign val -- (* sign val))) - (sign (or (and "+" `(-- 1)) - (and "-" `(-- -1)) - (and "" `(-- 1)))) - (digit [0-9] `(-- (- (char-before) ?0)))) - (peg-run (peg number)))) - -;; Put point after the ) and press C-x C-e -;; (peg-ex-parse-int)-234234 - -;; Parse arithmetic expressions and compute the result as side effect. -(defun peg-ex-arith () - (peg-parse - (expr _ sum eol) - (sum product (* (or (and "+" _ product `(a b -- (+ a b))) - (and "-" _ product `(a b -- (- a b)))))) - (product value (* (or (and "*" _ value `(a b -- (* a b))) - (and "/" _ value `(a b -- (/ a b)))))) - (value (or (and (substring number) `(string -- (string-to-number string))) - (and "(" _ sum ")" _))) - (number (+ [0-9]) _) - (_ (* [" \t"])) - (eol (or "\n" "\r\n" "\r")))) - -;; (peg-ex-arith) 1 + 2 * 3 * (4 + 5) -;; (peg-ex-arith) 1 + 2 ^ 3 * (4 + 5) ; fails to parse - -;; Parse URI according to RFC 2396. -(defun peg-ex-uri () - (peg-parse - (URI-reference (or absoluteURI relativeURI) - (or (and "#" (substring fragment)) - `(-- nil)) - `(scheme user host port path query fragment -- - (list :scheme scheme :user user - :host host :port port - :path path :query query - :fragment fragment))) - (absoluteURI (substring scheme) ":" (or hier-part opaque-part)) - (hier-part ;(-- user host port path query) - (or net-path - (and `(-- nil nil nil) - abs-path)) - (or (and "?" (substring query)) - `(-- nil))) - (net-path "//" authority (or abs-path `(-- nil))) - (abs-path "/" path-segments) - (path-segments segment (list (* "/" segment)) `(s l -- (cons s l))) - (segment (substring (* pchar) (* ";" param))) - (param (* pchar)) - (pchar (or unreserved escaped [":@&=+$,"])) - (query (* uric)) - (fragment (* uric)) - (relativeURI (or net-path abs-path rel-path) (opt "?" query)) - (rel-path rel-segment (opt abs-path)) - (rel-segment (+ unreserved escaped [";@&=+$,"])) - (authority (or server reg-name)) - (server (or (and (or (and (substring userinfo) "@") - `(-- nil)) - hostport) - `(-- nil nil nil))) - (userinfo (* (or unreserved escaped [";:&=+$,"]))) - (hostport (substring host) (or (and ":" (substring port)) - `(-- nil))) - (host (or hostname ipv4address)) - (hostname (* domainlabel ".") toplabel (opt ".")) - (domainlabel alphanum - (opt (* (or alphanum "-") (if alphanum)) - alphanum)) - (toplabel alpha - (* (or alphanum "-") (if alphanum)) - alphanum) - (ipv4address (+ digit) "." (+ digit) "." (+ digit) "." (+ digit)) - (port (* digit)) - (scheme alpha (* (or alpha digit ["+-."]))) - (reg-name (or unreserved escaped ["$,;:@&=+"])) - (opaque-part uric-no-slash (* uric)) - (uric (or reserved unreserved escaped)) - (uric-no-slash (or unreserved escaped [";?:@&=+$,"])) - (reserved (set ";/?:@&=+$,")) - (unreserved (or alphanum mark)) - (escaped "%" hex hex) - (hex (or digit [A-F] [a-f])) - (mark (set "-_.!~*'()")) - (alphanum (or alpha digit)) - (alpha (or lowalpha upalpha)) - (lowalpha [a-z]) - (upalpha [A-Z]) - (digit [0-9]))) - -;; (peg-ex-uri)http://luser@www.foo.com:8080/bar/baz.html?x=1#foo -;; (peg-ex-uri)file:/bar/baz.html?foo=df#x - -;; Split STRING where SEPARATOR occurs. -(defun peg-ex-split (string separator) - (peg-parse-string ((s (list (* (* sep) elt))) - (elt (substring (+ (not sep) (any)))) - (sep (= separator))) - string)) - -;; (peg-ex-split "-abc-cd-" "-") - -;; Parse a lisp style Sexp. -;; [To keep the example short, ' and . are handled as ordinary symbol.] -(defun peg-ex-lisp () - (peg-parse - (sexp _ (or string list number symbol)) - (_ (* (or [" \n\t"] comment))) - (comment ";" (* (not (or "\n" (eob))) (any))) - (string "\"" (substring (* (not "\"") (any))) "\"") - (number (substring (opt (set "+-")) (+ digit)) - (if terminating) - `(string -- (string-to-number string))) - (symbol (substring (and symchar (* (not terminating) symchar))) - `(s -- (intern s))) - (symchar [a-z A-Z 0-9 "-;!#%&'*+,./:;<=>?@[]^_`{|}~"]) - (list "(" `(-- (cons nil nil)) `(hd -- hd hd) - (* sexp `(tl e -- (setcdr tl (list e)))) - _ ")" `(hd _tl -- (cdr hd))) - (digit [0-9]) - (terminating (or (set " \n\t();\"'") (eob))))) - -;; (peg-ex-lisp) - -;; We try to detect left recursion and report it as error. -(defun peg-ex-left-recursion () - (eval '(peg-parse (exp (or term - (and exp "+" exp))) - (term (or digit - (and term "*" term))) - (digit [0-9])) - t)) - -(defun peg-ex-infinite-loop () - (eval '(peg-parse (exp (* (or "x" - "y" - (action (foo)))))) - t)) - -;; Some efficiency problems: - -;; Find the last digit in a string. -;; Recursive definition with excessive stack usage. -(defun peg-ex-last-digit (string) - (peg-parse-string ((s (or (and (any) s) - (substring [0-9])))) - string)) - -;; (peg-ex-last-digit "ab0cd1ef2gh") -;; (peg-ex-last-digit (make-string 50 ?-)) -;; (peg-ex-last-digit (make-string 1000 ?-)) - -;; Find the last digit without recursion. Doesn't run out of stack, -;; but probably still too inefficient for large inputs. -(defun peg-ex-last-digit2 (string) - (peg-parse-string ((s `(-- nil) - (+ (* (not digit) (any)) - (substring digit) - `(_d1 d2 -- d2))) - (digit [0-9])) - string)) - -;; (peg-ex-last-digit2 "ab0cd1ef2gh") -;; (peg-ex-last-digit2 (concat (make-string 500000 ?-) "8a9b")) -;; (peg-ex-last-digit2 (make-string 500000 ?-)) -;; (peg-ex-last-digit2 (make-string 500000 ?5)) - -(provide 'peg-tests) -;;; peg-tests.el ends here diff --git a/test/lisp/progmodes/peg-tests.el b/test/lisp/progmodes/peg-tests.el new file mode 100644 index 00000000000..864e09b4200 --- /dev/null +++ b/test/lisp/progmodes/peg-tests.el @@ -0,0 +1,367 @@ +;;; peg-tests.el --- Tests of PEG parsers -*- lexical-binding: t; -*- + +;; Copyright (C) 2008-2023 Free Software Foundation, Inc. + +;; This program is free software; you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; This program is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with this program. If not, see . + +;;; Commentary: + +;; Tests and examples, that used to live in peg.el wrapped inside an `eval'. + +;;; Code: + +(require 'peg) +(require 'ert) + +;;; Tests: + +(defmacro peg-parse-string (pex string &optional noerror) + "Parse STRING according to PEX. +If NOERROR is non-nil, push nil resp. t if the parse failed +resp. succeeded instead of signaling an error." + (let ((oldstyle (consp (car-safe pex)))) ;PEX is really a list of rules. + `(with-temp-buffer + (insert ,string) + (goto-char (point-min)) + ,(if oldstyle + `(with-peg-rules ,pex + (peg-run (peg ,(caar pex)) + ,(unless noerror '#'peg-signal-failure))) + `(peg-run (peg ,pex) + ,(unless noerror '#'peg-signal-failure)))))) + +(define-peg-rule peg-test-natural () + [0-9] (* [0-9])) + +(ert-deftest peg-test () + (should (peg-parse-string peg-test-natural "99 bottles" t)) + (should (peg-parse-string ((s "a")) "a" t)) + (should (not (peg-parse-string ((s "a")) "b" t))) + (should (peg-parse-string ((s (not "a"))) "b" t)) + (should (not (peg-parse-string ((s (not "a"))) "a" t))) + (should (peg-parse-string ((s (if "a"))) "a" t)) + (should (not (peg-parse-string ((s (if "a"))) "b" t))) + (should (peg-parse-string ((s "ab")) "ab" t)) + (should (not (peg-parse-string ((s "ab")) "ba" t))) + (should (not (peg-parse-string ((s "ab")) "a" t))) + (should (peg-parse-string ((s (range ?0 ?9))) "0" t)) + (should (not (peg-parse-string ((s (range ?0 ?9))) "a" t))) + (should (peg-parse-string ((s [0-9])) "0" t)) + (should (not (peg-parse-string ((s [0-9])) "a" t))) + (should (not (peg-parse-string ((s [0-9])) "" t))) + (should (peg-parse-string ((s (any))) "0" t)) + (should (not (peg-parse-string ((s (any))) "" t))) + (should (peg-parse-string ((s (eob))) "" t)) + (should (peg-parse-string ((s (not (eob)))) "a" t)) + (should (peg-parse-string ((s (or "a" "b"))) "a" t)) + (should (peg-parse-string ((s (or "a" "b"))) "b" t)) + (should (not (peg-parse-string ((s (or "a" "b"))) "c" t))) + (should (peg-parse-string (and "a" "b") "ab" t)) + (should (peg-parse-string ((s (and "a" "b"))) "abc" t)) + (should (not (peg-parse-string (and "a" "b") "ba" t))) + (should (peg-parse-string ((s (and "a" "b" "c"))) "abc" t)) + (should (peg-parse-string ((s (* "a") "b" (eob))) "b" t)) + (should (peg-parse-string ((s (* "a") "b" (eob))) "ab" t)) + (should (peg-parse-string ((s (* "a") "b" (eob))) "aaab" t)) + (should (not (peg-parse-string ((s (* "a") "b" (eob))) "abc" t))) + (should (peg-parse-string ((s "")) "abc" t)) + (should (peg-parse-string ((s "" (eob))) "" t)) + (should (peg-parse-string ((s (opt "a") "b")) "abc" t)) + (should (peg-parse-string ((s (opt "a") "b")) "bc" t)) + (should (not (peg-parse-string ((s (or))) "ab" t))) + (should (peg-parse-string ((s (and))) "ab" t)) + (should (peg-parse-string ((s (and))) "" t)) + (should (peg-parse-string ((s ["^"])) "^" t)) + (should (peg-parse-string ((s ["^a"])) "a" t)) + (should (peg-parse-string ["-"] "-" t)) + (should (peg-parse-string ((s ["]-"])) "]" t)) + (should (peg-parse-string ((s ["^]"])) "^" t)) + (should (peg-parse-string ((s [alpha])) "z" t)) + (should (not (peg-parse-string ((s [alpha])) "0" t))) + (should (not (peg-parse-string ((s [alpha])) "" t))) + (should (not (peg-parse-string ((s ["][:alpha:]"])) "z" t))) + (should (peg-parse-string ((s (bob))) "" t)) + (should (peg-parse-string ((s (bos))) "x" t)) + (should (not (peg-parse-string ((s (bos))) " x" t))) + (should (peg-parse-string ((s "x" (eos))) "x" t)) + (should (peg-parse-string ((s (syntax-class whitespace))) " " t)) + (should (peg-parse-string ((s (= "foo"))) "foo" t)) + (should (let ((f "foo")) (peg-parse-string ((s (= f))) "foo" t))) + (should (not (peg-parse-string ((s (= "foo"))) "xfoo" t))) + (should (equal (peg-parse-string ((s `(-- 1 2))) "") '(2 1))) + (should (equal (peg-parse-string ((s `(-- 1 2) `(a b -- a b))) "") '(2 1))) + (should (equal (peg-parse-string ((s (or (and (any) s) + (substring [0-9])))) + "ab0cd1ef2gh") + '("2"))) + ;; The PEG rule `other' doesn't exist, which will cause a byte-compiler + ;; warning, but not an error at run time because the rule is not actually + ;; used in this particular case. + (should (equal (peg-parse-string ((s (substring (or "a" other))) + ;; Unused left-recursive rule, should + ;; cause a byte-compiler warning. + (r (* "a") r)) + "af") + '("a"))) + (should (equal (peg-parse-string ((s (list x y)) + (x `(-- 1)) + (y `(-- 2))) + "") + '((1 2)))) + (should (equal (peg-parse-string ((s (list (* x))) + (x "" `(-- 'x))) + "xxx") + ;; The empty loop body should be matched once! + '((x)))) + (should (equal (peg-parse-string ((s (list (* x))) + (x "x" `(-- 'x))) + "xxx") + '((x x x)))) + (should (equal (peg-parse-string ((s (region (* x))) + (x "x" `(-- 'x))) + "xxx") + ;; FIXME: Since string positions start at 0, this should + ;; really be '(3 x x x 0) !! + '(4 x x x 1))) + (should (equal (peg-parse-string ((s (region (list (* x)))) + (x "x" `(-- 'x 'y))) + "xxx") + '(4 (x y x y x y) 1))) + (should (equal (with-temp-buffer + (save-excursion (insert "abcdef")) + (list + (peg-run (peg "a" + (replace "bc" "x") + (replace "de" "y") + "f")) + (buffer-string))) + '(t "axyf"))) + (with-temp-buffer + (insert "toro") + (goto-char (point-min)) + (should (peg-run (peg "to"))) + (should-not (peg-run (peg "to"))) + (should (peg-run (peg "ro"))) + (should (eobp))) + (with-temp-buffer + (insert " ") + (goto-char (point-min)) + (peg-run (peg (+ (syntax-class whitespace)))) + (should (eobp))) + ) + +;;; Examples: + +;; peg-ex-recognize-int recognizes integers. An integer begins with a +;; optional sign, then follows one or more digits. Digits are all +;; characters from 0 to 9. +;; +;; Notes: +;; 1) "" matches the empty sequence, i.e. matches without consuming +;; input. +;; 2) [0-9] is the character range from 0 to 9. This can also be +;; written as (range ?0 ?9). Note that 0-9 is a symbol. +(defun peg-ex-recognize-int () + (with-peg-rules ((number sign digit (* digit)) + (sign (or "+" "-" "")) + (digit [0-9])) + (peg-run (peg number)))) + +;; peg-ex-parse-int recognizes integers and computes the corresponding +;; value. The grammar is the same as for `peg-ex-recognize-int' +;; augmented with parsing actions. Unfortunaletly, the actions add +;; quite a bit of clutter. +;; +;; The actions for the sign rule push -1 on the stack for a minus sign +;; and 1 for plus or no sign. +;; +;; The action for the digit rule pushes the value for a single digit. +;; +;; The action `(a b -- (+ (* a 10) b)), takes two items from the stack +;; and pushes the first digit times 10 added to the second digit. +;; +;; The action `(sign val -- (* sign val)), multiplies val with the +;; sign (1 or -1). +(defun peg-ex-parse-int () + (with-peg-rules ((number sign digit (* digit + `(a b -- (+ (* a 10) b))) + `(sign val -- (* sign val))) + (sign (or (and "+" `(-- 1)) + (and "-" `(-- -1)) + (and "" `(-- 1)))) + (digit [0-9] `(-- (- (char-before) ?0)))) + (peg-run (peg number)))) + +;; Put point after the ) and press C-x C-e +;; (peg-ex-parse-int)-234234 + +;; Parse arithmetic expressions and compute the result as side effect. +(defun peg-ex-arith () + (peg-parse + (expr _ sum eol) + (sum product (* (or (and "+" _ product `(a b -- (+ a b))) + (and "-" _ product `(a b -- (- a b)))))) + (product value (* (or (and "*" _ value `(a b -- (* a b))) + (and "/" _ value `(a b -- (/ a b)))))) + (value (or (and (substring number) `(string -- (string-to-number string))) + (and "(" _ sum ")" _))) + (number (+ [0-9]) _) + (_ (* [" \t"])) + (eol (or "\n" "\r\n" "\r")))) + +;; (peg-ex-arith) 1 + 2 * 3 * (4 + 5) +;; (peg-ex-arith) 1 + 2 ^ 3 * (4 + 5) ; fails to parse + +;; Parse URI according to RFC 2396. +(defun peg-ex-uri () + (peg-parse + (URI-reference (or absoluteURI relativeURI) + (or (and "#" (substring fragment)) + `(-- nil)) + `(scheme user host port path query fragment -- + (list :scheme scheme :user user + :host host :port port + :path path :query query + :fragment fragment))) + (absoluteURI (substring scheme) ":" (or hier-part opaque-part)) + (hier-part ;(-- user host port path query) + (or net-path + (and `(-- nil nil nil) + abs-path)) + (or (and "?" (substring query)) + `(-- nil))) + (net-path "//" authority (or abs-path `(-- nil))) + (abs-path "/" path-segments) + (path-segments segment (list (* "/" segment)) `(s l -- (cons s l))) + (segment (substring (* pchar) (* ";" param))) + (param (* pchar)) + (pchar (or unreserved escaped [":@&=+$,"])) + (query (* uric)) + (fragment (* uric)) + (relativeURI (or net-path abs-path rel-path) (opt "?" query)) + (rel-path rel-segment (opt abs-path)) + (rel-segment (+ unreserved escaped [";@&=+$,"])) + (authority (or server reg-name)) + (server (or (and (or (and (substring userinfo) "@") + `(-- nil)) + hostport) + `(-- nil nil nil))) + (userinfo (* (or unreserved escaped [";:&=+$,"]))) + (hostport (substring host) (or (and ":" (substring port)) + `(-- nil))) + (host (or hostname ipv4address)) + (hostname (* domainlabel ".") toplabel (opt ".")) + (domainlabel alphanum + (opt (* (or alphanum "-") (if alphanum)) + alphanum)) + (toplabel alpha + (* (or alphanum "-") (if alphanum)) + alphanum) + (ipv4address (+ digit) "." (+ digit) "." (+ digit) "." (+ digit)) + (port (* digit)) + (scheme alpha (* (or alpha digit ["+-."]))) + (reg-name (or unreserved escaped ["$,;:@&=+"])) + (opaque-part uric-no-slash (* uric)) + (uric (or reserved unreserved escaped)) + (uric-no-slash (or unreserved escaped [";?:@&=+$,"])) + (reserved (set ";/?:@&=+$,")) + (unreserved (or alphanum mark)) + (escaped "%" hex hex) + (hex (or digit [A-F] [a-f])) + (mark (set "-_.!~*'()")) + (alphanum (or alpha digit)) + (alpha (or lowalpha upalpha)) + (lowalpha [a-z]) + (upalpha [A-Z]) + (digit [0-9]))) + +;; (peg-ex-uri)http://luser@www.foo.com:8080/bar/baz.html?x=1#foo +;; (peg-ex-uri)file:/bar/baz.html?foo=df#x + +;; Split STRING where SEPARATOR occurs. +(defun peg-ex-split (string separator) + (peg-parse-string ((s (list (* (* sep) elt))) + (elt (substring (+ (not sep) (any)))) + (sep (= separator))) + string)) + +;; (peg-ex-split "-abc-cd-" "-") + +;; Parse a lisp style Sexp. +;; [To keep the example short, ' and . are handled as ordinary symbol.] +(defun peg-ex-lisp () + (peg-parse + (sexp _ (or string list number symbol)) + (_ (* (or [" \n\t"] comment))) + (comment ";" (* (not (or "\n" (eob))) (any))) + (string "\"" (substring (* (not "\"") (any))) "\"") + (number (substring (opt (set "+-")) (+ digit)) + (if terminating) + `(string -- (string-to-number string))) + (symbol (substring (and symchar (* (not terminating) symchar))) + `(s -- (intern s))) + (symchar [a-z A-Z 0-9 "-;!#%&'*+,./:;<=>?@[]^_`{|}~"]) + (list "(" `(-- (cons nil nil)) `(hd -- hd hd) + (* sexp `(tl e -- (setcdr tl (list e)))) + _ ")" `(hd _tl -- (cdr hd))) + (digit [0-9]) + (terminating (or (set " \n\t();\"'") (eob))))) + +;; (peg-ex-lisp) + +;; We try to detect left recursion and report it as error. +(defun peg-ex-left-recursion () + (eval '(peg-parse (exp (or term + (and exp "+" exp))) + (term (or digit + (and term "*" term))) + (digit [0-9])) + t)) + +(defun peg-ex-infinite-loop () + (eval '(peg-parse (exp (* (or "x" + "y" + (action (foo)))))) + t)) + +;; Some efficiency problems: + +;; Find the last digit in a string. +;; Recursive definition with excessive stack usage. +(defun peg-ex-last-digit (string) + (peg-parse-string ((s (or (and (any) s) + (substring [0-9])))) + string)) + +;; (peg-ex-last-digit "ab0cd1ef2gh") +;; (peg-ex-last-digit (make-string 50 ?-)) +;; (peg-ex-last-digit (make-string 1000 ?-)) + +;; Find the last digit without recursion. Doesn't run out of stack, +;; but probably still too inefficient for large inputs. +(defun peg-ex-last-digit2 (string) + (peg-parse-string ((s `(-- nil) + (+ (* (not digit) (any)) + (substring digit) + `(_d1 d2 -- d2))) + (digit [0-9])) + string)) + +;; (peg-ex-last-digit2 "ab0cd1ef2gh") +;; (peg-ex-last-digit2 (concat (make-string 500000 ?-) "8a9b")) +;; (peg-ex-last-digit2 (make-string 500000 ?-)) +;; (peg-ex-last-digit2 (make-string 500000 ?5)) + +(provide 'peg-tests) +;;; peg-tests.el ends here