Add manual section about how to avoid regexp problems

author Mattias Engdegård <mattiase@acm.org>

Wed, 3 Nov 2021 12:42:25 +0000 (13:42 +0100)

committer Mattias Engdegård <mattiase@acm.org>

Wed, 3 Nov 2021 12:50:12 +0000 (13:50 +0100)
author Mattias Engdegård <mattiase@acm.org>
Wed, 3 Nov 2021 12:42:25 +0000 (13:42 +0100)
committer Mattias Engdegård <mattiase@acm.org>
Wed, 3 Nov 2021 12:50:12 +0000 (13:50 +0100)
diff --git a/doc/lispref/elisp.texi b/doc/lispref/elisp.texi

index c4bd97bf81508e3e21cbebd1939fb224b7f1a186..6057691239f9414148cc84a98626bf4f0ff7393c 100644 (file)
--- a/doc/lispref/elisp.texi
+++ b/doc/lispref/elisp.texi
@@ -1316,6 +1316,7 @@ Regular Expressions
  * Rx Notation::             An alternative, structured regexp notation.
  @end ifnottex
  * Regexp Functions::        Functions for operating on regular expressions.
+* Regexp Problems::         Some problems and how they may be avoided.
  
  Syntax of Regular Expressions
  
diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi

index d27cfb8c0c7292f88f634b9051c6477c4a71a547..ce79765b7334a3c092564a09bd27b339234087ad 100644 (file)
--- a/doc/lispref/searching.texi
+++ b/doc/lispref/searching.texi
@@ -263,6 +263,7 @@ case-sensitive.
  * Rx Notation::             An alternative, structured regexp notation.
  @end ifnottex
  * Regexp Functions::        Functions for operating on regular expressions.
+* Regexp Problems::         Some problems and how they may be avoided.
  @end menu
  
  @node Syntax of Regexps
@@ -343,15 +344,6 @@ first tries to match all three @samp{a}s; but the rest of the pattern is
  The next alternative is for @samp{a*} to match only two @samp{a}s.  With
  this choice, the rest of the regexp matches successfully.
  
-@strong{Warning:} Nested repetition operators can run for a very
-long time, if they lead to ambiguous matching.  For
-example, trying to match the regular expression @samp{\(x+y*\)*a}
-against the string @samp{xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz} could
-take hours before it ultimately fails.  Emacs may try each way of
-grouping the @samp{x}s before concluding that none of them can work.
-In general, avoid expressions that can match the same string in
-multiple ways.
-
  @item @samp{+}
  @cindex @samp{+} in regexp
  is a postfix operator, similar to @samp{*} except that it must match
@@ -1884,6 +1876,73 @@ variables that may be set to a pattern that actually matches
  something.
  @end defvar
  
+@node Regexp Problems
+@subsection Problems with Regular Expressions
+@cindex regular expression problems
+@cindex regexp stack overflow
+@cindex stack overflow in regexp
+
+The Emacs regexp implementation, like many of its kind, is generally
+robust but occasionally causes trouble in either of two ways: matching
+may run out of internal stack space and signal an error, and it can
+take a long time to complete.  The advice below will make these
+symptoms less likely and help alleviate problems that do arise.
+
+@itemize
+@item
+Anchor regexps at the beginning of a line, string or buffer using
+zero-width assertions (@samp{^} and @code{\`}). This takes advantage
+of fast paths in the implementation and can avoid futile matching
+attempts.  Other zero-width assertions may also bring benefits by
+causing a match to fail early.
+
+@item
+Avoid or-patterns in favour of character alternatives: write
+@samp{[ab]} instead of @samp{a\|b}.  Recall that @samp{\s-} and @samp{\sw}
+are equivalent to @samp{[[:space:]]} and @samp{[[:word:]]}, respectively.
+
+@item
+Since the last branch of an or-pattern does not add a backtrack point
+on the stack, consider putting the most likely matched pattern last.
+For example, @samp{^\(?:a\|.b\)*c} will run out of stack if trying to
+match a very long string of @samp{a}s, but the equivalent
+@samp{^\(?:.b\|a\)*c} will not.
+
+(It is a trade-off: successfully matched or-patterns run faster with
+the most frequently matched pattern first.)
+
+@item
+Try to ensure that any part of the text can only match in a single
+way.  For example, @samp{a*a*} will match the same set of strings as
+@samp{a*}, but the former can do so in many ways and will therefore
+cause slow backtracking if the match fails later on.  Make or-pattern
+branches mutually exclusive if possible, so that matching will not go
+far into more than one branch before failing.
+
+Be especially careful with nested repetitions: they can easily result
+in very slow matching in the presence of ambiguities.  For example,
+@samp{\(?:a*b*\)+c} will take a long time attempting to match even a
+moderately long string of @samp{a}s before failing.  The equivalent
+@samp{\(?:a\|b\)*c} is much faster, and @samp{[ab]*c} better still.
+
+@item
+Don't use capturing groups unless they are really needed; that is, use
+@samp{\(?:@dots{}\)} instead of @samp{\(@dots{}\)} for bracketing
+purposes.
+
+@ifnottex
+@item
+Consider using @code{rx} (@pxref{Rx Notation}); it can optimise some
+or-patterns automatically and will never introduce capturing groups
+unless explicitly requested.
+@end ifnottex
+@end itemize
+
+If you run into regexp stack overflow despite following the above
+advice, don't be afraid of performing the matching in multiple
+function calls, each using a simpler regexp where backtracking can
+more easily be contained.
+
  @node Regexp Search
  @section Regular Expression Searching
  @cindex regular expression searching
diff --git a/etc/PROBLEMS b/etc/PROBLEMS

index daff102a0d37720fd51380f3dd54f8d8b483c1ca..ede83a6e7c0e09ff26301aada9c1819fb08c9260 100644 (file)
--- a/etc/PROBLEMS
+++ b/etc/PROBLEMS
@@ -742,18 +742,6 @@ completed" message that tls.el relies upon, causing affected Emacs
  functions to hang.  To work around the problem, use older or newer
  versions of gnutls-cli, or use Emacs's built-in gnutls support.
  
-*** Stack overflow in regexp matcher.
-Due to fundamental limitations in the way Emacs' regular expression
-engine is designed, you might run into combinatorial explosions in
-backtracking with certain regexps.
-
-Avoid "\(...\(...\)*...\)*" and "\(...\)*\(...\)*".  Look for a way to
-anchor your regular expression, to avoid matching the null string in
-infinite ways.  The latter is what creates backtrack points, and
-eventual overflow in practice.
-
-(Also prefer "\(?:...\)" to "\(...\)" unless you need the latter.)
-
  * Runtime problems related to font handling
  
  ** Characters are displayed as empty boxes or with wrong font under X.
author	Mattias Engdegård <mattiase@acm.org>
	Wed, 3 Nov 2021 12:42:25 +0000 (13:42 +0100)
committer	Mattias Engdegård <mattiase@acm.org>
	Wed, 3 Nov 2021 12:50:12 +0000 (13:50 +0100)
doc/lispref/elisp.texi		patch \| blob \| history
doc/lispref/searching.texi		patch \| blob \| history
etc/PROBLEMS		patch \| blob \| history