From 7137c96ea520d9414c320d24a3d59407c4854af8 Mon Sep 17 00:00:00 2001 From: Eshel Yaron Date: Sat, 10 Feb 2024 14:58:09 +0100 Subject: [PATCH] New post about 26 YO Emacs bug --- ...ixing-a-twenty-six-years-old-emacs-bug.org | 277 ++++++++++++++++++ 1 file changed, 277 insertions(+) create mode 100644 source/posts/2024-02-10-fixing-a-twenty-six-years-old-emacs-bug.org diff --git a/source/posts/2024-02-10-fixing-a-twenty-six-years-old-emacs-bug.org b/source/posts/2024-02-10-fixing-a-twenty-six-years-old-emacs-bug.org new file mode 100644 index 0000000..528d79c --- /dev/null +++ b/source/posts/2024-02-10-fixing-a-twenty-six-years-old-emacs-bug.org @@ -0,0 +1,277 @@ +#+TITLE: Fixing a Twenty Six Years Old Emacs Bug +#+SUBTITLE: They got it backward those days... +#+DESCRIPTION: Post by Eshel Yaron about Fixing a Twenty Six Years Old Emacs Bug +#+KEYWORDS: emacs +#+DATE: 2024-02-10 + +@@html:
@@Created on [{{{date}}}], last updated [{{{modification-time(%Y-%m-%d, t)}}}]@@html:
@@ + +A couple of weeks ago, an interesting bug report landed on the [[note:emacs][Emacs]] +bug tracker: + +#+begin_quote +Subject: bug#68762: 30.0.50; thing-at-point for an e-mail adress stops at "." + +When I use (thing-at-point 'email) and point is on an e-mail address +like my.name@website.com, thing-at-point only gets name@website.com +and loses the first part before the "." character. + +This happens after I upgraded from 29.1 to emacs 30.0.50. +#+end_quote + +The reason this is a great bug report is that it's easy to reproduce +right there in the buffer showing the bug report itself. Namely, +since I was reading this inside Emacs, this was as simple as placing +the cursor right before where it says ~(thing-at-point 'email)~, +hitting ~C-M-k~ to "kill" (copy) the whole expression, then ~C-s name~ +to place the cursor in the middle of ~my.name@website.com~, and lastly +~M-: C-y RET~ to evaluate the expression. That's it. You see the +return value ~name@website.com~, instead of the expected +~my.name@website.com~, and that makes the issue apparent on the spot. + +The ~thing-at-point~ function does just what it says on the tin: you +specify a "thing" you're looking for, and the function searches for +that thing around your cursor (point). This function has different +methods for finding different "things". For ~email~, it uses regular +expression matching. In particular, it uses this regular expression: + +#+begin_src emacs-lisp + (defvar thing-at-point-email-regexp + "?" + "A regular expression probably matching an email address. + This does not match the real name portion, only the address, optionally + with angle brackets.") +#+end_src + +This is slightly daunting at first look, but actually the regular +expression itself is fine. Well, in general, trying to match email +addresses with a regular expression is asking for troubles, really. +This regular expression ain't perfectly correct either, in the sense +that there are valid email addresses that it fails to match, but it +/does/ match the example ~my.name@example.com~ from the bug report. +So what gives? + +This regular expression did change since Emacs 29.1, as we can quickly +see by selecting the above variable definition with ~C-M-SPC~ and +hitting ~C-x v h~: + +#+begin_src diff + Date: Sun Dec 3 13:49:07 2023 +0100 + + Add slashes to 'thing-at-point-email-regexp' + + ,* lisp/thingatpt.el (thing-at-point-email-regexp): Allow for + a (thing-at-point 'email) query to match addresses with slashes, as + used by Sourcehut. (Bug#67600) + + diff --git a/lisp/thingatpt.el b/lisp/thingatpt.el + --- a/lisp/thingatpt.el + +++ b/lisp/thingatpt.el + @@ -654,5 +654,5 @@ + (defvar thing-at-point-email-regexp + - "?" + + "?" + "A regular expression probably matching an email address. + This does not match the real name portion, only the address, optionally + with angle brackets.") +#+end_src + +#+begin_src diff + Date: Wed Feb 15 12:16:11 2023 +0100 + + Improve thing-at-point email detection + + ,* lisp/thingatpt.el (thing-at-point-email-regexp): Allow numbers at + the start of the user portion, and disallow '.' at the start. Also + disallow '.' at the start of the domain portion. + ... + + diff --git a/lisp/thingatpt.el b/lisp/thingatpt.el + --- a/lisp/thingatpt.el + +++ b/lisp/thingatpt.el + @@ -647,5 +647,5 @@ + (defvar thing-at-point-email-regexp + - "?" + + "?" + "A regular expression probably matching an email address. + This does not match the real name portion, only the address, optionally + with angle brackets.") +#+end_src + +Neither of these changes should have directly caused +~thing-at-point-email-regexp~ to stop matching ~my.name@example.com~, +though. The problem is with the specific way in which +~thing-at-point~ /uses/ this regular expression, and others. Usually, +one applies a regular expression to strings starting from their +beginning, and indeed, the following form evaluates to ~0~, which +shows that the example email address matches the regular expression: + +#+begin_src emacs-lisp + (string-match thing-at-point-email-regexp "my.name@example.com") + => 0 +#+end_src + +However, ~thing-at-point~ doesn't know in advance where the "thing" +starts. It needs to look for a match /somewhere/ around point. To do +that, it uses a helper function called ~thing-at-point-looking-at~. +This much I knew more or less before stumbling upon this bug report. +So I decided to take a look at ~thing-at-point-looking-at~, and see +how it performs the fabled around-point regexp matching. And, what do +you know, there it was! Among the trees of parentheses, a twenty six +years old bug. Check it out: + +#+begin_src emacs-lisp + (defun thing-at-point-looking-at (regexp &optional distance) + "Return non-nil if point is in or just after a match for REGEXP. + Set the match data from the earliest such match ending at or after + point. + + Optional argument DISTANCE limits search for REGEXP forward and + back from point." + (save-excursion + (let ((old-point (point)) + (forward-bound (and distance (+ (point) distance))) + (backward-bound (and distance (- (point) distance))) + match prev-pos new-pos) + (and (looking-at regexp) + (>= (match-end 0) old-point) + (setq match (point))) + ;; Search back repeatedly from end of next match. + ;; This may fail if next match ends before this match does. + (re-search-forward regexp forward-bound 'limit) + (setq prev-pos (point)) + (while (and (setq new-pos (re-search-backward regexp backward-bound t)) + ;; Avoid inflooping with some regexps, such as "^", + ;; matching which never moves point. + (< new-pos prev-pos) + (or (> (match-beginning 0) old-point) + (and (looking-at regexp) ; Extend match-end past search start + (>= (match-end 0) old-point) + (setq match (point)))))) + (if (not match) nil + (goto-char match) + ;; Back up a char at a time in case search skipped + ;; intermediate match straddling search start pos. + (while (and (not (bobp)) + (progn (backward-char 1) (looking-at regexp)) + (>= (match-end 0) old-point) + (setq match (point)))) + (goto-char match) + (looking-at regexp))))) +#+end_src + +There are some details to unravel here, but before diving in I first +did the ~C-x v h~ trick once more, to see if anything obvious broke +since Emacs 29.1. Alas, ~thing-at-point-looking-at~ nearly haven't +changed since its introduction in the summer of 1997: + +#+begin_src diff + commit d9cc804bf8f440fa73f49abe7977518554126601 + Author: Richard M. Stallman + Date: Fri Jul 4 19:59:49 1997 +0000 + + ... + (thing-at-point-looking-at): New function, adapted from old + browse-url-looking-at. + ... + + diff --git a/lisp/thingatpt.el b/lisp/thingatpt.el + --- a/lisp/thingatpt.el + +++ b/lisp/thingatpt.el + @@ -200,0 +273,27 @@ + +(defun thing-at-point-looking-at (regexp) + + "Return non-nil if point is in or just after a match for REGEXP. + +Set the match data from the earliest such match ending at or after + +point." + + (save-excursion + + (let ((old-point (point)) match) + + (and (looking-at regexp) + + (>= (match-end 0) old-point) + + (setq match (point))) + + ;; Search back repeatedly from end of next match. + + ;; This may fail if next match ends before this match does. + + (re-search-forward regexp nil 'limit) + + (while (and (re-search-backward regexp nil t) + + (or (> (match-beginning 0) old-point) + + (and (looking-at regexp) ; Extend match-end past search start + + (>= (match-end 0) old-point) + + (setq match (point)))))) + + (if (not match) nil + + (goto-char match) + + ;; Back up a char at a time in case search skipped + + ;; intermediate match straddling search start pos. + + (while (and (not (bobp)) + + (progn (backward-char 1) (looking-at regexp)) + + (>= (match-end 0) old-point) + + (setq match (point)))) + + (goto-char match) + + (looking-at regexp))))) + +#+end_src + +I found it funny to think about how while RMS wrote this code, +I... Well, I certainly wasn't doing anything useful back then. +Probably eating paste or some such. + +Indeed the bug was hiding there all these years, until finally the +change from last February---namely, the "disallow '.' at the start" +part---brought it to light. It's a simple matter of orientation: just +like the comments suggests, ~thing-at-point-looking-at~ would search +forward from point, and then start searching backward. But since +Emacs doesn't really have a backward-regexp-matching capability, its +~re-search-backward~ works by /moving backward/ and /searching +forward/ until it finds a match. That means that backward regexp +matching is inherently non-greedy, it returns the shortest match that +ends before point. ~thing-at-point-looking-at~ ventures to return the +/earliest/ match, as its docstring proclaims, so it employs an +additional adjustment after calling ~re-search-backward~, where it +moves back one character at the time and checks whether it still +matches from the new position. And that's the crux of the issue. +Extending the match backward that way, character by character, fails +miserably when even one position between the starting point and the +beginning of the full match is /not/ a match. This is what happens +when ~thing-at-point-looking-at~ tried to extend the match backward +past the ~.~ in ~my.name~, since the new value of +~thing-at-point-email-regexp~ in Emacs 30 (correctly) rejects +addresses that start with a ~.~. + +To fix this issue, I took a stab at shaking up the decades old +~thing-at-point-looking-at~. The idea seemed clear: backward regexp +searching is, well, kinda backward. If we just start from the +beginning and only search forward, we should find the earliest match +without breaking a sweat. This leads to the following implementation: + +#+begin_src emacs-lisp + (defun thing-at-point-looking-at (regexp &optional distance) + "Return non-nil if point is in or just after a match for REGEXP. + Set the match data from the earliest such match ending at or after + point. + + Optional argument DISTANCE limits search for REGEXP forward and + back from point." + (let* ((old (point)) + (beg (if distance (max (point-min) (- old distance)) (point-min))) + (end (if distance (min (point-max) (+ old distance)))) + prev match) + (save-excursion + (goto-char beg) + (while (and (setq prev (point) + match (re-search-forward regexp end t)) + (< (match-end 0) old)) + ;; Avoid inflooping when `regexp' matches the empty string. + (unless (< prev (point)) (forward-char)))) + (and match (<= (match-beginning 0) old (match-end 0))))) +#+end_src + +It even looks simpler, right? Less room for bugs to lurk. So I've +proposed this change as a solution for Bug#68762, and a few days ago +it landed on the Emacs master branch. + +I really hope this isn't /too/ simplistic. All existing tests, and +some new ones, still pass, but the tests don't cover everything, and a +lot of code could have come to rely on this function during the years. +If you're running Emacs 30 and see anything strange that you think +might be related, please report it via ~M-x report-emacs-bug~. And if +you ever need to write some code to match email addresses, my tip for +you is to stay away from regular expressions, and to use a proper +parser instead. -- 2.39.2