--- /dev/null
+#+TITLE: Fixing a Twenty Six Years Old Emacs Bug
+#+SUBTITLE: They got it backward those days...
+#+DESCRIPTION: Post by Eshel Yaron about Fixing a Twenty Six Years Old Emacs Bug
+#+KEYWORDS: emacs
+#+DATE: 2024-02-10
+
+@@html:<div class="metadata">@@Created on [{{{date}}}], last updated [{{{modification-time(%Y-%m-%d, t)}}}]@@html:</div>@@
+
+A couple of weeks ago, an interesting bug report landed on the [[note:emacs][Emacs]]
+bug tracker:
+
+#+begin_quote
+Subject: bug#68762: 30.0.50; thing-at-point for an e-mail adress stops at "."
+
+When I use (thing-at-point 'email) and point is on an e-mail address
+like my.name@website.com, thing-at-point only gets name@website.com
+and loses the first part before the "." character.
+
+This happens after I upgraded from 29.1 to emacs 30.0.50.
+#+end_quote
+
+The reason this is a great bug report is that it's easy to reproduce
+right there in the buffer showing the bug report itself. Namely,
+since I was reading this inside Emacs, this was as simple as placing
+the cursor right before where it says ~(thing-at-point 'email)~,
+hitting ~C-M-k~ to "kill" (copy) the whole expression, then ~C-s name~
+to place the cursor in the middle of ~my.name@website.com~, and lastly
+~M-: C-y RET~ to evaluate the expression. That's it. You see the
+return value ~name@website.com~, instead of the expected
+~my.name@website.com~, and that makes the issue apparent on the spot.
+
+The ~thing-at-point~ function does just what it says on the tin: you
+specify a "thing" you're looking for, and the function searches for
+that thing around your cursor (point). This function has different
+methods for finding different "things". For ~email~, it uses regular
+expression matching. In particular, it uses this regular expression:
+
+#+begin_src emacs-lisp
+ (defvar thing-at-point-email-regexp
+ "<?[-+_~a-zA-Z0-9/][-+_.~:a-zA-Z0-9/]*@[-a-zA-Z0-9]+[-.a-zA-Z0-9]*>?"
+ "A regular expression probably matching an email address.
+ This does not match the real name portion, only the address, optionally
+ with angle brackets.")
+#+end_src
+
+This is slightly daunting at first look, but actually the regular
+expression itself is fine. Well, in general, trying to match email
+addresses with a regular expression is asking for troubles, really.
+This regular expression ain't perfectly correct either, in the sense
+that there are valid email addresses that it fails to match, but it
+/does/ match the example ~my.name@example.com~ from the bug report.
+So what gives?
+
+This regular expression did change since Emacs 29.1, as we can quickly
+see by selecting the above variable definition with ~C-M-SPC~ and
+hitting ~C-x v h~:
+
+#+begin_src diff
+ Date: Sun Dec 3 13:49:07 2023 +0100
+
+ Add slashes to 'thing-at-point-email-regexp'
+
+ ,* lisp/thingatpt.el (thing-at-point-email-regexp): Allow for
+ a (thing-at-point 'email) query to match addresses with slashes, as
+ used by Sourcehut. (Bug#67600)
+
+ diff --git a/lisp/thingatpt.el b/lisp/thingatpt.el
+ --- a/lisp/thingatpt.el
+ +++ b/lisp/thingatpt.el
+ @@ -654,5 +654,5 @@
+ (defvar thing-at-point-email-regexp
+ - "<?[-+_~a-zA-Z0-9][-+_.~:a-zA-Z0-9]*@[-a-zA-Z0-9]+[-.a-zA-Z0-9]*>?"
+ + "<?[-+_~a-zA-Z0-9/][-+_.~:a-zA-Z0-9/]*@[-a-zA-Z0-9]+[-.a-zA-Z0-9]*>?"
+ "A regular expression probably matching an email address.
+ This does not match the real name portion, only the address, optionally
+ with angle brackets.")
+#+end_src
+
+#+begin_src diff
+ Date: Wed Feb 15 12:16:11 2023 +0100
+
+ Improve thing-at-point email detection
+
+ ,* lisp/thingatpt.el (thing-at-point-email-regexp): Allow numbers at
+ the start of the user portion, and disallow '.' at the start. Also
+ disallow '.' at the start of the domain portion.
+ ...
+
+ diff --git a/lisp/thingatpt.el b/lisp/thingatpt.el
+ --- a/lisp/thingatpt.el
+ +++ b/lisp/thingatpt.el
+ @@ -647,5 +647,5 @@
+ (defvar thing-at-point-email-regexp
+ - "<?[-+_.~a-zA-Z][-+_.~:a-zA-Z0-9]*@[-.a-zA-Z0-9]+>?"
+ + "<?[-+_~a-zA-Z0-9][-+_.~:a-zA-Z0-9]*@[-a-zA-Z0-9]+[-.a-zA-Z0-9]*>?"
+ "A regular expression probably matching an email address.
+ This does not match the real name portion, only the address, optionally
+ with angle brackets.")
+#+end_src
+
+Neither of these changes should have directly caused
+~thing-at-point-email-regexp~ to stop matching ~my.name@example.com~,
+though. The problem is with the specific way in which
+~thing-at-point~ /uses/ this regular expression, and others. Usually,
+one applies a regular expression to strings starting from their
+beginning, and indeed, the following form evaluates to ~0~, which
+shows that the example email address matches the regular expression:
+
+#+begin_src emacs-lisp
+ (string-match thing-at-point-email-regexp "my.name@example.com")
+ => 0
+#+end_src
+
+However, ~thing-at-point~ doesn't know in advance where the "thing"
+starts. It needs to look for a match /somewhere/ around point. To do
+that, it uses a helper function called ~thing-at-point-looking-at~.
+This much I knew more or less before stumbling upon this bug report.
+So I decided to take a look at ~thing-at-point-looking-at~, and see
+how it performs the fabled around-point regexp matching. And, what do
+you know, there it was! Among the trees of parentheses, a twenty six
+years old bug. Check it out:
+
+#+begin_src emacs-lisp
+ (defun thing-at-point-looking-at (regexp &optional distance)
+ "Return non-nil if point is in or just after a match for REGEXP.
+ Set the match data from the earliest such match ending at or after
+ point.
+
+ Optional argument DISTANCE limits search for REGEXP forward and
+ back from point."
+ (save-excursion
+ (let ((old-point (point))
+ (forward-bound (and distance (+ (point) distance)))
+ (backward-bound (and distance (- (point) distance)))
+ match prev-pos new-pos)
+ (and (looking-at regexp)
+ (>= (match-end 0) old-point)
+ (setq match (point)))
+ ;; Search back repeatedly from end of next match.
+ ;; This may fail if next match ends before this match does.
+ (re-search-forward regexp forward-bound 'limit)
+ (setq prev-pos (point))
+ (while (and (setq new-pos (re-search-backward regexp backward-bound t))
+ ;; Avoid inflooping with some regexps, such as "^",
+ ;; matching which never moves point.
+ (< new-pos prev-pos)
+ (or (> (match-beginning 0) old-point)
+ (and (looking-at regexp) ; Extend match-end past search start
+ (>= (match-end 0) old-point)
+ (setq match (point))))))
+ (if (not match) nil
+ (goto-char match)
+ ;; Back up a char at a time in case search skipped
+ ;; intermediate match straddling search start pos.
+ (while (and (not (bobp))
+ (progn (backward-char 1) (looking-at regexp))
+ (>= (match-end 0) old-point)
+ (setq match (point))))
+ (goto-char match)
+ (looking-at regexp)))))
+#+end_src
+
+There are some details to unravel here, but before diving in I first
+did the ~C-x v h~ trick once more, to see if anything obvious broke
+since Emacs 29.1. Alas, ~thing-at-point-looking-at~ nearly haven't
+changed since its introduction in the summer of 1997:
+
+#+begin_src diff
+ commit d9cc804bf8f440fa73f49abe7977518554126601
+ Author: Richard M. Stallman <rms@gnu.org>
+ Date: Fri Jul 4 19:59:49 1997 +0000
+
+ ...
+ (thing-at-point-looking-at): New function, adapted from old
+ browse-url-looking-at.
+ ...
+
+ diff --git a/lisp/thingatpt.el b/lisp/thingatpt.el
+ --- a/lisp/thingatpt.el
+ +++ b/lisp/thingatpt.el
+ @@ -200,0 +273,27 @@
+ +(defun thing-at-point-looking-at (regexp)
+ + "Return non-nil if point is in or just after a match for REGEXP.
+ +Set the match data from the earliest such match ending at or after
+ +point."
+ + (save-excursion
+ + (let ((old-point (point)) match)
+ + (and (looking-at regexp)
+ + (>= (match-end 0) old-point)
+ + (setq match (point)))
+ + ;; Search back repeatedly from end of next match.
+ + ;; This may fail if next match ends before this match does.
+ + (re-search-forward regexp nil 'limit)
+ + (while (and (re-search-backward regexp nil t)
+ + (or (> (match-beginning 0) old-point)
+ + (and (looking-at regexp) ; Extend match-end past search start
+ + (>= (match-end 0) old-point)
+ + (setq match (point))))))
+ + (if (not match) nil
+ + (goto-char match)
+ + ;; Back up a char at a time in case search skipped
+ + ;; intermediate match straddling search start pos.
+ + (while (and (not (bobp))
+ + (progn (backward-char 1) (looking-at regexp))
+ + (>= (match-end 0) old-point)
+ + (setq match (point))))
+ + (goto-char match)
+ + (looking-at regexp)))))
+
+#+end_src
+
+I found it funny to think about how while RMS wrote this code,
+I... Well, I certainly wasn't doing anything useful back then.
+Probably eating paste or some such.
+
+Indeed the bug was hiding there all these years, until finally the
+change from last February---namely, the "disallow '.' at the start"
+part---brought it to light. It's a simple matter of orientation: just
+like the comments suggests, ~thing-at-point-looking-at~ would search
+forward from point, and then start searching backward. But since
+Emacs doesn't really have a backward-regexp-matching capability, its
+~re-search-backward~ works by /moving backward/ and /searching
+forward/ until it finds a match. That means that backward regexp
+matching is inherently non-greedy, it returns the shortest match that
+ends before point. ~thing-at-point-looking-at~ ventures to return the
+/earliest/ match, as its docstring proclaims, so it employs an
+additional adjustment after calling ~re-search-backward~, where it
+moves back one character at the time and checks whether it still
+matches from the new position. And that's the crux of the issue.
+Extending the match backward that way, character by character, fails
+miserably when even one position between the starting point and the
+beginning of the full match is /not/ a match. This is what happens
+when ~thing-at-point-looking-at~ tried to extend the match backward
+past the ~.~ in ~my.name~, since the new value of
+~thing-at-point-email-regexp~ in Emacs 30 (correctly) rejects
+addresses that start with a ~.~.
+
+To fix this issue, I took a stab at shaking up the decades old
+~thing-at-point-looking-at~. The idea seemed clear: backward regexp
+searching is, well, kinda backward. If we just start from the
+beginning and only search forward, we should find the earliest match
+without breaking a sweat. This leads to the following implementation:
+
+#+begin_src emacs-lisp
+ (defun thing-at-point-looking-at (regexp &optional distance)
+ "Return non-nil if point is in or just after a match for REGEXP.
+ Set the match data from the earliest such match ending at or after
+ point.
+
+ Optional argument DISTANCE limits search for REGEXP forward and
+ back from point."
+ (let* ((old (point))
+ (beg (if distance (max (point-min) (- old distance)) (point-min)))
+ (end (if distance (min (point-max) (+ old distance))))
+ prev match)
+ (save-excursion
+ (goto-char beg)
+ (while (and (setq prev (point)
+ match (re-search-forward regexp end t))
+ (< (match-end 0) old))
+ ;; Avoid inflooping when `regexp' matches the empty string.
+ (unless (< prev (point)) (forward-char))))
+ (and match (<= (match-beginning 0) old (match-end 0)))))
+#+end_src
+
+It even looks simpler, right? Less room for bugs to lurk. So I've
+proposed this change as a solution for Bug#68762, and a few days ago
+it landed on the Emacs master branch.
+
+I really hope this isn't /too/ simplistic. All existing tests, and
+some new ones, still pass, but the tests don't cover everything, and a
+lot of code could have come to rely on this function during the years.
+If you're running Emacs 30 and see anything strange that you think
+might be related, please report it via ~M-x report-emacs-bug~. And if
+you ever need to write some code to match email addresses, my tip for
+you is to stay away from regular expressions, and to use a proper
+parser instead.