From c4526e933cdf0e55387767b32b2f18c0abbdae70 Mon Sep 17 00:00:00 2001
From: Eli Zaretskii <eliz@gnu.org>
Date: Sat, 1 Nov 2008 16:36:10 +0000
Subject: [PATCH] (Text Representations): Rewrite to make consistent with Emacs
 23 internal representation of characters.  Document `unibyte-string'.

---
 doc/lispref/ChangeLog     |   6 ++
 doc/lispref/nonascii.texi | 112 ++++++++++++++++++++++++--------------
 etc/NEWS                  |   2 +
 3 files changed, 78 insertions(+), 42 deletions(-)

diff --git a/doc/lispref/ChangeLog b/doc/lispref/ChangeLog
index 68d4996a39b..0037eccc6b5 100644
--- a/doc/lispref/ChangeLog
+++ b/doc/lispref/ChangeLog
@@ -1,3 +1,9 @@
+2008-11-01  Eli Zaretskii  <eliz@gnu.org>
+
+	* nonascii.texi (Text Representations): Rewrite to make consistent
+	with Emacs 23 internal representation of characters.  Document
+	`unibyte-string'.
+
 2008-10-28  Chong Yidong  <cyd@stupidchicken.com>
 
 	* processes.texi (Process Information): Note that process-status
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi
index 4a8205c178d..c70f8e56973 100644
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -10,11 +10,11 @@
 @cindex characters, multi-byte
 @cindex non-@acronym{ASCII} characters
 
-  This chapter covers the special issues relating to non-@acronym{ASCII}
-characters and how they are stored in strings and buffers.
+  This chapter covers the special issues relating to characters and
+how they are stored in strings and buffers.
 
 @menu
-* Text Representations::    Unibyte and multibyte representations
+* Text Representations::    How Emacs represents text.
 * Converting Representations::  Converting unibyte to multibyte and vice versa.
 * Selecting a Representation::  Treating a byte sequence as unibyte or multi.
 * Character Codes::         How unibyte and multibyte relate to
@@ -33,41 +33,62 @@ characters and how they are stored in strings and buffers.
 
 @node Text Representations
 @section Text Representations
-@cindex text representations
-
-  Emacs has two @dfn{text representations}---two ways to represent text
-in a string or buffer.  These are called @dfn{unibyte} and
-@dfn{multibyte}.  Each string, and each buffer, uses one of these two
-representations.  For most purposes, you can ignore the issue of
-representations, because Emacs converts text between them as
-appropriate.  Occasionally in Lisp programming you will need to pay
-attention to the difference.
+@cindex text representation
+
+  Emacs buffers and strings support a large repertoire of characters
+from many different scripts.  This is so users could type and display
+text in most any known written language.
+
+@cindex character codepoint
+@cindex codespace
+@cindex Unicode
+  To support this multitude of characters and scripts, Emacs closely
+follows the @dfn{Unicode Standard}.  The Unicode Standard assigns a
+unique number, called a @dfn{codepoint}, to each and every character.
+The range of codepoints defined by Unicode, or the Unicode
+@dfn{codespace}, is @code{0..10FFFF} (in hex) inclusive.  Emacs
+extends this range with codepoints in the range @code{3FFF80..3FFFFF},
+which it uses for representing raw 8-bit bytes that cannot be
+interpreted as characters.  Thus, a character codepoint in Emacs is a
+22-bit integer number.
+
+@cindex internal representation of characters
+@cindex characters, representation in buffers and strings
+@cindex multibyte text
+  To conserve memory, Emacs does not hold fixed-length 22-bit numbers
+that are codepoints of text characters within buffers and strings.
+Rather, Emacs uses a variable-length internal representation of
+characters, that stores each character as a sequence of 1 to 5 8-bit
+bytes, depending on the magnitude of its codepoint@footnote{
+This internal representation is based on one of the encodings defined
+by the Unicode Standard, called @dfn{UTF-8}, for representing any
+Unicode codepoint, but Emacs extends UTF-8 to represent the additional
+codepoints it uses for raw 8-bit bytes.}.
+For example, any @acronym{ASCII} character takes up only 1 byte, a
+Latin-1 character takes up 2 bytes, etc.  We call this representation
+of text @dfn{multibyte}, because it uses several bytes for each
+character.
+
+  Outside Emacs, characters can be represented in many different
+encodings, such as ISO-8859-1, GB-2312, Big-5, etc.  Emacs converts
+between these external encodings and the internal representation, as
+appropriate, when it reads text into a buffer or a string, or when it
+writes text to a disk file or passes it to some other process.
+
+  Occasionally, Emacs needs to hold and manipulate encoded text or
+binary non-text data in its buffer or string.  For example, when Emacs
+visits a file, it first reads the file's text verbatim into a buffer,
+and only then converts it to the internal representation.  Before the
+conversion, the buffer holds encoded text.
 
 @cindex unibyte text
-  In unibyte representation, each character occupies one byte and
-therefore the possible character codes range from 0 to 255.  Codes 0
-through 127 are @acronym{ASCII} characters; the codes from 128 through 255
-are used for one non-@acronym{ASCII} character set (you can choose which
-character set by setting the variable @code{nonascii-insert-offset}).
-
-@cindex leading code
-@cindex multibyte text
-@cindex trailing codes
-  In multibyte representation, a character may occupy more than one
-byte, and as a result, the full range of Emacs character codes can be
-stored.  The first byte of a multibyte character is always in the range
-128 through 159 (octal 0200 through 0237).  These values are called
-@dfn{leading codes}.  The second and subsequent bytes of a multibyte
-character are always in the range 160 through 255 (octal 0240 through
-0377); these values are @dfn{trailing codes}.
-
-  Some sequences of bytes are not valid in multibyte text: for example,
-a single isolated byte in the range 128 through 159 is not allowed.  But
-character codes 128 through 159 can appear in multibyte text,
-represented as two-byte sequences.  All the character codes 128 through
-255 are possible (though slightly abnormal) in multibyte text; they
-appear in multibyte buffers and strings when you do explicit encoding
-and decoding (@pxref{Explicit Encoding}).
+  Encoded text is not really text, as far as Emacs is concerned, but
+rather a sequence of raw 8-bit bytes.  We call buffers and strings
+that hold encoded text @dfn{unibyte} buffers and strings, because
+Emacs treats them as a sequence of individual bytes.  In particular,
+Emacs usually displays unibyte buffers and strings as octal codes such
+as @code{\237}.  We recommend that you never use unibyte buffers and
+strings except for manipulating encoded text or binary non-text data.
 
   In a buffer, the buffer-local value of the variable
 @code{enable-multibyte-characters} specifies the representation used.
@@ -77,7 +98,7 @@ when the string is constructed.
 @defvar enable-multibyte-characters
 This variable specifies the current buffer's text representation.
 If it is non-@code{nil}, the buffer contains multibyte text; otherwise,
-it contains unibyte text.
+it contains unibyte encoded text or binary non-text data.
 
 You cannot set this variable directly; instead, use the function
 @code{set-buffer-multibyte} to change a buffer's representation.
@@ -96,20 +117,22 @@ default value to @code{nil} early in startup.
 @end defvar
 
 @defun position-bytes position
-Return the byte-position corresponding to buffer position
+Buffer positions are measured in character units.  This function
+returns the byte-position corresponding to buffer position
 @var{position} in the current buffer.  This is 1 at the start of the
 buffer, and counts upward in bytes.  If @var{position} is out of
 range, the value is @code{nil}.
 @end defun
 
 @defun byte-to-position byte-position
-Return the buffer position corresponding to byte-position
-@var{byte-position} in the current buffer.  If @var{byte-position} is
-out of range, the value is @code{nil}.
+Return the buffer position, in character units, corresponding to
+byte-position @var{byte-position} in the current buffer.  If
+@var{byte-position} is out of range, the value is @code{nil}.
 @end defun
 
 @defun multibyte-string-p string
-Return @code{t} if @var{string} is a multibyte string.
+Return @code{t} if @var{string} is a multibyte string, @code{nil}
+otherwise.
 @end defun
 
 @defun string-bytes string
@@ -119,6 +142,11 @@ If @var{string} is a multibyte string, this can be greater than
 @code{(length @var{string})}.
 @end defun
 
+@defun unibyte-string &rest bytes
+This function concatenates all its argument @var{bytes} and makes the
+result a unibyte string.
+@end defun
+
 @node Converting Representations
 @section Converting Text Representations
 
diff --git a/etc/NEWS b/etc/NEWS
index 6e4273cae42..b0f2177e547 100644
--- a/etc/NEWS
+++ b/etc/NEWS
@@ -1347,6 +1347,7 @@ returns its output as a list of lines.
 
 ** Character code, representation, and charset changes.
 
++++
 The character code space is now 0x0..0x3FFFFF with no gap.
 Characters of code 0x0..0x10FFFF are Unicode characters of the same code points.
 Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes.
@@ -1354,6 +1355,7 @@ Characters of code 0x3FFF80..0x3FFFFF are raw 8-bit bytes.
 +++
 Generic characters no longer exist.
 
++++
 In buffers and strings, characters are represented by UTF-8 byte
 sequences in a multibyte buffer/string.
 
-- 
2.39.5