[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Characters are objects that represent printed characters, such as letters and digits.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Characters are written using the notation #\character
or
#\character-name
. For example:
#\a ; lowercase letter #\A ; uppercase letter #\( ; left parenthesis #\space ; the space character #\newline ; the newline character |
Case is significant in #\character
, but not in
#\character-name
. If character in
#\character
is a letter, character must be followed
by a delimiter character such as a space or parenthesis. Characters
written in the #\
notation are self-evaluating; you don't need to
quote them.
In addition to the standard character syntax, MIT Scheme also supports a
general syntax that denotes any Unicode character by its code point.
This notation is #\U+code-point
, where code-point is
a sequence of hexadecimal digits for a valid code point. So the above
examples could also be written like this:
#\U+61 ; lowercase letter #\U+41 ; uppercase letter #\U+28 ; left parenthesis #\U+20 ; the space character #\U+0A ; the newline character |
A character name may include one or more bucky bit prefixes to indicate that the character includes one or more of the keyboard shift keys Control, Meta, Super, or Hyper (note that the Control bucky bit prefix is not the same as the ASCII control key). The bucky bit prefixes and their meanings are as follows (case is not significant):
Key Bucky bit prefix Bucky bit --- ---------------- --------- Meta M- or Meta- 1 Control C- or Control- 2 Super S- or Super- 4 Hyper H- or Hyper- 8 |
For example,
#\c-a ; Control-a #\meta-b ; Meta-b #\c-s-m-h-a ; Control-Meta-Super-Hyper-A |
The following character-names are supported, shown here with their ASCII equivalents:
Character Name ASCII Name -------------- ---------- altmode ESC backnext US backspace BS call SUB linefeed LF page FF return CR rubout DEL space tab HT |
In addition, #\newline
is the same as #\linefeed
(but this
may change in the future, so you should not depend on it). All of the
standard ASCII names for non-printing characters are supported:
NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US DEL |
(char->name #\a) => "a" (char->name #\space) => "Space" (char->name #\c-a) => "C-a" (char->name #\control-a) => "C-a" |
Slashify?, if specified and true, says to insert the necessary
backslash characters in the result so that read
will parse it
correctly. In other words, the following generates the external
representation of char:
(string-append "#\\" (char->name char #t)) |
If slashify? is not specified, it defaults to #f
.
name->char
signals
an error.
(name->char "a") => #\a (name->char "space") => #\Space (name->char "c-a") => #\C-a (name->char "control-a") => #\C-a |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
#t
if the specified characters are have the appropriate
order relationship to one another; otherwise returns #f
. The
-ci
procedures don't distinguish uppercase and lowercase letters.
Character ordering follows these portability rules:
(char<? #\0 #\9)
returns
#t
.
(char<? #\A
#\B)
returns #t
.
(char<? #\a
#\b)
returns #t
.
MIT/GNU Scheme uses a specific character ordering, in which characters
have the same order as their corresponding integers. See the
documentation for char->integer
for further details.
Note: Although character objects can represent all of Unicode, the model of alphabetic case used covers only ASCII letters, which means that case-insensitive comparisons and case conversions are incorrect for non-ASCII letters. This will eventually be fixed.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
#t
if object is a character; otherwise returns
#f
.
(char-ci=? char
char2)
.
Note: Although character objects can represent all of Unicode, the model of alphabetic case used covers only ASCII letters, which means that case-insensitive comparisons and case conversions are incorrect for non-ASCII letters. This will eventually be fixed.
char->digit
returns #f
.
Note that this procedure is insensitive to the alphabetic case of char.
(char->digit #\8) => 8 (char->digit #\e 16) => 14 (char->digit #\e) => #f |
digit->char
returns #f
.
(digit->char 8) => #\8 (digit->char 14 16) => #\E |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
An MIT/GNU Scheme character consists of a code part and a bucky bits part. The MIT/GNU Scheme set of characters can represent more characters than ASCII can; it includes characters with Super and Hyper bucky bits, as well as Control and Meta. Every ASCII character corresponds to some MIT/GNU Scheme character, but not vice versa.(5)
MIT/GNU Scheme uses a 21-bit character code with 4 bucky bits. The character code contains the Unicode code point for the character. This is a change from earlier versions of the system, which used the ISO-8859-1 code point, but it is upwards compatible with previous usage, since ISO-8859-1 is a proper subset of Unicode.
char-code
and char-bits
to
extract the code and bucky bits from the character. If 0
is
specified for bucky-bits, make-char
produces an ordinary
character; otherwise, the appropriate bits are turned on as follows:
1 Meta 2 Control 4 Super 8 Hyper |
For example,
(make-char 97 0) => #\a (make-char 97 1) => #\M-a (make-char 97 2) => #\C-a (make-char 97 3) => #\C-M-a |
(char-bits #\a) => 0 (char-bits #\m-a) => 1 (char-bits #\c-a) => 2 (char-bits #\c-m-a) => 3 |
(char-code #\a) => 97 (char-code #\c-a) => 97 |
Note that in MIT/GNU Scheme, the value of char-code
is the
Unicode code point for char.
char->integer
returns the character code representation for
char. integer->char
returns the character whose character
code representation is k.
In MIT/GNU Scheme, if (char-ascii? char)
is true, then
(eqv? (char->ascii char) (char->integer char)) |
However, this behavior is not required by the Scheme standard, and code that depends on it is not portable to other implementations.
These procedures implement order isomorphisms between the set of
characters under the char<=?
ordering and some subset of the
integers under the <=
ordering. That is, if
(char<=? a b) => #t and (<= x y) => #t |
and x
and y
are in the range of char->integer
,
then
(<= (char->integer a) (char->integer b)) => #t (char<=? (integer->char x) (integer->char y)) => #t |
In MIT/GNU Scheme, the specific relationship implemented by these procedures is as follows:
(define (char->integer c) (+ (* (char-bits c) #x200000) (char-code c))) (define (integer->char n) (make-char (remainder n #x200000) (quotient n #x200000))) |
This implies that char->integer
and char-code
produce
identical results for characters that have no bucky bits set, and that
characters are ordered according to their Unicode code points.
Note: If the argument to char->integer
or integer->char
is
a constant, the compiler will constant-fold the call, replacing it with
the corresponding result. This is a very useful way to denote unusual
character constants or ASCII codes.
char->integer
is defined to be the exact
non-negative integers that are less than the value of this variable
(exclusive). Note, however, that there are some holes in this range,
because the character code must be a valid Unicode code point.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
MIT/GNU Scheme internally uses ISO-8859-1 codes for I/O, and stores character objects in a fashion that makes it convenient to convert between ISO-8859-1 codes and characters. Also, character strings are implemented as byte vectors whose elements are ISO-8859-1 codes; these codes are converted to character objects when accessed. For these reasons it is sometimes desirable to be able to convert between ISO-8859-1 codes and characters.
Not all characters can be represented as ISO-8859-1 codes. A character that has an equivalent ISO-8859-1 representation is called an ISO-8859-1 character.
For historical reasons, the procedures that manipulate ISO-8859-1 characters use the word "ASCII" rather than "ISO-8859-1".
#f
.
In the current implementation, the characters that satisfy this predicate are those in which the bucky bits are turned off, and for which the character code is less than 256.
condition-type:bad-range-argument
is signalled if char
doesn't have an ISO-8859-1 representation.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
MIT/GNU Scheme's character-set abstraction is used to represent groups of characters, such as the letters or digits. Character sets may contain only ISO-8859-1 characters; use the alphabet abstraction (see section 5.7 Unicode if you need to cover the entire Unicode range.
#t
if object is a character set; otherwise returns
#f
.
char-set-members
.
Alphabetic characters are the 52 upper and lower case letters.
Numeric characters are the 10 decimal digits. Alphanumeric
characters are those in the union of these two sets. Whitespace
characters are #\space
, #\tab
, #\page
,
#\linefeed
, and #\return
. Graphic characters are
the printing characters and #\space
. Standard characters
are the printing characters, #\space
, and #\newline
.
These are the printing characters:
! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ |
#t
if char is in char-set; otherwise returns
#f
.
#t
if char-set-1 and char-set-2 contain
exactly the same characters; otherwise returns #f
.
char-set
returns an empty
character set.
(apply
char-set chars)
.
For historical reasons, the name of this procedure refers to "ASCII" rather than "ISO-8859-1".
predicate->char-set
creates and returns a character set
consisting of the ISO-8859-1 characters for which
predicate is true.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
MIT/GNU Scheme provides rudimentary support for Unicode characters. In an ideal world, Unicode would be the base character set for MIT/GNU Scheme. But MIT/GNU Scheme predates the invention of Unicode, and converting an application of this size is a considerable undertaking. So for the time being, the base character set for I/O and strings is ISO-8859-1, and Unicode support is grafted on.
This Unicode support was implemented as a part of the XML parser (see section 14.12 XML Parser) implementation. XML uses Unicode as its base character set, and any XML implementation must support Unicode.
The basic unit in a Unicode implementation is the code point. The character equivalent of a code point is a wide character.
#t
if object is a Unicode code point, which are
implemented as exact non-negative integers. Code points are further
limited, by the Unicode standard, to be strictly less than
#x110000
, with the values #xD800
through #xDFFF
,
#xFFFE
, and #xFFFF
excluded.
#t
if object is a wide character, specifically if
object is a character with no bucky bits and whose code satisfies
unicode-code-point?
.
The Unicode implementation consists of three parts:
char-set
abstraction).
5.7.1 Wide Strings | ||
5.7.2 Unicode Representations | ||
5.7.3 Alphabets |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Wide characters can be combined into wide strings, which are similar to strings but can contain any Unicode character sequence. The implementation used for wide strings is guaranteed to provide constant-time access to each character in the string.
#t
if object is a wide string.
wide-string?
. If start and end are supplied, they
specify a substring of wide-string that is to be converted.
Start defaults to `0', and end defaults to
`(wide-string-length wide-string)'.
It is an error if any character in wide-string fails to satisfy
char-ascii?
.
get-output-string
on the
returned port to get a wide string containing the accumulated
characters.
(define (call-with-wide-output-string procedure) (let ((port (open-wide-output-string))) (procedure port) (get-output-string port))) |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The procedures in this section implement transformations that convert between the internal representation of Unicode characters and several standard external representations. These external representations are all implemented as sequences of bytes, but they differ in their intended usage.
The UTF-16 and UTF-32 representations may be serialized to and from a byte stream in either big-endian or little-endian order. In big-endian order, the most significant byte is first, the next most significant byte is second, etc. In little-endian order, the least significant byte is first, etc. All of the UTF-16 and UTF-32 representation procedures are available in both orders, which are indicated by names containing `utfNN-be' and `utfNN-le', respectively. There are also procedures that implement host-endian order, which is either big-endian or little-endian depending on the underlying computer architecture.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Applications often need to manipulate sets of characters, such as the set of alphabetic characters or the set of whitespace characters. The alphabet abstraction provides an efficient implementation of sets of Unicode code points.
#t
if object is a Unicode alphabet, otherwise
returns #f
.
well-formed-code-points-list?
.
#t
if object is a well-formed code-points list,
otherwise returns #f
. A well-formed code-points list is a
proper list, each element of which is either a code point or a pair of
code points. A pair of code points represents a contiguous range of
code points. The CAR of the pair is the lower limit, and the
CDR is the upper limit. Both limits are inclusive, and the lower
limit must be strictly less than the upper limit.
#t
if char is a member of alphabet,
otherwise returns #f
.
Character sets and alphabets can be converted to one another, provided that the alphabet contains only 8-bit code points. This is true because 8-bit code points in Unicode map directly to ISO-8859-1 characters, which is what character sets contain.
(char-set->alphabet (string->char-set string)) |
#t
if alphabet contains only 8-bit code points,
otherwise returns #f
.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |