13.2 Encoding and Conversion

0.45+9.1

Rhombus

13.2 Encoding and Conversion🔗ℹ

The rhombus/bytes module supports conversions between string encodings at the byte-string level. A bytes.Converter object maintains conversion state, and it its bytes.Converter.convert method converts from one encoding to another. The set of available encodings and combinations varies by platform, depending on available system libraries such as iconv.

See Encodings and Locales for general information about encodings.

class

class bytes.Converter():

constructor (~from: from :: String,

~to: to :: String)

A Converter object maintains state for converting a stream of bytes from the from encoding to the to encoding. The name "" can be used as equivalent to the current locale’s default encoding, which is also reported by system.locale_string_encoding.

An exception is thrown if a from–to converter is not available. The from–to converters available mostly depend on system facilities, but number of pairs are always supported:

"UTF-8" to "UTF-8": The identity conversion, except that encoding errors in the input lead to a decoding failure.
"UTF-8-permissive" to "UTF-8": The identity conversion, except that any input byte that is not part of a valid encoding sequence is effectively replaced by the UTF-8 encoding sequence for Char"\uFFFD". This handling of invalid sequences is consistent with the interpretation of port bytes streams into characters (see Encodings and Locales).
"" to "UTF-8": Converts from the current locale’s default encoding (see Encodings and Locales) to UTF-8.
"UTF-8" to "": Converts from UTF-8 to the current locale’s default encoding (see Encodings and Locales).
"platform-UTF-8" to "platform-UTF-16": Converts UTF-8 to UTF-16 on Unix and Mac OS, where each UTF-16 code unit is a sequence of two bytes ordered by the current platform’s endianness. On Windows, the conversion is the same as "WTF-8" to "WTF-16" to support unpaired surrogate code units.
"platform-UTF-8-permissive" to "platform-UTF-16": Like "platform-UTF-8" to "platform-UTF-16", but an input byte that is not part of a valid UTF-8 encoding sequence (or valid for the unpaired-surrogate extension on Windows) is effectively replaced with Char"\uFFFD".
"platform-UTF-16" to "platform-UTF-8": Converts UTF-16 (bytes ordered by the current platform’s endianness) to UTF-8 on Unix and Mac OS. On Windows, the conversion is the same as "WTF-16" to "WTF-8" to support unpaired surrogates. On Unix and Mac OS, surrogates are assumed to be paired: a pair of bytes with the bits 0xD800 starts a surrogate pair, and the 0x03FF bits are used from the pair and following pair (independent of the value of the 0xDC00 bits). On all platforms, performance may be poor when decoding from an odd offset within an input byte string.
"WTF-8" to "WTF-16": Converts the WTF-8 superset of UTF-8 to a superset of UTF-16 to support unpaired surrogate code units, where each UTF-16 code unit is a sequence of two bytes ordered by the current platform’s endianness.
"WTF-8-permissive" to "WTF-16": Like "WTF-8" to "WTF-16", but an input byte that is not part of a valid WTF-8 encoding sequence is effectively replaced with Char"\uFFFD".
"WTF-16" to "WTF-8": Converts the WTF-8 superset of UTF-16 to the WTF-8 superset of UTF-8. The input can include UTF-16 code units that are unpaired surrogates, and the corresponding output includes an encoding of each surrogate in a natural extension of UTF-8.

Use Converter.convert or Converter.convert_to with the result to convert byte strings.

A newly opened byte converter is registered with the current custodian, so that the converter is closed as if by Converter.close when the custodian is shut down. A converter is not registered with a custodian—and does not need to be closed—if it is one of the guaranteed combinations not involving "" on Unix, or if it is any of the guaranteed combinations (including "") on Windows and Mac OS.

> def cvt = bytes.Converter(~from: "UTF-8", ~to: "UTF-16")
> cvt.convert(#"ABCD")
#"\377\376A\0B\0C\0D\0".copy()
4
#'complete
> cvt.convert(#"\303\247\303\260\303\266\302\243")
#"\347\0\360\0\366\0\243\0".copy()
8
#'complete
> cvt.close()

method

method (cvt :: bytes.Converter).close() :: Void

Closes the converter and release its resources. A closed converter can no longer be used with Converter.convert, Converter.convert_to, Converter.end, or Converter.end_to.

method

method (cvt :: bytes.Converter).convert(

src :: Bytes,

~start: src_start :: Nat = 0,

~end: src_end :: Nat = src.length()

) :: values(Bytes, Nat, Converter.Status)

method

method (cvt :: bytes.Converter).convert_to(

src :: Bytes,

~start: src_start :: Nat = 0,

~end: src_end :: Nat = src.length(),

~dest: dest :: MutableBytes,

~dest_start: dest_start :: Nat = 0,

~dest_end: dest_end :: Nat = dest.length()

) :: values(Nat, Nat, Converter.Status)

enumeration

enum bytes.Converter.Status

| complete

| continues

| aborts

| error

Converts bytes in src from offset src_start (inclusive) to src_end (exclusive). In the case of Converter.convert, the first result is a freshly allocated byte string. In the case of Converter.convert_to, the converted bytes are written to the given dest (between dest_start and dest_end) and the number of written bytes is the first returned result.

Conversion might not consume all of the src bytes. The second result is the number of converted bytes, which can be no more than src_end-src_start. If it is less, the Converter.Status as the third result indicates why.

The Converter.Status result is one of the following:

#'complete: All supplied src bytes (in the src_start to src_end range) where consumed for conversion.
#'continues: Conversion stopped due to the limit on the result size or the space in dest. In this case, fewer than dest_end-dest_start bytes may be written if more space is needed to process the next complete encoding sequence in src.
#'aborts: The input stopped part-way through an encoding sequence, and more input bytes are necessary to continue. For example, if the last byte of input is 0o303 for a "UTF-8-permissive" decoding, the result is #'aborts, because another byte is needed to determine how to use the 0o303 byte.
#'error: The src contains bytes that do not form a legal encoding sequence. Specifically, the bytes starting at src_start plus the second result are ill-formed. This result is never produced for some encodings, where all byte sequences are valid encodings. For example, since "UTF-8-permissive" handles an invalid UTF-8 sequence by dropping characters or generating ?, every byte sequence is effectively valid.

Applying a converter accumulates state in the converter (even when the third result of Converter.convert or Converter.convert_to is #'complete). This state can affect both further processing of input and further generation of output, but only for conversions that involve “shift sequences” to change modes within a stream. To terminate an input sequence and reset the converter, use Converter.end or Converter.end_to.

method

method (cvt :: bytes.Converter).end()

:: values(Bytes, Status)

method

method (cvt :: bytes.Converter).end_to(

~dest: dest :: MutableBytes,

~dest_start: dest_start :: Nat = 0,

~dest_end: dest_end :: Nat = dest.length()

) :: values(Nat, Status)

Like Converter.convert and Converter.convert_to, but instead of converting bytes, this procedure generates an ending sequence for the conversion (sometimes called a “shift sequence”), if any. Few encodings use shift sequences, so this function will succeed with no output for most encodings. In any case, successful output of a (possibly empty) shift sequence resets the converter to its initial state.

In the case of Converter.end, the first result is a freshly allocated byte string. In the case of Converter.end_to, the converted bytes are written to the given dest (between dest_start and dest_end) and the number of written bytes is the first returned result.

The second result is #'complete or #'continues to indicate whether conversion completed. If #'complete, then an entire ending sequence was produced. If #'continues, then the conversion could not complete due to the limit on the result size or the space in dest, and the first result is either an empty byte string or 0.

function

fun bytes.Converter.maybe_open(~from: from :: String, ~to: to :: String)

:: maybe(Converter)

Like the Converter constructor, but returning #false instead of throwing an exception if a from–to converter is not available.

context parameter

Parameter.def bytes.current_locale :: maybe(String)

A context parameter for the current locale. See Encodings and Locales for more information.

function

fun bytes.hex_string(bstr :: Bytes) :: String

Returns a string with a 2-digit hexadecimal number each the byte values in bstr. That is, the result string has twice as many characters as bstr has bytes, and every character in the result string is a hexadecimal digit in the range Char"0" to Char"9" or Char"a" to Char"f".

> bytes.hex_string(#"A\000\377")
"4100ff"

1	Notation and Conventions
2	Implicits and Context
3	Names and Definitions
4	Functions and Operators
5	Comparison and Branching
6	Objects and Annotations
7	Basic Data
8	Collections and Iteration
9	Object Protocols
10	Higher-Order Control
11	Code as Data
12	String Formatting and Matching
13	Input and Output
14	Operating System
15	Threads and Concurrency
16	Shutdown and Finalization
17	Runtime System

13.1	Ports
13.2	Encoding and Conversion
13.3	Logging
13.4	Printing
13.5	Shrubbery Input
13.6	Cryptographic Hashing