13.2 Encoding and Conversion
| import: rhombus/bytes | package: rhombus-lib |
The rhombus/bytes module supports conversions between string encodings at the byte-string level. A bytes.Converter object maintains conversion state, and it its bytes.Converter.convert method comverts from one encoding to another. The set of available encodings and combinations varies by platform, depending on available system libraries such as iconv.
See Encodings and Locales for general information about encodings.
An exception is thrown if a from–to converter is not available. The from–to converters available mostly depend on system facilities, but number of pairs are always supported:
"UTF-8" to "UTF-8": The identity conversion, except that encoding errors in the input lead to a decoding failure.
"UTF-8-permissive" to "UTF-8": The identity conversion, except that any input byte that is not part of a valid encoding sequence is effectively replaced by the UTF-8 encoding sequence for Char"\uFFFD". This handling of invalid sequences is consistent with the interpretation of port bytes streams into characters (see Encodings and Locales).
"" to "UTF-8": Converts from the current locale’s default encoding (see Encodings and Locales) to UTF-8.
"UTF-8" to "": Converts from UTF-8 to the current locale’s default encoding (see Encodings and Locales).
"platform-UTF-8" to "platform-UTF-16": Converts UTF-8 to UTF-16 on Unix and Mac OS, where each UTF-16 code unit is a sequence of two bytes ordered by the current platform’s endianness. On Windows, the conversion is the same as "WTF-8" to "WTF-16" to support unpaired surrogate code units.
"platform-UTF-8-permissive" to "platform-UTF-16": Like "platform-UTF-8" to "platform-UTF-16", but an input byte that is not part of a valid UTF-8 encoding sequence (or valid for the unpaired-surrogate extension on Windows) is effectively replaced with Char"\uFFFD".
"platform-UTF-16" to "platform-UTF-8": Converts UTF-16 (bytes ordered by the current platform’s endianness) to UTF-8 on Unix and Mac OS. On Windows, the conversion is the same as "WTF-16" to "WTF-8" to support unpaired surrogates. On Unix and Mac OS, surrogates are assumed to be paired: a pair of bytes with the bits 0xD800 starts a surrogate pair, and the 0x03FF bits are used from the pair and following pair (independent of the value of the 0xDC00 bits). On all platforms, performance may be poor when decoding from an odd offset within an input byte string.
"WTF-8" to "WTF-16": Converts the WTF-8 superset of UTF-8 to a superset of UTF-16 to support unpaired surrogate code units, where each UTF-16 code unit is a sequence of two bytes ordered by the current platform’s endianness.
"WTF-8-permissive" to "WTF-16": Like "WTF-8" to "WTF-16", but an input byte that is not part of a valid WTF-8 encoding sequence is effectively replaced with Char"\uFFFD".
"WTF-16" to "WTF-8": Converts the WTF-8 superset of UTF-16 to the WTF-8 superset of UTF-8. The input can include UTF-16 code units that are unpaired surrogates, and the corresponding output includes an encoding of each surrogate in a natural extension of UTF-8.
Use Converter.convert or Converter.convert_to with the result to convert byte strings.
A newly opened byte converter is registered with the current
custodian, so that the converter is closed as if by
Converter.close when the custodian is shut down. A converter
is not registered with a custodian—
#"\377\376A\0B\0C\0D\0".copy()
4
#'complete
#"\347\0\360\0\366\0\243\0".copy()
8
#'complete
method | ||||||||
| ||||||||
| ||||||||
method | ||||||||
| ||||||||
| ||||||||
enumeration | ||||||||
|
Conversion might not consume all of the src bytes. The second result is the number of converted bytes, which can be no more than src_end-src_start. If it is less, the Converter.Status as the third result indicates why.
The Converter.Status result is one of the following:
#'complete: All supplied src bytes (in the src_start to src_end range) where consumed for conversion.
#'continues: Conversion stopped due to the limit on the result size or the space in dest. In this case, fewer than dest_end-dest_start bytes may be written if more space is needed to process the next complete encoding sequence in src.
#'aborts: The input stopped part-way through an encoding sequence, and more input bytes are necessary to continue. For example, if the last byte of input is 0o303 for a "UTF-8-permissive" decoding, the result is #'aborts, because another byte is needed to determine how to use the 0o303 byte.
#'error: The src contains bytes that do not form a legal encoding sequence. Specifically, the bytes starting at src_start plus the second result are ill-formed. This result is never produced for some encodings, where all byte sequences are valid encodings. For example, since "UTF-8-permissive" handles an invalid UTF-8 sequence by dropping characters or generating ?, every byte sequence is effectively valid.
Applying a converter accumulates state in the converter (even when the third result of Converter.convert or Converter.convert_to is #'complete). This state can affect both further processing of input and further generation of output, but only for conversions that involve “shift sequences” to change modes within a stream. To terminate an input sequence and reset the converter, use Converter.end or Converter.end_to.
method | |||||
| |||||
| |||||
method | |||||
|
In the case of Converter.end, the first result is a freshly allocated byte string. In the case of Converter.end_to, the converted bytes are written to the given dest (between dest_start and dest_end) and the number of written bytes is the first returned result.
The second result is #'complete or #'continues to indicate whether conversion completed. If #'complete, then an entire ending sequence was produced. If #'continues, then the conversion could not complete due to the limit on the result size or the space in dest, and the first result is either an empty byte string or 0.
context parameter | |
|