encoding-rs-kotlin

0.1.1indexed

Line-by-line transliteration port of a mature Encoding Standard implementation. Streams and buffers convert between legacy encodings, UTF-8 and UTF-16, with mem utilities, label resolution and optional SIMD acceleration.

AndroidJSJVMNativeWasm·KotlinMania/encoding-rs-kotlin

Stars

—

Used by

dependents

—

Health

/ 100

encoding-rs-kotlin in Kotlin

This is a Kotlin Multiplatform line-by-line transliteration port of hsivonen/encoding_rs.

Original Project: This port is based on hsivonen/encoding_rs. All design credit and project intent belong to the upstream authors; this repository is a faithful port to Kotlin Multiplatform with no behavioural changes intended.

Porting status

This is an in-progress port. The goal is feature parity with the upstream Rust crate while providing a native Kotlin Multiplatform API. Every Kotlin file carries a // port-lint: source <path> header naming its upstream Rust counterpart so the AST-distance tool can track provenance.

Upstream README — `hsivonen/encoding_rs`

The text below is reproduced and lightly edited from https://github.com/hsivonen/encoding_rs. It is the upstream project's own description and remains under the upstream authors' authorship; links have been rewritten to absolute upstream URLs so they continue to resolve from this repository.

encoding_rs

encoding_rs an implementation of the (non-JavaScript parts of) the Encoding Standard written in Rust.

The Encoding Standard defines the Web-compatible set of character encodings, which means this crate can be used to decode Web content. encoding_rs is used in Gecko starting with Firefox 56. Due to the notable overlap between the legacy encodings on the Web and the legacy encodings used on Windows, this crate may be of use for non-Web-related situations as well; see below for links to adjacent crates.

Additionally, the mem module provides various operations for dealing with in-RAM text (as opposed to data that's coming from or going to an IO boundary). The mem module is a module instead of a separate crate due to internal implementation detail efficiencies.

Functionality

Due to the Gecko use case, encoding_rs supports decoding to and encoding from UTF-16 in addition to supporting the usual Rust use case of decoding to and encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly to accommodate the C++ side of Gecko.

Specifically, encoding_rs does the following:

Additionally, encoding_rs::mem does the following:

Integration with `std::io`

Notably, the above feature list doesn't include the capability to wrap a std::io::Read, decode it into UTF-8 and presenting the result via std::io::Read. The encoding_rs_io crate provides that capability.

`no_std` Environment

The crate works in a no_std environment. By default, the alloc feature, which assumes that an allocator is present is enabled. For a no-allocator environment, the default features (i.e. alloc) can be turned off. This makes the part of the API that returns Vec/String/Cow unavailable.

Decoding Email

For decoding character encodings that occur in email, use the charset crate instead of using this one directly. (It wraps this crate and adds UTF-7 decoding.)

Windows Code Page Identifier Mappings

For mappings to and from Windows code page identifiers, use the codepage crate.

DOS Encodings

This crate does not support single-byte DOS encodings that aren't required by the Web Platform, but the oem_cp crate does.

Preparing Text for the Encoders

Normalizing text into Unicode Normalization Form C prior to encoding text into a legacy encoding minimizes unmappable characters. Text can be normalized to Unicode Normalization Form C using the icu_normalizer crate.

The exception is windows-1258, which after normalizing to Unicode Normalization Form C requires tone marks to be decomposed in order to minimize unmappable characters. Vietnamese tone marks can be decomposed using the detone crate.

Licensing

TL;DR: (Apache-2.0 OR MIT) AND BSD-3-Clause for the code and data combination.

Please see the file named COPYRIGHT.

The non-test code that isn't generated from the WHATWG data in this crate is under Apache-2.0 OR MIT. Test code is under CC0.

This crate contains code/data generated from WHATWG-supplied data. The WHATWG upstream changed its license for portions of specs incorporated into source code from CC0 to BSD-3-Clause between the initial release of this crate and the present version of this crate. The in-source licensing legends have been updated for the parts of the generated code that have changed since the upstream license change.

Documentation

Generated API documentation is available online.

There is a long-form write-up about the design and internals of the crate.

C and C++ bindings

An FFI layer for encoding_rs is available as a separate crate. The crate comes with a demo C++ wrapper using the C++ standard library and GSL types.

The bindings for the mem module are in the encoding_c_mem crate.

For the Gecko context, there's a C++ wrapper using the MFBT/XPCOM types.

There's a write-up about the C++ wrappers.

Sample programs

Optional features

There are currently these optional cargo features:

`simd-accel`

Enables SIMD acceleration using the nightly-dependent portable_simd standard library feature.

This is an opt-in feature, because enabling this feature opts out of Rust's guarantees of future compilers compiling old code (aka. "stability story").

Currently, this has not been tested to be an improvement except for these targets and enabling the simd-accel feature is expected to break the build on other targets:

x86_64
i686
aarch64
thumbv7neon

If you use nightly Rust, you use targets whose first component is one of the above, and you are prepared to have to revise your configuration when updating Rust, you should enable this feature. Otherwise, please do not enable this feature.

Used by Firefox.

`serde`

Enables support for serializing and deserializing &'static Encoding-typed struct fields using Serde.

Not used by Firefox.

`fast-legacy-encode`

A catch-all option for enabling the fastest legacy encode options. Does not affect decode speed or UTF-8 encode speed.

At present, this option is equivalent to enabling the following options:

fast-hangul-encode
fast-hanja-encode
fast-kanji-encode
fast-gb-hanzi-encode
fast-big5-hanzi-encode

Adds 176 KB to the binary size.

Not used by Firefox.

`fast-hangul-encode`

Changes encoding precomposed Hangul syllables into EUC-KR from binary search over the decode-optimized tables to lookup by index making Korean plain-text encode about 4 times as fast as without this option.

Adds 20 KB to the binary size.