tango.text.convert.UnicodeBom

License:

BSD style: see license.txt

Version:

Initial release: December 2005

Author:

Kris
enum Encoding
see http://icu.sourceforge.net/docs/papers/forms_of_unicode/#t2
class UnicodeBom(T) : BomSniffer
Convert unicode content
Unicode is an encoding of textual material. The purpose of this module is to interface external-encoding with a programmer-defined internal- encoding. This internal encoding is declared via the template argument T, whilst the external encoding is either specified or derived.

Three internal encodings are supported: char, wchar, and dchar. The methods herein operate upon arrays of this type. That is, decode() returns an array of the type, while encode() expect an array of said type.

Supported external encodings are as follow:

Encoding.Unknown Encoding.UTF_8N Encoding.UTF_8 Encoding.UTF_16 Encoding.UTF_16BE Encoding.UTF_16LE Encoding.UTF_32 Encoding.UTF_32BE Encoding.UTF_32LE

These can be divided into non-explicit and explicit encodings:

Encoding.Unknown Encoding.UTF_8 Encoding.UTF_16 Encoding.UTF_32

Encoding.UTF_8N Encoding.UTF_16BE Encoding.UTF_16LE Encoding.UTF_32BE Encoding.UTF_32LE The former group of non-explicit encodings may be used to 'discover' an unknown encoding, by examining the first few bytes of the content for a signature. This signature is optional, but is often written such that the content is self-describing. When an encoding is unknown, using one of the non-explicit encodings will cause the decode() method to look for a signature and adjust itself accordingly. It is possible that a ZWNBSP character might be confused with the signature; today's unicode content is supposed to use the WORD-JOINER character instead. The group of explicit encodings are for use when the content encoding is known. These *must* be used when converting back to external encoding, since written content must be in a known format. It should be noted that, during a decode() operation, the existence of a signature is in conflict with these explicit varieties.

See http://www.utf-8.com/ http://www.hackcraft.net/xmlUnicode/ http://www.unicode.org/faq/utf_bom.html/ http://www.azillionmonkeys.com/qed/unicode.html/ http://icu.sourceforge.net/docs/papers/forms_of_unicode/
this(Encoding encoding)
Construct a instance using the given external encoding ~ one of the Encoding.xx types
T[] decode(void[] content, T[] dst = null, size_t* ate = null) [final]
Convert the provided content. The content is inspected for a BOM signature, which is stripped. An exception is thrown if a signature is present when, according to the encoding type, it should not be. Conversely, An exception is thrown if there is no known signature where the current encoding expects one to be present.
Where 'ate' is provided, it will be set to the number of elements consumed from the input and the decoder operates in streaming-mode. That is: 'dst' should be supplied since it is not resized or allocated.
void[] encode(T[] content, void[] dst = null) [final]
Perform encoding of content. Note that the encoding must be of the explicit variety by the time we get here
T[] into(void[] x, uint type, T[] dst = null, size_t* ate = null) [static]
Convert from 'type' into the given T.
Where 'ate' is provided, it will be set to the number of elements consumed from the input and the decoder operates in streaming-mode. That is: 'dst' should be supplied since it is not resized or allocated.
void[] from(T[] x, uint type, void[] dst = null, size_t* ate = null) [static]
Convert from T into the given 'type'.
Where 'ate' is provided, it will be set to the number of elements consumed from the input and the decoder operates in streaming-mode. That is: 'dst' should be supplied since it is not resized or allocated.
class BomSniffer
Handle byte-order-mark prefixes
Encoding encoding() [@property, final]
Return the current encoding. This is either the originally specified encoding, or a derived one obtained by inspecting the content for a BOM. The latter is performed as part of the decode() method
bool encoded() [@property, final]
Was an encoding located in the text (configured via setup)
const(void)[] signature() [@property, final]
Return the signature (BOM) of the current encoding
void setup(Encoding encoding, bool found = false) [final]
Configure this instance with unicode converters
const(Info)* test(void[] content) [static, final]
Scan the BOM signatures looking for a match. We scan in reverse order to get the longest match first