
| Current Path : /proc/thread-self/root/usr/share/nodejs/iconv-lite/generation/research/ |
Linux ift1.ift-informatik.de 5.4.0-216-generic #236-Ubuntu SMP Fri Apr 11 19:53:21 UTC 2025 x86_64 |
| Current File : //proc/thread-self/root/usr/share/nodejs/iconv-lite/generation/research/normalization.md |
Combining diacritics: * http://en.wikipedia.org/wiki/Unicode_equivalence * Canonically equivalent -> n + ◌̃ = ñ (Same display, printing, meaning) * Compatible: ligatures ff = ff (Same is some apps - sorting, indexing) * Unicode normalization - replaces equivalent sequences. * There some equivalent characters (angstrom = 00C5 and 212B) * Combining vs precomposed characters (ligatures, combining) * Typographical conventions: ① is compatible with 1 * There are 4 normal forms to compare/search for strings: * Canonical(NF)/Compatibility(NFK) equivalence (chosen semantically, canonical = strict, compatibility = relaxed) * Composed/Decomposed - doesn't matter, just choose one. * Forms are in http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt (http://en.wikipedia.org/wiki/Character_property_(Unicode)) * http://unicode.org/reports/tr15/ - Normalization & Equivalence. * http://www.icu-project.org/docs/papers/optimized_unicode_composition_and_decomposition.html * Algorithms: http://www.unicode.org/versions/Unicode6.3.0/ch03.pdf * TR15 Part 8: Legacy encodings - about how to convert from/to other encodings. * There's a Node.js unicode normalization library 'unorm' * http://en.wikipedia.org/wiki/Combining_diacritical_mark * If several combining codes, in canonical they should be stable sorted in order of combining class. * There's a `quick check` flags http://unicode.org/faq/normalization.html * We can check before encoding/decoding that the char is in needed form. * There's also a complex combining alg-m for Korean 'Hangul' 'Jamo', through 3 tables. * Combining diacritical: 0x300-0x36F, 0x591-0x5C7, 0x610-0x61A, 0x64B-0x065F, some others. * Encodings containing: 864, 874, 1046, 1129, 1133, 1161-1163, 1255, 1256, 1258, 8859-6, 8859-11, TCVN, MacThai (mostly TCVN, 1258, 1255) * Even for single-byte encodings I need (when there are combining chars): * Composing when decoding. * Decomposing when encoding. ================================================= SBCS fast alg-m fails (see http://www.icu-project.org/docs/papers/optimized_unicode_composition_and_decomposition.html as inspiration): * If combined char is in encoding, then un-combined is also there: * CP866: Її