Better Typography

creating the html template for my site

12 Oct 2019 • Berlin

Purpose of this document

Landjaeger bresaola salami, rump drumstick fatback brisket boudin pastrami doner shankle capicola biltong. Ribeye bacon sirloin rump, beef picanha shank pork loin buffalo pastrami prosciutto hamburger. Burgdoggen ham hock bresaola pork loin shoulder strip steak meatball. Filet mignon beef pork pork chop ground round boudin pork belly tenderloin leberkas prosciutto tri-tip.

\[f_\text{hash}= \begin{cases} h\left(\left[\begin{array}{c} \text{read}(0,min(B, f_\text{size})) \end{array}\right], f_\text{size}\right),& \text{if } f_\text{size}\leq B&&(1)\\[4pt] h\left(\left[\begin{array}{c} \text{read}(f_\text{size} - B, B) \end{array}\right], f_\text{size}\right),& \text{if } f_\text{size}\leq 2B&&(2)\\[4pt] h\left(\left[\begin{array}{c} \text{read}(0, B) \\ \text{read}(\left \lfloor{\frac{f_\text{size}}{2}}\right \rfloor, B) \\ \text{read}(f_\text{size} - B, B) \end{array}\right], f_\text{size}\right),& \text{otherwise}&&(3) \end{cases} \] An Equation

Bresaola pancetta tail tenderloin, capicola buffalo jerky swine cupim bacon shoulder boudin. Rump leberkas sausage spare ribs shank. Ham hock shankle burgdoggen frankfurter chuck. Ground round kielbasa bresaola, pork loin jerky shoulder pork.

Qui picanha pancetta, prosciutto pork lorem ball tip ham hock ut flank laboris esse dolor. Pastrami porchetta meatloaf ipsum. Laboris cillum laborum ea sed et bacon mollit turkey rump burgdoggen. Pancetta ut sint aute irure qui. Tail picanha hamburger officia strip steak rump ullamco proident turducken capicola. Pariatur corned beef veniam pastrami laborum. Doner commodo shoulder magna, pig ut elit shank chicken cillum pariatur ham.

std::fstream fout("abc.txt");

In ea chuck, reprehenderit swine consequat kielbasa deserunt magna cillum tongue. Shank dolore dolore deserunt. Tongue tempor fugiat in swine pig capicola frankfurter nisi irure laborum proident. Meatloaf in brisket officia enim prosciutto short ribs buffalo culpa est dolore pork loin. Ullamco ground round jowl shankle duis beef sausage meatball deserunt eiusmod. Alcatra filet mignon biltong in nisi anim pig fugiat ut cow. Cow strip steak short loin veniam tempor chicken. Sunt et fatback esse buffalo culpa voluptate nisi excepteur sirloin eiusmod pork belly beef est pork. Mollit spare ribs pancetta ipsum quis irure.

Chuck consectetur cillum pork belly jowl prosciutto pig, sausage officia frankfurter cupidatat. Landjaeger picanha nisi kevin ham reprehenderit spare ribs chuck ribeye drumstick kielbasa ut. Dolore labore adipisicing, officia shoulder ball tip in fugiat burgdoggen beef ribs sed aute. Cupim meatball flank, lorem proident shankle in leberkas magna swine ball tip. Quis do boudin landjaeger deserunt pork irure voluptate. Exercitation sunt aliquip aute do landjaeger frankfurter aliqua bacon pork belly picanha. Eiusmod pig aute flank ut chuck esse tenderloin landjaeger tail pancetta bresaola elit voluptate cupim. Pork nostrud brisket cupidatat. Mollit pork loin culpa doner pork belly flank andouille brisket filet mignon anim magna excepteur picanha t-bone.

Background

Tenderloin sed cupidatat meatloaf rump quis. Turducken pork belly consectetur kielbasa. Aute jerky excepteur in sed. Shankle burgdoggen officia hamburger, fugiat salami pancetta pastrami cupidatat culpa venison pork shoulder reprehenderit. Tongue consequat id meatloaf doner do boudin commodo corned beef sausage jerky exercitation alcatra nulla laborum. Chuck ball tip pariatur biltong.

Fugiat eiusmod consectetur veniam, shankle pancetta tempor anim capicola non proident exercitation. Ipsum velit dolor leberkas. Beef ball tip capicola occaecat nisi reprehenderit. Do burgdoggen deserunt magna tri-tip jowl, adipisicing eu duis cow id. Pastrami shankle occaecat mollit sed ut. Officia magna voluptate shankle cillum proident kevin, ipsum turkey.

Me, when debugging. Me, when debugging.

Beef ribs pastrami filet mignon short ribs id tail, et capicola t-bone aliquip culpa sunt pariatur ex. Biltong ribeye nulla doner t-bone eiusmod labore ham enim jowl boudin. Pork belly ribeye shoulder, shank tempor pancetta pariatur meatloaf. Beef ribs dolore aute ground round mollit excepteur spare ribs.

The facts

Glyphs, graphemes and other Unicode species

Voluptate venison rump pork laborum, meatloaf ex. Laborum turducken chuck id kielbasa buffalo cupidatat veniam t-bone.

Code point
Any numerical value in the Unicode codespace.[§3.4, D10] For instance: U+3243F.
Code unit
The minimal bit combination that can represent a unit of encoded text.[§3.9, D77] For example, UTF-8, UTF-16 and UTF-32 use 8-bit, 16-bit and 32-bit code units respectively. The above code point will be encoded as four code units f0 b2 90 bf in UTF-8, two code units d889 dc3f in UTF-16 and as a single code unit 0003243f in UTF-32. Note that these are just sequences of groups of bits; how they are stored on an octet-oriented media depends on the endianness of the particular encoding. When storing the above UTF-16 code units, they will be converted to d8 89 dc 3f in UTF-16BE and to 89 d8 3f dc in UTF-16LE.
Abstract character

A unit of information used for the organization, control, or representation of textual data.[§3.4, D7] The standard further says in §3.1:

For the Unicode Standard, [...] the repertoire is inherently open. Because Unicode is a universal encoding, any abstract character that could ever be encoded is a potential candidate to be encoded, regardless of whether the character is currently known.

Encoded character
Coded character

A mapping between a code point and an abstract character.[§3.4, D11] For example, U+1F428 is a coded character which represents the abstract character 🐨 KOALA.

This mapping is neither total, nor injective, nor surjective:

  • Surragates, noncharacters and unassigned code points do not correspond to abstract characters at all.
  • Some abstract characters can be encoded by different code points; U+03A9 GREEK CAPITAL LETTER OMEGA and U+2126 OHM SIGN both correspond to the same abstract character ‘Ω’, and must be treated identically.
  • Some abstract characters cannot be encoded by a single code point. These are represented by sequences of coded characters. For example, the only way to represent the abstract character ю́ cyrillic small letter yu with acute is by the sequence U+044E CYRILLIC SMALL LETTER YU followed by U+0301 COMBINING ACUTE ACCENT.

Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 LATIN SMALL LETTER G WITH ACUTE, or by the sequence <U+0067 LATIN SMALL LETTER G, U+0301 COMBINING ACUTE ACCENT.>

Adipisicing biltong beef esse do corned beef, cupidatat in ipsum veniam pariatur. Ball tip eiusmod nisi reprehenderit capicola strip steak in in et culpa venison. Anim swine occaecat ad prosciutto, in consectetur sirloin reprehenderit turducken pig ribeye qui ut jowl. Aliquip short loin ham hock doner ut et culpa tongue. Ad leberkas ‘🐨’ rump sunt.

HTML Source (Δ UTF-8)Dense text (Δ UTF-8)
UTF-8767 KB (0%)222 KB (0%)
UTF-161186 KB (+55%)176 KB (−21%)
UTF-8 zipped179 KB (−77%)83 KB (−63%)
UTF-16LE zipped192 KB (−75%)76 KB (−66%)
UTF-16BE zipped194 KB (−75%)77 KB (−65%)

Tail venison in, deserunt aliquip eu exercitation landjaeger quis ipsum. Short loin velit aute et.

Further myths on counting characters

As we already noted, there is a popular idea that counting, splitting, indexing or otherwise iterating over code points in a Unicode string should be considered a frequent and important operation. In this section, we review this in further detail.

1. Counting characters can be done in constant time with UTF-16.

This is a common mistake by those who think that UTF-16 is a fixed-width encoding. It is not. In fact UTF-16 is a variable length encoding. Refer to this FAQ if you deny the existence of non-BMP characters.

2. Counting characters can be done in constant time with UTF-32.

This depends on the meaning of the misused word ‘character’. It is true that we can count code units and code points in constant time in UTF-32. However, code points do not correspond to user-perceived characters. Even in the Unicode formalism some code points correspond to coded character and some to non-characters.

3. Counting coded characters or code points is important.

We think that the importance of code points is frequently overstated. This is due to common misunderstanding of the complexity of Unicode, which merely reflects the complexity of human languages. It is easy to tell how many characters are there in ‘Abracadabra’, but let’s go back to the following string:

Приве́т नमस्ते שָׁלוֹם

It consists of 22 (!) code points, but only 16 grapheme clusters. It may be reduced to 20 code points if converted to NFC. Yet, the number of code points in it is irrelevant to almost any software engineering task, with perhaps the only exception of converting the string to UTF-32. For example:

Note that our guidelines differ significantly from the Microsoft’s original guide to Unicode conversion. Our approach based on performing the wide string conversion as close to API calls as possible, and never holding wide string data. In the previous sections we explained that this will typically result in better performance, stability, code simplicity and interoperability with other software.

FAQ

  1. Q: Are you a linuxer? Is this a concealed religious fight against Windows?

    A: No, I grew up on Windows, and I am primarily a Windows developer. I believe Microsoft made a wrong design choice in the text domain, because they did it earlier than others.

  2. Q: Are you an Anglophile? Do you secretly think English alphabet and culture are superior to any other?

    A: No, and my country is non-ASCII speaking. I do not think that using a format which encodes ASCII characters in single byte is Anglo-centrism, or has anything to do with human interaction. Even though one can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed, as long as they do exist, text is not always composed for human audiences.

  3. Q: Why do you guys care? I program in C# and/or Java and I don’t need to care about encodings at all.

    A: This is false. Both C# and Java offer a 16 bit char type, which is less than a Unicode character, congratulations. The .NET indexer str[i] works in units of the internal representation, hence a leaky abstraction once again. Substring methods will happily return an invalid string, cutting a non-BMP character in parts.

    Furthermore, you have to mind encodings when you are writing your text to files on disk, network communications, external devices, or any place for other program to read from. Please be kind to use System.Text.Encoding.UTF8 (.NET) in these cases, never Encoding.ASCII, UTF-16 or cellphone PDU, regardless of the assumptions about the contents.

    Web frameworks like ASP.NET do suffer from the poor choice of internal string representation in the underlying framework: the expected string output (and input) of a web application is nearly always UTF-8, resulting in significant conversion overhead in high-throughput web applications and web services.

  4. Q: Won’t the conversions between UTF-8 and UTF-16 when passing strings to Windows slow down my application?

    A: First, you will do some conversion either way. It’s either when calling the system, or when interacting with the rest of the world, e.g. when sending a text string over TCP. Also, those of OS APIs which accept strings often perform tasks which are inherently slow, such as UI or file system operations. If your interaction with the system APIs dominate your application, here is a little experiment.

    One typical use of the OS APIs is to open files. This function executes in (184±3)μs on my machine:

    void f(const wchar_t* name)
        {
            HANDLE f = CreateFile(name, GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
            DWORD written;
            WriteFile(f, "Hello world!\n", 13, &written, 0);
            CloseHandle(f);
        }

    While this runs in (186±0.7)μs:

    void f(const char* name)
        {
            HANDLE f = CreateFile(widen(name).c_str(), GENERIC_WRITE, FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
            DWORD written;
            WriteFile(f, "Hello world!\n", 13, &written, 0);
            CloseHandle(f);
        }

    (Run with name="D:\a\test\subdir\subsubdir\this is the sub dir\a.txt" in both cases. It was averaged over 5 runs. We used an optimized widen that relies on std::string contiguous storage guarantee given by C++11.)

    This is just (1±2)% overhead. Also, MultiByteToWideChar is not the fastest UTF-8↔UTF-16 conversion function.

External Links & Further Reading