PetWay progress: revised strings and generics

2025-07-18

Changes

Low-level string iterators

When strings can have multiple memory layouts and character sizes, picking characters by index is slow operation. Low level iterators expose memory layout so the inner code depends only on character size.

This opens a way to variable character size: UTF-8.

It would be great if strings supported UTF-8 natively but there are a couple issues:

C compiler does not provide length of UTF-8 string in codepoints, so it's impossible to initialize PwValue structure without a runtime hack. Because of this, PwStringUtf8 is now a function that returns PwValue and panics when OOM.
Search and substring functions operate with indexes. In parsing algorithms scanning is followed by substring extraction and these functions need low-level conterparts to avoid re-scanning UTF-8 strings.

Merged CharPtr type with String

This change resulted is static string. This does not mean that string is constant. Like other two kinds of strings, embedded and allocated, static strings can be modified thanks to COW.

Static string initializers are PW_STATIC_STRING and PW_STATIC_STRING_UTF32. Rvalues are created with a single generic PwStaticString which never fails. It's a function, and it has to call strlen to initialize length. There's no rvalue macros similar to PwString. They are easy to implement but it would increase the entropy. Later, maybe. Depending on use cases.

Increased bit width for char_size

This allowed storing character size as is, 1-based instead of 0-based. Also, this is another step towards variable-size strings. But maybe it should be stored as a shift counter to eliminate multiplications. Need to evaluate this on ARM.

Initializer and rvalue for UTF-32 embedded strings

That's PW_STRING_UTF32 and PwStringUtf32 respectively.

There's no support for wide character to avoid stepping into surrogate pairs shit.

Replaced UTF-8 wrappers with `_ascii` versions in generics

Clang 16 did not correctly handle char8_t* and char* in generics. Clang 19 does that in the right way.

Future work

Iterators: evaluate branchless approach.
Add UTF-8 variable character size support.
Refactor MYAW and JSON parsers to use iterators instead of picking characters by index.