PetWay progress: revised strings and generics
2025-07-18
Changes
Low-level string iterators
When strings can have multiple memory layouts and character sizes, picking characters by index is slow operation. Low level iterators expose memory layout so the inner code depends only on character size.
This opens a way to variable character size: UTF-8.
It would be great if strings supported UTF-8 natively but there are a couple issues:
- C compiler does not provide length of UTF-8 string in codepoints, so it's impossible to initialize PwValue structure without a runtime hack. Because of this, PwStringUtf8 is now a function that returns PwValue and panics when OOM.
- Search and substring functions operate with indexes. In parsing algorithms scanning is followed by substring extraction and these functions need low-level conterparts to avoid re-scanning UTF-8 strings.
Merged CharPtr type with String
This change resulted is static string. This does not mean that string is constant. Like other two kinds of strings, embedded and allocated, static strings can be modified thanks to COW.
Static string initializers are PW_STATIC_STRING
and PW_STATIC_STRING_UTF32
.
Rvalues are created with a single generic PwStaticString
which never fails.
It's a function, and it has to call strlen to initialize length.
There's no rvalue macros similar to PwString
.
They are easy to implement but it would increase the entropy.
Later, maybe. Depending on use cases.
Increased bit width for char_size
This allowed storing character size as is, 1-based instead of 0-based. Also, this is another step towards variable-size strings. But maybe it should be stored as a shift counter to eliminate multiplications. Need to evaluate this on ARM.
Initializer and rvalue for UTF-32 embedded strings
That's PW_STRING_UTF32
and PwStringUtf32
respectively.
There's no support for wide character to avoid stepping into surrogate pairs shit.
Replaced UTF-8 wrappers with _ascii
versions in generics
Clang 16 did not correctly handle char8_t* and char* in generics. Clang 19 does that in the right way.
Future work
- Iterators: evaluate branchless approach.
- Add UTF-8 variable character size support.
- Refactor MYAW and JSON parsers to use iterators instead of picking characters by index.