Ziglyph and Zigstr

A progress report on Ziglyph:

  • Code point type detection (punctuation, symbols, letters, etc.)
  • Unicode case (lower, upper, title) detection and conversion.
  • Unicode case folding.
  • Unicode normalization via Canonical (D) and Compatibility (KD) decomposition.
  • Grapheme Cluster segmentation and iteration.

In addition, the library now includes a UTF-8 string type: Zigstr that uses much of the Ziglyph functionality to provide:

  • Byte level methods.
  • Code point level methods.
  • Grapheme cluster level methods.
  • Equality comparisons (exact, ignore case, normalized).
  • Concatenation.
  • Case detection / conversion.
  • and much more!

This is the v0.1.0 release, or as I see it, the “get it to work” release. :smiley: I would now like to do some refactoring and optimizations before moving on to more functionality. Hope its already useful to anybody needing some Unicode tools in Zig. Feedback is welcome and greatly appreciated!

11 Likes

Holy cow, so much stuff, that’s amazing!

With this I could implement correct word wrapping for bork, I think there is only one thing that is missing for my use case: the ability to know how wide a given cluster should be when being represented. I believe this is information part of the unicode standard but I’d be happy to be corrected it that’s not the case.

Let me know, I’m super looking forward to stream some bork development to integrate with your library!

2 Likes

@kristoff , when you refer to cluster width when represented, I think you mean like when an application has to draw it, right? I believe there are some areas of Unicode that deal with widths but I didn’t get to those. I’ll take a look and see what I find. There’s also a specific area on word boundaries and segmentation, which I plan implementing next, but I left it for later because it’s a pretty complex set of rules.

Yes, I think we’re referring to the same thing. Clearly, in a more generalized context, it also depends on the font used, but I believe there is a concept of zero, normal, double width, or something along those lines.

In my case I’m printing text in a terminal so I just need to know when a cluster is an emoji (normal width), an ideogram (double width I believe), or a weird zero width character (not sure if those are even supposed to be standalone things or not), for example.

With that I can do perfect word wrapping in the terminal. Or close to, at the moment I’m naively tokenizing on ascii space, so I am definitely interested in word boundaries and segmentation, but at the moment getting widths right is a more concrete problem.

Yes, with just a single search on Google I see this is an important use-case, especially for fixed-width terminal style output. Very interesting indeed. I saw some links to implementations in other languages so I’ll be checking those out to see how deep the rabbit hole goes, but I suspect this should be fairly simple to implement once grapheme cluster segmentation is available, and that’s already in the bag. I’ll keep you posted.

1 Like

That’s awesome progress @dude_the_builder - you’re truly living up to your username! :heart_eyes:

4 Likes

@kristoff , check it out and let me know if this workds. GitHub - jecolon/ziglyph: Unicode processing with Zig.

1 Like

Neat, I’ll try to find a moment to stream some bork development soon :tm:

1 Like

Unicode processing in Zig is badly needed, and Ziglyph looks awesome!
Thanks a ton for starting this! This is going to be super useful to many projects.

2 Likes

Just as a FYI, I added Ziglyph to bork and now I’m able to print emojis and other wide characters correctly. Thank you very much!

4 Likes

That’s really awesome to know! I’m making a boat load of optimizations all over the place, primarily to reduce binary sizes as much as possible without sacrificing speed. I’ve also been down to the depths of the Unicode Collation Algorithm Hell, lol; but just a while ago got all tests to pass on that. Hope to be pushing to github soon. Thanks for the feedback!

4 Likes
1 Like

Some more progress on Ziglyph:

  • Zigstr now a separate repo.
  • All Normalization forms supported (NFC, NFKC, NFD, NFKD).
  • Case detection / conversion for strings.
  • Unicode Text Segmentation: Grapheme Clusters, Words, and Sentences via iterators. Non-allocating comptime versions for comptime strings also available.
  • Display Width calculations and operations such as padLeft, padRight, center, and wrap for strings.
4 Likes

Can someone write a “Teach my ASCII brain Unicode through Zig vision” article on zig.news, please?

Ok, so it doesn’t have be given that title. It’d be helpful to learn to think about string processing and how to do it with Zig. I assume these libraries would be part of the article.

Yeah, I’ve been putting off preparing a write-up on these topics for too long. I’ll give it a shot. :slight_smile:

1 Like

Part I: Unicode Basics in Zig - Zig NEWS

7 Likes