Strings in Zig: What do I miss?

Hi

I cannot find no string type nor a string namespace in Zigs standard library. There is a section in the documentation about the nature of the string literals, unicode etc: Documentation - The Zig Programming Language. But how to do all the things you do with strings: split, join, reverse, capitalize, etc
Do I look at the wrong places? If no and this is a deliberate desicion could someone talk a little bit more about the reasons behind it?

Ben

1 Like

Strings are just slices of u8, so most of the “string methods” are actually “slice methods”, e.g. see: Documentation - Zig

1 Like

See jmc’s answer above, however there are also a few extra facilities relating specifically to slices of u8, namely std.ascii and std.unicode.

1 Like

As the previous answers point out, you can go a long way in a language handling just arrays of bytes and not having a native “string” in the type system. In light of how extremely complex the “proper” handling of human languages as text in computers is, it’s not a trivial question as to whether there should be a string type builtin or whether it should be provided as a library. I’ve been reading up a lot on Unicode lately (must be the quarantine :smiley: ) and have been surprised as to how many string implementations in major programming languages are sub-optimal, incomplete or plain wrong. If the string type in all those cases were libraries, you can just swap them out for a better one and move on. But if they’re builtin, you have to wait for a new language version, and that’s if the problem can be fixed without breaking the whole ecosystem. I know how convenient it is to handle strings in many languages using the builtin type and standard library utilities, and the ease of use when you can use operators like +, ==, etc. ; but if you get it wrong, or leave out parts that are maybe not important to an English speaking audience but crucial for say Asians or Western Europeans, it’s a pretty big flawed part of the language. So in conclusion, if you’re gonna have a builtin string type, you better get it right, and that’s no trivial task. Not impossible, but requires careful attention. On the upside, seeing the incredible things Zig has already accomplished so early in its evolution, I’m pretty sure if there’s a way, they’ll find it!

5 Likes

Here’s a simple example in Go, and these are the guys that also created UTF-8 (Thompson and Pike), so getting it right is definitely not easy! The Go Playground

5 Likes

This is one of my top desires as well, @ben . I love that Zig doesn’t try to hide the difficulty of handling “strings” from the developer but some guidance is sorely needed.

I come from a number of dynamic languages where these details are papered over (sometimes very badly), but it’s also super clear how to perform the basic actions you’ve described (until something goes wrong, of course :sweat_smile: ).

(@dude_the_builder - you summarized the problems I’ve encountered with various languages perfectly!)

@jmc and @truemedian Thanks for the hints on where to find what is needed from the Zig Standard Library!

std.mem

The “mem” functions for treating strings as “just bytes” do a lot:

https://ziglang.org/documentation/master/std/#std;mem

  • concat()
  • endsWith()
  • eql()
  • indexOfPos()
  • join()
  • …and many more.

Also mem.SplitIterator and mem.TokenIterator look very interesting.

std.ascii

It looks like ASCII operations are mostly for case changes and classifying individual characters:

https://ziglang.org/documentation/master/std/#std;ascii

  • allocLowerString()
  • allocUpperString()
  • isAlpha()
  • isDigit()
  • etc…

std.unicode

The Unicode operations are specific to encoding and decoding 21-byte codepoints and UTF-8 and UTF-16 encoding:

https://ziglang.org/documentation/master/std/#std;unicode

  • utf8CodepointSequenceLength()
  • utf8Decode()
  • utf8Encode()

So…

Between these three parts of the library, you can probably accomplish just about anything, but knowing how to put it all together remains a puzzle for the developer to sort out.

A “Zig String Cookbook” would be enormously valuable.

The True Nature of a String

In a related note, I recently attempted to summarize all of the things the standard library’s format() "{s}" formatting option will accept as a string. The documentation says:

for pointer-to-many and C pointers of u8, print as a C-string using zero-termination
for slices of u8, print the entire slice as a string without zero-termination

and here’s the things that will work:

    const real: *const [3:0]u8 = "baz";  // pointer to null-terminated array
    const a: []const u8 = "foo";         // slice (pointer with length)
    const b: [*]const u8 = "bar";        // many-item pointer (unknown length!)
    const b2: []const u8 = b[0..3];      // ...cast to slice so it has a length
    const c = [3]u8{ 'f', 'o', 'o' };    // array
    const d = [_]u8{ 'b', 'a', 'r' };    // array with inferred size
    const e: [3]u8 = .{ 'b', 'a', 'z' }; // array from anonymous struct
    print("{s} {s} {s} {s} {s} {s}\n", .{real, a, b2, c, d, e});

There may be more?

6 Likes

Hey @ben , but you’re a Perl guy, so to you any other language’s string processing will be a joke. :smiley: Just kidding man, Perl is awesome too!

2 Likes

Anybody interested in implementing normalization functions in the std.unicode namespace ?
:stuck_out_tongue:

1 Like

The recently added std.meta.trait.isZigString() tries to capture all allowed cases, so that you can more easily answer the question “is this thing usable as a string” when doing metaprogramming.

2 Likes

Would be awesome to work on that, but still need a lot more reading on Unicode and practicing Zig skills before I would dare. But hey, C++ hasn’t got it yet in pretty much 30 years so we have time. :laughing:

2 Likes

I can see, that makes all perfect sense, but what about the concept of “strings” (as vague it may be)? I love doing things with “strings”, but maybe thats just an old habit.

1 Like

From my perspective, in line with what @dude_the_builder said, with unicode you either go big or go home and I don’t think it would make sense for Zig to have a builtin type that tries to be “unicode”.

It’s not just codepoints and normalization, but also grapheme clusters and all the related operations, and on top of that such type would require to decode everything beforehand and trick programmers that are not used handling this stuff into very inefficient paths.

Strings should be byte sequences unless they need to be something else, and in those cases you want full support, maybe even with support for different unicode versions depending on the context.

When writing bork I was faced with all these problems and for now the best tool I’ve found to know how to properly implement unicode-aware word wrapping is to rely on GitHub - JuliaStrings/utf8proc: a clean C library for processing UTF-8 Unicode data, which I hope to start using as soon as I get back into bork development.

This is bork btw, it’s a TUI twitch chat client.

4 Likes

I also have text processing needs.

I’m trying to extend the ChordPro format (details for the musically inclined). I need to parse a custom text format and generate HTML (and then PDF, but that can happen externally).

Mecha seems to get by without extensive Unicode support. Koino, on the other hand, relies on zunicode, which is undocumented and seems unmaintained.

(Still looking for a HTML template library. Maybe template.zig or etch…)

2 Likes

@teknico : I forked zunicode to update it to Unicode 13 and Zig 0.8. All tests are passing. Made a PR so let’s see if I get a response.

It’s interesting because it’s a translation from the Go unicode package, and that’s exactly what I had done this past week in my own fiddling on Unicode and Zig. I made a little benchmark processing about 3.9 MB of UTF-8 text in English, Spanish, and Chinese. When using Go’s native string type and a range loop to get the code points, Go is fastest (13ms) versus zunicode (25ms). If I instead use the Go unicode.utf8.decodeRune function on a slice of bytes, Go takes (35ms) so zunicode wins there. I wonder how the Go native string can process code points so fast.

3 Likes

That’s very interesting, no idea about the speed difference.

Thank you so much for your efforts! I hope your PR gets merged; regardless, it sounds very good. I wonder if the Go stdlib authors would mind too much us borrowing some of their docs too… :slight_smile:

1 Like

Can I ask if these sources (Go or Zig) are generated from some Unicode data tables, or is the code maintained by hand? Just curious.

The data tables in the Go package are auto-generated from the official Unicode database. From that, the original zunicode author modified the Go program so that it would generate a Zig file with the tables. I’m looking at the code for this to see if I can port it to Zig, and thus have the data update functionality native in the zunicode lib.

3 Likes

Great, makes sense to do it this way. Do you need help?

1 Like

@gonzo : Right now I’m studying the code to see how it works in Go to then start the porting process to Zig. It seems it won’t be too hard, but there are some parts that have some “clever” bits of code that may need more analysis. I’ll keep you posted if I get stuck! Thanks.

1 Like

There should be no problem as long as we give proper attribution.