Introducing Utf8StringInterpolation: Optimizing UTF8 String Generation in C#
We are excited to announce the release of a new library, Utf8StringInterpolation! This library specializes in generating and writing UTF8 strings. It leverages C# compiler’s feature for customizing string interpolation, offering both direct string generation, akin to the StringBuilder’s continuous write functionality.
The basic flow goes like this, allowing you to generate/write in UTF8 just like you would with a regular String:
using Utf8StringInterpolation;
// Create UTF8 encoded string directly(without encoding).
byte[] utf8 = Utf8String.Format($"Hello, {name}, Your id is {id}!");
// write to IBufferWriter<byte>(for example ASP.NET HttpResponse.BodyWriter)
// support custom format(yyyy-MM-dd and others)
Utf8String.Format(bufferWriter, $"Today is {DateTime.Now:yyyy-MM-dd}");
// like a StringBuilder
var writer = Utf8String.CreateWriter(bufferWriter);
writer.Append("My Name...");
writer.AppendFormat($"is...? {name}");
writer.AppendLine();
writer.Flush();
// Join, Concat methods
var seq = Enumerable.Range(1, 10);
byte[] utf8seq = Utf8String.Join(", ", seq);
Though closely related to ZString from Cysharp, the distinction lies in the fact that while ZString supported both String(UTF16) and UTF8, Utf8StringInterpolation focuses solely on the UTF8 side. The motivation behind this shift can be attributed to the performance enhancements introduced in C# 10.0’s Improved Interpolated Strings. Specifically, the compiler now breaks down the structure of string interpolations (like $”foo{bar}baz”) and directly passes areas with embedded values as Generics, thereby eliminating boxing. In essence, this was the objective of ZString; however, remember ZString was designed before C# 10.0. This implies that half of ZString’s functionalities have become obsolete with these improvements.
Still, standard support for the UTF8 side remains thin. However, Improved Interpolated Strings now permits custom behaviors in string interpolations. With this insight, we envisioned Utf8StringInterpolation to utilize string interpolations for constructing UTF8, positioning it rightly as a successor to ZLogger.
// Passing such a string interpolation as:
// Utf8String.Format(ref Utf8StringWriter format)
Utf8String.Format($"Hello, {name}, Your id is {id}!");
// The compiler will "at compile-time" expand it to:
var writer = new Utf8StringWriter(literalLength: 20, formattedCount: 2);
writer.AppendLiteral("Hello, ");
writer.AppendFormatted<string>(name);
writer.AppendLiteral(", You id is ");
writer.AppendFormatted<int>(id);
writer.AppendLiteral("!");
The ability to expand at compile-time is crucial for performance. This means we no longer need to parse string expressions at runtime like String.Format
. Moreover, all values can be written without boxing.
Utf8StringWriter
InterpolatedStringHandler
The Utf8StringWriter
class provides a direct way to write values in UTF8 format without resorting to boxing. This class is adorned with the [InterpolatedStringHandler]
attribute, which means that when you pass an interpolated string (e.g., $"Hello, {name}!"
), the compiler automatically expands it, simplifying the UTF8 encoding process.
// The structure of the Utf8StringWriter class
[InterpolatedStringHandler]
public ref struct Utf8StringWriter<TBufferWriter> where TBufferWriter : IBufferWriter<byte>
{
TBufferWriter bufferWriter; // when buffer is full, advance and get more buffer
Span<byte> buffer; // current write buffer
public void AppendLiteral(string value)
{
// encode string literal to Utf8 buffer directly
var bytesWritten = Encoding.UTF8.GetBytes(value, buffer);
buffer = buffer.Slice(bytesWritten);
}
public void AppendFormatted<T>(T value, int alignment = 0, string? format = null)
where T : IUtf8SpanFormattable
{
// write value to Utf8 buffer directly
while (!value.TryFormat(buffer, out var bytesWritten, format))
{
Grow();
}
buffer = buffer.Slice(bytesWritten);
}
}
In .NET 8, if the value passed implements the IUtf8SpanFormattable interface (like most standard primitives such as int
), you can directly write to a Span
as UTF8 using the TryFormat
method.
However, claiming support only for .NET 8 would be rather restrictive. Therefore, for backward compatibility with .NET Standard 2.1, .NET 6, and .NET 7, the Utf8Formatter.TryFormat is used, ensuring comparable performance.
Builder vs. Writer
Previously, with ZString, the approach was more aligned with StringBuilder
, with an internal buffer. However, for UTF8 use-cases, this didn't seem optimal. A recent presentation, Modern High-Performance C# 2023 Edition at CEDEC 2023, highlighted that the .NET standard leans towards IBufferWriter<byte>
.
It became clear that the approach should be more writer-oriented than builder-oriented. Thus, the Utf8StringWriter
class was designed to accept an IBufferWriter<byte>
and write to it directly.
public ref partial struct Utf8StringWriter<TBufferWriter>
where TBufferWriter : IBufferWriter<byte>
{
Span<byte> destination;
TBufferWriter bufferWriter;
int currentWritten;
public Utf8StringWriter(TBufferWriter bufferWriter)
{
this.bufferWriter = bufferWriter;
this.destination = bufferWriter.GetSpan();
}
public void Flush()
{
if (currentWritten != 0)
{
bufferWriter.Advance(currentWritten);
currentWritten = 0;
}
}
Instead of enlarging the buffer when it runs out, the design uses the Advance
method and fetches a new buffer through the GetSpan
method. This approach differs from StringBuilder
as it requires the concept of "flushing", but it offers significant performance improvements.
Apart from the need for a flush, it can be handled similarly to StringBuilder.
var writer = Utf8String.CreateWriter(bufferWriter);
// Call each append method.
writer.Append("foo");
writer.AppendFormat($"bar {Guid.NewGuid()}");
writer.AppendLine();
// Finally, call Flush(or Dispose)
writer.Flush();
Furthermore, for times when you simply want to use it like StringBuilder and preparing an IBufferWriter<byte>
is cumbersome, an overload using an internally pooled buffer is available. The return value acts as a buffer controller, enabling you to convert it to an array, copy to another IBufferWriter<byte>
, or retrieve a ReadOnlySpan<byte>
.
// The buffer must be disposed after use (recommend using the 'using' statement)
using var buffer = Utf8String.CreateWriter(out var writer);
// Call each append method.
writer.Append("foo");
writer.AppendFormat($"bar {Guid.NewGuid()}");
writer.AppendLine();
// Finally, call Flush(no need to call Dispose for writer)
writer.Flush();
// Copy to the written byte array
var bytes = buffer.ToArray();
// Or copy to another IBufferWriter<byte>, or get ReadOnlySpan<byte>
buffer.CopyTo(otherBufferWriter);
var writtenData = buffer.WrittenSpan;
Additionally, methods like Format
, Join
, and Concat
have overloads prepared to either accept IBufferWriter<byte>
or return byte[]
.
.NET 8 and StandardFormat
You’re likely familiar with using format strings for values, especially for DateTime
. But there are various formats available for numeric types and others as well. The methods on how to format .NET's numerics, dates, enums, and other types and various custom format specification strings are incredibly handy.
However, the traditional means for directly writing values into UTF8, namely Utf8Formatter.TryFormat, did not support these standard format specification strings! Instead, we had the StandardFormat — which was incredibly restrictive (for example, you could only specify single-character formats like ‘G’, ‘D’, or ‘X’), and it wasn’t very useful.
However, with .NET 8, the usual format specification strings have made a comeback with the introduction of IUtf8SpanFormattable.TryFormat
!
// Utf8Formatter.TryFormat
static bool TryFormat (int value, Span<byte> destination, out int bytesWritten, System.Buffers.StandardFormat format = default);
// .NET 8 IUtf8SpanFormattable.TryFormat
bool TryFormat (Span<byte> utf8Destination, out int bytesWritten, ReadOnlySpan<char> format, IFormatProvider? provider);
Although the parameters look very similar, the new method accepts a string format. Here’s a comparison:
Span<byte> dest = stackalloc byte[16];
int written = 0;
// Can't represent as it throws an Exception on parsing
Utf8Formatter.TryFormat(123.456789, dest, out written, StandardFormat.Parse(".###"));
// Outputs 123.456
123.456123.TryFormat(dest, out written, ".###");
// Can't specify custom format strings, thus an exception! Only supports 'G', 'R', 'l', 'O'!
Utf8Formatter.TryFormat(DateTime.Now, dest, out written, StandardFormat.Parse("yyyy-MM-dd"));
// This, of course, works just fine
DateTime.Now.TryFormat(dest, out written, "yyyy-MM-dd");
Console.WriteLine(Encoding.UTF8.GetString(dest.Slice(0, written)));
Thank goodness, we’re finally back to a normal world! This was a point of frustration in ZString, and for those using ZLogger which internally leveraged ZString.
In .NET 8, Utf8StringInterpolation
converts everything via IUtf8SpanFormattable
. However, in .NET Standard 2.1, .NET 6, and .NET 7, it sadly still uses Utf8Formatter
, so there are limitations on format specification. Depending on the target platform, you might experience inconsistent behavior for numerics.
// .NET 8 supports all numeric custom format strings but .NET Standard 2.1, .NET 6(.NET 7) does not.
Utf8String.Format($"Double value is {123.456789:.###}");
However, for DateTime
, DateTimeOffset
, and TimeSpan
, custom processing is done that doesn't use Utf8Formatter
. Thus, custom format specifications are available across all target platforms!
// DateTime, DateTimeOffset, TimeSpan support custom format string on all target platforms.
// https://learn.microsoft.com/en-us/dotnet/standard/base-types/custom-date-and-time-format-strings
Utf8String.Format($"Today is {DateTime.Now:yyyy-MM-dd}");
In any case, not being able to properly format DateTime
was one of the most challenging aspects of using ZString/ZLogger, so it's a relief that this has been addressed. However, due to this support, the conversion performance for DateTime
has decreased, and you'll get the best performance with .NET 8.
Unity
Unity support is not available! No, as much as possible, I want to create libraries that support both .NET and Unity. I have been doing this up to now, but this time it’s not possible. After all, Improved Interpolated Strings are from C# 10.0, and Unity’s current C# version is C# 9.0…! Indeed, there’s no way around that.
It’s been a while since it stopped at C# 9.0. Even if they don’t need to upgrade the runtime version, I sincerely wish they would only upgrade the compiler version. But, if they move to C# 10.0, they would probably run into issues like the lack of DefaultInterpolatedStringHandler, and they would eventually need to upgrade the runtime version as well.
Once Unity supports C# 10.0, I do intend to adapt to it immediately! I’m waiting!
Next
Now, that being said, the cases where you have to deal directly with UTF8 strings might not be that frequent. In fact, my main focus is on using it for the major version upgrade of ZLogger. Until now, ZLogger was based on ZString, but we’re developing a new version that has been redesigned from the ground up. Within it, we use Utf8StringInterpolation for string processing. We are diligently developing with the aim of creating the ultimate logger for .NET, featuring zero-allocation for top performance, flat JSON structured logging, scope support, source generators, and more.
There are various scenarios where, if injected into the application’s foundational layer, it proves to be effectively functional. Of course, it’s also okay for people to use it directly…!?