Thomas Buss blog

Rust Slice Types (and Strings), explained

Table of Contents

Recently, I have started to do some Katas with Rust. Rust is a exciting new(-ish) language that positions itself as a competitor to C/C++ as a low-level language for performance-critical use cases. What’s unique about Rust is it’s Ownership system that enables memory management without the hassle of malloc/free/delete calls or a Garbage Collector. We’re going to see a brief explanation of the Ownership system soon. This post, however, is not intended as a complete introduction to the Ownership system. Rather, I will try to explain the slice types (such as &[T] and Vec[T]) in easier terms, because many newcomers (including myself) have problems with them. At the end, we’ll have a look at how this relates to the String types &str and String.

A brief explanation of the purpose of the Ownership System

Many bugs in low-level code arise from memory management… or rather, the lack thereof. If you as a programmer do not handle memory properly, the program can crash (at best), leak memory or behave unexpectedly. Even worse, it might have a security vulnerability.

Rust aims to make memory management bullet proof with the Ownership system. However, in doing so, it complicates things a lot, thus raising the barrier of entry for newcomers.

Here’s a small example:

1
2
3
4
5
fn main() {
    let s = String::new("Hello World");
    do_something(s);
    println!("{}", s); // Won't compile; Ownership has been moved
}

This code does not compile because the ownership of the variable s is moved from the function main to do_something. Only one scope can have ownership of a variable. After the scope of the variable ends, the data behind the variable can be deallocated. This makes sure that do_something does nothing to the variable that the caller does not expect, like changing it. Imagine you’re keeping an index to some part of the string, like a delimiter, and someone appends to the front of the string. The index is not updated and bugs occur. While functional languages try to outlaw this by making everything immutable (and immutability is awesome!), this can sometimes be inefficient in terms of memory consumption and performance. Rust tries to squeeze the last gram of performance out of the program, so the immutability approach alone does not cut it.

Before we’ll look any further at the string types (which are basically just special types of slices), we first look at slices in general.

[T]: The Basic Slice Type

The “Basic Slice Type” as I like to call it is just denoted as [T]. This means that a variable with that type holds some amount of T’s. A variable with this type, however, can not be instantiated. Rust needs to know at compile time how much space it must allocate for a variable in order to handle memory for us. If you’re familiar with C or another language that uses pointer arithmetic, you can think of this as tuple of a pointer to the first element and the number of elements in the slice. Also, the type does not tell you where that data is stored in memory, on the heap or on the stack. Finally, from the ownership perspective, the current scope would own this data (but as this type cannot be instantiated, it does not really matter).

[T; 5]: The Fixed-sized Array

The type [T; 5] is the array type. When you define a variable like this:

1
let arr = [1, 2, 3];

then the type of the variable arr is [i32; 3]. Note there is no & or anything like that. This is because the Rust compiler knows exactly how large the memory for this variable must be. Note, however, that it cannot be passed to functions that accept an array with a different amount of elements, such as this function:

1
fn do_something(a: [i32; 2]) {...}

Only arrays with a size of 2 can be passed to the do_something function. This makes sense, as all function arguments need to be copied to the call stack and the size of the function’s call frame must be known at compile time.

But it would not be helpful to write a function that can only operate on a type of array with a fixed length. If we want to operate on arrays of arbitrary length, we have two alternatives that we can pass to a function:

  1. A reference
  2. An owning type

These alternatives differ in how they handle ownership.

&[T]: The Slice Reference

This type describes some amount of T’s that this scope does not own. As the scope does not own it, the data behind the reference is not copied to the call stack. This type does not give you any information where the data is being kept, and it does not have to. The reference refers to a [T], which has the pointer to the first element and the number of elements. Here’s some sample code:

1
2
3
4
5
6
7
fn main() {
    let arr = [1, 2, 3];
    do_something(&arr);
    do_something(&arr); // This works; the ownership is not moved
}

fn do_something(a: &[i32]) { ... }

Box<[T]>: The Owning Type

The type Box<[T]> denotes a variable to be some amount of T’s that this scope owns. Ownership, in this case, is indicated by the Box struct.

We can use this type to move the ownership of the variable’s data to the function, like so:

1
2
3
4
5
6
7
fn main() {
    let arr: Box<[i32]> = Box::from([1, 2, 3]);
    do_something(arr);
    do_something(arr); // Won't compile; ownership has been moved
}

fn do_something(a: Box<[i32]>) { ... }

Vec<T>: The Vector type

We’re getting closer to what the String and str types actually are. As Box<[T]> denotes some amount of T’s that this scope owns, so does the Vec<T>. The difference is that the Vector type has some additional functionality that allow this struct to shrink and grow dynamically. Because of this, the data of a Vector variable is stored on the heap rather than on the stack. Here’s an example, using the vec! macro for some easy-to-read code:

1
2
3
4
5
6
fn main() {
    let arr = vec![1, 2, 3,];
    do_something(arr);
}

fn do_something(a: Vec<i32>) { ... }

If you want the benefits the Vector Type, but still keep ownership, you can, of course, pass &arr to do_something, which accepts the argument as a: &Vec<i32>.

Intermezzo: UTF-8 in Rust

Before we have a look at the string types, we need to understand why Rust behaves the way it does when it comes to strings. In languages like C, a single character is equivalent to a single byte, so a value from 0 to 255. Naturally, you cannot represent all letters of all alphabets of all languages around the world with just 255 symbols. Therefore, Rust uses UTF-8 to encode all strings. The problem here is that in UTF-8, not all characters have the same length in bytes. An example is my last name, Buß:

1
2
3
4
fn main() {
    let a = "Buß";
    println!("{}", a.len()); // prints 4
}

This is the reason why we need to pay special attention when it comes to strings. We cannot split a string between bytes that belong the same logical character in UTF-8. Here’s an adapted example from the documentation of the method char_indices:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
fn main() {
    let a = "Buß";
    println!("{}", a.len()); // prints 4

    let char_indices = a.char_indices();
    for c in char_indices {
        // type of c is (usize, char)
        println!("At position {} is char {}", c.0, c.1);
    }
}

This code will print the following:

1
2
3
At position 0 is char B
At position 1 is char u
At position 2 is char ß

This is what you would normally expect. You can easily fall into traps, however, when we change the order of the letters from Buß to Bßu:

1
2
3
At position 0 is char B
At position 1 is char ß
At position 3 is char u

See how the index jumped from 1 to 3? The iterator knows about UTF-8 encoding and will skip the index 2, as this byte belongs to the char with index 1. Thus, we need to keep in mind that not every index will be a valid character.

Here’s a little bonus: Since we’re already talking about it, there is something else we need to keep in mind: The German letter ß does not have an uppercase form. Instead, the letters SS are used, so Buß becomes BUSS. When we call to_uppercase before we create the iterator, we get this output:

1
2
3
4
At position 0 is char B
At position 1 is char U
At position 2 is char S
At position 3 is char S

So, be careful when uppercasing strings, as it might change the length of characters in other languages than English.

&str: The Primitive String Type

Now, let’s continue to talk about types: The type of the variable a from the example above is &str and you can think of it as &[char], so some amount of characters that this scope does not own. But for the reasons discussed earlier, Rust will enforce that the &str is always a valid UTF-8 string. Therefore, If we try to split the variable at an invalid index, we will get a panic:

1
2
3
4
5
6
7
8
fn main() {
    let a: &str = "Hallöle";
    let a1 = &a[..4];
    let a2 = &a[5..];

    println!("{}", a1);
    println!("{}", a2);
}

This results in the following error:

1
thread 'main' panicked at 'byte index 5 is not a char boundary; it is inside 'ö' (bytes 4..6) of `Hallöle`', src\main.rs:4:15

String: The Owned String type

The final type we’ll be looking at in this post is String. You can think of this type as a Vec<char>, so some amount of characters that this scope owns. Just as a Vec, it allows shrinking and growing. Just as a &str, it will also enforce UTF-8 encoding. What’s great about this type is that we can get the underling &str pretty easy:

1
2
3
4
5
6
7
8
fn main() {
    let a: &str = "Hallöle";
    do_something(a);
    let b: String = String::from(a);
    do_something(&b[..]); // &str created from String
}

fn do_something(s: &str) { ... }

This way, we do can just use &str as a method parameter and can pass both static &str and String variables into it.

Now, just for fun, let’s look at the method signature read_line from the stdio struct:

1
pub fn read_line(&self, buf: &mut String) -> Result<usize>

We can now clearly see that this definition makes perfect sense. The second parameter acts as our buffer.

The type &mut String is the only one what matches these criteria.

Conclusion

Let’s reiterate the types we have talked about in this blog post:

Type Description
[T] Some amount of T’s (cannot be instantiated)
[T; 5] Exactly 5 T’s that this scope owns
&[T] Some amount of T’s that this scope does not own
Box<[T]> Some amount of T’s that this scope owns
Vec<T> Some amount of T’s that this scope owns, which can shrink and grow dynamically (therefore, stored on the heap)
&str Some amount of characters that this scope does not own and is guaranteed to be UTF-8 encoded
String Some amount of characters that this scope owns, can shrink and grow dynamically, and is guaranteed to be UTF-8 encoded

I hoped this blog post was helpful to those who are learning Rust. If you want to know more, you can have a look at the resources below or leave a comment with a question.

Resources

Comments powered by Disqus