Rust Slice Types (and Strings), explained
Recently, I have started to do some Katas with Rust.
Rust is a exciting new(-ish) language that positions itself as a competitor to C/C++ as a low-level language for performance-critical use cases.
What’s unique about Rust is it’s Ownership system that enables memory management without the hassle of malloc
/free
/delete
calls or a Garbage Collector.
We’re going to see a brief explanation of the Ownership system soon.
This post, however, is not intended as a complete introduction to the Ownership system.
Rather, I will try to explain the slice types (such as &[T]
and Vec[T]
) in easier terms, because many newcomers (including myself) have problems with them.
At the end, we’ll have a look at how this relates to the String types &str
and String
.
A brief explanation of the purpose of the Ownership System
Many bugs in low-level code arise from memory management… or rather, the lack thereof. If you as a programmer do not handle memory properly, the program can crash (at best), leak memory or behave unexpectedly. Even worse, it might have a security vulnerability.
Rust aims to make memory management bullet proof with the Ownership system. However, in doing so, it complicates things a lot, thus raising the barrier of entry for newcomers.
Here’s a small example:
|
|
This code does not compile because the ownership of the variable s
is moved from the function main
to do_something
.
Only one scope can have ownership of a variable.
After the scope of the variable ends, the data behind the variable can be deallocated.
This makes sure that do_something
does nothing to the variable that the caller does not expect, like changing it.
Imagine you’re keeping an index to some part of the string, like a delimiter, and someone appends to the front of the string.
The index is not updated and bugs occur.
While functional languages try to outlaw this by making everything immutable (and immutability is awesome!), this can sometimes be inefficient in terms of memory consumption and performance.
Rust tries to squeeze the last gram of performance out of the program, so the immutability approach alone does not cut it.
Before we’ll look any further at the string types (which are basically just special types of slices), we first look at slices in general.
[T]
: The Basic Slice Type
The “Basic Slice Type” as I like to call it is just denoted as [T]
.
This means that a variable with that type holds some amount of T’s.
A variable with this type, however, can not be instantiated.
Rust needs to know at compile time how much space it must allocate for a variable in order to handle memory for us.
If you’re familiar with C or another language that uses pointer arithmetic, you can think of this as tuple of a pointer to the first element and the number of elements in the slice.
Also, the type does not tell you where that data is stored in memory, on the heap or on the stack.
Finally, from the ownership perspective, the current scope would own this data (but as this type cannot be instantiated, it does not really matter).
[T; 5]
: The Fixed-sized Array
The type [T; 5]
is the array type.
When you define a variable like this:
|
|
then the type of the variable arr
is [i32; 3]
.
Note there is no &
or anything like that.
This is because the Rust compiler knows exactly how large the memory for this variable must be.
Note, however, that it cannot be passed to functions that accept an array with a different amount of elements, such as this function:
|
|
Only arrays with a size of 2 can be passed to the do_something
function.
This makes sense, as all function arguments need to be copied to the call stack and the size of the function’s call frame must be known at compile time.
But it would not be helpful to write a function that can only operate on a type of array with a fixed length. If we want to operate on arrays of arbitrary length, we have two alternatives that we can pass to a function:
- A reference
- An owning type
These alternatives differ in how they handle ownership.
&[T]
: The Slice Reference
This type describes some amount of T’s that this scope does not own.
As the scope does not own it, the data behind the reference is not copied to the call stack.
This type does not give you any information where the data is being kept, and it does not have to.
The reference refers to a [T]
, which has the pointer to the first element and the number of elements.
Here’s some sample code:
|
|
Box<[T]>
: The Owning Type
The type Box<[T]>
denotes a variable to be some amount of T’s that this scope owns.
Ownership, in this case, is indicated by the Box
struct.
We can use this type to move the ownership of the variable’s data to the function, like so:
|
|
Vec<T>
: The Vector type
We’re getting closer to what the String
and str
types actually are.
As Box<[T]>
denotes some amount of T’s that this scope owns, so does the Vec<T>
.
The difference is that the Vector type has some additional functionality that allow this struct to shrink and grow dynamically.
Because of this, the data of a Vector variable is stored on the heap rather than on the stack.
Here’s an example, using the vec!
macro for some easy-to-read code:
|
|
If you want the benefits the Vector Type, but still keep ownership, you can, of course, pass &arr
to do_something
, which accepts the argument as a: &Vec<i32>
.
Intermezzo: UTF-8 in Rust
Before we have a look at the string types, we need to understand why Rust behaves the way it does when it comes to strings. In languages like C, a single character is equivalent to a single byte, so a value from 0 to 255. Naturally, you cannot represent all letters of all alphabets of all languages around the world with just 255 symbols. Therefore, Rust uses UTF-8 to encode all strings. The problem here is that in UTF-8, not all characters have the same length in bytes. An example is my last name, Buß:
|
|
This is the reason why we need to pay special attention when it comes to strings.
We cannot split a string between bytes that belong the same logical character in UTF-8.
Here’s an adapted example from the documentation of the method char_indices
:
|
|
This code will print the following:
|
|
This is what you would normally expect.
You can easily fall into traps, however, when we change the order of the letters from Buß
to Bßu
:
|
|
See how the index jumped from 1 to 3?
The iterator knows about UTF-8 encoding and will skip the index 2, as this byte belongs to the char
with index 1.
Thus, we need to keep in mind that not every index will be a valid character.
Here’s a little bonus: Since we’re already talking about it, there is something else we need to keep in mind:
The German letter ß
does not have an uppercase form.
Instead, the letters SS
are used, so Buß
becomes BUSS
.
When we call to_uppercase
before we create the iterator, we get this output:
|
|
So, be careful when uppercasing strings, as it might change the length of characters in other languages than English.
&str
: The Primitive String Type
Now, let’s continue to talk about types:
The type of the variable a
from the example above is &str
and you can think of it as &[char]
, so some amount of characters that this scope does not own.
But for the reasons discussed earlier, Rust will enforce that the &str
is always a valid UTF-8 string.
Therefore, If we try to split the variable at an invalid index, we will get a panic:
|
|
This results in the following error:
|
|
String
: The Owned String type
The final type we’ll be looking at in this post is String
.
You can think of this type as a Vec<char>
, so some amount of characters that this scope owns.
Just as a Vec
, it allows shrinking and growing.
Just as a &str
, it will also enforce UTF-8 encoding.
What’s great about this type is that we can get the underling &str
pretty easy:
|
|
This way, we do can just use &str
as a method parameter and can pass both static &str
and String
variables into it.
Now, just for fun, let’s look at the method signature read_line
from the stdio
struct:
|
|
We can now clearly see that this definition makes perfect sense. The second parameter acts as our buffer.
- It needs to be owned by the caller of the method to retrieve the data
- It needs to be mutable so the method can alter the buffer’s content
- It needs to be able to grow as the length of the line is unknown
The type &mut String
is the only one what matches these criteria.
Conclusion
Let’s reiterate the types we have talked about in this blog post:
Type | Description |
---|---|
[T] |
Some amount of T’s (cannot be instantiated) |
[T; 5] |
Exactly 5 T’s that this scope owns |
&[T] |
Some amount of T’s that this scope does not own |
Box<[T]> |
Some amount of T’s that this scope owns |
Vec<T> |
Some amount of T’s that this scope owns, which can shrink and grow dynamically (therefore, stored on the heap) |
&str |
Some amount of characters that this scope does not own and is guaranteed to be UTF-8 encoded |
String |
Some amount of characters that this scope owns, can shrink and grow dynamically, and is guaranteed to be UTF-8 encoded |
I hoped this blog post was helpful to those who are learning Rust. If you want to know more, you can have a look at the resources below or leave a comment with a question.
Resources
- The Rust Book, chapter 4.3: The Slice Type
- The Rust Book, chapter 8.2: Storing UTF-8 Encoded Text with Strings
- A very helpful forum thread which I found while I was looking for an explanation on the difference between the types
String
andstr
. Especially useful is this comment. - The documentation of
char_indices
- Not so helpful, but worth mentioning: The Language Reference on Slices