RustBrock/String.md
2025-01-20 17:14:33 -07:00

11 KiB
Raw Blame History

String

String s are implemented as a collection bytes plus some methods o provide useful functionality when those bytes are interpreted as text.

Strings can do everything that a UTF-8 character can

These attached methods include creating, updating and reading.

Strings are different to other collections, namely how indexing into a String is complicated by the differences between how people and computers interpret String data

What is a String?

There is only one string type in the core language which is the string slice str this is usually in the borrowed form &str.

This is special when not referenced because it is a constant written into the binary. This is only the case for string literals

When a string is referred to in rust they refer either to the &str (string slice) or the String that is included in the std library

string slices are also UTF-9 encoded

Creating a New String

Many of the same operations available with Vec<T> are available with String as well because it is implemented as a wrapper around a vector of bytes with some guarantees, restrictions and capabilities.

To create one it works the same as a Vec<T>

let mut s = String::new();

This creates a new empty string, which can then load data

Often times we have some initial data for that we can use the to_string method which is available on any type that implements the Display trait, as string literals do

let data = "initial contents";

let s = data.to_string();

// the same as the two lines above
let s = "inital contents".to_string();

These both create a string containing intial contents

You can also use the String::from to create a string form a string literal

let s = String::form("initial contents");

Because strings are used for so many things we can use many different generic APIs for strings, providing us with a lot of options.

Whilst some can seem redundant but they all have their place.

for String::from and to_string whilst they do the same thing which one you choose is a matter of style and readability

UTF-8 Strings, because it has this property it can do any language where all of them are valid

    let hello = String::from("السلام عليكم");
    let hello = String::from("Dobrý den");
    let hello = String::from("Hello");
    let hello = String::from("שלום");
    let hello = String::from("नमस्ते");
    let hello = String::from("こんにちは");
    let hello = String::from("안녕하세요");
    let hello = String::from("你好");
    let hello = String::from("Olá");
    let hello = String::from("Здравствуйте");
    let hello = String::from("Hola");

Updating a String

A String can grow in size and its contents can change, just like the contents of a Vec<T> if you push more data into it.

In addition you can use the + operator or the format! macro to concatenate String values.

Appending to a String with push_str and push

Strings can be grown by using the push_str method to append to a string slice

let mut s = String::from("foo");
s.push_str("bar");

after the two lines the string will contain foobar

The push_str takes a string slice because it doesn't necessarily want to take ownership of the parameter, therefore we are able to use a part we borrowed for appending

The push method takes a single character as a parameter and adds it to the String

let mut s = String::from("lo");
s.push('l');

Concatenation with the + Operator or the format! Macro

If you want to combine two existing strings one way is to use the + operator

    let s1 = String::from("Hello, ");
    let s2 = String::from("world!");
    let s3 = s1 + &s2; // note s1 has been moved here and can no longer be used

s3 will contain the string Hello, world!

s1 is no longer valid due to how the + operator is implemented in the add function

fn add(self, s: &str) -> String {

In this definition is the string changes ownership and requires a second argument that is a reference

add is normally defined with generics and associated types, here they are defined with concrete types to illustrate what will happen when using it fort a string

even though a &String is not a &str the program still compiles because the compiler can coerce the &String argument into a &str.

You cannot add two Strings together directly

Rust uses a deref coercion which turns &s2 into &s2[..], also due to this being a reference ownership does not transfer

The definition moves (copies) s1 and creates a copy of s2 and combines them together

It may appear to be creating a lot of copies but it isn't the implementation is more efficient than copying

if we need to concatenate multiple strings, the behavior of the + operator gets unwieldy

fir combining strings more complex ways we can instead use the format! macro

    let s1 = String::from("tic");
    let s2 = String::from("tac");
    let s3 = String::from("toe");

    let s = s1 + "-" + &s2 + "-" + &s3;
    
	// updated method
	
    let s1 = String::from("tic");
    let s2 = String::from("tac");
    let s3 = String::from("toe");

    let s = format!("{s1}-{s2}-{s3}");

the format! macro works just like println! but instead of outputting to the screen it returns a String with the contents.

using format! is much easier to read and the implementation of format uses references so it doesn't take ownership of any of its parameters

Indexing into Strings

You cannot index into a string and change characters using normal index syntax

let s1 = String::from("hello");
let h = s1[0];

The reason is how strings are stored in memory in Rust

Internal Representation

A string is a wrapper over a Vec<u8>

lets take some UTF-8 example strings

let hello = String::from("Hola");

In this case len will be 4 which means the vector storing the string "Hola" is 4 bytes long, each of these letters takes one byte when encoded in UTF-8 but this is not always true

for example

    let hello = String::from("Здравствуйте");

if you were asked how long the string is, you may say 12 but in fact the answer in Rust is 24, which is the same number of bytes to encode the string in UTF-8

The String type is provided by Rust's standard library rather than coded into the core language is growable, mutable, owned and UTF-8 encoded string type.

This is because every Unicode scalar value in that string takes 2 bytes of storage. Therefore indexing into the string's bytes will not always correlate to a valid Unicode scalar value

To break this down consider

let hello = "Здравствуйте"; 
let answer = &hello[0];

The value inside answer is not З, the first letter

When encoded into UTF-8, the first byte of З is 208 and the second is 151 so it would seem that answer should contain 208, but 208 by itself is not a valid character. No one generally needs just the first byte at a index of a string

So to avoid returning the first byte which is an unexpected value and could be considered a bug, so the right answer is to give a compilation error and not compile. This prevents misunderstandings early in the dev process.

Bytes and Scalar Values and Grapheme Clusters

There are three ways to look at a UTF-8 string from Rust's perspective: - Bytes - Scalar Values - Grapheme Clusters (the closest thing to what we would call letters)

Consider the Hindi word  “नमस्ते” written in the Devanagari script, the vector of u8 values that would store that string would look like this

[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]

This is 18 bytes and is how computers store the data

If we look at them as Unicode scalar values, which is what the Rust char type is, those bytes would look like this

['न', 'म', 'स', '्', 'त', 'े']

The problem with this is that the 4th and 6th values are not letters, they are diacritics that don't make sense on their own

If we look at them as Grapheme Clusters then we would get the four characters that make up the Hindi word

["न", "म", "स्", "ते"]

Rust provides these ways of interpreting the raw string so that each program can choose the interpretation it needs, no matter what human language the data is in.

The final reason Rust doesn't allow us to index into a String to get a character is that indexing operations are expected to always take constant time O(1), but that is not always possible because you have to iterate through it to determine how many valid characters there were.

Slicing Strings

Indexing is a bad idea because the return type is not very clear what it should be: a byte value, a character a grapheme cluster, or a string slice.

This needs to be more specified

Rather than indexing using a single number, you can use [] with a range to create a string slice that contains particular bytes

let hello = "Здравствуйте";

let s = &hello[0..4];

s is a &str that contains the first 4 bytes of the string but since each of these characters has two bytes i means that s contains Зд

if we where only to slice part of a character then Rust will would panic at runtime in the same way as if an invalid index were accessed in a vector

Be careful when creating string slices with ranges

Methods for Iterating over Strings

the best way to operate on pieces of strings is to be explicit about whether you want characters or bytes

For individual Unicode scalar values sue the chars method

calling chars on “Зд” separates out and returns two char values you can also iterate over the result to access each element

for c in "Зд".chars() {
    println!("{c}");
}

this will output

З
д

You can also use the bytes method to return each raw byte

for b in "Зд".bytes() {
    println!("{b}");
}

this will output

208
151
208
180

This may be appropriate for your use case

Remember that valid Unicode scalar values may be made up of more than one byte

Getting grapheme clusters from strings such as with the Devanagari script is complex Therefore this functionality is not provided in the std library

Download a Crate from crates.io if you need this functionality

Strings Are Not So Simple

Rust chooses to make correct handling of String data the default behavior for all Rust programs, which means that handling UTF-8 data up front

Whilst this exposes the complexity of Non-ASCII characters is prevents these kinds of errors later in development

the std library offers a lot of functionality built off the String and &str types

some other useful methods include contains for searching in a string and replace for substituting parts of a string with another string

check the documentation for other useful methods