RustBrock/String.md
2025-01-20 17:14:33 -07:00

300 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# String
String s are implemented as a collection bytes plus some methods o provide useful functionality when those bytes are interpreted as text.
Strings can do everything that a UTF-8 character can
These attached methods include creating, updating and reading.
Strings are different to other collections, namely how indexing into a ``String`` is complicated by the differences between how people and computers interpret ``String`` data
## What is a String?
There is only one string type in the core language which is the string slice ``str`` this is usually in the borrowed form ``&str``.
This is special when not referenced because it is a constant written into the binary.
This is only the case for string literals
When a string is referred to in rust they refer either to the ``&str`` (string slice) or the ``String`` that is included in the std library
string slices are also UTF-9 encoded
## Creating a New String
Many of the same operations available with [``Vec<T>``](Vector.md) are available with ``String`` as well because it is implemented as a wrapper around a vector of bytes with some guarantees, restrictions and capabilities.
To create one it works the same as a ``Vec<T>``
```rust
let mut s = String::new();
```
This creates a new empty string, which can then load data
Often times we have some initial data for that we can use the ``to_string`` method which is available on any type that implements the ``Display`` trait, as string literals do
```rust
let data = "initial contents";
let s = data.to_string();
// the same as the two lines above
let s = "inital contents".to_string();
```
These both create a string containing ``intial contents``
You can also use the ``String::from`` to create a string form a string literal
```rust
let s = String::form("initial contents");
```
Because strings are used for so many things we can use many different generic APIs for strings, providing us with a lot of options.
Whilst some can seem redundant but they all have their place.
for ``String::from`` and ``to_string`` whilst they do the same thing which one you choose is a matter of style and readability
UTF-8 Strings, because it has this property it can do any language where all of them are valid
```rust
let hello = String::from("السلام عليكم");
let hello = String::from("Dobrý den");
let hello = String::from("Hello");
let hello = String::from("שלום");
let hello = String::from("नमस्ते");
let hello = String::from("こんにちは");
let hello = String::from("안녕하세요");
let hello = String::from("你好");
let hello = String::from("Olá");
let hello = String::from("Здравствуйте");
let hello = String::from("Hola");
```
## Updating a String
A ``String`` can grow in size and its contents can change, just like the contents of a ``Vec<T>`` if you push more data into it.
In addition you can use the ``+`` operator or the ``format!`` macro to concatenate ``String`` values.
### Appending to a String with push_str and push
Strings can be grown by using the ``push_str`` method to append to a string slice
```rust
let mut s = String::from("foo");
s.push_str("bar");
```
after the two lines the string will contain ``foobar``
The ``push_str`` takes a string slice because it doesn't necessarily want to take ownership of the parameter, therefore we are able to use a part we borrowed for appending
The ``push`` method takes a single character as a parameter and adds it to the ``String``
```rust
let mut s = String::from("lo");
s.push('l');
```
### Concatenation with the ``+`` Operator or the ``format!`` Macro
If you want to combine two existing strings one way is to use the ``+`` operator
```rust
let s1 = String::from("Hello, ");
let s2 = String::from("world!");
let s3 = s1 + &s2; // note s1 has been moved here and can no longer be used
```
s3 will contain the string ``Hello, world!``
s1 is no longer valid due to how the ``+`` operator is implemented in the add function
```rust
fn add(self, s: &str) -> String {
```
In this definition is the string changes ownership and requires a second argument that is a reference
``add`` is normally defined with generics and associated types, here they are defined with concrete types to illustrate what will happen when using it fort a string
even though a ``&String`` is not a ``&str`` the program still compiles because the compiler can coerce the ``&String`` argument into a ``&str``.
You cannot add two Strings together directly
Rust uses a *deref coercion* which turns &s2 into ``&s2[..]``, also due to this being a reference ownership does not transfer
The definition moves (copies) s1 and creates a copy of s2 and combines them together
It may appear to be creating a lot of copies but it isn't the implementation is more efficient than copying
if we need to concatenate multiple strings, the behavior of the ``+`` operator gets unwieldy
fir combining strings more complex ways we can instead use the ``format!`` macro
```rust
let s1 = String::from("tic");
let s2 = String::from("tac");
let s3 = String::from("toe");
let s = s1 + "-" + &s2 + "-" + &s3;
// updated method
let s1 = String::from("tic");
let s2 = String::from("tac");
let s3 = String::from("toe");
let s = format!("{s1}-{s2}-{s3}");
```
the ``format!`` macro works just like ``println!`` but instead of outputting to the screen it returns a ``String`` with the contents.
using ``format!`` is much easier to read and the implementation of ``format`` uses references so it doesn't take ownership of any of its parameters
## Indexing into Strings
You cannot index into a string and change characters using normal index syntax
```rust
let s1 = String::from("hello");
let h = s1[0];
```
The reason is how strings are stored in memory in Rust
### Internal Representation
A string is a wrapper over a ``Vec<u8>``
lets take some UTF-8 example strings
```rust
let hello = String::from("Hola");
```
In this case ``len`` will be `4` which means the vector storing the string `"Hola"` is 4 bytes long, each of these letters takes one byte when encoded in UTF-8 but this is not always true
for example
```rust
let hello = String::from("Здравствуйте");
```
if you were asked how long the string is, you may say 12 but in fact the answer in Rust is 24, which is the same number of bytes to encode the string in UTF-8
The String type is provided by Rust's standard library rather than coded into the core language is growable, mutable, owned and UTF-8 encoded string type.
This is because every Unicode scalar value in that string takes 2 bytes of storage. Therefore indexing into the string's bytes will not always correlate to a valid Unicode scalar value
To break this down consider
```rust
let hello = "Здравствуйте";
let answer = &hello[0];
```
The value inside `answer` is not `З`, the first letter
When encoded into UTF-8, the first byte of `З` is `208` and the second is `151` so it would seem that `answer` should contain `208`, but `208` by itself is not a valid character.
No one generally needs just the first byte at a index of a string
So to avoid returning the first byte which is an unexpected value and could be considered a bug, so the right answer is to give a compilation error and not compile. This prevents misunderstandings early in the dev process.
### Bytes and Scalar Values and Grapheme Clusters
There are three ways to look at a UTF-8 string from Rust's perspective:
- Bytes
- Scalar Values
- Grapheme Clusters (the closest thing to what we would call letters)
Consider the Hindi word  “नमस्ते” written in the Devanagari script, the vector of u8 values that would store that string would look like this
```
[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]
```
This is 18 bytes and is how computers store the data
If we look at them as Unicode scalar values, which is what the Rust `char` type is, those bytes would look like this
```
['न', 'म', 'स', '्', 'त', 'े']
```
The problem with this is that the 4th and 6th values are not letters, they are diacritics that don't make sense on their own
If we look at them as Grapheme Clusters then we would get the four characters that make up the Hindi word
```
["न", "म", "स्", "ते"]
```
Rust provides these ways of interpreting the raw string so that each program can choose the interpretation it needs, no matter what human language the data is in.
The final reason Rust doesn't allow us to index into a `String` to get a character is that indexing operations are expected to always take constant time O(1), but that is not always possible because you have to iterate through it to determine how many valid characters there were.
## Slicing Strings
Indexing is a bad idea because the return type is not very clear what it should be: a byte value, a character a grapheme cluster, or a string slice.
This needs to be more specified
Rather than indexing using a single number, you can use `[]` with a range to create a string slice that contains particular bytes
```rust
let hello = "Здравствуйте";
let s = &hello[0..4];
```
s is a `&str` that contains the first 4 bytes of the string
but since each of these characters has two bytes i means that s contains `Зд`
if we where only to slice part of a character then Rust will would panic at runtime in the same way as if an invalid index were accessed in a vector
Be careful when creating string slices with ranges
## Methods for Iterating over Strings
the best way to operate on pieces of strings is to be explicit about whether you want characters or bytes
For individual Unicode scalar values sue the `chars` method
calling chars on “Зд” separates out and returns two char values
you can also iterate over the result to access each element
```rust
for c in "Зд".chars() {
println!("{c}");
}
```
this will output
```
З
д
```
You can also use the `bytes` method to return each raw byte
```rust
for b in "Зд".bytes() {
println!("{b}");
}
```
this will output
```
208
151
208
180
```
This may be appropriate for your use case
Remember that valid Unicode scalar values may be made up of more than one byte
Getting grapheme clusters from strings such as with the Devanagari script is complex
Therefore this functionality is not provided in the std library
Download a Crate from [crates.io](https://crates.io) if you need this functionality
## Strings Are Not So Simple
Rust chooses to make correct handling of `String` data the default behavior for all Rust programs, which means that handling UTF-8 data up front
Whilst this exposes the complexity of Non-ASCII characters is prevents these kinds of errors later in development
the std library offers a lot of functionality built off the `String` and `&str` types
some other useful methods include `contains` for searching in a string and `replace` for substituting parts of a string with another string
check the documentation for other useful methods