Part 2: Streams

Published 3 months ago Updated 3 months ago


“What is a stream?” I hear you ask.

Put very simply, a stream is akin to an array but with a crucial difference: it can represent an endless sequence of data. Unlike arrays, which have a fixed size determined by the amount of available memory, streams allow for dynamic loading and processing of data, making them suitable for handling arbitrarily large amounts of data.

An array of 1000 int values for example, will occupy exactly1 4000 bytes (4 KB) in memory. Because of this, arrays have a capacity that is dictated by the amount of available RAM. Frameworks such as .NET or the JVM will go a step further and prevent you from allocating too much in a bid to save you from yourself. If we attempt to allocate a 3 GiB array:

var array = new byte[3L * 1024 * 1024 * 1024]; // 3 GiB
Console.WriteLine(array.Length);

We'll actually hit an OverflowException:

Unhandled exception. System.OverflowException: Array dimensions exceeded supported range.
   at Program.<Main>$(String[] args) in Program.cs:line 1

But sometimes you want to load large amounts of data. It's not uncommon for audio and video files to well exceed this size, so how do VLC and other media players actually load large videos into memory? This is where streams come in.

Streams allow a program to load large amounts of data in more manageable “chunks”, and we move around what is essentially a pointer. When we first create a stream, this pointer is at the start of the stream ready for data to be read or written. As you read from or write to the stream, the pointer moves forward by the specified number of bytes, allowing you to process data incrementally.

For example, we could choose to read 4 bytes at a time. When we do this, we get back just those 4 bytes we asked for, while the stream pointer moves ahead 4 bytes ready to read the next chunk of data. This essentially allows us to create a “window” into the data without allocating the entire size at once.

I'll take a byte out of this data!

Let's see how this might look in code. We're going to use the MemoryStream class which represents - quite obviously - a stream in memory. This means no actual data gets written to disk or anywhere special. Let's write a single byte 4516, then try to read it back.

using var stream = new MemoryStream();
stream.WriteByte(0x45); // 69 nice

var b = (byte)stream.ReadByte();
Console.WriteLine(b);
Note

Notice the use of a using statement when declaring the stream. The Stream class implements the IDisposable interface due to the native allocations it potentially creates, and therefore it's critical that we call the Dispose method when we're done using it to avoid memory leaks. Alternatively, you can call the Close method but for reasons I won't get into here, you shouldn't. Use Dispose instead.

Except when we run this, we don't see 69 printed, we see 255. Why?

This is due to the behaviour of ReadByte which, interestingly enough, doesn't return a byte. This method actually returns an int because a potential return value is -1. In this case, we are getting -1 which is being cast to byte. This causes the byte to underflow and give us 255. If you remove the cast, you can see the value for what it truly is.

However the question remains, why -1 and not 69?

Remember that a stream acts as a pointer. When we perform read and write operations, the pointer moves ahead by exactly how many bytes were read. Since we wrote a byte, the stream is no longer pointing at the start. It actually moved ahead 1 byte, and when we now try to read from the current position, we get -1 because this is the value used to represent the “end of the stream”.

30fps animations make me cry.

What we need to do is to move the pointer back to the start, so that when we call ReadByte, it reads the byte we just wrote. We do this with a technique called “seeking” and all that does is move the pointer to exactly where we want it to go. In .NET, we use the Seek method. It accepts an offset and an “origin”. The docs are a tad obscure on this part, but the SeekOrigin parameter simply specifies what our offset is relative to. An offset of 2 with an origin of SeekOrigin.End for example, means that the stream will seek 2 bytes before the end of the stream; skipping all the rest of the data.

In our case, we want to seek from Begin with an offset of 0. This tells the stream to move back to the start.

using var stream = new MemoryStream();
stream.WriteByte(0x45); // 69 nice

// move to the start of the stream
stream.Seek(0, SeekOrigin.Begin);

var b = (byte)stream.ReadByte();
Console.WriteLine(b);

Now when we run the code and call ReadByte, we see 69 printed to the console as we expected.

Important

Not all streams support seeking! Streams can point to any arbitrary data including data that are streamed in from a remote server. It's possible that the server just sends the data and forgets about it, meaning it would be impossible to “go back” and re-read again. A few streams do support seeking, including MemoryStream because the underlying data are in RAM. By definition this means we can randomly access it.

You're probably already familiar with streams

Chances are you've dealt with streams in the past and not even realised it. Whenever you call Console.ReadLine or Console.WriteLine, you are reading from and writing to the standard input and output streams respectively. After all, it's not as if you write the entirety of the program's output in one fell swoop2. Instead you write to the standard output stream one line at a time, perhaps reading some input from the user along the way, before continuing on and writing more lines.

This is streams-in-action! It's just abstracted in a way that makes it easier to read and write strings, because the primary purpose of the standard input/output streams is to request user input and display messages to the user.

However, being a stream, we do have the ability to read and write raw byte data too. In .NET this is achieved by calling the Console.OpenStandardOutput method. This method gives us back a Stream which we can then use to perform the same test as before:

using Stream stream = Console.OpenStandardOutput();
stream.WriteByte(0x45);

If we run this code, we see the letter E printed to the console. This is because although we are sending raw bytes, the console is doing what it's designed to do and displays it as text. The character E corresponds to the ASCII value 69 (45 in hex), and so writing the byte 0x45 printed an E.

But remember how I said not all streams are “seekable”? The standard input and output streams are two such examples. If we try to call the Seek method on the standard output stream:

using Stream stream = Console.OpenStandardOutput();
stream.Seek(0, SeekOrigin.Begin);

We'll hit NotSupportedException, because once data are written to the console the pointer simply moves ahead - there is no way to override this behaviour.

Unhandled exception. System.NotSupportedException: Stream does not support seeking.
   at System.IO.ConsoleStream.Seek(Int64 offset, SeekOrigin origin)
   at Program.<Main>$(String[] args) in Program.cs:line 2

A more sane way to read and write

Reading and writing one byte at a time, while possible, is not very practical. We ideally want to read and write in bulk, and this is where the most important part of the Stream API comes in. I'm going to create an array of bytes which contains some data:

byte[] data = [0x45, 0x7A, 0xB2, 0xFF, 0x81, 0x90];

For the sake of simplicity and demonstration, it's only 6 elements. But let's now write this data to the stream using the Write(byte[], int, int) method. This method accepts the array of bytes we want to write, an index to represent the start of the array we want to copy from, and a count of how many bytes to actually write. The first argument will be our data array. Since we want to copy every element, we want to copy from index 0, and a count of 6.

byte[] data = [0x45, 0x7A, 0xB2, 0xFF, 0x81, 0x90];

<mark>using var stream = new MemoryStream();</mark>
<mark>stream.Write(data, 0, 6);</mark>

Alternatively, since we're in managed .NET land, we could also just pass the Length of the array:

byte[] data = [0x45, 0x7A, 0xB2, 0xFF, 0x81, 0x90];

using var stream = new MemoryStream();
stream.Write(data, 0, <mark>data.Length</mark>);

So now how might we read this data, 2 bytes at a time? Remember the stream could contain arbitrarily large amounts of data. Sure it's 6 elements in this example, but it could be much much larger. For this we use the Read(byte[], int, int) method. This method might seem a bit confusing at first because it doesn't return the data. It actually returns an integer which represents how many bytes were read. The data are actually written to a pre-existing block of memory, which is the array that we pass to it.

We want to read 2 bytes at a time, so we're going to need to allocate enough room for 2 bytes. We can then tell the Stream to read 2 bytes and place them into the array. Similar to the Write method, the Read method wants an index at which to start writing into the array, as well as a count of how many bytes to read from the stream.

byte[] data = [0x45, 0x7A, 0xB2, 0xFF, 0x81, 0x90];

using var stream = new MemoryStream();
stream.Write(data, 0, data.Length);

<mark>var destination = new byte[2];</mark>
<mark>stream.Read(destination, 0, destination.Length);</mark>

After that, we'll print each byte out to the console:

foreach (byte b in destination)
{
    Console.Write($"{b:X2} ");
}

Console.WriteLine();

Run this code, and you'll see we actually don't print the bytes 4516 and 7A16 printed as we expected. We get the bytes 00 00. Perhaps it didn't read all the bytes. Remember how I mentioned the Read method returns the number of bytes read? Let's confirm how many bytes were read by printing the result of that method:

var destination = new byte[2];
<mark>int bytesRead =</mark> stream.Read(destination, 0, destination.Length);
<mark>Console.WriteLine($"{bytesRead} bytes were read");</mark>

Run it, and you'll indeed see “0 bytes were read”. We forgot something, do you remember what it was?

Hint: It's to do with the stream pointer.

We forgot to seek to the start of the stream after writing the data. After we wrote our array into the stream, the pointer moved to position 6. So our attempt at reading caused the pointer to hit the end of the stream. So after writing, let's seek to position 0 and try again:

byte[] data = [0x45, 0x7A, 0xB2, 0xFF, 0x81, 0x90];

using var stream = new MemoryStream();
stream.Write(data, 0, data.Length);
<mark>stream.Seek(0, SeekOrigin.Begin);</mark>

var destination = new byte[2];
int bytesRead = stream.Read(destination, 0, destination.Length);
Console.WriteLine($"{bytesRead} bytes were read");

foreach (byte b in destination)
{
    Console.Write($"{b:X2} ");
}

Console.WriteLine();

Run the program again and now it works as expected! The output becomes:

2 bytes were read
45 7A 

Now it becomes trivial to iterate over the entire stream, by simply reading in a loop until bytesRead becomes 0:

byte[] data = [0x45, 0x7A, 0xB2, 0xFF, 0x81, 0x90];

using var stream = new MemoryStream();
stream.Write(data, 0, data.Length);
stream.Seek(0, SeekOrigin.Begin);

var destination = new byte[2];
<mark>int bytesRead;</mark>
<mark>do</mark>
<mark>{</mark>
    bytesRead = stream.Read(destination, 0, destination.Length);
    Console.WriteLine($"{bytesRead} bytes were read");

    foreach (byte b in destination)
    {
        Console.Write($"{b:X2} ");
    }

    Console.WriteLine();
<mark>} while (bytesRead > 0);</mark>

Now the code reads the stream in chunks of 2, printing the bytes as they come in:

2 bytes were read
45 7A 
2 bytes were read
B2 FF 
2 bytes were read
81 90 
0 bytes were read
81 90 

In the last iteration, we see “0 bytes were read” and the loop terminates. It does however print the same two bytes since the call to Read didn't replace the existing data. We'll just call this a feature and I leave that as an exercise to you to “fix”.

Files as a stream

For this example, you're going to need a large file. Specifically a file that exceeds 2 GiB in size, since .NET imposes a limit on arrays being no larger than int.MaxValue in size. I'm going to use an OBS recording that actually occupies close to 12 GiB, but feel free to use whatever file you want.

Let's try load this large file into memory all at once, using the File.ReadAllBytes method which returns an array of bytes. Be sure to replace the path variable with the path to your large file in question:

string path = @"/path/to/large/file";
<mark>byte[] data = File.ReadAllBytes(path);</mark>
Console.WriteLine(data.Length);

When we run this code, we'll hit IOException because the file is simply too large to be contained within a byte[] value:

Unhandled exception. System.IO.IOException: The file is too long. This operation is currently limited to supporting files less than 2 gigabytes in size.
   at System.IO.File.ReadAllBytes(String path)
   at Program.<Main>$(String[] args) in Program.cs:line 2

Not to fear though! We can treat the file as a stream instead, and just like before using the same API, we can read its data.

We can actually grab a handle to a file stream quite easily using the File.Open method. This method returns a FileStream and it accepts the path, a FileMode, and a FileAccess parameter.

We want to open an existing file for reading only, so the call should look like this:

string path = @"/path/to/large/file";
using FileStream stream = File.Open(path, <mark>FileMode.Open</mark>, <mark>FileAccess.Read</mark>);

However there is also a helper method which can shorten this line. “Open” mode and “Read” access is such a common thing to do, there exists the File.OpenRead method which just wants the path. So we can call that instead:

string path = @"/path/to/large/file";
using FileStream stream = <mark>File.OpenRead</mark>(path);

And just like before, we can allocate a small array to hold some of the data (let's say 10 bytes for now), and read just 10 bytes into it:

string path = @"/path/to/large/file";
using FileStream stream = File.OpenRead(path);

var buffer = new byte[10];
<mark>int bytesRead = stream.Read(buffer, 0, buffer.Length);</mark>
<mark>Console.WriteLine($"{bytesRead} bytes were read");</mark>

<mark>foreach (byte b in buffer)</mark>
<mark>{</mark>
    <mark>Console.Write($"{b:X2} ");</mark>
<mark>}</mark>

<mark>Console.WriteLine();</mark>

Your data will differ of course, but I see:

10 bytes were read
1A 45 DF A3 A3 42 86 81 01 42 

And just like before, we can print the bytes out for the entire large file. We'll also make the buffer 1 MiB in size, instead of 10 bytes, just so it can read a bit faster:

var buffer = new byte[1024 * 1024]; // 1 MiB buffer
do
{
    int bytesRead = stream.Read(buffer);
    <mark>if (bytesRead == 0)</mark>
    <mark>{</mark>
        <mark>break;</mark> // did you do your homework? this is how you "fix" it
    <mark>}</mark>

    foreach (byte b in buffer)
    {
        Console.Write($"{b:X2} ");
    }
} while (<mark>true</mark>);

Console.WriteLine();

Now when running the application we can see the data get printed to the console. Depending on how large the file is, this may take an extremely long time. We can speed it up slightly by utilising something called a BufferedStream. This stream wraps another existing stream, and is designed to make read/write operations more efficient and performant. We don't actually need to change the loop at all, all we need to do is change what the stream variable points to:

using FileStream <mark>fileStream</mark> = File.OpenRead(path);
<mark>using var stream = new BufferedStream(fileStream, 1024 * 1024);</mark>

var buffer = new byte[1024 * 1024]; // 1 MiB buffer
// ...

Now the Read method inside the loop no longer calls Read on the FileStream, but rather the BufferedStream which wraps it!

I understand that this might have been a lot to take in. Do feel free to drop me a line if there's anything about this section that can be improved or clarified.

In the coming parts, we'll look at just what all of this has to do with sockets and why understanding how to work with streams is an important aspect of it.


  1. Okay maybe not “exactly” depending on the language and runtime you're using due to overhead and metadata. “At least” is better phrasing.

  2. Though it is technically possible to do this on both Windows and Linux.