| jmcph4 |

Not So Direct I/O

[2025-08-12 11:10:00 +1000]

Recently, the CEO of Tigerbeetle, Joran Dirk Greef, posted an interview question for a (presumably hypothetical) DBMS engineering role. This nerdsniped me and lead me on a bit of a wild goose chase to try and find a satisfactory answer. In the end, Tanel Poder ended up answering the question with a fairly succinct demonstration. In an effort to do more and also share my thought process on the problem, this post will act as both a solution but also an exploration of how filesystems work and some of the Linux kernel source code.

Understanding the Question

DBMS Interview Challenge

A system performs Direct I/O:

- using O_DIRECT,
- aligning to 4096 byte Advanced Format sector size, and
- reading exactly 4096 bytes at a time.

What numbers other than 0 or 4096 will read(2) return, as the number of bytes read? Why?

— Joran Dirk Greef (@jorandirkgreef) August 5, 2025

Firstly, let's unpack the constraints provided to us in the question.

using O_DIRECT

This refers to a certain flag provided to open(2) and suggests that we'll be performing direct I/O (there's a big caveat here; more on this shortly).

aligning to 4096 byte Advanced Format sector size

I am interpreting this as referring to the memory alignment of the userland buffers that we'll be performing I/O to and from. Finally, from the last constraint, we know that we'll only ever be reading data in 4096 byte increments. Joran made several follow-up posts which provide some additional facts:

(Assume all writes are also aligned to 4096 bytes, and that the file size is a multiple of 4096)

— Joran Dirk Greef (@jorandirkgreef) August 5, 2025

Ah! You can assume the file size is aligned to 4096 bytes as well (but good clarification!).

— Joran Dirk Greef (@jorandirkgreef) August 5, 2025

Assume the read is never interrupted, and there's no error, are you sure you'll only get 0 or 4096? Thinking about the hardware some more, what's plausible to expect?

— Joran Dirk Greef (@jorandirkgreef) August 5, 2025

So we know that the file size is a multiple of 4096 bytes (i.e., we'll never request to read a partial block!) and that we will never perform a write to an address that is not 4096-byte-aligned.

Background Knowledge

Kernel Buffering of File I/O

During an ordinary read, when data is requested from a block device, the kernel will first attempt to serve the request from an in-memory cache (called the page cache, somewhat confusingly). This potentially avoids an expensive seek on the physical drive1. Under this buffered I/O regime, whenever a read misses this page cache the kernel will perform read ahead: it will retrieve more data than requested from the disk in order to populate this cache in an attempt to avoid future misses. This page cache structure is a mapping from $\left(\text{inode}, \text{offset}\right)$ tuples to $\text{page}$2. The relevant part of the kernel source tree is mm; with readahead.c and filemap.c being particularly notable.

With O_DIRECT, this caching is (ideally) skipped. This means that if you request data from disk, the kernel won't bother consulting the page cache nor will it perform read-ahead. The caveat is that O_DIRECT is essentially meaningless in that it provides zero guarantees of anything. In fact, any sane filesystem will resort to regular buffered I/O when it cannot reliably service the direct access request.

More on O_DIRECT

For those who don't know, O_DIRECT is a flag that can be passed to the open(2) system call in POSIX-compliant systems. From the man page for open(2):

Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.

O_DIRECT requests that kernelland I/O caching is skipped. Looking at the relevant subsection in the notes (emphasis my own):

The O_DIRECT flag may impose alignment restrictions on the length and address of user-space buffers and the file offset of I/Os. In Linux alignment restrictions vary by filesystem and kernel version and might be absent entirely. The handling of misaligned O_DIRECT I/Os also varies; they can either fail with EINVAL or fall back to buffered I/O.

Since Linux 6.1, O_DIRECT support and alignment restrictions for a file can be queried using statx(2), using the STATX_DIOALIGN flag. Support for STATX_DIOALIGN varies by filesystem; see statx(2).

Some filesystems provide their own interfaces for querying O_DIRECT alignment restrictions, for example the XFS_IOC_DIOINFO operation in xfsctl(3). STATX_DIOALIGN should be used instead when it is available.

If none of the above is available, then direct I/O support and alignment restrictions can only be assumed from known characteristics of the filesystem, the individual file, the underlying storage device(s), and the kernel version. In Linux 2.4, most filesystems based on block devices require that the file offset and the length and memory address of all I/O segments be multiples of the filesystem block size (typically 4096 bytes). In Linux 2.6.0, this was relaxed to the logical block size of the block device (typically 512 bytes). A block device's logical block size can be determined using the ioctl(2) BLKSSZGET operation or from the shell using the command:

blockdev --getss

O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. This restriction does not apply when the memory buffer for the O_DIRECT I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag. Nor does this restriction apply when the memory buffer has been advised as MADV_DONTFORK with madvise(2), ensuring that it will not be available to the child after fork(2).

The O_DIRECT flag was introduced in SGI IRIX, where it has alignment restrictions similar to those of Linux 2.4. IRIX has also a fcntl(2) call to query appropriate alignments, and sizes. FreeBSD 4.x introduced a flag of the same name, but without alignment restrictions.

O_DIRECT support was added in Linux 2.4.10. Older Linux kernels simply ignore this flag. Some filesystems may not implement the flag, in which case open() fails with the error EINVAL if it is used.

Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the filesystem correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixing mmap(2) of files with direct I/O to the same files.

The behavior of O_DIRECT with NFS will differ from local filesystems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O will bypass the page cache only on the client; the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics of O_DIRECT. Some servers will perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this will avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions on O_DIRECT I/O.

In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default.

Reading

Let's now consult the man page for read(2):

On success, the number of bytes read is returned (zero indicates end of file), and the file position is advanced by this number. It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal. See also NOTES.

and the notes section:

The types size_t and ssize_t are, respectively, unsigned and signed integer data types specified by POSIX.1.

On Linux, read() (and similar system calls) will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)

On NFS filesystems, reading small amounts of data will update the timestamp only the first time, subsequent calls may not do so. This is caused by client side attribute caching, because most if not all NFS clients leave st_atime (last file access time) updates to the server, and client side reads satisfied from the client's cache will not cause st_atime updates on the server as there are no server-side reads. UNIX semantics can be obtained by disabling client-side attribute caching, but in most situations this will substantially increase server load and decrease performance.

Now let's read the actual source code for read(2)3. As of Linux 6.1:

ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
{
    struct fd f = fdget_pos(fd);
    ssize_t ret = -EBADF;

    if (f.file) {
        loff_t pos, *ppos = file_ppos(f.file);
        if (ppos) {
            pos = *ppos;
            ppos = &pos;
        }
        ret = vfs_read(f.file, buf, count, ppos);
        if (ret >= 0 && ppos)
            f.file->f_pos = pos;
        fdput_pos(f);
    }
    return ret;
}

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
    return ksys_read(fd, buf, count);
}

-- fs/read_write.c:602-624

vfs_read stands for "virtual filesystem read" and essentially dispatches the actual read operation to the underlying filesystem. It does this by calling a file pointer associated with the provided struct file parameter that looks like this:

    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);

-- include/linux/fs.h:2106

At this point during my search I realised that we'd need to make assumptions about the actual filesystem (e.g., ext4, XFS, btrfs, etc.) and I didn't think that this was directly relevant to the interview question itself.

Actually Answering the Question

At this stage, we can rule out different subsets of the domain for the return value of read(2):

Thus, the answer must be within the $\left(0, 4096\right)$ interval.

The crux of the problem is that read(2) deals with logical files whereas the filesystem internals and underlying block device ultimately deal with physical sectors. It's the responsibility of the kernel to shield our naive userspace program from the horrors of how hardware actually works. So if the mapping of logical files to physical sectors should violate certain assumptions about their relative sizes, we might get an "unexpected" result from read(2).

Tanel Proder provides a very succint demonstration of this in his answer.

Physical allocation vs. logical file size... sparse files, truncated files, files with "punched holes" in them, etc... pic.twitter.com/0MxTt8znvw

— Tanel Poder 🇺🇦 (@TanelPoder) August 6, 2025

Here's a diagrammatic explanation:

Figure 1: One instance that can induce a "short" read under O_DIRECT assumptions.

Conclusion

So my final answer is essentially that of Tanel's: if the file uses less physical sectors than its logical size would suggest then we may get a short read on the final call to read(2) as EOF (a property of logical files) would occur prior to the end of the final physical sector.

Bibliography

  1. P. L. aka phs, “How to use O_DIRECT in Linux?,” Stack Overflow, Mar. 2018. [Online]. Available: https://stackoverflow.com/a/49462406/8028639. [Accessed: Aug. 6, 2025].
  2. N. J. Leonoff, “O_DIRECT,” The Yarchive. [Online]. Available: https://yarchive.net/comp/linux/o_direct.html. [Accessed: Aug. 6, 2025].
  3. chb, “Why does Linux limit buffered file reads to ~128kB/sec?,” Server Fault, Jul. 2023. [Online]. Available: https://serverfault.com/a/1141440. [Accessed: Aug. 6, 2025].
  4. Chris Down, “Use of O_DIRECT on Linux,” Unix & Linux Stack Exchange. [Online]. Available: https://unix.stackexchange.com/questions/6467/use-of-o-direct-on-linux. [Accessed: Aug. 6, 2025].
  5. “du vs df in XFS,” Oracle Blogs, Jun. 20, 2025. [Online]. Available: https://blogs.oracle.com/linux/post/du-vs-df-in-xfs. [Accessed: Aug. 8, 2025].
1

It's 2025 and obviously there are block devices that are not spinning magnetic platters but just humour me - it's not as if reading from this would be somehow faster than reading from memory (or even CPU cache!). 2: Modern versions of the kernel use a type called struct folio rather than struct page for this. 3: When you call read from your own code, you're technically hitting glibc first but all glibc does is wrap the direct syscall. Feel free to read the glibc source if you're a masochist!