Recently, the CEO of Tigerbeetle, Joran Dirk Greef, posted an interview question for a (presumably hypothetical) DBMS engineering role. This nerdsniped me and lead me on a bit of a wild goose chase to try and find a satisfactory answer. In the end, Tanel Poder ended up answering the question with a fairly succinct demonstration. In an effort to do more and also share my thought process on the problem, this post will act as both a solution but also an exploration of how filesystems work and some of the Linux kernel source code.
DBMS Interview Challenge
— Joran Dirk Greef (@jorandirkgreef) August 5, 2025
A system performs Direct I/O:
- using O_DIRECT,
- aligning to 4096 byte Advanced Format sector size, and
- reading exactly 4096 bytes at a time.
What numbers other than 0 or 4096 will read(2) return, as the number of bytes read? Why?
Firstly, let's unpack the constraints provided to us in the question.
using
O_DIRECT
This refers to a certain flag provided to open(2)
and suggests that we'll be performing direct I/O (there's a big caveat here; more on this shortly).
aligning to 4096 byte Advanced Format sector size
I am interpreting this as referring to the memory alignment of the userland buffers that we'll be performing I/O to and from. Finally, from the last constraint, we know that we'll only ever be reading data in 4096 byte increments. Joran made several follow-up posts which provide some additional facts:
(Assume all writes are also aligned to 4096 bytes, and that the file size is a multiple of 4096)
— Joran Dirk Greef (@jorandirkgreef) August 5, 2025
Ah! You can assume the file size is aligned to 4096 bytes as well (but good clarification!).
— Joran Dirk Greef (@jorandirkgreef) August 5, 2025
Assume the read is never interrupted, and there's no error, are you sure you'll only get 0 or 4096? Thinking about the hardware some more, what's plausible to expect?
— Joran Dirk Greef (@jorandirkgreef) August 5, 2025
So we know that the file size is a multiple of 4096 bytes (i.e., we'll never request to read a partial block!) and that we will never perform a write to an address that is not 4096-byte-aligned.
During an ordinary read, when data is requested from a block device, the kernel will first attempt to serve the request from an in-memory cache (called the page cache, somewhat confusingly). This potentially avoids an expensive seek on the physical drive1. Under this buffered I/O regime, whenever a read misses this page cache the kernel will perform read ahead: it will retrieve more data than requested from the disk in order to populate this cache in an attempt to avoid future misses. This page cache structure is a mapping from $\left(\text{inode}, \text{offset}\right)$ tuples to $\text{page}$2. The relevant part of the kernel source tree is mm
; with readahead.c
and filemap.c
being particularly notable.
With O_DIRECT
, this caching is (ideally) skipped. This means that if you request data from disk, the kernel won't bother consulting the page cache nor will it perform read-ahead. The caveat is that O_DIRECT
is essentially meaningless in that it provides zero guarantees of anything. In fact, any sane filesystem will resort to regular buffered I/O when it cannot reliably service the direct access request.
O_DIRECT
For those who don't know, O_DIRECT
is a flag that can be passed to the open(2)
system call in POSIX-compliant systems. From the man page for open(2)
:
Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers. The
O_DIRECT
flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of theO_SYNC
flag that data and necessary metadata are transferred. To guarantee synchronous I/O,O_SYNC
must be used in addition toO_DIRECT
. See NOTES below for further discussion.
O_DIRECT
requests that kernelland I/O caching is skipped. Looking at the relevant subsection in the notes (emphasis my own):
The
O_DIRECT
flag may impose alignment restrictions on the length and address of user-space buffers and the file offset of I/Os. In Linux alignment restrictions vary by filesystem and kernel version and might be absent entirely. The handling of misalignedO_DIRECT
I/Os also varies; they can either fail withEINVAL
or fall back to buffered I/O.Since Linux 6.1,
O_DIRECT
support and alignment restrictions for a file can be queried usingstatx(2)
, using theSTATX_DIOALIGN
flag. Support forSTATX_DIOALIGN
varies by filesystem; seestatx(2)
.Some filesystems provide their own interfaces for querying
O_DIRECT
alignment restrictions, for example theXFS_IOC_DIOINFO
operation inxfsctl(3)
.STATX_DIOALIGN
should be used instead when it is available.If none of the above is available, then direct I/O support and alignment restrictions can only be assumed from known characteristics of the filesystem, the individual file, the underlying storage device(s), and the kernel version. In Linux 2.4, most filesystems based on block devices require that the file offset and the length and memory address of all I/O segments be multiples of the filesystem block size (typically 4096 bytes). In Linux 2.6.0, this was relaxed to the logical block size of the block device (typically 512 bytes). A block device's logical block size can be determined using the
ioctl(2)
BLKSSZGET
operation or from the shell using the command:
blockdev --getss
O_DIRECT
I/Os should never be run concurrently with thefork(2)
system call, if the memory buffer is a private mapping (i.e., any mapping created with themmap(2)
MAP_PRIVATE
flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed beforefork(2)
is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. This restriction does not apply when the memory buffer for theO_DIRECT
I/Os was created usingshmat(2)
ormmap(2)
with theMAP_SHARED
flag. Nor does this restriction apply when the memory buffer has been advised asMADV_DONTFORK
withmadvise(2)
, ensuring that it will not be available to the child afterfork(2)
.The
O_DIRECT
flag was introduced in SGI IRIX, where it has alignment restrictions similar to those of Linux 2.4. IRIX has also afcntl(2)
call to query appropriate alignments, and sizes. FreeBSD 4.x introduced a flag of the same name, but without alignment restrictions.
O_DIRECT
support was added in Linux 2.4.10. Older Linux kernels simply ignore this flag. Some filesystems may not implement the flag, in which caseopen()
fails with the errorEINVAL
if it is used.Applications should avoid mixing
O_DIRECT
and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the filesystem correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone. Likewise, applications should avoid mixingmmap(2)
of files with direct I/O to the same files.The behavior of
O_DIRECT
with NFS will differ from local filesystems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, soO_DIRECT
I/O will bypass the page cache only on the client; the server may still cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics ofO_DIRECT
. Some servers will perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this will avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions onO_DIRECT
I/O.In summary,
O_DIRECT
is a potentially powerful tool that should be used with caution. It is recommended that applications treat use ofO_DIRECT
as a performance option which is disabled by default.
Let's now consult the man page for read(2)
:
On success, the number of bytes read is returned (zero indicates end of file), and the file position is advanced by this number. It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because
read()
was interrupted by a signal. See also NOTES.
and the notes section:
The types
size_t
andssize_t
are, respectively, unsigned and signed integer data types specified by POSIX.1.On Linux,
read()
(and similar system calls) will transfer at most0x7ffff000
(2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)On NFS filesystems, reading small amounts of data will update the timestamp only the first time, subsequent calls may not do so. This is caused by client side attribute caching, because most if not all NFS clients leave
st_atime
(last file access time) updates to the server, and client side reads satisfied from the client's cache will not causest_atime
updates on the server as there are no server-side reads. UNIX semantics can be obtained by disabling client-side attribute caching, but in most situations this will substantially increase server load and decrease performance.
Now let's read the actual source code for read(2)
3. As of Linux 6.1:
ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;
if (f.file) {
loff_t pos, *ppos = file_ppos(f.file);
if (ppos) {
pos = *ppos;
ppos = &pos;
}
ret = vfs_read(f.file, buf, count, ppos);
if (ret >= 0 && ppos)
f.file->f_pos = pos;
fdput_pos(f);
}
return ret;
}
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
return ksys_read(fd, buf, count);
}
vfs_read
stands for "virtual filesystem read" and essentially dispatches the actual read operation to the underlying filesystem. It does this by calling a file pointer associated with the provided struct file
parameter that looks like this:
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
At this point during my search I realised that we'd need to make assumptions about the actual filesystem (e.g., ext4, XFS, btrfs, etc.) and I didn't think that this was directly relevant to the interview question itself.
At this stage, we can rule out different subsets of the domain for the return value of read(2)
:
-1
as we're allowed to assume no error conditions during readsThus, the answer must be within the $\left(0, 4096\right)$ interval.
The crux of the problem is that read(2)
deals with logical files whereas the filesystem internals and underlying block device ultimately deal with physical sectors. It's the responsibility of the kernel to shield our naive userspace program from the horrors of how hardware actually works. So if the mapping of logical files to physical sectors should violate certain assumptions about their relative sizes, we might get an "unexpected" result from read(2)
.
Tanel Proder provides a very succint demonstration of this in his answer.
Physical allocation vs. logical file size... sparse files, truncated files, files with "punched holes" in them, etc... pic.twitter.com/0MxTt8znvw
— Tanel Poder 🇺🇦 (@TanelPoder) August 6, 2025
Here's a diagrammatic explanation:
O_DIRECT
assumptions.So my final answer is essentially that of Tanel's: if the file uses less physical sectors than its logical size would suggest then we may get a short read on the final call to read(2)
as EOF (a property of logical files) would occur prior to the end of the final physical sector.
It's 2025 and obviously there are block devices that are not spinning magnetic platters but just humour me - it's not as if reading from this would be somehow faster than reading from memory (or even CPU cache!).
2: Modern versions of the kernel use a type called struct folio
rather than struct page
for this.
3: When you call read
from your own code, you're technically hitting glibc first but all glibc does is wrap the direct syscall. Feel free to read the glibc source if you're a masochist!