简要介绍一下postgres中的SMGR和FD层:

  • src/backend/storage/smgr/smgr.c
  • src/backend/storage/file/fd.c

SMGR层(Storage Manager)负责为上层提供了面向relation磁盘存储接口;而FD层(File Descriptor)则封装了具体的文件打开、关闭、缓存与操作管理,直接和OS系统调用交互。

bufferpool──►[smgr]───┐                     
                      │                     
slru──────────────────┼────►[fd]───►syscall 
                      │                     
wal ──────────────────┘                     

SMGR接口

SMGR的接口见smgrsw,它是一个f_smgr结构(若干函数指针),目前pg代码内部只有md(magnetic disk,src/backend/storage/smgr/md.c)这一个实现:

static const f_smgr smgrsw[] = {
	/* magnetic disk */
	{
		.smgr_init = mdinit,
		.smgr_shutdown = NULL,
    ...
		.smgr_read = mdread,
		.smgr_write = mdwrite,
    ...
		.smgr_immedsync = mdimmedsync,
	}
};

这些接口函数都非常好理解,功能如名。留意它们的操作对象都是SMgrRelation:将”关系”对应的文件句柄纳入其内进行管理

以写操作为例:

void smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool skipFsync)
    -> mdwrite(...);    //函数签名和smgrwrite一致
        -> FileWrite(v->mdfd_vfd, buffer, BLCKSZ, ...);

其功能非常直接:将buffer中的数据写入到reln的第forknum个文件的第blocknum个块中。内部实现通过转发请求至FD层的FileWrite(),然后它再和OS通过syscall进行交互,完成数据的最终写入。

FD接口

FD层的更准确名称是VFD(virtual file descriptor),pg自己实现了的一个文件描述符缓存:通过VFD机制来灵活地管理和复用有限的文件描述符。这些被管理的文件描述符都封装在SMgrRelation结构中。

同样以写操作为例:

int FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info)
    -> pg_pwrite();
        -> pwrite();    // syscall

将buffer中的内容通过pwrite()系统调用写入到文件的相应偏移上。

如下是FD层几个常用函数和系统调用的对应关系:

PathNameOpenFile() → open()
FileRead()         → pread()
FileWrite()        → pwrite()
FileSync()         → fsync()

系统调用通常是以glibc函数的形式提供给上层应用来调用,因此在名称上可能有一些差异。比如64位系统中就是pwrite64(),一但发现gdb b不住时,可以检查一下是不是名称的问题

调用栈示例

通过下边2个例子看一下整体的调用栈轨迹:

★读操作select

#0  pread (__offset=0, __nbytes=8192, __buf=0xed65230f8f80, __fd=27) at /usr/include/aarch64-linux-gnu/bits/unistd.h:38
#1  FileRead (file=<optimized out>, buffer=buffer@entry=0xed65230f8f80 "", amount=amount@entry=8192, offset=offset@entry=0, wait_event_info=167772175) at fd.c:2053
#2  mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=0, buffer=0xed65230f8f80 "") at md.c:656
#3  smgrread (reln=reln@entry=0xada30a82fd80, forknum=forknum@entry=MAIN_FORKNUM, blocknum=blocknum@entry=0, buffer=buffer@entry=0xed65230f8f80 "") at smgr.c:504
#4  ReadBuffer_common (smgr=0xada30a82fd80, relpersistence=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=0, mode=mode@entry=RBM_NORMAL, strategy=strategy@entry=0x0, hit=hit@entry=0xffffd2e655d7) at bufmgr.c:1013
#6  heapgetpage (sscan=sscan@entry=0xada30a867b50, page=page@entry=0) at heapam.c:413
#9  table_scan_getnextslot (sscan=<optimized out>, direction=direction@entry=ForwardScanDirection, slot=slot@entry=0xada30a8674b0) at ../../../src/include/access/tableam.h:1045
#13 ExecSeqScan (pstate=<optimized out>) at nodeSeqscan.c:112

在脉络上非常清晰: Scan(Executor) → heap(tableAM) → readbuffer(bufferpool) → smgrread(SMGR)→ FileRead(FD) → pread(syscall)

★写操作insert

#0  __libc_pwrite64 (fd=7, buf=buf@entry=0xf10990b22f80, count=count@entry=8192, offset=offset@entry=655360) at ../sysdeps/unix/sysv/linux/pwrite64.c:24
#1  FileWrite (file=<optimized out>, buffer=buffer@entry=0xf10990b22f80 "", amount=amount@entry=8192, offset=offset@entry=655360, wait_event_info=167772171) at fd.c:2135
#2  mdextend (reln=0xab1ec04c0230, forknum=MAIN_FORKNUM, blocknum=80, buffer=0xf10990b22f80 "", skipFsync=false) at md.c:448
#3  smgrextend (reln=reln@entry=0xab1ec04c0230, forknum=forknum@entry=MAIN_FORKNUM, blocknum=blocknum@entry=80, buffer=buffer@entry=0xf10990b22f80 "", skipFsync=skipFsync@entry=false) at smgr.c:465
#4  ReadBuffer_common (smgr=0xab1ec04c0230, relpersistence=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=80, blockNum@entry=4294967295, mode=mode@entry=RBM_ZERO_AND_LOCK, strategy=strategy@entry=0x0, hit=hit@entry=0xffffe15fa987) at bufmgr.c:988
#7  RelationGetBufferForTuple (relation=relation@entry=0xf109993af760, len=32, otherBuffer=otherBuffer@entry=0, options=options@entry=0, bistate=bistate@entry=0x0, vmbuffer=vmbuffer@entry=0xffffe15faa8c, vmbuffer_other=vmbuffer_other@entry=0x0) at hio.c:624
#8  heap_insert (relation=relation@entry=0xf109993af760, tup=tup@entry=0xab1ec04e5140, cid=cid@entry=0, options=options@entry=0, bistate=bistate@entry=0x0) at heapam.c:2124
#10 table_tuple_insert (bistate=0x0, options=0, cid=<optimized out>, slot=0xab1ec04e5028, rel=0xf109993af760) at ../../../src/include/access/tableam.h:1375
#11 ExecInsert (mtstate=mtstate@entry=0xab1ec04e57a8, resultRelInfo=resultRelInfo@entry=0xab1ec04e59c0, slot=0xab1ec04e5028, planSlot=planSlot@entry=0xab1ec04e6b60, estate=estate@entry=0xab1ec04e5530, canSetTag=<optimized out>) at nodeModifyTable.c:1031
#12 ExecModifyTable (pstate=0xab1ec04e57a8) at nodeModifyTable.c:2719

脉络同read,但是需要留意:insert只触发了同步的extend操作(新增一个block到文件),而具体的数据需要等待bgwriter/checkpointer来异步进行写入并刷入磁盘

Discussion

pg的SMGR和FD层非常朴实,在pg17之前基本未使用到OS提供的高级IO特性,所以这里显然有很大的改进空间:

  • pg17中引入了vectored I/O:在一次系统调用中完成多个IO请求
  • pg18中引入了AIO和DIO:进一步加强性能
    • The path to using AIO in postgres (PGConf.EU 2023)

      For a few years we (Andres Freund, Thomas Munro, Melanie Plageman, David Rowley) have been working towards using asynchronous IO (AIO) and direct IO in Postgres. The goal of using AIO and DIO in postgres is to improve throughput, decrease latency, reduce jitter, reduce double buffering and more.

    • 目前的主要使用场景还是数据预读(prefetch),社区正在持续优化中,可以多关注。

另外SMGR层是支持通过扩展hook的,一旦需要改变最终的读写行为时就可以这里搞,比如NeonDB

static const struct f_smgr neon_smgr =
{
...
#if PG_MAJORVERSION_NUM >= 17
	.smgr_prefetch = neon_prefetch,
	.smgr_readv = neon_readv,   // 先读本地缓存,未命中再去拉取存储服务
	.smgr_writev = neon_writev, // 不写本地数据
#else
	.smgr_prefetch = neon_prefetch,
	.smgr_read = neon_read,
	.smgr_write = neon_write,
#endif
...

Questions:

  1. 之前提到了bufferpool中的数据写盘是异步的,具体再分那几种场景?

    提示:比如数据页换出,可以通过gdb来看看

  2. mysql innodb也有bufferpool,它推荐使用的内存越大越好,那么pg也是这样的吗?

    如果2者有差异,造成这种区别的原因是什么?