mergerfs/mkdocs/docs/pages/documentation/caching.md

10 KiB

CACHING

page caching

https://en.wikipedia.org/wiki/Page_cache

  • cache.files=off: Disables page caching. Underlying files cached, mergerfs files are not.
  • cache.files=partial: Enables page caching. Underlying files cached, mergerfs files cached while open.
  • cache.files=full: Enables page caching. Underlying files cached, mergerfs files cached across opens.
  • cache.files=auto-full: Enables page caching. Underlying files cached, mergerfs files cached across opens if mtime and size are unchanged since previous open.
  • cache.files=libfuse: follow traditional libfuse direct_io, kernel_cache, and auto_cache arguments.
  • cache.files=per-process: Enable page caching (equivalent to cache.files=partial) only for processes whose 'comm' name matches one of the values defined in cache.files.process-names. If the name does not match the file open is equivalent to cache.files=off.

FUSE, which mergerfs uses, offers a number of page caching modes. mergerfs tries to simplify their use via the cache.files option. It can and should replace usage of direct_io, kernel_cache, and auto_cache.

Due to mergerfs using FUSE and therefore being a userland process proxying existing filesystems the kernel will double cache the content being read and written through mergerfs. Once from the underlying filesystem and once from mergerfs (it sees them as two separate entities). Using cache.files=off will keep the double caching from happening by disabling caching of mergerfs but this has the side effect that all read and write calls will be passed to mergerfs which may be slower than enabling caching, you lose shared mmap support which can affect apps such as rtorrent, and no read-ahead will take place. The kernel will still cache the underlying filesystem data but that only helps so much given mergerfs will still process all requests.

If you do enable file page caching, cache.files=partial|full|auto-full, you should also enable dropcacheonclose which will cause mergerfs to instruct the kernel to flush the underlying file's page cache when the file is closed. This behavior is the same as the rsync fadvise / drop cache patch and Feh's nocache project.

If most files are read once through and closed (like media) it is best to enable dropcacheonclose regardless of caching mode in order to minimize buffer bloat.

It is difficult to balance memory usage, cache bloat & duplication, and performance. Ideally, mergerfs would be able to disable caching for the files it reads/writes but allow page caching for itself. That would limit the FUSE overhead. However, there isn't a good way to achieve this. It would need to open all files with O_DIRECT which places limitations on what the underlying filesystems would be supported and complicates the code.

kernel documentation: https://www.kernel.org/doc/Documentation/filesystems/fuse-io.txt

entry & attribute caching

Given the relatively high cost of FUSE due to the kernel <-> userspace round trips there are kernel side caches for file entries and attributes. The entry cache limits the lookup calls to mergerfs which ask if a file exists. The attribute cache limits the need to make getattr calls to mergerfs which provide file attributes (mode, size, type, etc.). As with the page cache these should not be used if the underlying filesystems are being manipulated at the same time as it could lead to odd behavior or data corruption. The options for setting these are cache.entry and cache.negative_entry for the entry cache and cache.attr for the attributes cache. cache.negative_entry refers to the timeout for negative responses to lookups (non-existent files).

writeback caching

When cache.files is enabled the default is for it to perform writethrough caching. This behavior won't help improve performance as each write still goes one for one through the filesystem. By enabling the FUSE writeback cache small writes may be aggregated by the kernel and then sent to mergerfs as one larger request. This can greatly improve the throughput for apps which write to files inefficiently. The amount the kernel can aggregate is limited by the size of a FUSE message. Read the fuse_msg_size section for more details.

There is a small side effect as a result of enabling writeback caching. Underlying files won't ever be opened with O_APPEND or O_WRONLY. The former because the kernel then manages append mode and the latter because the kernel may request file data from mergerfs to populate the write cache. The O_APPEND change means that if a file is changed outside of mergerfs it could lead to corruption as the kernel won't know the end of the file has changed. That said any time you use caching you should keep from using the same file outside of mergerfs at the same time.

Note that if an application is properly sizing writes then writeback caching will have little or no effect. It will only help with writes of sizes below the FUSE message size (128K on older kernels, 1M on newer).

statfs caching

Of the syscalls used by mergerfs in policies the statfs / statvfs call is perhaps the most expensive. It's used to find out the available space of a filesystem and whether it is mounted read-only. Depending on the setup and usage pattern these queries can be relatively costly. When cache.statfs is enabled all calls to statfs by a policy will be cached for the number of seconds its set to.

Example: If the create policy is mfs and the timeout is 60 then for that 60 seconds the same filesystem will be returned as the target for creates because the available space won't be updated for that time.

As of version 4.20 Linux supports symlink caching. Significant performance increases can be had in workloads which use a lot of symlinks. Setting cache.symlinks=true will result in requesting symlink caching from the kernel only if supported. As a result it's safe to enable it on systems prior to 4.20. That said it is disabled by default for now. You can see if caching is enabled by querying the xattr user.mergerfs.cache.symlinks but given it must be requested at startup you can not change it at runtime.

readdir caching

As of version 4.20 Linux supports readdir caching. This can have a significant impact on directory traversal. Especially when combined with entry (cache.entry) and attribute (cache.attr) caching. Setting cache.readdir=true will result in requesting readdir caching from the kernel on each opendir. If the kernel doesn't support readdir caching setting the option to true has no effect. This option is configurable at runtime via xattr user.mergerfs.cache.readdir.

tiered caching

Some storage technologies support what some call "tiered" caching. The placing of usually smaller, faster storage as a transparent cache to larger, slower storage. NVMe, SSD, Optane in front of traditional HDDs for instance.

mergerfs does not natively support any sort of tiered caching. Most users have no use for such a feature and its inclusion would complicate the code. However, there are a few situations where a cache filesystem could help with a typical mergerfs setup.

  1. Fast network, slow filesystems, many readers: You've a 10+Gbps network with many readers and your regular filesystems can't keep up.
  2. Fast network, slow filesystems, small'ish bursty writes: You have a 10+Gbps network and wish to transfer amounts of data less than your cache filesystem but wish to do so quickly.

With #1 it's arguable if you should be using mergerfs at all. RAID would probably be the better solution. If you're going to use mergerfs there are other tactics that may help: spreading the data across filesystems (see the mergerfs.dup tool) and setting func.open=rand, using symlinkify, or using dm-cache or a similar technology to add tiered cache to the underlying device.

With #2 one could use dm-cache as well but there is another solution which requires only mergerfs and a cronjob.

  1. Create 2 mergerfs pools. One which includes just the slow devices and one which has both the fast devices (SSD,NVME,etc.) and slow devices.
  2. The 'cache' pool should have the cache filesystems listed first.
  3. The best create policies to use for the 'cache' pool would probably be ff, epff, lfs, or eplfs. The latter two under the assumption that the cache filesystem(s) are far smaller than the backing filesystems. If using path preserving policies remember that you'll need to manually create the core directories of those paths you wish to be cached. Be sure the permissions are in sync. Use mergerfs.fsck to check / correct them. You could also set the slow filesystems mode to NC though that'd mean if the cache filesystems fill you'd get "out of space" errors.
  4. Enable moveonenospc and set minfreespace appropriately. To make sure there is enough room on the "slow" pool you might want to set minfreespace to at least as large as the size of the largest cache filesystem if not larger. This way in the worst case the whole of the cache filesystem(s) can be moved to the other drives.
  5. Set your programs to use the cache pool.
  6. Save one of the below scripts or create you're own.
  7. Use cron (as root) to schedule the command at whatever frequency is appropriate for your workflow.
time based expiring

Move files from cache to backing pool based only on the last time the file was accessed. Replace -atime with -amin if you want minutes rather than days. May want to use the fadvise / --drop-cache version of rsync or run rsync with the tool "nocache".

NOTE: The arguments to these scripts include the cache filesystem itself. Not the pool with the cache filesystem. You could have data loss if the source is the cache pool.

mergerfs.time-based-mover

percentage full expiring

Move the oldest file from the cache to the backing pool. Continue till below percentage threshold.

NOTE: The arguments to these scripts include the cache filesystem itself. Not the pool with the cache filesystem. You could have data loss if the source is the cache pool.

mergerfs.percent-full-mover