Steven Siloti
2016-04-13 03:11:53 UTC
See https://github.com/arvidn/libtorrent/wiki/memory-mapped-I-O for
background.
So here's my thoughts on a grand mmaped IO refactoring, starting with a
revised threading model:
1. Have a single thread dedicated to write operations. This would
include writes, flushes, and file modifications[1] (delete, rename,
move). With buffered IO there's no point in having multiple threads for
writing because they're just creating dirty pages for the OS to flush at
its leisure. Operating systems use blocking threads which are dirtying
pages to signal that the page cache is under pressure, so we must detect
this blocking to know when to apply back-presure to the network. Having
a single writer thread makes this task simpler. What we should not do is
try to control writeback ourselves by issuing explicit syncs[2]. The OS
is in a much better position that libtorrent to determine the optimal
writeback strategy, we should rely on the extensive work OS developers
have done in this area. We also want to have write jobs be separated
from reads so that the former does not block the latter.
2. Make the reader thread pool scalable. A new read job should be
handled in a new thread if there are none available and the reader
thread count is below some threshold, say 16 by default. The idea is
that the maximum number of reader threads should correspond to a
reasonable upper bound on the number of commands to have queued to the
underlying storage device. For a typical desktop/router/NAS this is
somewhere in the range of 16-32, severs are probably going to want more.
Of course idle threads should be killed after some timeout.
3. Eliminate dedicated hasher threads. With a scalable reader pool I
don't think they are necessary.
Other thoughts:
I'm not convinced using mincore() to detect cache hits is a big enough
win to justify the added complexity. A big queue of blocked reads means
the cache hit rate is probably low, so the bigger the potential gain the
less chance there is of actually realizing it. If read jobs are
unconditionally queued to the thread pool then the whole concept of
cache hits can be more-or-less eliminated from libtorrent[3]. A cache
hit simply becomes the happy case where all required mmapped pages
happen to already be in the page cache. It would also mean all buffer
stitching would be done in the disk threads. Plus mincore() is
inherently racy, if the pages in question get evicted the network thread
will block on re-reading them. While this wouldn't happen often, the
impact of even a small rate of occurrence could be significant.
I certainly wouldn't bother with zero copy for the initial
implementation. As is said on the wiki, attempting zero copy for
receiving data is likely not worth it due to the extra syscall overhead,
and zero copy in general only works for unencrypted connections which is
not the typical mode of operation these days.
One challenge that hasn't been mentioned is that on 32-bit systems
libtorrent would need to actively manage mappings due to limited address
space. This needs some empirical work to determine the optimum mapping
size which minimizes churn.
[1] I don't have a strong opinion about putting these on the writer
thread or the reader threads. They usually have to be serialized anyways
so putting them on the writer thread seems natural.
[2] Syncing may still be necessary for consistency, e.g. when saving
resume data. It just shouldn't be used to drive flow-control.
[3] Ideally I'd like to see libtorrent drop explicit caching entirely,
such that all *_cache_* settings would be deprecated.
background.
So here's my thoughts on a grand mmaped IO refactoring, starting with a
revised threading model:
1. Have a single thread dedicated to write operations. This would
include writes, flushes, and file modifications[1] (delete, rename,
move). With buffered IO there's no point in having multiple threads for
writing because they're just creating dirty pages for the OS to flush at
its leisure. Operating systems use blocking threads which are dirtying
pages to signal that the page cache is under pressure, so we must detect
this blocking to know when to apply back-presure to the network. Having
a single writer thread makes this task simpler. What we should not do is
try to control writeback ourselves by issuing explicit syncs[2]. The OS
is in a much better position that libtorrent to determine the optimal
writeback strategy, we should rely on the extensive work OS developers
have done in this area. We also want to have write jobs be separated
from reads so that the former does not block the latter.
2. Make the reader thread pool scalable. A new read job should be
handled in a new thread if there are none available and the reader
thread count is below some threshold, say 16 by default. The idea is
that the maximum number of reader threads should correspond to a
reasonable upper bound on the number of commands to have queued to the
underlying storage device. For a typical desktop/router/NAS this is
somewhere in the range of 16-32, severs are probably going to want more.
Of course idle threads should be killed after some timeout.
3. Eliminate dedicated hasher threads. With a scalable reader pool I
don't think they are necessary.
Other thoughts:
I'm not convinced using mincore() to detect cache hits is a big enough
win to justify the added complexity. A big queue of blocked reads means
the cache hit rate is probably low, so the bigger the potential gain the
less chance there is of actually realizing it. If read jobs are
unconditionally queued to the thread pool then the whole concept of
cache hits can be more-or-less eliminated from libtorrent[3]. A cache
hit simply becomes the happy case where all required mmapped pages
happen to already be in the page cache. It would also mean all buffer
stitching would be done in the disk threads. Plus mincore() is
inherently racy, if the pages in question get evicted the network thread
will block on re-reading them. While this wouldn't happen often, the
impact of even a small rate of occurrence could be significant.
I certainly wouldn't bother with zero copy for the initial
implementation. As is said on the wiki, attempting zero copy for
receiving data is likely not worth it due to the extra syscall overhead,
and zero copy in general only works for unencrypted connections which is
not the typical mode of operation these days.
One challenge that hasn't been mentioned is that on 32-bit systems
libtorrent would need to actively manage mappings due to limited address
space. This needs some empirical work to determine the optimum mapping
size which minimizes churn.
[1] I don't have a strong opinion about putting these on the writer
thread or the reader threads. They usually have to be serialized anyways
so putting them on the writer thread seems natural.
[2] Syncing may still be necessary for consistency, e.g. when saving
resume data. It just shouldn't be used to drive flow-control.
[3] Ideally I'd like to see libtorrent drop explicit caching entirely,
such that all *_cache_* settings would be deprecated.