Discussion:
[libtorrent] memory mapped I/O
Steven Siloti
2016-04-13 03:11:53 UTC
Permalink
See https://github.com/arvidn/libtorrent/wiki/memory-mapped-I-O for
background.

So here's my thoughts on a grand mmaped IO refactoring, starting with a
revised threading model:

1. Have a single thread dedicated to write operations. This would
include writes, flushes, and file modifications[1] (delete, rename,
move). With buffered IO there's no point in having multiple threads for
writing because they're just creating dirty pages for the OS to flush at
its leisure. Operating systems use blocking threads which are dirtying
pages to signal that the page cache is under pressure, so we must detect
this blocking to know when to apply back-presure to the network. Having
a single writer thread makes this task simpler. What we should not do is
try to control writeback ourselves by issuing explicit syncs[2]. The OS
is in a much better position that libtorrent to determine the optimal
writeback strategy, we should rely on the extensive work OS developers
have done in this area. We also want to have write jobs be separated
from reads so that the former does not block the latter.

2. Make the reader thread pool scalable. A new read job should be
handled in a new thread if there are none available and the reader
thread count is below some threshold, say 16 by default. The idea is
that the maximum number of reader threads should correspond to a
reasonable upper bound on the number of commands to have queued to the
underlying storage device. For a typical desktop/router/NAS this is
somewhere in the range of 16-32, severs are probably going to want more.
Of course idle threads should be killed after some timeout.

3. Eliminate dedicated hasher threads. With a scalable reader pool I
don't think they are necessary.

Other thoughts:

I'm not convinced using mincore() to detect cache hits is a big enough
win to justify the added complexity. A big queue of blocked reads means
the cache hit rate is probably low, so the bigger the potential gain the
less chance there is of actually realizing it. If read jobs are
unconditionally queued to the thread pool then the whole concept of
cache hits can be more-or-less eliminated from libtorrent[3]. A cache
hit simply becomes the happy case where all required mmapped pages
happen to already be in the page cache. It would also mean all buffer
stitching would be done in the disk threads. Plus mincore() is
inherently racy, if the pages in question get evicted the network thread
will block on re-reading them. While this wouldn't happen often, the
impact of even a small rate of occurrence could be significant.

I certainly wouldn't bother with zero copy for the initial
implementation. As is said on the wiki, attempting zero copy for
receiving data is likely not worth it due to the extra syscall overhead,
and zero copy in general only works for unencrypted connections which is
not the typical mode of operation these days.

One challenge that hasn't been mentioned is that on 32-bit systems
libtorrent would need to actively manage mappings due to limited address
space. This needs some empirical work to determine the optimum mapping
size which minimizes churn.

[1] I don't have a strong opinion about putting these on the writer
thread or the reader threads. They usually have to be serialized anyways
so putting them on the writer thread seems natural.
[2] Syncing may still be necessary for consistency, e.g. when saving
resume data. It just shouldn't be used to drive flow-control.
[3] Ideally I'd like to see libtorrent drop explicit caching entirely,
such that all *_cache_* settings would be deprecated.
Arvid Norberg
2016-04-13 05:18:29 UTC
Permalink
Thanks for responding to this Steven!
Post by Steven Siloti
See https://github.com/arvidn/libtorrent/wiki/memory-mapped-I-O for
background.
So here's my thoughts on a grand mmaped IO refactoring, starting with a
1. Have a single thread dedicated to write operations. This would
include writes, flushes, and file modifications[1] (delete, rename,
move).
The file operations (at least delete, likely rename on windows, as well as
re-opening files in read-only mode once they're fully written and clearing
the sparse file-flag) would most likely have to be synchronized with reader
threads too, wouldn't they?

I'm not super happy about the complexity of the current mechanism of
raising fences for a torrent, to synchronize, but I can't think of any
simpler way of doing it. It could definitely be done at a finer
granularity, and be more complicated.
Post by Steven Siloti
With buffered IO there's no point in having multiple threads for
writing because they're just creating dirty pages for the OS to flush at
its leisure. Operating systems use blocking threads which are dirtying
pages to signal that the page cache is under pressure, so we must detect
this blocking to know when to apply back-presure to the network.
Meaning, the page fault will block while it's trying to allocate a fresh
page to write the new data to, so the writing thread will pause. This could
be detected by the growing queue of write jobs being piled onto the thread.

Having
Post by Steven Siloti
a single writer thread makes this task simpler.
With a simple striping scheme, I would imagine multiple threads could be
used fairly simple too. Say, every odd piece on thread 0 and every even on
1 (probably not quite that simple). The network would just have the be
sensitive to more than one queue.

Even though dirtying pages and copying memory may not seem like a
bottleneck, I'm a bit hesitant to rely too much on that. Most of my (high
fidelity) experience comes from somewhat contrived benchmarks running over
loopback, but also from some fairly beefy seedboxes. In local benchmarking
(off the top of my head) a significant amount of time is spent copying
memory from the disk cache (that is, the current user-level disk cache)
when you hit the top upload rate (this is about download rate though, and
perhaps an argument can be made that the marginal effects of improving
download rate performance is a lot lower than the upload rate, since you
only download once).

The writing thread would potentially have to sit around waiting in munmap()
and mmap() calls as well, at least every time a new file was created, and
even more often on 32 bit systems.

In fact, one thing I've been thinking about for quite a while is whether it
would make sense to stripe the entire disk cache. If each thread has it's
own, private, part of the torrent it's responsible for, there's no shared
state between threads and the cache data structure could be a lot simpler.
On the other hand, it would likely not make it possible to split reads and
writes onto separate threads. They would be lumped together based on stripe
instead. The idea of no synchronization is a bit appealing though :)

What we should not do is
Post by Steven Siloti
try to control writeback ourselves by issuing explicit syncs[2]. The OS
is in a much better position that libtorrent to determine the optimal
writeback strategy, we should rely on the extensive work OS developers
have done in this area. We also want to have write jobs be separated
from reads so that the former does not block the latter.
That makes sense.

2. Make the reader thread pool scalable. A new read job should be
Post by Steven Siloti
handled in a new thread if there are none available and the reader
thread count is below some threshold, say 16 by default. The idea is
that the maximum number of reader threads should correspond to a
reasonable upper bound on the number of commands to have queued to the
underlying storage device. For a typical desktop/router/NAS this is
somewhere in the range of 16-32, severs are probably going to want more.
Of course idle threads should be killed after some timeout.
There's actually a TODO in the disk_io_thread about this. :)
Post by Steven Siloti
3. Eliminate dedicated hasher threads. With a scalable reader pool I
don't think they are necessary.
What worries me a bit is Amdahl's law. It's a little bit like complexity
analysis; it's probably a good idea to expect problems to be thrown at your
code that are a lot larger than you would imagine. Likewise, I think it
makes sense to be careful about designing in serial portions of a program.
Now, the current hashing thread affinity isn't exactly all that clean. It
basically kicks in once you have enough threads configured, hashing jobs
are concentrated on one of them (or about a 3rd of the threads iirc).

Although, it may make sense to re-discover the performance gains of having
SHA-1 hashing not interfere with the disk I/O, or discover that it is
indeed not necessary to separate them. (SHA-1 hash is easily the single
most expensive operation in bittorrent, only rivaled by the Diffie-Hellman
exchange when connecting encrypred peers, so I suppose it depends on what
you do most, downloading or connecting to peers).

Since the hashing is associated with downloading, I imagine it would fall
on the writer thread. Which may be another reason to consider multiple
writers.
Post by Steven Siloti
I'm not convinced using mincore() to detect cache hits is a big enough
win to justify the added complexity. A big queue of blocked reads means
the cache hit rate is probably low, so the bigger the potential gain the
less chance there is of actually realizing it. If read jobs are
unconditionally queued to the thread pool then the whole concept of
cache hits can be more-or-less eliminated from libtorrent[3]. A cache
hit simply becomes the happy case where all required mmapped pages
happen to already be in the page cache. It would also mean all buffer
stitching would be done in the disk threads. Plus mincore() is
inherently racy, if the pages in question get evicted the network thread
will block on re-reading them. While this wouldn't happen often, the
impact of even a small rate of occurrence could be significant.
Yes, it's definitely hairy. The cost of the system call, the questionable
utility, having to do buffer stitching in both network thread and disk
threads, and the risk of having the network thread block because of a page
that was just evicted.

The mincore() logic could be done only for blocks that entirely fall into a
single file, but that may be questionable as it would cause different
behaviors depending on the file layout in the torrent (although, I wouldn't
be ashamed to perform better for piece-aligned files)

The race could be mitigated by having a dedicated cache-hit-thread.

Unfortunately, my experience is that this logic provides significant
performance improvements. Granted, it could be that the current thread-pool
is so shitty that being able to circumvent it offers huge benefits.
However, I believe it's more fundamental than that. Once you start to take
the system towards its extreme, the latency savings could drastically
reduce the bandwidth delay product and open up for a much higher request
rate from peers. One thing I've found is that when tuning for seed
performance, the limiting factor is often the rate and number of
outstanding requests from peers. (as you may have guessed, on a loopback
benchmark, the "network latency" is very low, and the disk I/O latency
dominates).

The use case I think is important to keep in mind is the one where the
entire torrent fits in RAM. I believe this is a common case for most
high-performance bittorrent machines. Now, I say torrent in singular,
meaning that one torrent is being downloaded at a time, and a fairly small
number of torrents are being seeded at high rates, and they probably fit in
RAM as well. In these cases, the cache hit ratio is extremely high. And if
the peers being uploaded to have hundreds of MiB/s capacity, fairly small
latency increases can have drastic effects to bandwidth-delay-product.

Actually, the cache-hit-thread made me think. Currently, every block
(16kiB) in the cache has a reference count, and references to cache-buffers
are passed back directly to the network thread and hung onto the peer
connection send buffer chain. The refcount prevents the cache from evicting
the block while it's being sent over a socket. It is true in general, any
memory access may block and potentially take a very long time (it could
have been swapped out to disk for instance). Are mmapped pages more likely
to be evicted than other (say anonymous memory) page? Probably not, or
rather, probably only because they're likely to not have been accessed for
a while.

I would imagine with an mmapped based disk cache, we would still pass back
pointers straight into the mmapped region to the sockets to send, and the
kernel is still free to evict those pages while it's in queue for a socket
to send it. Presumably though, the read thread would read the first byte of
each requested page, to make sure they're all fetched (possibly by first
issuing an MADV_WILLNEED for all of the pages). These pages will only be
*likely* to still be in RAM by the time the socket starts consuming them.

How does the kernel decide which pages should be evicted? My first thought
was that perhaps the kernel keeps track of when pages are evicted and
reloaded into the TLB, as a proxy for _unused_ pages (because obviously it
has no way of detecting use of a page that is mapped by the TLB and is in
DRAM). However, on intel the kernel isn't involved in loading TLB entries,
so I don't think it could do that.

Anyway, the question is: does it even make sense to copy data from a memory
mapped file into a malloc:ed buffer just as a way to "make sure" it isn't
evicted? (it sounds questionable).


I certainly wouldn't bother with zero copy for the initial
Post by Steven Siloti
implementation. As is said on the wiki, attempting zero copy for
receiving data is likely not worth it due to the extra syscall overhead,
and zero copy in general only works for unencrypted connections which is
not the typical mode of operation these days.
agreed.

An important consideration (at least for high performance bittorrent
machines) is the latency between completing a piece and offering it up to
other peers.

One aspect of this is to hash blocks as they arrive, and once the last
block is received, there's no reading anything back, we have the SHA-1
digest. In fact, the current caching mechanism tries really hard to only
evict blocks that the SHA-1 cursor has passed by already, to minimize
read-back. Another aspect is the latency. A when the last block is hashed
and it clears the piece hash check, we should immediately send out a HAVE
(or maybe even sooner, predicting that it will pass, but that's a different
story). A peer can then request any block from this piece, potentially
before it's hit the disk. Now, if we do the hashing in the writer thread,
it probably _will_ have hit the page cache. But it might make sense to
think for a moment whether it really should need to.

The current disk cache allocates blocks up-front, and when the cache is
full, peers are throttled, waiting for new block to become available. I
think there may be a latency benefit of doing that, since once the bytes
have been copied into that buffer, it can be hung into the disk cache
instantly (and in fact, iirc, dirty blocks are inserted into the cache
directly from the network thread, and a flush job is issued to the disk
queue). This lets any other peer request this block immediately, no waiting
around in a queue to be picked up by another thread first. (I suppose the
hash job has to come back from the disk thread though, and maybe that's why
it makes sense to have dedicated hasher threads).

This part of bittorrent, the time from having completely downloaded a piece
until it's being uploaded to another peer, can be likened to the sequential
portion of a program in Amdahl's law. When taken to its extreme, this
latency will be the limiting factor.
Post by Steven Siloti
One challenge that hasn't been mentioned is that on 32-bit systems
libtorrent would need to actively manage mappings due to limited address
space. This needs some empirical work to determine the optimum mapping
size which minimizes churn.
Another challenge is the TLB pressure on a 64 bit system having a huge
virtual address space footprint. Would it make sense to map (large) files
using huge pages? I imagine that would affect the granularity with which
data is flushed back to and read from disk.
Post by Steven Siloti
[1] I don't have a strong opinion about putting these on the writer
thread or the reader threads. They usually have to be serialized anyways
so putting them on the writer thread seems natural.
[2] Syncing may still be necessary for consistency, e.g. when saving
Post by Steven Siloti
resume data. It just shouldn't be used to drive flow-control.
[3] Ideally I'd like to see libtorrent drop explicit caching entirely,
such that all *_cache_* settings would be deprecated.
--
Arvid Norberg
Arvid Norberg
2016-04-13 21:21:17 UTC
Permalink
After some more thought and research.

It would probably make sense to (at first) still copy data from the memory
mapped files into buffers passed back to the network thread. Given that
buffers likely must be mutable anyway (for encryption) and it also appears
the linux VM prioritizes anonymous memory (as in making it less likely to
be evicted when under pressure). Also, given that buffer stitching is
required (since an mmapped cache is in file-space, not piece-space),
copying would be required there as well (technically not, but significantly
simpler).
Post by Arvid Norberg
Thanks for responding to this Steven!
Post by Steven Siloti
See https://github.com/arvidn/libtorrent/wiki/memory-mapped-I-O for
background.
So here's my thoughts on a grand mmaped IO refactoring, starting with a
1. Have a single thread dedicated to write operations. This would
include writes, flushes, and file modifications[1] (delete, rename,
move).
The file operations (at least delete, likely rename on windows, as well as
re-opening files in read-only mode once they're fully written and clearing
the sparse file-flag) would most likely have to be synchronized with reader
threads too, wouldn't they?
I'm not super happy about the complexity of the current mechanism of
raising fences for a torrent, to synchronize, but I can't think of any
simpler way of doing it. It could definitely be done at a finer
granularity, and be more complicated.
Post by Steven Siloti
With buffered IO there's no point in having multiple threads for
writing because they're just creating dirty pages for the OS to flush at
its leisure. Operating systems use blocking threads which are dirtying
pages to signal that the page cache is under pressure, so we must detect
this blocking to know when to apply back-presure to the network.
Meaning, the page fault will block while it's trying to allocate a fresh
page to write the new data to, so the writing thread will pause. This could
be detected by the growing queue of write jobs being piled onto the thread.
Having
Post by Steven Siloti
a single writer thread makes this task simpler.
With a simple striping scheme, I would imagine multiple threads could be
used fairly simple too. Say, every odd piece on thread 0 and every even on
1 (probably not quite that simple). The network would just have the be
sensitive to more than one queue.
Even though dirtying pages and copying memory may not seem like a
bottleneck, I'm a bit hesitant to rely too much on that. Most of my (high
fidelity) experience comes from somewhat contrived benchmarks running over
loopback, but also from some fairly beefy seedboxes. In local benchmarking
(off the top of my head) a significant amount of time is spent copying
memory from the disk cache (that is, the current user-level disk cache)
when you hit the top upload rate (this is about download rate though, and
perhaps an argument can be made that the marginal effects of improving
download rate performance is a lot lower than the upload rate, since you
only download once).
The writing thread would potentially have to sit around waiting in
munmap() and mmap() calls as well, at least every time a new file was
created, and even more often on 32 bit systems.
In fact, one thing I've been thinking about for quite a while is whether
it would make sense to stripe the entire disk cache. If each thread has
it's own, private, part of the torrent it's responsible for, there's no
shared state between threads and the cache data structure could be a lot
simpler. On the other hand, it would likely not make it possible to split
reads and writes onto separate threads. They would be lumped together based
on stripe instead. The idea of no synchronization is a bit appealing though
:)
What we should not do is
Post by Steven Siloti
try to control writeback ourselves by issuing explicit syncs[2]. The OS
is in a much better position that libtorrent to determine the optimal
writeback strategy, we should rely on the extensive work OS developers
have done in this area. We also want to have write jobs be separated
from reads so that the former does not block the latter.
That makes sense.
2. Make the reader thread pool scalable. A new read job should be
Post by Steven Siloti
handled in a new thread if there are none available and the reader
thread count is below some threshold, say 16 by default. The idea is
that the maximum number of reader threads should correspond to a
reasonable upper bound on the number of commands to have queued to the
underlying storage device. For a typical desktop/router/NAS this is
somewhere in the range of 16-32, severs are probably going to want more.
Of course idle threads should be killed after some timeout.
There's actually a TODO in the disk_io_thread about this. :)
Post by Steven Siloti
3. Eliminate dedicated hasher threads. With a scalable reader pool I
don't think they are necessary.
What worries me a bit is Amdahl's law. It's a little bit like complexity
analysis; it's probably a good idea to expect problems to be thrown at your
code that are a lot larger than you would imagine. Likewise, I think it
makes sense to be careful about designing in serial portions of a program.
Now, the current hashing thread affinity isn't exactly all that clean. It
basically kicks in once you have enough threads configured, hashing jobs
are concentrated on one of them (or about a 3rd of the threads iirc).
Although, it may make sense to re-discover the performance gains of having
SHA-1 hashing not interfere with the disk I/O, or discover that it is
indeed not necessary to separate them. (SHA-1 hash is easily the single
most expensive operation in bittorrent, only rivaled by the Diffie-Hellman
exchange when connecting encrypred peers, so I suppose it depends on what
you do most, downloading or connecting to peers).
Since the hashing is associated with downloading, I imagine it would fall
on the writer thread. Which may be another reason to consider multiple
writers.
Post by Steven Siloti
I'm not convinced using mincore() to detect cache hits is a big enough
win to justify the added complexity. A big queue of blocked reads means
the cache hit rate is probably low, so the bigger the potential gain the
less chance there is of actually realizing it. If read jobs are
unconditionally queued to the thread pool then the whole concept of
cache hits can be more-or-less eliminated from libtorrent[3]. A cache
hit simply becomes the happy case where all required mmapped pages
happen to already be in the page cache. It would also mean all buffer
stitching would be done in the disk threads. Plus mincore() is
inherently racy, if the pages in question get evicted the network thread
will block on re-reading them. While this wouldn't happen often, the
impact of even a small rate of occurrence could be significant.
Yes, it's definitely hairy. The cost of the system call, the questionable
utility, having to do buffer stitching in both network thread and disk
threads, and the risk of having the network thread block because of a page
that was just evicted.
The mincore() logic could be done only for blocks that entirely fall into
a single file, but that may be questionable as it would cause different
behaviors depending on the file layout in the torrent (although, I wouldn't
be ashamed to perform better for piece-aligned files)
The race could be mitigated by having a dedicated cache-hit-thread.
Unfortunately, my experience is that this logic provides significant
performance improvements. Granted, it could be that the current thread-pool
is so shitty that being able to circumvent it offers huge benefits.
However, I believe it's more fundamental than that. Once you start to take
the system towards its extreme, the latency savings could drastically
reduce the bandwidth delay product and open up for a much higher request
rate from peers. One thing I've found is that when tuning for seed
performance, the limiting factor is often the rate and number of
outstanding requests from peers. (as you may have guessed, on a loopback
benchmark, the "network latency" is very low, and the disk I/O latency
dominates).
The use case I think is important to keep in mind is the one where the
entire torrent fits in RAM. I believe this is a common case for most
high-performance bittorrent machines. Now, I say torrent in singular,
meaning that one torrent is being downloaded at a time, and a fairly small
number of torrents are being seeded at high rates, and they probably fit in
RAM as well. In these cases, the cache hit ratio is extremely high. And if
the peers being uploaded to have hundreds of MiB/s capacity, fairly small
latency increases can have drastic effects to bandwidth-delay-product.
Actually, the cache-hit-thread made me think. Currently, every block
(16kiB) in the cache has a reference count, and references to cache-buffers
are passed back directly to the network thread and hung onto the peer
connection send buffer chain. The refcount prevents the cache from evicting
the block while it's being sent over a socket. It is true in general, any
memory access may block and potentially take a very long time (it could
have been swapped out to disk for instance). Are mmapped pages more likely
to be evicted than other (say anonymous memory) page? Probably not, or
rather, probably only because they're likely to not have been accessed for
a while.
I would imagine with an mmapped based disk cache, we would still pass back
pointers straight into the mmapped region to the sockets to send, and the
kernel is still free to evict those pages while it's in queue for a socket
to send it. Presumably though, the read thread would read the first byte of
each requested page, to make sure they're all fetched (possibly by first
issuing an MADV_WILLNEED for all of the pages). These pages will only be
*likely* to still be in RAM by the time the socket starts consuming them.
How does the kernel decide which pages should be evicted? My first thought
was that perhaps the kernel keeps track of when pages are evicted and
reloaded into the TLB, as a proxy for _unused_ pages (because obviously it
has no way of detecting use of a page that is mapped by the TLB and is in
DRAM). However, on intel the kernel isn't involved in loading TLB entries,
so I don't think it could do that.
Anyway, the question is: does it even make sense to copy data from a
memory mapped file into a malloc:ed buffer just as a way to "make sure" it
isn't evicted? (it sounds questionable).
I certainly wouldn't bother with zero copy for the initial
Post by Steven Siloti
implementation. As is said on the wiki, attempting zero copy for
receiving data is likely not worth it due to the extra syscall overhead,
and zero copy in general only works for unencrypted connections which is
not the typical mode of operation these days.
agreed.
An important consideration (at least for high performance bittorrent
machines) is the latency between completing a piece and offering it up to
other peers.
One aspect of this is to hash blocks as they arrive, and once the last
block is received, there's no reading anything back, we have the SHA-1
digest. In fact, the current caching mechanism tries really hard to only
evict blocks that the SHA-1 cursor has passed by already, to minimize
read-back. Another aspect is the latency. A when the last block is hashed
and it clears the piece hash check, we should immediately send out a HAVE
(or maybe even sooner, predicting that it will pass, but that's a different
story). A peer can then request any block from this piece, potentially
before it's hit the disk. Now, if we do the hashing in the writer thread,
it probably _will_ have hit the page cache. But it might make sense to
think for a moment whether it really should need to.
The current disk cache allocates blocks up-front, and when the cache is
full, peers are throttled, waiting for new block to become available. I
think there may be a latency benefit of doing that, since once the bytes
have been copied into that buffer, it can be hung into the disk cache
instantly (and in fact, iirc, dirty blocks are inserted into the cache
directly from the network thread, and a flush job is issued to the disk
queue). This lets any other peer request this block immediately, no waiting
around in a queue to be picked up by another thread first. (I suppose the
hash job has to come back from the disk thread though, and maybe that's why
it makes sense to have dedicated hasher threads).
This part of bittorrent, the time from having completely downloaded a
piece until it's being uploaded to another peer, can be likened to the
sequential portion of a program in Amdahl's law. When taken to its extreme,
this latency will be the limiting factor.
Post by Steven Siloti
One challenge that hasn't been mentioned is that on 32-bit systems
libtorrent would need to actively manage mappings due to limited address
space. This needs some empirical work to determine the optimum mapping
size which minimizes churn.
Another challenge is the TLB pressure on a 64 bit system having a huge
virtual address space footprint. Would it make sense to map (large) files
using huge pages? I imagine that would affect the granularity with which
data is flushed back to and read from disk.
Post by Steven Siloti
[1] I don't have a strong opinion about putting these on the writer
thread or the reader threads. They usually have to be serialized anyways
so putting them on the writer thread seems natural.
[2] Syncing may still be necessary for consistency, e.g. when saving
Post by Steven Siloti
resume data. It just shouldn't be used to drive flow-control.
[3] Ideally I'd like to see libtorrent drop explicit caching entirely,
such that all *_cache_* settings would be deprecated.
--
Arvid Norberg
--
Arvid Norberg
Steven Siloti
2016-04-15 03:36:02 UTC
Permalink
Post by Arvid Norberg
The file operations (at least delete, likely rename on windows, as well as
re-opening files in read-only mode once they're fully written and clearing
the sparse file-flag) would most likely have to be synchronized with reader
threads too, wouldn't they?
Yes, file modification will require synchronization with all disk IO
threads regardless, which is why I don't feel that strongly about which
thread(s) they run on.
Post by Arvid Norberg
I'm not super happy about the complexity of the current mechanism of
raising fences for a torrent, to synchronize, but I can't think of any
simpler way of doing it. It could definitely be done at a finer
granularity, and be more complicated.
I dealt with many of the same issues in a previous life, that code was
not pretty in the least. I'm pretty sure job ordering dependencies are
an inherently hard problem.
Post by Arvid Norberg
Post by Steven Siloti
With buffered IO there's no point in having multiple threads for
writing because they're just creating dirty pages for the OS to flush at
its leisure. Operating systems use blocking threads which are dirtying
pages to signal that the page cache is under pressure, so we must detect
this blocking to know when to apply back-presure to the network.
Meaning, the page fault will block while it's trying to allocate a fresh
page to write the new data to, so the writing thread will pause. This could
be detected by the growing queue of write jobs being piled onto the thread.
I know Linux at least actually preemptively blocks the thread if it sees
that there are already more dirty pages than the backing device can
flush in a reasonable amount of time[1]. This goes back to my previous
argument for using job timing to determine the blocking threshold. The
optimum size of the job queue varies depending on the speed of the
underlying storage, the appropriate time for a job to sit on the queue
is more-or-less constant so it's much easier to use that as the metric
to watch.
Post by Arvid Norberg
Having
Post by Steven Siloti
a single writer thread makes this task simpler.
With a simple striping scheme, I would imagine multiple threads could be
used fairly simple too. Say, every odd piece on thread 0 and every even on
1 (probably not quite that simple). The network would just have the be
sensitive to more than one queue.
Even though dirtying pages and copying memory may not seem like a
bottleneck, I'm a bit hesitant to rely too much on that. Most of my (high
fidelity) experience comes from somewhat contrived benchmarks running over
loopback, but also from some fairly beefy seedboxes. In local benchmarking
(off the top of my head) a significant amount of time is spent copying
memory from the disk cache (that is, the current user-level disk cache)
when you hit the top upload rate (this is about download rate though, and
perhaps an argument can be made that the marginal effects of improving
download rate performance is a lot lower than the upload rate, since you
only download once).
I did some quick micro-benchmarks of copying data from anonymous memory
to an mmapped file (not including flush time) and found that using 2
threads was 1.5x faster than using one and 4 threads was the same speed
as 2. So it seems there's limited gains to be had with parallelism here,
at least outside of big NUMA systems, although with the extra overhead
of a full client there may be a bigger difference. Also, one core of an
A10-7870K is already able to copy over 1GB/s which seems like it should
be "enough for anyone" :)
Post by Arvid Norberg
The writing thread would potentially have to sit around waiting in munmap()
and mmap() calls as well, at least every time a new file was created, and
even more often on 32 bit systems.
I would say that if you have a fast enough network for it to matter, use
64-bit :) Throughput to disk with many small files is going to suck
anyways so I doubt mmap() would be a bottleneck in that case.
Post by Arvid Norberg
In fact, one thing I've been thinking about for quite a while is whether it
would make sense to stripe the entire disk cache. If each thread has it's
own, private, part of the torrent it's responsible for, there's no shared
state between threads and the cache data structure could be a lot simpler.
On the other hand, it would likely not make it possible to split reads and
writes onto separate threads. They would be lumped together based on stripe
instead. The idea of no synchronization is a bit appealing though :)
Can't say I'm a fan of this. It seems like trading one complexity for
another, and for a scheme which precludes the optimum way of interacting
with the OS.
Post by Arvid Norberg
Post by Steven Siloti
I'm not convinced using mincore() to detect cache hits is a big enough
win to justify the added complexity. A big queue of blocked reads means
the cache hit rate is probably low, so the bigger the potential gain the
less chance there is of actually realizing it. If read jobs are
unconditionally queued to the thread pool then the whole concept of
cache hits can be more-or-less eliminated from libtorrent[3]. A cache
hit simply becomes the happy case where all required mmapped pages
happen to already be in the page cache. It would also mean all buffer
stitching would be done in the disk threads. Plus mincore() is
inherently racy, if the pages in question get evicted the network thread
will block on re-reading them. While this wouldn't happen often, the
impact of even a small rate of occurrence could be significant.
Yes, it's definitely hairy. The cost of the system call, the questionable
utility, having to do buffer stitching in both network thread and disk
threads, and the risk of having the network thread block because of a page
that was just evicted.
The mincore() logic could be done only for blocks that entirely fall into a
single file, but that may be questionable as it would cause different
behaviors depending on the file layout in the torrent (although, I wouldn't
be ashamed to perform better for piece-aligned files)
The race could be mitigated by having a dedicated cache-hit-thread.
Unfortunately, my experience is that this logic provides significant
performance improvements. Granted, it could be that the current thread-pool
is so shitty that being able to circumvent it offers huge benefits.
However, I believe it's more fundamental than that. Once you start to take
the system towards its extreme, the latency savings could drastically
reduce the bandwidth delay product and open up for a much higher request
rate from peers. One thing I've found is that when tuning for seed
performance, the limiting factor is often the rate and number of
outstanding requests from peers. (as you may have guessed, on a loopback
benchmark, the "network latency" is very low, and the disk I/O latency
dominates).
Hm, so is the problem cache hits being blocked by misses or just the
base overhead of passing a job to the thread pool? If cache hits are
being blocked by misses then I would say: If in doubt, add more threads
:) For beefy servers with fast connections it would be perfectly
appropriate to run with a 128 or even 256 reader thread limit.
Supporting this efficiently may require some care when it comes to
things like lock contention, but I don't think it would be a big problem
since we're talking about read jobs which shouldn't be modifying global
state in the fast path.

Reducing the overhead of the thread pool is tougher. One thing which
might help a lot is batching job submission. Right now jobs are being
passed in one-at-a-time and there is a lot of stuff like locks being
acquired/released, resource allocations, and container insertions which
could be amortized.
Post by Arvid Norberg
An important consideration (at least for high performance bittorrent
machines) is the latency between completing a piece and offering it up to
other peers.
One aspect of this is to hash blocks as they arrive, and once the last
block is received, there's no reading anything back, we have the SHA-1
digest. In fact, the current caching mechanism tries really hard to only
evict blocks that the SHA-1 cursor has passed by already, to minimize
read-back. Another aspect is the latency. A when the last block is hashed
and it clears the piece hash check, we should immediately send out a HAVE
(or maybe even sooner, predicting that it will pass, but that's a different
story). A peer can then request any block from this piece, potentially
before it's hit the disk. Now, if we do the hashing in the writer thread,
it probably _will_ have hit the page cache. But it might make sense to
think for a moment whether it really should need to.
The current disk cache allocates blocks up-front, and when the cache is
full, peers are throttled, waiting for new block to become available. I
think there may be a latency benefit of doing that, since once the bytes
have been copied into that buffer, it can be hung into the disk cache
instantly (and in fact, iirc, dirty blocks are inserted into the cache
directly from the network thread, and a flush job is issued to the disk
queue). This lets any other peer request this block immediately, no waiting
around in a queue to be picked up by another thread first. (I suppose the
hash job has to come back from the disk thread though, and maybe that's why
it makes sense to have dedicated hasher threads).
This part of bittorrent, the time from having completely downloaded a piece
until it's being uploaded to another peer, can be likened to the sequential
portion of a program in Amdahl's law. When taken to its extreme, this
latency will be the limiting factor.
Hm, I hadn't thought about time-to-HAVE as an important metric. I see
your points about the need for dedicated hasher threads. A separate
thread pool with a default limit set to the number of CPU cores seems
appropriate since these jobs are expected to be CPU bound.
Post by Arvid Norberg
Post by Steven Siloti
One challenge that hasn't been mentioned is that on 32-bit systems
libtorrent would need to actively manage mappings due to limited address
space. This needs some empirical work to determine the optimum mapping
size which minimizes churn.
Another challenge is the TLB pressure on a 64 bit system having a huge
virtual address space footprint. Would it make sense to map (large) files
using huge pages? I imagine that would affect the granularity with which
data is flushed back to and read from disk.
Unfortunately, due to complications with the page cache, Linux currently
does not support mapping files with huge pages. I don't think we'd want
to use them anyways. They would impose significant read-modify-write
penalty, especially if the piece size is less than the huge page size (2MB).

[1] https://lwn.net/Articles/405076/
Arvid Norberg
2016-04-15 04:42:42 UTC
Permalink
[...]
Post by Arvid Norberg
Post by Arvid Norberg
Post by Steven Siloti
With buffered IO there's no point in having multiple threads for
writing because they're just creating dirty pages for the OS to flush at
its leisure. Operating systems use blocking threads which are dirtying
pages to signal that the page cache is under pressure, so we must detect
this blocking to know when to apply back-presure to the network.
Meaning, the page fault will block while it's trying to allocate a fresh
page to write the new data to, so the writing thread will pause. This
could
Post by Arvid Norberg
be detected by the growing queue of write jobs being piled onto the
thread.
I know Linux at least actually preemptively blocks the thread if it sees
that there are already more dirty pages than the backing device can
flush in a reasonable amount of time[1]. This goes back to my previous
argument for using job timing to determine the blocking threshold. The
optimum size of the job queue varies depending on the speed of the
underlying storage, the appropriate time for a job to sit on the queue
is more-or-less constant so it's much easier to use that as the metric
to watch.
That makes sense.
Post by Arvid Norberg
Post by Arvid Norberg
[...]
Even though dirtying pages and copying memory may not seem like a
bottleneck, I'm a bit hesitant to rely too much on that. Most of my (high
fidelity) experience comes from somewhat contrived benchmarks running
over
Post by Arvid Norberg
loopback, but also from some fairly beefy seedboxes. In local
benchmarking
Post by Arvid Norberg
(off the top of my head) a significant amount of time is spent copying
memory from the disk cache (that is, the current user-level disk cache)
when you hit the top upload rate (this is about download rate though, and
perhaps an argument can be made that the marginal effects of improving
download rate performance is a lot lower than the upload rate, since you
only download once).
I did some quick micro-benchmarks of copying data from anonymous memory
to an mmapped file (not including flush time) and found that using 2
threads was 1.5x faster than using one and 4 threads was the same speed
as 2. So it seems there's limited gains to be had with parallelism here,
at least outside of big NUMA systems, although with the extra overhead
of a full client there may be a bigger difference. Also, one core of an
A10-7870K is already able to copy over 1GB/s which seems like it should
be "enough for anyone" :)
I'm not primarily worried about the time to copy memory, but to map new
pages and set up new TLB entries. I suppose updating the page table, the
whole process may suffer a TLB flush.

In your benchmark, did you copy into previously uncommitted pages?
Post by Arvid Norberg
Post by Arvid Norberg
The writing thread would potentially have to sit around waiting in
munmap()
Post by Arvid Norberg
and mmap() calls as well, at least every time a new file was created, and
even more often on 32 bit systems.
I would say that if you have a fast enough network for it to matter, use
64-bit :) Throughput to disk with many small files is going to suck
anyways so I doubt mmap() would be a bottleneck in that case.
actually, I get the impression that for small files, using regular read()
and write() may actually be quicker. Also, there's a chance the current
cache mechanism would be quicker, since it defers all file operations
(including creating and opening files) to the disk I/O thread, and allows
hashing and seeding independently.

[...]
Post by Arvid Norberg
Post by Arvid Norberg
In fact, one thing I've been thinking about for quite a while is whether
it
Post by Arvid Norberg
would make sense to stripe the entire disk cache. If each thread has it's
own, private, part of the torrent it's responsible for, there's no shared
state between threads and the cache data structure could be a lot
simpler.
Post by Arvid Norberg
On the other hand, it would likely not make it possible to split reads
and
Post by Arvid Norberg
writes onto separate threads. They would be lumped together based on
stripe
Post by Arvid Norberg
instead. The idea of no synchronization is a bit appealing though :)
Can't say I'm a fan of this. It seems like trading one complexity for
another, and for a scheme which precludes the optimum way of interacting
with the OS.
Post by Arvid Norberg
[...]
Unfortunately, my experience is that this logic provides significant
performance improvements. Granted, it could be that the current
thread-pool
Post by Arvid Norberg
is so shitty that being able to circumvent it offers huge benefits.
However, I believe it's more fundamental than that. Once you start to
take
Post by Arvid Norberg
the system towards its extreme, the latency savings could drastically
reduce the bandwidth delay product and open up for a much higher request
rate from peers. One thing I've found is that when tuning for seed
performance, the limiting factor is often the rate and number of
outstanding requests from peers. (as you may have guessed, on a loopback
benchmark, the "network latency" is very low, and the disk I/O latency
dominates).
Hm, so is the problem cache hits being blocked by misses or just the
base overhead of passing a job to the thread pool? If cache hits are
being blocked by misses then I would say: If in doubt, add more threads
:) For beefy servers with fast connections it would be perfectly
appropriate to run with a 128 or even 256 reader thread limit.
Supporting this efficiently may require some care when it comes to
things like lock contention, but I don't think it would be a big problem
since we're talking about read jobs which shouldn't be modifying global
state in the fast path.
That's actually a good question (and point). I suppose the main cause of
latency would be other jobs ahead of it.
Post by Arvid Norberg
Reducing the overhead of the thread pool is tougher. One thing which
might help a lot is batching job submission. Right now jobs are being
passed in one-at-a-time and there is a lot of stuff like locks being
acquired/released, resource allocations, and container insertions which
could be amortized.
Actually, they are batched right now. The mutex may actually be acquired
once per job, but the condition variable is signaled once per batch.
(In my profiling, the mutex is cheap compared to signaling the condition
variable, since a mutex not under contention is user-space only).

But it's a good point, there may be some more opportunities for batching.
Post by Arvid Norberg
Post by Arvid Norberg
An important consideration (at least for high performance bittorrent
machines) is the latency between completing a piece and offering it up to
other peers.
One aspect of this is to hash blocks as they arrive, and once the last
block is received, there's no reading anything back, we have the SHA-1
digest. In fact, the current caching mechanism tries really hard to only
evict blocks that the SHA-1 cursor has passed by already, to minimize
read-back. Another aspect is the latency. A when the last block is hashed
and it clears the piece hash check, we should immediately send out a HAVE
(or maybe even sooner, predicting that it will pass, but that's a
different
Post by Arvid Norberg
story). A peer can then request any block from this piece, potentially
before it's hit the disk. Now, if we do the hashing in the writer thread,
it probably _will_ have hit the page cache. But it might make sense to
think for a moment whether it really should need to.
The current disk cache allocates blocks up-front, and when the cache is
full, peers are throttled, waiting for new block to become available. I
think there may be a latency benefit of doing that, since once the bytes
have been copied into that buffer, it can be hung into the disk cache
instantly (and in fact, iirc, dirty blocks are inserted into the cache
directly from the network thread, and a flush job is issued to the disk
queue). This lets any other peer request this block immediately, no
waiting
Post by Arvid Norberg
around in a queue to be picked up by another thread first. (I suppose the
hash job has to come back from the disk thread though, and maybe that's
why
Post by Arvid Norberg
it makes sense to have dedicated hasher threads).
This part of bittorrent, the time from having completely downloaded a
piece
Post by Arvid Norberg
until it's being uploaded to another peer, can be likened to the
sequential
Post by Arvid Norberg
portion of a program in Amdahl's law. When taken to its extreme, this
latency will be the limiting factor.
Hm, I hadn't thought about time-to-HAVE as an important metric. I see
your points about the need for dedicated hasher threads. A separate
thread pool with a default limit set to the number of CPU cores seems
appropriate since these jobs are expected to be CPU bound.
I guess makes sense to keep in consideration, but probably not implement in
a first pass.
Post by Arvid Norberg
Post by Arvid Norberg
Post by Steven Siloti
One challenge that hasn't been mentioned is that on 32-bit systems
libtorrent would need to actively manage mappings due to limited address
space. This needs some empirical work to determine the optimum mapping
size which minimizes churn.
Another challenge is the TLB pressure on a 64 bit system having a huge
virtual address space footprint. Would it make sense to map (large) files
using huge pages? I imagine that would affect the granularity with which
data is flushed back to and read from disk.
Unfortunately, due to complications with the page cache, Linux currently
does not support mapping files with huge pages. I don't think we'd want
to use them anyways. They would impose significant read-modify-write
penalty, especially if the piece size is less than the huge page size (2MB).
yeah, I realized this is only supported for anonymous memory.
--
Arvid Norberg
Steven Siloti
2016-04-16 02:52:06 UTC
Permalink
Post by Arvid Norberg
[...]
Post by Steven Siloti
I did some quick micro-benchmarks of copying data from anonymous memory
to an mmapped file (not including flush time) and found that using 2
threads was 1.5x faster than using one and 4 threads was the same speed
as 2. So it seems there's limited gains to be had with parallelism here,
at least outside of big NUMA systems, although with the extra overhead
of a full client there may be a bigger difference. Also, one core of an
A10-7870K is already able to copy over 1GB/s which seems like it should
be "enough for anyone" :)
I'm not primarily worried about the time to copy memory, but to map new
pages and set up new TLB entries. I suppose updating the page table, the
whole process may suffer a TLB flush.
In your benchmark, did you copy into previously uncommitted pages?
My test procedure was:

- allocate 1GB anonymous source buffer
- start timing
- Create a new test file
- fallocate 1GB
- mmap entire file
- memcpy source buffer to file in 16KB chunks
- stop timing

So the mapped pages are getting faulted in during the copy.
Post by Arvid Norberg
Also, there's a chance the current
cache mechanism would be quicker, since it defers all file operations
(including creating and opening files) to the disk I/O thread, and allows
hashing and seeding independently.
I'm a bit confused by this sentence, I didn't think anyone was
proposing to do file operations outside of the disk I/O thread.
Post by Arvid Norberg
[...]
Post by Steven Siloti
Reducing the overhead of the thread pool is tougher. One thing which
might help a lot is batching job submission. Right now jobs are being
passed in one-at-a-time and there is a lot of stuff like locks being
acquired/released, resource allocations, and container insertions which
could be amortized.
Actually, they are batched right now. The mutex may actually be acquired
once per job, but the condition variable is signaled once per batch.
(In my profiling, the mutex is cheap compared to signaling the condition
variable, since a mutex not under contention is user-space only).
But it's a good point, there may be some more opportunities for batching.
Ah I see. There are definitely savings to be had, e.g. consecutive jobs
are likely to reference the same piece so we could save on calls to
block_cache::find_piece, but I'd defer this work until there's some data
showing it to be worthwhile.
Arvid Norberg
2016-04-16 04:20:14 UTC
Permalink
Post by Steven Siloti
Post by Arvid Norberg
[...]
Post by Steven Siloti
I did some quick micro-benchmarks of copying data from anonymous memory
to an mmapped file (not including flush time) and found that using 2
threads was 1.5x faster than using one and 4 threads was the same speed
as 2. So it seems there's limited gains to be had with parallelism here,
at least outside of big NUMA systems, although with the extra overhead
of a full client there may be a bigger difference. Also, one core of an
A10-7870K is already able to copy over 1GB/s which seems like it should
be "enough for anyone" :)
I'm not primarily worried about the time to copy memory, but to map new
pages and set up new TLB entries. I suppose updating the page table, the
whole process may suffer a TLB flush.
In your benchmark, did you copy into previously uncommitted pages?
- allocate 1GB anonymous source buffer
- start timing
- Create a new test file
- fallocate 1GB
- mmap entire file
- memcpy source buffer to file in 16KB chunks
- stop timing
So the mapped pages are getting faulted in during the copy.
Cool, that's good.
Post by Steven Siloti
Post by Arvid Norberg
Also, there's a chance the current
cache mechanism would be quicker, since it defers all file operations
(including creating and opening files) to the disk I/O thread, and allows
hashing and seeding independently.
I'm a bit confused by this sentence, I didn't think anyone was
proposing to do file operations outside of the disk I/O thread.
Yeah, that wasn't very clear. What I meant was that while a buffer is
waiting to have its file opened, it should be available to be hashed and
uploaded to other peers. That is the case right now because the cache is
organized at the piece-level, and not at the file level. However, what
occurred to me is that the disk job queue can be seen as a store buffer (in
a CPU). While the data is in the queue, it's still accessible by the CPU as
a kind of cache hit.

If the write job queue can be indexed by piece/block, the same effect could
be achieved. basically the best of both worlds.
Post by Steven Siloti
Post by Arvid Norberg
[...]
Post by Steven Siloti
Reducing the overhead of the thread pool is tougher. One thing which
might help a lot is batching job submission. Right now jobs are being
passed in one-at-a-time and there is a lot of stuff like locks being
acquired/released, resource allocations, and container insertions which
could be amortized.
Actually, they are batched right now. The mutex may actually be acquired
once per job, but the condition variable is signaled once per batch.
(In my profiling, the mutex is cheap compared to signaling the condition
variable, since a mutex not under contention is user-space only).
But it's a good point, there may be some more opportunities for batching.
Ah I see. There are definitely savings to be had, e.g. consecutive jobs
are likely to reference the same piece so we could save on calls to
block_cache::find_piece, but I'd defer this work until there's some data
showing it to be worthwhile.
That's true (I don't recall it showing up in my profiles though).
--
Arvid Norberg
Arvid Norberg
2016-04-16 19:04:00 UTC
Permalink
I've updated the wiki page to document our progress so far. Please feel
free to propose updates to it.

https://github.com/arvidn/libtorrent/wiki/memory-mapped-I-O

I'm quite happy with how this is shaping up. It seems quite doable and
surprisingly self-contained.
Post by Arvid Norberg
Post by Steven Siloti
Post by Arvid Norberg
[...]
Post by Steven Siloti
I did some quick micro-benchmarks of copying data from anonymous memory
to an mmapped file (not including flush time) and found that using 2
threads was 1.5x faster than using one and 4 threads was the same speed
as 2. So it seems there's limited gains to be had with parallelism
here,
Post by Arvid Norberg
Post by Steven Siloti
at least outside of big NUMA systems, although with the extra overhead
of a full client there may be a bigger difference. Also, one core of an
A10-7870K is already able to copy over 1GB/s which seems like it should
be "enough for anyone" :)
I'm not primarily worried about the time to copy memory, but to map new
pages and set up new TLB entries. I suppose updating the page table, the
whole process may suffer a TLB flush.
In your benchmark, did you copy into previously uncommitted pages?
- allocate 1GB anonymous source buffer
- start timing
- Create a new test file
- fallocate 1GB
- mmap entire file
- memcpy source buffer to file in 16KB chunks
- stop timing
So the mapped pages are getting faulted in during the copy.
Cool, that's good.
Post by Steven Siloti
Post by Arvid Norberg
Also, there's a chance the current
cache mechanism would be quicker, since it defers all file operations
(including creating and opening files) to the disk I/O thread, and
allows
Post by Arvid Norberg
hashing and seeding independently.
I'm a bit confused by this sentence, I didn't think anyone was
proposing to do file operations outside of the disk I/O thread.
Yeah, that wasn't very clear. What I meant was that while a buffer is
waiting to have its file opened, it should be available to be hashed and
uploaded to other peers. That is the case right now because the cache is
organized at the piece-level, and not at the file level. However, what
occurred to me is that the disk job queue can be seen as a store buffer (in
a CPU). While the data is in the queue, it's still accessible by the CPU as
a kind of cache hit.
If the write job queue can be indexed by piece/block, the same effect
could be achieved. basically the best of both worlds.
Post by Steven Siloti
Post by Arvid Norberg
[...]
Post by Steven Siloti
Reducing the overhead of the thread pool is tougher. One thing which
might help a lot is batching job submission. Right now jobs are being
passed in one-at-a-time and there is a lot of stuff like locks being
acquired/released, resource allocations, and container insertions which
could be amortized.
Actually, they are batched right now. The mutex may actually be acquired
once per job, but the condition variable is signaled once per batch.
(In my profiling, the mutex is cheap compared to signaling the condition
variable, since a mutex not under contention is user-space only).
But it's a good point, there may be some more opportunities for
batching.
Ah I see. There are definitely savings to be had, e.g. consecutive jobs
are likely to reference the same piece so we could save on calls to
block_cache::find_piece, but I'd defer this work until there's some data
showing it to be worthwhile.
That's true (I don't recall it showing up in my profiles though).
--
Arvid Norberg
--
Arvid Norberg
Loading...