Discussion:
[libtorrent] Disk cache
Arvid Norberg
2006-04-09 14:25:51 UTC
Permalink
There has been some discussion on the IRC channel about benefits and
possible cache algorithms for a disk cache in libtorrent. I'm posting
this hoping that somebody could comment on some of the techniques or
assumptions.

The way I see it, there are two problems that a cache aims to solve:

1. seeding on a high speed connection will make the hard drive seek a
lot back and forth, just to do short reads.

2. downloading on a high speed connection will make the hard drive
seek a lot back and forth.


For 1, one assumption that can be made is that if block 0 is
requested from a piece, it's very likely that many more blocks from
that piece will be requested soon. So caching the entire piece at
once would probably be a good idea.

This is exactly what a normal OS' disk cache does. It reads ahead
into the cache. So I don't really see any benefit if caching this at
application level. One possible optimization would be if the disk
cache could be bypassed so that only one piece was cached (since the
disk cache in the OS may cache more than the piece we're reading
from). This would be hard to do, and especially hard to do in a
platform independent manner.

Is this a reasonable assumption?

I believe some clients also order the requested blocks in increasing
index order, and then reads them in that order. The idea is that it
will minimize the seek distance for the hard drive, since it's only
reading forward. Then of course it will jump back to the low index
again when it's time to read next batch of pieces.

Would this improve the disk performance?


Number 2 is a little bit more interesting though. Since every block
that is downloaded has to be written to disk, and since every block
is only downloaded once, you cannot really optimize away disk writes.
What you can do though, is to optimize the disk seeking.

uTorrent used to implement a write priority queue, where write
requests were put, and ordered in increasing index order. Once the
write cache was full, the pieces were written to disk in that order.
This was also done every 30 seconds, it the cache wasn't full yet.

from version 1.5 of uTorrent, this algorithm changed, into one that
caches blocks, and once all the blocks for one piece has been
downloaded, they are written to disk immediately. Ludvig (the author
of uTorrent) claims that the new algorithm worked better, and that
the hard drive didn't make as much noise anymore.

So, what do you think, should a write cache and read cache be
separated? Of course read operations would still look in the write
queue for up to date data, but a write operation wouldn't be able to
remove something from the read cache.

Should a cache be global or per torrent? I don't know how to minimize
the seeks between torrent. The assumption is that large parts of
files are unfragmented, so that most of the time, a seeking forward
is pretty cheap. In different torrents the storages are guaranteed to
be in different files, whose position on the media is unknown. It
would be desirable to have a global cache size though I guess.

Are there any other better caching algorithms?

Is there any points having a cache at all?

thanks.
--
Arvid Norberg
Matt Sicker
2006-04-09 14:43:33 UTC
Permalink
Post by Arvid Norberg
There has been some discussion on the IRC channel about benefits and
possible cache algorithms for a disk cache in libtorrent. I'm posting
this hoping that somebody could comment on some of the techniques or
assumptions.
1. seeding on a high speed connection will make the hard drive seek a
lot back and forth, just to do short reads.
2. downloading on a high speed connection will make the hard drive seek
a lot back and forth.
For 1, one assumption that can be made is that if block 0 is requested
from a piece, it's very likely that many more blocks from that piece
will be requested soon. So caching the entire piece at once would
probably be a good idea.
This is exactly what a normal OS' disk cache does. It reads ahead into
the cache. So I don't really see any benefit if caching this at
application level. One possible optimization would be if the disk cache
could be bypassed so that only one piece was cached (since the disk
cache in the OS may cache more than the piece we're reading from). This
would be hard to do, and especially hard to do in a platform independent
manner.
Is this a reasonable assumption?
I believe some clients also order the requested blocks in increasing
index order, and then reads them in that order. The idea is that it will
minimize the seek distance for the hard drive, since it's only reading
forward. Then of course it will jump back to the low index again when
it's time to read next batch of pieces.
Would this improve the disk performance?
Number 2 is a little bit more interesting though. Since every block that
is downloaded has to be written to disk, and since every block is only
downloaded once, you cannot really optimize away disk writes. What you
can do though, is to optimize the disk seeking.
uTorrent used to implement a write priority queue, where write requests
were put, and ordered in increasing index order. Once the write cache
was full, the pieces were written to disk in that order. This was also
done every 30 seconds, it the cache wasn't full yet.
from version 1.5 of uTorrent, this algorithm changed, into one that
caches blocks, and once all the blocks for one piece has been
downloaded, they are written to disk immediately. Ludvig (the author of
uTorrent) claims that the new algorithm worked better, and that the hard
drive didn't make as much noise anymore.
So, what do you think, should a write cache and read cache be separated?
Of course read operations would still look in the write queue for up to
date data, but a write operation wouldn't be able to remove something
from the read cache.
Should a cache be global or per torrent? I don't know how to minimize
the seeks between torrent. The assumption is that large parts of files
are unfragmented, so that most of the time, a seeking forward is pretty
cheap. In different torrents the storages are guaranteed to be in
different files, whose position on the media is unknown. It would be
desirable to have a global cache size though I guess.
Are there any other better caching algorithms?
Is there any points having a cache at all?
thanks.
--
Arvid Norberg
Would it be possible to try this in a feature-branch of libtorrent? If
you know what to do already, it probably shouldn't take long to create
POC's for both methods.

Also, I do know that Azureus generally requests blocks to a piece in
order, but it usually doesn't receive them in that order. Watching some
sort of debug mode for BT programs should help show that for experimental
results (rather than just assuming what the program wants to do according
to the source code will happen).
Arvid Norberg
2006-04-09 21:33:10 UTC
Permalink
Post by Matt Sicker
Would it be possible to try this in a feature-branch of
libtorrent? If
you know what to do already, it probably shouldn't take long to create
POC's for both methods.
Well. I don't know what I want to do. And I'm not sure how to measure
and compare performance either. Any suggestions are welcome. For
example, having a write cache at piece level would definitely reduce
the number of system calls to read() and write(), but the question
is, would it help performance? Or would it just introduce another
layer on top of the OS' cache?

I can't think of a good way to measure that.

I do intend to gather some statistics once I have something
implemented though.
Post by Matt Sicker
Also, I do know that Azureus generally requests blocks to a piece in
order, but it usually doesn't receive them in that order. Watching some
sort of debug mode for BT programs should help show that for
experimental
results (rather than just assuming what the program wants to do according
to the source code will happen).
libtorrent also requests blocks in order. And receives them out of
order from some clients.


--
Arvid Norberg
Luboš Doležel
2006-04-09 14:56:25 UTC
Permalink
Post by Arvid Norberg
Is there any points having a cache at all?
My personal opinion is that caching should be (and is) performed by
operating system and not by application because it's more effective:

When some more memory is needed by some other app and there is no free
mem left, OS simply decides to free some cache - but when caching is
done directly by applicaton, swapping may occur and this can IMHO result
in an even bigger slowdown.

Lubos Dolezel
Arvid Norberg
2006-04-09 15:27:59 UTC
Permalink
Post by Luboš Doležel
Post by Arvid Norberg
Is there any points having a cache at all?
My personal opinion is that caching should be (and is) performed by
When some more memory is needed by some other app and there is no free
mem left, OS simply decides to free some cache - but when caching is
done directly by applicaton, swapping may occur and this can IMHO result
in an even bigger slowdown.
Any caching would of course be optional.


--
Arvid Norberg
Felipe Magno de Almeida
2006-04-09 15:21:05 UTC
Permalink
Hi, I've been a long time in this mailing list but never got much time
to investigate libtorrent's internals. So, sorry if I say anything
stupid.

On 4/9/06, Arvid Norberg <***@cs.umu.se> wrote:

[snipped]
Post by Arvid Norberg
1. seeding on a high speed connection will make the hard drive seek a
lot back and forth, just to do short reads.
2. downloading on a high speed connection will make the hard drive
seek a lot back and forth.
For 1, one assumption that can be made is that if block 0 is
requested from a piece, it's very likely that many more blocks from
that piece will be requested soon. So caching the entire piece at
once would probably be a good idea.
This is exactly what a normal OS' disk cache does. It reads ahead
into the cache. So I don't really see any benefit if caching this at
application level. One possible optimization would be if the disk
cache could be bypassed so that only one piece was cached (since the
disk cache in the OS may cache more than the piece we're reading
from). This would be hard to do, and especially hard to do in a
platform independent manner.
Although I believe that it isnt worth just for this, I dont think that
making it in a platform independent manner would be very hard. POSIX
and Windows have flags to inhibit OS caches at least.

[snipped]
Post by Arvid Norberg
Is there any points having a cache at all?
IMHO, it is only possible to answer this with tests and see how better
a cache could perform the I/O operations.
I believe that other assumption could be that rarest blocks could stay
in cache longer too, since they should be requested more often from
peers.
FWIW, Boost reviewed and accepted asio, a library for asynchronous
I/O. It supports only sockets operations for now, but probably soon it
will support files as well, which could have some impact in easy of
use and CPU usage.

BitComet has this statistics in one torrent I'm downloading:

Disk Read Statistics: Request: 38982 (freq: 2.1/s), Actual Disk Read:
2542 (freq: 0.1/s), Hit Ratio: 93.4%
Disk Write Statistics: Request: 9986 (freq: 1.2/s), Actual Disk Write:
184 (freq: 0.0/s), Hit Ratio: 98.1%

It seems like pretty good numbers. Although we cant really know what
Disk Write means here...
Post by Arvid Norberg
thanks.
--
Arvid Norberg
best regards,
--
Felipe Magno de Almeida
Arvid Norberg
2006-04-09 21:41:54 UTC
Permalink
Post by Felipe Magno de Almeida
Post by Arvid Norberg
Is there any points having a cache at all?
IMHO, it is only possible to answer this with tests and see how better
a cache could perform the I/O operations.
I believe that other assumption could be that rarest blocks could stay
in cache longer too, since they should be requested more often from
peers.
Yes. But usually there isn't a single piece which is rarest, but a
whole bunch of them. But a read cache should probably prioritize
them, assuming all peers implement rarest-first.
Post by Felipe Magno de Almeida
FWIW, Boost reviewed and accepted asio, a library for asynchronous
I/O. It supports only sockets operations for now, but probably soon it
will support files as well, which could have some impact in easy of
use and CPU usage.
Yes, I'm following this progress. I'm already using asio for
networking (not in the main branch yet, but soon).
Post by Felipe Magno de Almeida
2542 (freq: 0.1/s), Hit Ratio: 93.4%
184 (freq: 0.0/s), Hit Ratio: 98.1%
It seems like pretty good numbers. Although we cant really know what
Disk Write means here...
Yeah, I'm a bit interested in how that cache works as well. The stats
for the read cache seems to suggest that either almost the entire
torrent fits in the cache, or that the swarm is practically
requesting the same pieces all the time.

--
Arvid Norberg
M***@massaroddel.de
2006-04-09 17:02:29 UTC
Permalink
Post by Arvid Norberg
There has been some discussion on the IRC channel about benefits and
possible cache algorithms for a disk cache in libtorrent. I'm posting
this hoping that somebody could comment on some of the techniques or
assumptions.
One thing that we did not discuss in irc channel is the open file limit. I can imagine that it
has a bad effect on the os's caching if the files get often opened and closed because the
torrent(s) have to many files. (i.e. 40 isn't much for scanlations)

MassaRoddel
Arvid Norberg
2006-04-09 21:44:51 UTC
Permalink
Post by M***@massaroddel.de
Post by Arvid Norberg
There has been some discussion on the IRC channel about benefits and
possible cache algorithms for a disk cache in libtorrent. I'm posting
this hoping that somebody could comment on some of the techniques or
assumptions.
One thing that we did not discuss in irc channel is the open file
limit. I can imagine that it
has a bad effect on the os's caching if the files get often opened and closed because the
torrent(s) have to many files. (i.e. 40 isn't much for scanlations)
I don't think the OS' cache is flushed just because the file is
closed. At least my experience on windows XP and OSX 10.4 suggests
that there is no measurable speed difference between opening and
closing the file for each piece that is read, or just leaving the
file open. Except when you have (crappy?) anti virus software that
hooks on file close and scans the file's content.

--
Arvid Norberg
Radu Hociung
2006-04-10 18:53:21 UTC
Permalink
Hello Arvid, and all,

I think caching at the application level can be far more efficient than
at the O/S level, as the application typically can manage
characteristics of the dataflow, whereas the O/S cannot. Please see below.

I will use "cheap data" and "expensive data" to refer to in-buffer and
on-disk data respectively.

The theory of Nash equilibrium applies very much to torrents, I think.
Downloading torrents is a non-cooperative game, in that each client is
concerned with ensuring only its own success, and the strategy it must
employ is of maximizing the chance for success while minimizing its cost
of achieving success.

Please see below for my comments on caching.
Post by Arvid Norberg
There has been some discussion on the IRC channel about benefits and
possible cache algorithms for a disk cache in libtorrent. I'm posting
this hoping that somebody could comment on some of the techniques or
assumptions.
1. seeding on a high speed connection will make the hard drive seek a
lot back and forth, just to do short reads.
2. downloading on a high speed connection will make the hard drive seek
a lot back and forth.
For 1, one assumption that can be made is that if block 0 is requested
from a piece, it's very likely that many more blocks from that piece
will be requested soon. So caching the entire piece at once would
probably be a good idea.
This is exactly what a normal OS' disk cache does. It reads ahead into
the cache. So I don't really see any benefit if caching this at
application level. One possible optimization would be if the disk cache
could be bypassed so that only one piece was cached (since the disk
cache in the OS may cache more than the piece we're reading from). This
would be hard to do, and especially hard to do in a platform
independent manner.
Is this a reasonable assumption?
If the application is able to fill the available outgoing bandwidth with
cheap data, there is no benefit to reading more/other data from the
disk, is there?

I think it would be a good idea for the library to prefer to respond to
requests for cheap data, and postpone responses to requests for
expensive data until the utilization of the available outgoing bandwidth
drops. Clients that request expensive data should be choked until cheap
data can no longer fill the outgoing pipe, and it becomes necessary to
read expensive data to fill the pipe.
Post by Arvid Norberg
Number 2 is a little bit more interesting though. Since every block
that is downloaded has to be written to disk, and since every block is
only downloaded once, you cannot really optimize away disk writes. What
you can do though, is to optimize the disk seeking.
Absolutely. Also, since while downloading, the client also uploads, it
would be more disk efficient to only seek for disk writes, but never for
disk reads. Ie, the client should prefer to respond to requests for data
is has freshly downloaded, instead of reading other data from disk.
Assuming that the client downloads least-common data first, it's a
reasonable assumtion that the data it holds in memory buffers is still
rare to other clients.

In this case, the library should write to disk when downloaded pieces
are complete, but do not free the buffers until other clients have
downloaded them enough to cause the uplink utilization to drop.

I have thought about optimizing the dataflow a lot. I believe there are
at least two good reasons for improving caching of torrents:

1. Fewer seeks == less noise == drives lasting longer.
2. Fewer seeks == drive is more responsive to other apps that run
concurently.

I believe the on-drive cache is better than nothing, O/S cache is better
than the on-drive cache, and application cache is better than O/S cache.
The closer the cache is to the data generation/consumption algorithm,
the more information is available for optimizing it.

I think this approach to caching gives an answer to multiple-torrent
caching as well.


One more thing:

I have considered working on a test-bench that is able to run a swarm.
The testbench would be able to report disk-access efficiency, by
measuring some metrics, like how far apart the written pieces are in
memory (ie, seek throw), total number of accesses, etc. The idea is that
it can be calculated empirically what the minimum number of accesses,
and long seeks is to fully download a file, while saturating the output
link. The metrics given by the testbench could be compared to the ideal,
and a measure of client efficiency would thus be obtained. (Naturally,
min-long-seeks = 0, and min-accesses = file-size/buffer-size)

The test-bench would hook O/S calls in a manner similar to strace, and
would be able to run multiple makes/brands/versions of clients, such
that swarm behaviour can be observed. This would be very useful to study
potential protocol changes and their effect on older-version clients.

I am very interested in your thoughts and comments on caching,

Regards,
Radu.
Arvid Norberg
2006-04-11 10:09:12 UTC
Permalink
Post by Radu Hociung
Hello Arvid, and all,
I think caching at the application level can be far more efficient than
at the O/S level, as the application typically can manage
characteristics of the dataflow, whereas the O/S cannot. Please see below.
Yes, this is true in general, my concern is that the characteristics
of the dataflow in the bittorrent case is very close to completely
random. Talking about the request pattern, the only exception to
randomness is the rarest first-algorithm. But this algorithm will
also make sure that the distribution among the peers is as uniform as
possible at all time, keeping the request pattern as random as possible.

It is true though, that in the case where the swarm has tendencies of
being "serialized", it is very common to request pieces which were
just made available from a peer. But I'm not sure that is a very
common case. Obviously some data is needed here.
Post by Radu Hociung
I will use "cheap data" and "expensive data" to refer to in-buffer and
on-disk data respectively.
The theory of Nash equilibrium applies very much to torrents, I think.
Downloading torrents is a non-cooperative game, in that each client is
concerned with ensuring only its own success, and the strategy it must
employ is of maximizing the chance for success while minimizing its cost
of achieving success.
true.
Post by Radu Hociung
Please see below for my comments on caching.
Post by Arvid Norberg
There has been some discussion on the IRC channel about benefits and
possible cache algorithms for a disk cache in libtorrent. I'm posting
this hoping that somebody could comment on some of the techniques or
assumptions.
1. seeding on a high speed connection will make the hard drive seek a
lot back and forth, just to do short reads.
2. downloading on a high speed connection will make the hard
drive seek
a lot back and forth.
For 1, one assumption that can be made is that if block 0 is
requested
from a piece, it's very likely that many more blocks from that piece
will be requested soon. So caching the entire piece at once would
probably be a good idea.
This is exactly what a normal OS' disk cache does. It reads ahead
into
the cache. So I don't really see any benefit if caching this at
application level. One possible optimization would be if the disk
cache
could be bypassed so that only one piece was cached (since the disk
cache in the OS may cache more than the piece we're reading
from). This
would be hard to do, and especially hard to do in a platform
independent manner.
Is this a reasonable assumption?
If the application is able to fill the available outgoing bandwidth with
cheap data, there is no benefit to reading more/other data from the
disk, is there?
I think it would be a good idea for the library to prefer to
respond to
requests for cheap data, and postpone responses to requests for
expensive data until the utilization of the available outgoing
bandwidth
drops. Clients that request expensive data should be choked until cheap
data can no longer fill the outgoing pipe, and it becomes necessary to
read expensive data to fill the pipe.
The client isn't completely free of choosing who to upload to/whose
requests to respond do. In order to maximize download speed, it has
to upload to those that it can download from. I also think that the
most common case is that the output pipe very seldom is saturated.
So, I'm not sure weighting peers depending on the pieces they request
would work very well, it would affect the behavior of the client in
other ways.

Your distinction of cheap and expensive data doesn't take seek
distance into account. I think that may be the most important thing
to concentrate on, at least when it comes to writing.
Post by Radu Hociung
Post by Arvid Norberg
Number 2 is a little bit more interesting though. Since every block
that is downloaded has to be written to disk, and since every
block is
only downloaded once, you cannot really optimize away disk
writes. What
you can do though, is to optimize the disk seeking.
Absolutely. Also, since while downloading, the client also uploads, it
would be more disk efficient to only seek for disk writes, but
never for
disk reads. Ie, the client should prefer to respond to requests for data
is has freshly downloaded, instead of reading other data from disk.
Assuming that the client downloads least-common data first, it's a
reasonable assumtion that the data it holds in memory buffers is still
rare to other clients.
Yes, at least to a certain degree. One thing to keep in mind though,
is that the number of pieces in a torrent is generally much more than
the number of peers that a client can see. How rare a piece is is
quantized to the number of peers that has it. This means that the
number of equally rare pieces is usually very high, far higher than a
write cache would be.
Post by Radu Hociung
In this case, the library should write to disk when downloaded pieces
are complete, but do not free the buffers until other clients have
downloaded them enough to cause the uplink utilization to drop.
I have thought about optimizing the dataflow a lot. I believe there are
1. Fewer seeks == less noise == drives lasting longer.
2. Fewer seeks == drive is more responsive to other apps that run
concurently.
Definitely, I think the disk perfomance is very important.

I might add that there's a huge difference for me (OSX 10.4) when
downloading in full allocation mode and compact allocation mode. The
compact allocation mode will have the effect that most reads and
writes are kept relatively close to each other. They will spread out
over time of course though. But just downloading with client_test
will saturate my connection at slightly above 1 MB/s if I'm in
compact mode. In full allocation mode though, I will not get above
about 600 kB/s, and my drive is really making a noise and slowing
down the computer to be virtually useless.
Post by Radu Hociung
I believe the on-drive cache is better than nothing, O/S cache is better
than the on-drive cache, and application cache is better than O/S cache.
The closer the cache is to the data generation/consumption algorithm,
the more information is available for optimizing it.
I think this approach to caching gives an answer to multiple-torrent
caching as well.
What do you mean? Do you think there would be a point of having a
shared cache for all torrents?

One benefit would be that if you have 100 torrents but only 2 of them
are active, they will get all the cache for themselves.
Post by Radu Hociung
I have considered working on a test-bench that is able to run a swarm.
The testbench would be able to report disk-access efficiency, by
measuring some metrics, like how far apart the written pieces are in
memory (ie, seek throw), total number of accesses, etc. The idea is that
it can be calculated empirically what the minimum number of accesses,
and long seeks is to fully download a file, while saturating the output
link. The metrics given by the testbench could be compared to the ideal,
and a measure of client efficiency would thus be obtained. (Naturally,
min-long-seeks = 0, and min-accesses = file-size/buffer-size)
The test-bench would hook O/S calls in a manner similar to strace, and
would be able to run multiple makes/brands/versions of clients, such
that swarm behaviour can be observed. This would be very useful to study
potential protocol changes and their effect on older-version clients.
That sounds like it could be very useful!
Post by Radu Hociung
I am very interested in your thoughts and comments on caching,
Ditto :)

So, I think my first attempt will be to cache pieces, and only write
whole pieces instead of blocks to disk. To see if the performance is
improved. I don't know how to measure it though, maybe I could try
saturating my download link in full allocation mode and see if it
increases.

Do you have any other suggestions that are reasonably easy to try out?

--
Arvid Norberg
Radu Hociung
2006-04-11 19:56:33 UTC
Permalink
Post by Radu Hociung
Hello Arvid, and all,
I think caching at the application level can be far more efficient than
at the O/S level, as the application typically can manage
characteristics of the dataflow, whereas the O/S cannot. Please see
below.
Yes, this is true in general, my concern is that the characteristics of
the dataflow in the bittorrent case is very close to completely random.
Talking about the request pattern, the only exception to randomness is
the rarest first-algorithm. But this algorithm will also make sure that
the distribution among the peers is as uniform as possible at all time,
keeping the request pattern as random as possible.
It is true though, that in the case where the swarm has tendencies of
being "serialized", it is very common to request pieces which were just
made available from a peer. But I'm not sure that is a very common
case. Obviously some data is needed here.
The serialized downloads make sense in at least two scenarios:

a. well-seeded torrents, and/or
b. applications where it is desirable to start playback as soon as
possible after the download starts. Obviously this makes sense mainly
for media-type torrents.

You're right, without more data, I find it hard to agree that randomness
is a good thing. In the case of a well-established torrent (ie,
long-lived, aggregate upload capacity balanced with aggregate download
capacity), the arrival time of new clients is random, and even though
their individual download pattern could be serialized, the request
pattern seen by individual seeds would be random.

But I think this is a topic big enough to deserve its own thread/research.

I will use the good/bad/perfect torrent taxonomy defined at
http://azureus.aelitis.com/wiki/index.php/Good_Torrents and probably
elsewhere too.
Post by Radu Hociung
I will use "cheap data" and "expensive data" to refer to in-buffer and
on-disk data respectively.
The theory of Nash equilibrium applies very much to torrents, I think.
Downloading torrents is a non-cooperative game, in that each client is
concerned with ensuring only its own success, and the strategy it must
employ is of maximizing the chance for success while minimizing its cost
of achieving success.
true.
Post by Radu Hociung
Please see below for my comments on caching.
Post by Arvid Norberg
There has been some discussion on the IRC channel about benefits and
possible cache algorithms for a disk cache in libtorrent. I'm posting
this hoping that somebody could comment on some of the techniques or
assumptions.
1. seeding on a high speed connection will make the hard drive seek a
lot back and forth, just to do short reads.
2. downloading on a high speed connection will make the hard drive
seek
a lot back and forth.
For 1, one assumption that can be made is that if block 0 is requested
from a piece, it's very likely that many more blocks from that piece
will be requested soon. So caching the entire piece at once would
probably be a good idea.
This is exactly what a normal OS' disk cache does. It reads ahead into
the cache. So I don't really see any benefit if caching this at
application level. One possible optimization would be if the disk
cache
could be bypassed so that only one piece was cached (since the disk
cache in the OS may cache more than the piece we're reading from).
This
would be hard to do, and especially hard to do in a platform
independent manner.
Is this a reasonable assumption?
If the application is able to fill the available outgoing bandwidth with
cheap data, there is no benefit to reading more/other data from the
disk, is there?
I think it would be a good idea for the library to prefer to respond to
requests for cheap data, and postpone responses to requests for
expensive data until the utilization of the available outgoing bandwidth
drops. Clients that request expensive data should be choked until cheap
data can no longer fill the outgoing pipe, and it becomes necessary to
read expensive data to fill the pipe.
The client isn't completely free of choosing who to upload to/whose
requests to respond do. In order to maximize download speed, it has to
upload to those that it can download from. I also think that the most
common case is that the output pipe very seldom is saturated. So, I'm
not sure weighting peers depending on the pieces they request would
work very well, it would affect the behavior of the client in other ways.
Your distinction of cheap and expensive data doesn't take seek distance
into account. I think that may be the most important thing to
concentrate on, at least when it comes to writing.
I thought your number 1 aim deals with seeds on high-speed links, hence
no downloading.

Since there are different scenarios which have distinct properties,
maybe caching behaviour could be altered as necessary. The 4 scenarios
I'm thinking of are:

a. seeding with out pipe saturated
b. seeding with out pipe underutilized
c. downloading from torrent "good" or better
d. downloading from torrent "acceptable" or worse

The key difference between a-b is that the data being requested is
likely available cheaper from another seed.

The key difference between c-d is that a random dl pattern is
unnecessary in c.
Post by Radu Hociung
Post by Arvid Norberg
Number 2 is a little bit more interesting though. Since every block
that is downloaded has to be written to disk, and since every block is
only downloaded once, you cannot really optimize away disk writes.
What
you can do though, is to optimize the disk seeking.
Absolutely. Also, since while downloading, the client also uploads, it
would be more disk efficient to only seek for disk writes, but never for
disk reads. Ie, the client should prefer to respond to requests for data
is has freshly downloaded, instead of reading other data from disk.
Assuming that the client downloads least-common data first, it's a
reasonable assumtion that the data it holds in memory buffers is still
rare to other clients.
Yes, at least to a certain degree. One thing to keep in mind though, is
that the number of pieces in a torrent is generally much more than the
number of peers that a client can see. How rare a piece is is quantized
to the number of peers that has it. This means that the number of
equally rare pieces is usually very high, far higher than a write cache
would be.
I think you're making the (healthy) assumption that a client would cache
one, or very few pieces for upload? This is very good for lowering
memory footprints of the client.

What I am thinking is that as soon as a client advertises a completed
piece, many swarm members would request it immediately, instead of
requesting other pieces. The just-downloaded piece is cheap data, and
the downloading client would prioritize its upload.

In an ideal swarm, each piece in a torrent would be read from a disk
exactly once, and written to disk exactly N-1 times (for N clients in
the swarm). A torrent that is well seeded ("good" or better) should be
able to approach this ideal, without risking the availability of any
piece to be so low as to endanger the torrent. For a torrent that is
"acceptable" or worse, the rarest-first approach may be given a higher
preference than the ideal.
Post by Radu Hociung
In this case, the library should write to disk when downloaded pieces
are complete, but do not free the buffers until other clients have
downloaded them enough to cause the uplink utilization to drop.
I have thought about optimizing the dataflow a lot. I believe there are
1. Fewer seeks == less noise == drives lasting longer.
2. Fewer seeks == drive is more responsive to other apps that run
concurently.
Definitely, I think the disk perfomance is very important.
I might add that there's a huge difference for me (OSX 10.4) when
downloading in full allocation mode and compact allocation mode. The
compact allocation mode will have the effect that most reads and writes
are kept relatively close to each other. They will spread out over time
of course though. But just downloading with client_test will saturate
my connection at slightly above 1 MB/s if I'm in compact mode. In full
allocation mode though, I will not get above about 600 kB/s, and my
drive is really making a noise and slowing down the computer to be
virtually useless.
Fascinating! This would seem to support compact/serialized downloading.
Post by Radu Hociung
I believe the on-drive cache is better than nothing, O/S cache is better
than the on-drive cache, and application cache is better than O/S cache.
The closer the cache is to the data generation/consumption algorithm,
the more information is available for optimizing it.
I think this approach to caching gives an answer to multiple-torrent
caching as well.
What do you mean? Do you think there would be a point of having a
shared cache for all torrents?
One benefit would be that if you have 100 torrents but only 2 of them
are active, they will get all the cache for themselves.
This, and more. Each client could quantify the quality of the torrents
it serves, and even chose to (temporarily) not service well-seeded
torrents, while allocating upload bandwidth and cache to the
less-healthy torrents. In this case, a torrent client serving a large
collection of torrents (50+) could do so by dynamically choosing which
torrents to serve now, and which are safe to download. A possible
criteria for serving a torrent would be "always serve only the 1/2 worst
torrents". It is very likely that the client would have no trouble
maxing its upload pipe on bad torrents (thus healing them the fastest).
This approach somewhat ignores the download needs.

A slightly better approach would be "serve the 1 worst torrent, with max
cache, and the worst torrent which allows me to max my download
capacity". Ie. if we have for instance 50 torrents, numbered in the
order of their quality, with 1 being seeds-only, and 50 being mostly
leaches", the client would pick torrent 50 to dedicate its cache to, and
torrent 21 for maxing its download. (say that on torrent 1 we get a 1/6
dl/ul speed, while on torrent 21, which has many seeds, we get 4/1, and
the two add up such that both up and down bandwidths are both
maximized). In this case, there are many disk writes caused by torrent
21, but no disk reads, and very few disk writes from torrent 50, and no
reads. The heads will rarely seek to the torrent 50 file, but stay on
the 21 file most of the time. Given that most of the cache is allocated
to torrent 50 and that it is downloading serialized, when it needs to
write to disk, it writes infrequently and in big chunks. Seek time is
thus minimized as well.
Post by Radu Hociung
I have considered working on a test-bench that is able to run a swarm.
The testbench would be able to report disk-access efficiency, by
measuring some metrics, like how far apart the written pieces are in
memory (ie, seek throw), total number of accesses, etc. The idea is that
it can be calculated empirically what the minimum number of accesses,
and long seeks is to fully download a file, while saturating the output
link. The metrics given by the testbench could be compared to the ideal,
and a measure of client efficiency would thus be obtained. (Naturally,
min-long-seeks = 0, and min-accesses = file-size/buffer-size)
The test-bench would hook O/S calls in a manner similar to strace, and
would be able to run multiple makes/brands/versions of clients, such
that swarm behaviour can be observed. This would be very useful to study
potential protocol changes and their effect on older-version clients.
That sounds like it could be very useful!
I wonder if there would be any volunteers to donate some time to help
with the cause.
Post by Radu Hociung
I am very interested in your thoughts and comments on caching,
Ditto :)
So, I think my first attempt will be to cache pieces, and only write
whole pieces instead of blocks to disk. To see if the performance is
improved. I don't know how to measure it though, maybe I could try
saturating my download link in full allocation mode and see if it
increases.
But would this not increase the memory footprint of the application,
seeing how as many download buffers are needed as there are parallel
downloads, while none of these buffers can be used to upload from, as
their data has not been checksummed until they have completed downloading?
Do you have any other suggestions that are reasonably easy to try out?
I think the first thing to do in studying the effects of libtorrent
caching would be to disable the O/S cache, in order to make comparisons
between versions relevant.

It would be interesting to find out how much riskier it is for a client
to serialize its downloads on poor torrents. Obviously it is safest to
do rarest-first on such torrents, but I am curious as to the difference
between rarest-first and serialized.

Also, we should not forget of the relationship between memory footprint
and caching effectiveness. It's easy to get more effective by allowing
increased memory use, but we should resist increased memory use when
possible.

How are you able to get 1MB/s ? Do you have torrents running on a test
network? Are the experiments reproducible? Would you mind describing
your test setup?

Greetings,
Radu.
Radu Hociung
2006-04-13 23:53:18 UTC
Permalink
Greetings all,

I have done some experiments on libtorrent and Transmission, comparing
the effectiveness of their client-side caching and download scheduling
algorithms.

I put the results (with graphs) on Wikipedia at:

http://en.wikipedia.org/wiki/BitTorrent_performance

I hope that you'll find it a worthwhile read.

Regards,
Radu.
Post by Radu Hociung
Hello Arvid, and all,
I think caching at the application level can be far more efficient than
at the O/S level, as the application typically can manage
characteristics of the dataflow, whereas the O/S cannot. Please see
below.
Arvid Norberg
2006-04-14 02:06:11 UTC
Permalink
Post by Radu Hociung
Greetings all,
I have done some experiments on libtorrent and Transmission, comparing
the effectiveness of their client-side caching and download scheduling
algorithms.
http://en.wikipedia.org/wiki/BitTorrent_performance
I hope that you'll find it a worthwhile read.
Very nice graphs :)

Caching blocks and only write whole pieces at once would definitely
improve that measurement. But I'm not sure it would improve in
reality (it probably would, a bit at least), since the OS's cache
very likely is done in the kernel, which means that you don't know
how many of those write-calls actually resulting in a write to the disk.

--
Arvid Norberg
Radu Hociung
2006-04-14 04:27:18 UTC
Permalink
Post by Arvid Norberg
Post by Radu Hociung
Greetings all,
I have done some experiments on libtorrent and Transmission, comparing
the effectiveness of their client-side caching and download scheduling
algorithms.
http://en.wikipedia.org/wiki/BitTorrent_performance
I hope that you'll find it a worthwhile read.
Very nice graphs :)
Caching blocks and only write whole pieces at once would definitely
improve that measurement. But I'm not sure it would improve in reality
(it probably would, a bit at least), since the OS's cache very likely is
done in the kernel, which means that you don't know how many of those
write-calls actually resulting in a write to the disk.
I doubt that writing whole pieces would make much difference. Looking at
libtorrent's current download pattern, and at the strace, it looks like
it writes the blocks of a piece in the proper order. So they probably
get assembled into the respective pieces in the OS's cache. If they were
being downloaded out of order, I would see a case for caching whole pieces.

However, you're right, about not knowing when the writes to disk
actually occur:

Say that the kernel starts off with an empty cache and 100MB free RAM.
When downloading a 5GB file, after the 1st 100MB are received, the OS
will have to start writing. It's reasonable that it will start flushing
the oldest cache slots first, so it is likely that after the fist 100MB
received, it will start writing the data it received first.

In the case of some advanced filesystems (such as XFS, which buffers
data as long as possible and allocates storage at the last possible
moment, when flushing buffers becomes innevitable), it is possible that
the cache won't always flush the first-received data, but it may try to
group buffers based on their relative closeness.


But in any case, when the file size is much larger than the available
cache memory, it is inevitable that the disk heads will jump around.

In my test setup, I only had 1 seed and 1 client, so to the client, all
the pieces it didn't already have were equally rare in the 'swarm', and
the random download pattern really did not have any purpose.


I was thinking of another way of scheduling downloads. With 100MB of
cache, the client could "focus" on downloading 100MB chunks of the
torrent, and select the rarest first within that 100MB chunk before
moving on to another chunk. the chunk could be a sliding window or such.
Before moving the sliding window, the client could do a 'sync' to force
data to be flushed. I would assume that on sync, the data would be
written in the most contiguous way possible.

I plan on experimenting with the scheduling routines of either
libtorrent or transmission to test out the various hypotheses.

The more I play around, the more I convince myself that it's the
scheduling algorithm that can bring the most benefits, not the caching
algorithm. After all, in a case where most pieces are written to the
disk once, and never read back, the cache acts as little more than a
simple delayed-write buffer, whether it is an application cache or an OS
cache.

BTW: Why are there 2 duplicate seeks before every write and 3 before
reads? Here's an example:

21561 write(6, /* head at 4161536 effectiveness 61.077 */0x8083e31,
16384) = 16384
21561 _llseek(6, 4177920, [4177920], /* BT efficiency 61.077% */
SEEK_SET) = 0
21561 _llseek(6, 0, [4177920], /* BT efficiency 61.077% */ SEEK_CUR) = 0
21561 write(6, /* head at 4177920 effectiveness 61.111 */0x8083e31,
16384) = 16384
21561 _llseek(6, 3145728, [3145728], /* BT efficiency 61.111% */
SEEK_SET) = 0
21561 _llseek(6, 0, [3145728], /* BT efficiency 61.111% */ SEEK_CUR) = 0
21561 _llseek(6, 0, [3145728], /* BT efficiency 61.111% */ SEEK_CUR) = 0
21561 read(6, /* head at 4194304 effectiveness 60.000 */0x404f4008,
1048576) = 1048576
21561 _llseek(6, 4194304, [4194304], /* BT efficiency 60.000% */
SEEK_SET) = 0
21561 _llseek(6, 0, [4194304], /* BT efficiency 60.000% */ SEEK_CUR) = 0
21561 write(6, /* head at 4194304 effectiveness 61.905 */0x404f4008,
1048576) = 1048576



Regards,
Radu.
Cory Nelson
2006-04-14 08:58:24 UTC
Permalink
Post by Radu Hociung
Post by Arvid Norberg
Post by Radu Hociung
Greetings all,
I have done some experiments on libtorrent and Transmission, comparing
the effectiveness of their client-side caching and download scheduling
algorithms.
http://en.wikipedia.org/wiki/BitTorrent_performance
I hope that you'll find it a worthwhile read.
Very nice graphs :)
Caching blocks and only write whole pieces at once would definitely
improve that measurement. But I'm not sure it would improve in reality
(it probably would, a bit at least), since the OS's cache very likely is
done in the kernel, which means that you don't know how many of those
write-calls actually resulting in a write to the disk.
I doubt that writing whole pieces would make much difference. Looking at
libtorrent's current download pattern, and at the strace, it looks like
it writes the blocks of a piece in the proper order. So they probably
get assembled into the respective pieces in the OS's cache. If they were
being downloaded out of order, I would see a case for caching whole pieces.
However, you're right, about not knowing when the writes to disk
Say that the kernel starts off with an empty cache and 100MB free RAM.
When downloading a 5GB file, after the 1st 100MB are received, the OS
will have to start writing. It's reasonable that it will start flushing
the oldest cache slots first, so it is likely that after the fist 100MB
received, it will start writing the data it received first.
In the case of some advanced filesystems (such as XFS, which buffers
data as long as possible and allocates storage at the last possible
moment, when flushing buffers becomes innevitable), it is possible that
the cache won't always flush the first-received data, but it may try to
group buffers based on their relative closeness.
But in any case, when the file size is much larger than the available
cache memory, it is inevitable that the disk heads will jump around.
In my test setup, I only had 1 seed and 1 client, so to the client, all
the pieces it didn't already have were equally rare in the 'swarm', and
the random download pattern really did not have any purpose.
I was thinking of another way of scheduling downloads. With 100MB of
cache, the client could "focus" on downloading 100MB chunks of the
torrent, and select the rarest first within that 100MB chunk before
moving on to another chunk. the chunk could be a sliding window or such.
Before moving the sliding window, the client could do a 'sync' to force
data to be flushed. I would assume that on sync, the data would be
written in the most contiguous way possible.
I plan on experimenting with the scheduling routines of either
libtorrent or transmission to test out the various hypotheses.
The more I play around, the more I convince myself that it's the
scheduling algorithm that can bring the most benefits, not the caching
algorithm. After all, in a case where most pieces are written to the
disk once, and never read back, the cache acts as little more than a
simple delayed-write buffer, whether it is an application cache or an OS
cache.
BTW: Why are there 2 duplicate seeks before every write and 3 before
21561 write(6, /* head at 4161536 effectiveness 61.077 */0x8083e31,
16384) = 16384
21561 _llseek(6, 4177920, [4177920], /* BT efficiency 61.077% */
SEEK_SET) = 0
21561 _llseek(6, 0, [4177920], /* BT efficiency 61.077% */ SEEK_CUR) = 0
21561 write(6, /* head at 4177920 effectiveness 61.111 */0x8083e31,
16384) = 16384
21561 _llseek(6, 3145728, [3145728], /* BT efficiency 61.111% */
SEEK_SET) = 0
21561 _llseek(6, 0, [3145728], /* BT efficiency 61.111% */ SEEK_CUR) = 0
21561 _llseek(6, 0, [3145728], /* BT efficiency 61.111% */ SEEK_CUR) = 0
21561 read(6, /* head at 4194304 effectiveness 60.000 */0x404f4008,
1048576) = 1048576
21561 _llseek(6, 4194304, [4194304], /* BT efficiency 60.000% */
SEEK_SET) = 0
21561 _llseek(6, 0, [4194304], /* BT efficiency 60.000% */ SEEK_CUR) = 0
21561 write(6, /* head at 4194304 effectiveness 61.905 */0x404f4008,
1048576) = 1048576
Regards,
Radu.
Windows has a way to control how much the OS should cache. This
amount has a very conservative default on client versions - I'm
curious to see how good an application-level cache would be if the
OS-level cache was bumped up. Having libtorrent use gobs of memory in
a useless attempt to out-cache the kernel is certainly undesirable.

I imagine all modern operating systems optimize outstanding
asynchronous disk i/o calls to be completed with the least amount of
seeking - something which libtorrent doesn't use last I checked but
could certainly benefit from even if only for the cpu usage benefits.

I'm also curious as to how much app level cache really benefits the
user. Either you have a low amount of traffic that can be easily
cached by the OS or you have a high amount of traffic which neither
cache can keep up with without using a lot of memory.

In the worst case an application-level cache might be put into the
swap file and cause even more disk thrashing, but this would
undoubtedly never happen with an OS cache.

IMHO this definately needs more real world testing!

--
Cory Nelson
http://www.int64.org
Arvid Norberg
2006-04-14 10:32:13 UTC
Permalink
Post by Radu Hociung
Post by Arvid Norberg
Post by Radu Hociung
Greetings all,
I have done some experiments on libtorrent and Transmission,
comparing
the effectiveness of their client-side caching and download
scheduling
algorithms.
http://en.wikipedia.org/wiki/BitTorrent_performance
I hope that you'll find it a worthwhile read.
Very nice graphs :)
Caching blocks and only write whole pieces at once would definitely
improve that measurement. But I'm not sure it would improve in reality
(it probably would, a bit at least), since the OS's cache very likely is
done in the kernel, which means that you don't know how many of those
write-calls actually resulting in a write to the disk.
I doubt that writing whole pieces would make much difference.
Looking at
libtorrent's current download pattern, and at the strace, it looks like
it writes the blocks of a piece in the proper order. So they probably
get assembled into the respective pieces in the OS's cache. If they were
being downloaded out of order, I would see a case for caching whole pieces.
I thought you downloaded a "wild" torrent, with more than one peer.
If you had done that, you would have noticed that the write
operations would have been scattered more. Because libtorrent is
writing each block to disk, and even though blocks within a piece are
downloaded sequentially, it would still be more seeking.

In high speeds libtorrent prefers to download whole pieces from
single peers, which means block writes to different pieces would be
interleaved.
Post by Radu Hociung
However, you're right, about not knowing when the writes to disk
Say that the kernel starts off with an empty cache and 100MB free RAM.
When downloading a 5GB file, after the 1st 100MB are received, the OS
will have to start writing. It's reasonable that it will start
flushing
the oldest cache slots first, so it is likely that after the fist 100MB
received, it will start writing the data it received first.
In the case of some advanced filesystems (such as XFS, which buffers
data as long as possible and allocates storage at the last possible
moment, when flushing buffers becomes innevitable), it is possible that
the cache won't always flush the first-received data, but it may try to
group buffers based on their relative closeness.
Maybe it's possible to get timestamps in the trace log. Then it would
be possible to estimate when actual writes were done.
Post by Radu Hociung
But in any case, when the file size is much larger than the available
cache memory, it is inevitable that the disk heads will jump around.
In my test setup, I only had 1 seed and 1 client, so to the client, all
the pieces it didn't already have were equally rare in the 'swarm', and
the random download pattern really did not have any purpose.
I was thinking of another way of scheduling downloads. With 100MB of
cache, the client could "focus" on downloading 100MB chunks of the
torrent, and select the rarest first within that 100MB chunk before
moving on to another chunk. the chunk could be a sliding window or such.
Before moving the sliding window, the client could do a 'sync' to force
data to be flushed. I would assume that on sync, the data would be
written in the most contiguous way possible.
I plan on experimenting with the scheduling routines of either
libtorrent or transmission to test out the various hypotheses.
The more I play around, the more I convince myself that it's the
scheduling algorithm that can bring the most benefits, not the caching
I'm a bit reluctant to change the random distribution + rarest first
algorithm, because it is much more efficient in keeping a torrent
alive than if chunks would be downloaded. (with alive I mean that the
distributed copies >= 1).
Post by Radu Hociung
algorithm. After all, in a case where most pieces are written to the
disk once, and never read back, the cache acts as little more than a
simple delayed-write buffer, whether it is an application cache or an OS
cache.
That is why I haven't made any real effort in implementing a cache in
libtorrent so far.
Post by Radu Hociung
BTW: Why are there 2 duplicate seeks before every write and 3 before
21561 write(6, /* head at 4161536 effectiveness 61.077 */0x8083e31,
16384) = 16384
21561 _llseek(6, 4177920, [4177920], /* BT efficiency 61.077% */
SEEK_SET) = 0
21561 _llseek(6, 0, [4177920], /* BT efficiency 61.077% */
SEEK_CUR) = 0
21561 write(6, /* head at 4177920 effectiveness 61.111 */0x8083e31,
16384) = 16384
21561 _llseek(6, 3145728, [3145728], /* BT efficiency 61.111% */
SEEK_SET) = 0
21561 _llseek(6, 0, [3145728], /* BT efficiency 61.111% */
SEEK_CUR) = 0
21561 _llseek(6, 0, [3145728], /* BT efficiency 61.111% */
SEEK_CUR) = 0
21561 read(6, /* head at 4194304 effectiveness 60.000 */0x404f4008,
1048576) = 1048576
21561 _llseek(6, 4194304, [4194304], /* BT efficiency 60.000% */
SEEK_SET) = 0
21561 _llseek(6, 0, [4194304], /* BT efficiency 60.000% */
SEEK_CUR) = 0
21561 write(6, /* head at 4194304 effectiveness 61.905 */0x404f4008,
1048576) = 1048576
Does this happen only with libtorrent, or with transmission too? If
it's in both, I would guess that it's an implementation detail of the
POSIX layer, that the POSIX calls don't translate directly to system
calls. Since libtorrent has a pool of open files, each read or write
operation has to seek before reading/writing, because there may have
been other such operations since last time, which moved the file
position. This means that even when reading sequentially, libtorrent
will seek to the current position because it doesn't know that the
last read/write operation left the file position at the place where
it wants it for the next operation.

I can't imagine that there's any cost involved in doing that extra
seek, the alternative would be to call tell() and the compare, and
then seek. But the common case is that that seek is necessary anyway,
it's just that in your case you only had one peer.


--
Arvid Norberg
Radu Hociung
2006-04-14 15:05:25 UTC
Permalink
Post by Radu Hociung
Post by Arvid Norberg
Post by Radu Hociung
Greetings all,
I have done some experiments on libtorrent and Transmission, comparing
the effectiveness of their client-side caching and download scheduling
algorithms.
http://en.wikipedia.org/wiki/BitTorrent_performance
I hope that you'll find it a worthwhile read.
Very nice graphs :)
Caching blocks and only write whole pieces at once would definitely
improve that measurement. But I'm not sure it would improve in reality
(it probably would, a bit at least), since the OS's cache very likely is
done in the kernel, which means that you don't know how many of those
write-calls actually resulting in a write to the disk.
I doubt that writing whole pieces would make much difference. Looking at
libtorrent's current download pattern, and at the strace, it looks like
it writes the blocks of a piece in the proper order. So they probably
get assembled into the respective pieces in the OS's cache. If they were
being downloaded out of order, I would see a case for caching whole pieces.
I thought you downloaded a "wild" torrent, with more than one peer. If
you had done that, you would have noticed that the write operations
would have been scattered more. Because libtorrent is writing each block
to disk, and even though blocks within a piece are downloaded
sequentially, it would still be more seeking.
In high speeds libtorrent prefers to download whole pieces from single
peers, which means block writes to different pieces would be interleaved.
Ah, of course. I should have thought of this.

In order to answer questions regarding the behaviour of OS caches, there
is a library available to study physical disk access, written at the
Renaissance Computing Institute. It requires a patch to the kernel, as
well as some calls in the user-level application to start/stop tracing.

If the user-level calls were put in strace, then physical analysis of an
application would be easier to apply to any client, without modifying
the client itself.

Here's the URL to the (open source) library:

http://www.renci.org/research/cadre/CADRETraceLibraries/CADREPhysicalIOLibrary.htm

The testbench sw I am using is essentially strace, with modifications I
made. I plan to release this testbench so that you can run it yourselves
also, which I think will allow you to easily experiment and draw
conclusions, without requiring guessing.
Post by Radu Hociung
However, you're right, about not knowing when the writes to disk
Say that the kernel starts off with an empty cache and 100MB free RAM.
When downloading a 5GB file, after the 1st 100MB are received, the OS
will have to start writing. It's reasonable that it will start flushing
the oldest cache slots first, so it is likely that after the fist 100MB
received, it will start writing the data it received first.
In the case of some advanced filesystems (such as XFS, which buffers
data as long as possible and allocates storage at the last possible
moment, when flushing buffers becomes innevitable), it is possible that
the cache won't always flush the first-received data, but it may try to
group buffers based on their relative closeness.
Maybe it's possible to get timestamps in the trace log. Then it would be
possible to estimate when actual writes were done.
It would be very easy to add timestamps, you can use the same command
line as I show in wikipedia, but add "-tt" (timestamp) and/or -T (call
duration) to the strace options.

You can this way profile any torrent.
Post by Radu Hociung
But in any case, when the file size is much larger than the available
cache memory, it is inevitable that the disk heads will jump around.
In my test setup, I only had 1 seed and 1 client, so to the client, all
the pieces it didn't already have were equally rare in the 'swarm', and
the random download pattern really did not have any purpose.
I was thinking of another way of scheduling downloads. With 100MB of
cache, the client could "focus" on downloading 100MB chunks of the
torrent, and select the rarest first within that 100MB chunk before
moving on to another chunk. the chunk could be a sliding window or such.
Before moving the sliding window, the client could do a 'sync' to force
data to be flushed. I would assume that on sync, the data would be
written in the most contiguous way possible.
I plan on experimenting with the scheduling routines of either
libtorrent or transmission to test out the various hypotheses.
The more I play around, the more I convince myself that it's the
scheduling algorithm that can bring the most benefits, not the caching
I'm a bit reluctant to change the random distribution + rarest first
algorithm, because it is much more efficient in keeping a torrent alive
than if chunks would be downloaded. (with alive I mean that the
distributed copies >= 1).
I understand, but in some cases the random distribution does not do
anything. In the case I show of 1:1, it's no safer to download data out
of order than it is to download sequentially I think.

Also, in the case of a perfect torrent (more seeds than downloaders, say
20:5), say that two pieces have availability of 20 and 24 respectively.
Does it really make a difference if you download the more available
piece first? Is it likely that the 20 seeds will leave the torrent,
rendering it dead? The client should constantly evaluate risk, and
switch to "survival" mode only when necessary, but it's overkill to
always run in "survival" mode.
Post by Radu Hociung
algorithm. After all, in a case where most pieces are written to the
disk once, and never read back, the cache acts as little more than a
simple delayed-write buffer, whether it is an application cache or an OS
cache.
That is why I haven't made any real effort in implementing a cache in
libtorrent so far.
Post by Radu Hociung
BTW: Why are there 2 duplicate seeks before every write and 3 before
21561 write(6, /* head at 4161536 effectiveness 61.077 */0x8083e31,
16384) = 16384
21561 _llseek(6, 4177920, [4177920], /* BT efficiency 61.077% */
SEEK_SET) = 0
21561 _llseek(6, 0, [4177920], /* BT efficiency 61.077% */ SEEK_CUR) = 0
21561 write(6, /* head at 4177920 effectiveness 61.111 */0x8083e31,
16384) = 16384
21561 _llseek(6, 3145728, [3145728], /* BT efficiency 61.111% */
SEEK_SET) = 0
21561 _llseek(6, 0, [3145728], /* BT efficiency 61.111% */ SEEK_CUR) = 0
21561 _llseek(6, 0, [3145728], /* BT efficiency 61.111% */ SEEK_CUR) = 0
21561 read(6, /* head at 4194304 effectiveness 60.000 */0x404f4008,
1048576) = 1048576
21561 _llseek(6, 4194304, [4194304], /* BT efficiency 60.000% */
SEEK_SET) = 0
21561 _llseek(6, 0, [4194304], /* BT efficiency 60.000% */ SEEK_CUR) = 0
21561 write(6, /* head at 4194304 effectiveness 61.905 */0x404f4008,
1048576) = 1048576
Does this happen only with libtorrent, or with transmission too? If it's
in both, I would guess that it's an implementation detail of the POSIX
layer, that the POSIX calls don't translate directly to system calls.
Since libtorrent has a pool of open files, each read or write operation
has to seek before reading/writing, because there may have been other
such operations since last time, which moved the file position. This
means that even when reading sequentially, libtorrent will seek to the
current position because it doesn't know that the last read/write
operation left the file position at the place where it wants it for the
next operation.
I can't imagine that there's any cost involved in doing that extra seek,
the alternative would be to call tell() and the compare, and then seek.
But the common case is that that seek is necessary anyway, it's just
that in your case you only had one peer.
This only happens in libtorrent; transmission seeks exactly once for
both reads and writes.

So if you do the SEEK_CUR in order to ensure the cursor is where you
expect it, it means you don't have an atomic lock between the SEEK_SET
and the read/write? If you don't have such a lock and your app is
multi-threaded, what prevents another thread from seeking between your
SEEK_CUR and the read/write?

Radu.
Arvid Norberg
2006-04-19 18:14:42 UTC
Permalink
Post by Radu Hociung
Post by Arvid Norberg
Post by Radu Hociung
I doubt that writing whole pieces would make much difference. Looking at
libtorrent's current download pattern, and at the strace, it
looks like
it writes the blocks of a piece in the proper order. So they
probably
get assembled into the respective pieces in the OS's cache. If they were
being downloaded out of order, I would see a case for caching whole pieces.
I thought you downloaded a "wild" torrent, with more than one
peer. If
you had done that, you would have noticed that the write operations
would have been scattered more. Because libtorrent is writing each block
to disk, and even though blocks within a piece are downloaded
sequentially, it would still be more seeking.
In high speeds libtorrent prefers to download whole pieces from single
peers, which means block writes to different pieces would be
interleaved.
Ah, of course. I should have thought of this.
In order to answer questions regarding the behaviour of OS caches, there
is a library available to study physical disk access, written at the
Renaissance Computing Institute. It requires a patch to the kernel, as
well as some calls in the user-level application to start/stop
tracing.
If the user-level calls were put in strace, then physical analysis of an
application would be easier to apply to any client, without modifying
the client itself.
http://www.renci.org/research/cadre/CADRETraceLibraries/
CADREPhysicalIOLibrary.htm
The testbench sw I am using is essentially strace, with
modifications I
made. I plan to release this testbench so that you can run it
yourselves
also, which I think will allow you to easily experiment and draw
conclusions, without requiring guessing.
Do you think it would be possible to use it on a clean strace log
(without the modifications?). I'm asking because in that case maybe
it would be possible to use it on a ktrace log (which is used on
darwin).
Post by Radu Hociung
Post by Arvid Norberg
Post by Radu Hociung
However, you're right, about not knowing when the writes to disk
Say that the kernel starts off with an empty cache and 100MB free RAM.
When downloading a 5GB file, after the 1st 100MB are received, the OS
will have to start writing. It's reasonable that it will start flushing
the oldest cache slots first, so it is likely that after the fist 100MB
received, it will start writing the data it received first.
In the case of some advanced filesystems (such as XFS, which buffers
data as long as possible and allocates storage at the last possible
moment, when flushing buffers becomes innevitable), it is
possible that
the cache won't always flush the first-received data, but it may try to
group buffers based on their relative closeness.
Maybe it's possible to get timestamps in the trace log. Then it would be
possible to estimate when actual writes were done.
It would be very easy to add timestamps, you can use the same command
line as I show in wikipedia, but add "-tt" (timestamp) and/or -T (call
duration) to the strace options.
You can this way profile any torrent.
Post by Arvid Norberg
Post by Radu Hociung
But in any case, when the file size is much larger than the
available
cache memory, it is inevitable that the disk heads will jump around.
In my test setup, I only had 1 seed and 1 client, so to the
client, all
the pieces it didn't already have were equally rare in the
'swarm', and
the random download pattern really did not have any purpose.
I was thinking of another way of scheduling downloads. With 100MB of
cache, the client could "focus" on downloading 100MB chunks of the
torrent, and select the rarest first within that 100MB chunk before
moving on to another chunk. the chunk could be a sliding window or such.
Before moving the sliding window, the client could do a 'sync' to force
data to be flushed. I would assume that on sync, the data would be
written in the most contiguous way possible.
I plan on experimenting with the scheduling routines of either
libtorrent or transmission to test out the various hypotheses.
The more I play around, the more I convince myself that it's the
scheduling algorithm that can bring the most benefits, not the caching
I'm a bit reluctant to change the random distribution + rarest first
algorithm, because it is much more efficient in keeping a torrent alive
than if chunks would be downloaded. (with alive I mean that the
distributed copies >= 1).
I understand, but in some cases the random distribution does not do
anything. In the case I show of 1:1, it's no safer to download data out
of order than it is to download sequentially I think.
Not in the case where you're guaranteed that the seed will stay
online, but it's very uncommon to have such a guarantee. Usually, the
seed could go offline and a half-finished peer will come online. The
optimal case would be that both peers would have 1 distributed copy
and be able to complete the file. And closest one can get to that
(without having any global knowledge) is to download pieces in random.
Post by Radu Hociung
Also, in the case of a perfect torrent (more seeds than
downloaders, say
20:5), say that two pieces have availability of 20 and 24
respectively.
Does it really make a difference if you download the more available
piece first? Is it likely that the 20 seeds will leave the torrent,
rendering it dead? The client should constantly evaluate risk, and
switch to "survival" mode only when necessary, but it's overkill to
always run in "survival" mode.
This is very true (as mentioned in another branch of this thread). I
don't think the ratio is that important though. As long as a piece
has enough peers, it could be downloaded sequentially.
Post by Radu Hociung
Post by Arvid Norberg
Post by Radu Hociung
algorithm. After all, in a case where most pieces are written to the
disk once, and never read back, the cache acts as little more than a
simple delayed-write buffer, whether it is an application cache or an OS
cache.
That is why I haven't made any real effort in implementing a cache in
libtorrent so far.
Post by Radu Hociung
BTW: Why are there 2 duplicate seeks before every write and 3 before
21561 write(6, /* head at 4161536 effectiveness 61.077 */0x8083e31,
16384) = 16384
21561 _llseek(6, 4177920, [4177920], /* BT efficiency 61.077% */
SEEK_SET) = 0
21561 _llseek(6, 0, [4177920], /* BT efficiency 61.077% */
SEEK_CUR) = 0
21561 write(6, /* head at 4177920 effectiveness 61.111 */0x8083e31,
16384) = 16384
21561 _llseek(6, 3145728, [3145728], /* BT efficiency 61.111% */
SEEK_SET) = 0
21561 _llseek(6, 0, [3145728], /* BT efficiency 61.111% */
SEEK_CUR) = 0
21561 _llseek(6, 0, [3145728], /* BT efficiency 61.111% */
SEEK_CUR) = 0
21561 read(6, /* head at 4194304 effectiveness 60.000 */0x404f4008,
1048576) = 1048576
21561 _llseek(6, 4194304, [4194304], /* BT efficiency 60.000% */
SEEK_SET) = 0
21561 _llseek(6, 0, [4194304], /* BT efficiency 60.000% */
SEEK_CUR) = 0
21561 write(6, /* head at 4194304 effectiveness 61.905 */0x404f4008,
1048576) = 1048576
Does this happen only with libtorrent, or with transmission too? If it's
in both, I would guess that it's an implementation detail of the POSIX
layer, that the POSIX calls don't translate directly to system calls.
Since libtorrent has a pool of open files, each read or write
operation
has to seek before reading/writing, because there may have been other
such operations since last time, which moved the file position. This
means that even when reading sequentially, libtorrent will seek to the
current position because it doesn't know that the last read/write
operation left the file position at the place where it wants it for the
next operation.
I can't imagine that there's any cost involved in doing that extra seek,
the alternative would be to call tell() and the compare, and then seek.
But the common case is that that seek is necessary anyway, it's just
that in your case you only had one peer.
This only happens in libtorrent; transmission seeks exactly once for
both reads and writes.
That is very strange, I'll have to investigate this.
Post by Radu Hociung
So if you do the SEEK_CUR in order to ensure the cursor is where you
expect it, it means you don't have an atomic lock between the SEEK_SET
and the read/write? If you don't have such a lock and your app is
multi-threaded, what prevents another thread from seeking between your
SEEK_CUR and the read/write?
No, that's not why I do a seek before every read and write. All IO
operations are done in the same thread. But every read operation is
done without any context. The current file position is ignored. The
storage abstraction basically gives random access to the files. And
the queue for sending blocks may be interleaved with blocks from
different pieces.



--
Arvid Norberg
Radu Hociung
2006-04-19 21:36:16 UTC
Permalink
Post by Arvid Norberg
Post by Radu Hociung
Post by Radu Hociung
I doubt that writing whole pieces would make much difference. Looking at
libtorrent's current download pattern, and at the strace, it looks
like
it writes the blocks of a piece in the proper order. So they probably
get assembled into the respective pieces in the OS's cache. If they
were
being downloaded out of order, I would see a case for caching whole pieces.
I thought you downloaded a "wild" torrent, with more than one peer. If
you had done that, you would have noticed that the write operations
would have been scattered more. Because libtorrent is writing each block
to disk, and even though blocks within a piece are downloaded
sequentially, it would still be more seeking.
In high speeds libtorrent prefers to download whole pieces from single
peers, which means block writes to different pieces would be
interleaved.
Ah, of course. I should have thought of this.
In order to answer questions regarding the behaviour of OS caches, there
is a library available to study physical disk access, written at the
Renaissance Computing Institute. It requires a patch to the kernel, as
well as some calls in the user-level application to start/stop tracing.
If the user-level calls were put in strace, then physical analysis of an
application would be easier to apply to any client, without modifying
the client itself.
http://www.renci.org/research/cadre/CADRETraceLibraries/
CADREPhysicalIOLibrary.htm
The testbench sw I am using is essentially strace, with modifications I
made. I plan to release this testbench so that you can run it yourselves
also, which I think will allow you to easily experiment and draw
conclusions, without requiring guessing.
Do you think it would be possible to use it on a clean strace log
(without the modifications?). I'm asking because in that case maybe it
would be possible to use it on a ktrace log (which is used on darwin).
The CADRE library is not used on a log. Instead, it must be
activated/deactivated somehow. The information it captures when
activated is stored in a separate location, not in the strace log.

Further, in order to use it, a patch must be applied to a linux kernel.
This patch adds instrumentation to the device drivers. It's that
instrumentation that is then started/stopped by making a call from the
user level application.

So, I don't think this library will run on darwin; I am not familiar
with ktrace, but it looks to be an strace-like utility. Does the "-ti"
option not log interractions of the kernel with the hardware? If it
does, then you don't need CADRE in order to monitor the behaviour of the
disk cache.
Post by Arvid Norberg
Post by Radu Hociung
Post by Radu Hociung
However, you're right, about not knowing when the writes to disk
Say that the kernel starts off with an empty cache and 100MB free RAM.
When downloading a 5GB file, after the 1st 100MB are received, the OS
will have to start writing. It's reasonable that it will start flushing
the oldest cache slots first, so it is likely that after the fist 100MB
received, it will start writing the data it received first.
In the case of some advanced filesystems (such as XFS, which buffers
data as long as possible and allocates storage at the last possible
moment, when flushing buffers becomes innevitable), it is possible
that
the cache won't always flush the first-received data, but it may try to
group buffers based on their relative closeness.
Maybe it's possible to get timestamps in the trace log. Then it would be
possible to estimate when actual writes were done.
It would be very easy to add timestamps, you can use the same command
line as I show in wikipedia, but add "-tt" (timestamp) and/or -T (call
duration) to the strace options.
You can this way profile any torrent.
Post by Radu Hociung
But in any case, when the file size is much larger than the available
cache memory, it is inevitable that the disk heads will jump around.
In my test setup, I only had 1 seed and 1 client, so to the client,
all
the pieces it didn't already have were equally rare in the 'swarm',
and
the random download pattern really did not have any purpose.
I was thinking of another way of scheduling downloads. With 100MB of
cache, the client could "focus" on downloading 100MB chunks of the
torrent, and select the rarest first within that 100MB chunk before
moving on to another chunk. the chunk could be a sliding window or
such.
Before moving the sliding window, the client could do a 'sync' to force
data to be flushed. I would assume that on sync, the data would be
written in the most contiguous way possible.
I plan on experimenting with the scheduling routines of either
libtorrent or transmission to test out the various hypotheses.
The more I play around, the more I convince myself that it's the
scheduling algorithm that can bring the most benefits, not the caching
I'm a bit reluctant to change the random distribution + rarest first
algorithm, because it is much more efficient in keeping a torrent alive
than if chunks would be downloaded. (with alive I mean that the
distributed copies >= 1).
I understand, but in some cases the random distribution does not do
anything. In the case I show of 1:1, it's no safer to download data out
of order than it is to download sequentially I think.
Not in the case where you're guaranteed that the seed will stay online,
but it's very uncommon to have such a guarantee. Usually, the seed
could go offline and a half-finished peer will come online. The optimal
case would be that both peers would have 1 distributed copy and be able
to complete the file. And closest one can get to that (without having
any global knowledge) is to download pieces in random.
Ah, I see... I had not thought of such a scenario. Very interesting.

So, in a torrent like you described (1 seed, 1 hidden/offline peer, 1
blind peer), assuming downloaders get 1/2 of the file each, they would
be able to finish only if they have complementary halves of the file.

With a randomization at the piece level, with say, 1,000 pieces in
total, each of the downloaders need the right 500 pieces to make the
torrent healthy.

The probability for the blind client downloading the right 500 pieces,
not knowing which 500 the hidden peer has, are about 1/(2^1000), since
each piece has 1/2 chance of being the right piece, and that there are
1000 guesses to be made.

I still think sequential downloading gives the torrent a better chance
for healing:

Say that each client downloads quarters of the file in random order. The
pieces in each quarter are downloaded sequentially. (ie, all the pieces
in quarter 1 are downloaded in sequential order, but the file may be
downloaded as Q2,Q4,Q3,Q1)

In this case, if the torrent behaves like you described above, the
chances for the torrent remaining healthy with the blind and hidden
peers present, but seed absent, are about 1/(2^4), since there are 4
guesses to be made. If each of the clients has 1/2 of the file when the
seed goes away, they have a 1/16 chance of finishing.

So comparing a download strategy of piece-by-piece-randomly (PBPR) or
quarter-by-quarter-randomly (QBQR), the QBQR strategy has a 6.25% chance
of finishing, while PBPR has no (1/(2^1000)) chance of finishing.

I think even the 6.25% probability can be improved:

If the file is downloaded in halves instead, there is a 25% chance of
finishing this download. Perhaps each client can decide whether to
download halves, quarters, eights or whatever, depending on how many
peers it encounters:

1 seed, 0 peers = download halves
1 seed, 1 peer = download quarters
2 seeds, 2 peers = download eights

What do you think?
Post by Arvid Norberg
Post by Radu Hociung
Also, in the case of a perfect torrent (more seeds than downloaders, say
20:5), say that two pieces have availability of 20 and 24 respectively.
Does it really make a difference if you download the more available
piece first? Is it likely that the 20 seeds will leave the torrent,
rendering it dead? The client should constantly evaluate risk, and
switch to "survival" mode only when necessary, but it's overkill to
always run in "survival" mode.
This is very true (as mentioned in another branch of this thread). I
don't think the ratio is that important though. As long as a piece has
enough peers, it could be downloaded sequentially.
Post by Radu Hociung
Post by Radu Hociung
algorithm. After all, in a case where most pieces are written to the
disk once, and never read back, the cache acts as little more than a
simple delayed-write buffer, whether it is an application cache or
an OS
cache.
That is why I haven't made any real effort in implementing a cache in
libtorrent so far.
Post by Radu Hociung
BTW: Why are there 2 duplicate seeks before every write and 3 before
21561 write(6, /* head at 4161536 effectiveness 61.077 */0x8083e31,
16384) = 16384
21561 _llseek(6, 4177920, [4177920], /* BT efficiency 61.077% */
SEEK_SET) = 0
21561 _llseek(6, 0, [4177920], /* BT efficiency 61.077% */
SEEK_CUR) = 0
21561 write(6, /* head at 4177920 effectiveness 61.111 */0x8083e31,
16384) = 16384
21561 _llseek(6, 3145728, [3145728], /* BT efficiency 61.111% */
SEEK_SET) = 0
21561 _llseek(6, 0, [3145728], /* BT efficiency 61.111% */
SEEK_CUR) = 0
21561 _llseek(6, 0, [3145728], /* BT efficiency 61.111% */
SEEK_CUR) = 0
21561 read(6, /* head at 4194304 effectiveness 60.000 */0x404f4008,
1048576) = 1048576
21561 _llseek(6, 4194304, [4194304], /* BT efficiency 60.000% */
SEEK_SET) = 0
21561 _llseek(6, 0, [4194304], /* BT efficiency 60.000% */
SEEK_CUR) = 0
21561 write(6, /* head at 4194304 effectiveness 61.905 */0x404f4008,
1048576) = 1048576
Does this happen only with libtorrent, or with transmission too? If
it's
in both, I would guess that it's an implementation detail of the POSIX
layer, that the POSIX calls don't translate directly to system calls.
Since libtorrent has a pool of open files, each read or write operation
has to seek before reading/writing, because there may have been other
such operations since last time, which moved the file position. This
means that even when reading sequentially, libtorrent will seek to the
current position because it doesn't know that the last read/write
operation left the file position at the place where it wants it for the
next operation.
I can't imagine that there's any cost involved in doing that extra seek,
the alternative would be to call tell() and the compare, and then seek.
But the common case is that that seek is necessary anyway, it's just
that in your case you only had one peer.
This only happens in libtorrent; transmission seeks exactly once for
both reads and writes.
That is very strange, I'll have to investigate this.
Post by Radu Hociung
So if you do the SEEK_CUR in order to ensure the cursor is where you
expect it, it means you don't have an atomic lock between the SEEK_SET
and the read/write? If you don't have such a lock and your app is
multi-threaded, what prevents another thread from seeking between your
SEEK_CUR and the read/write?
No, that's not why I do a seek before every read and write. All IO
operations are done in the same thread. But every read operation is
done without any context. The current file position is ignored. The
storage abstraction basically gives random access to the files. And the
queue for sending blocks may be interleaved with blocks from different
pieces.
Sounds like the SEEK_CUR is not needed then.

Cheers,
Radu.
Arvid Norberg
2006-04-19 23:23:45 UTC
Permalink
Post by Radu Hociung
Post by Arvid Norberg
Post by Radu Hociung
Post by Radu Hociung
I doubt that writing whole pieces would make much difference. Looking at
libtorrent's current download pattern, and at the strace, it
looks
like
it writes the blocks of a piece in the proper order. So they probably
get assembled into the respective pieces in the OS's cache. If
they
were
being downloaded out of order, I would see a case for caching
whole
pieces.
I thought you downloaded a "wild" torrent, with more than one peer. If
you had done that, you would have noticed that the write operations
would have been scattered more. Because libtorrent is writing each block
to disk, and even though blocks within a piece are downloaded
sequentially, it would still be more seeking.
In high speeds libtorrent prefers to download whole pieces from single
peers, which means block writes to different pieces would be interleaved.
Ah, of course. I should have thought of this.
In order to answer questions regarding the behaviour of OS
caches, there
is a library available to study physical disk access, written at the
Renaissance Computing Institute. It requires a patch to the
kernel, as
well as some calls in the user-level application to start/stop tracing.
If the user-level calls were put in strace, then physical
analysis of an
application would be easier to apply to any client, without
modifying
the client itself.
http://www.renci.org/research/cadre/CADRETraceLibraries/
CADREPhysicalIOLibrary.htm
The testbench sw I am using is essentially strace, with
modifications I
made. I plan to release this testbench so that you can run it yourselves
also, which I think will allow you to easily experiment and draw
conclusions, without requiring guessing.
Do you think it would be possible to use it on a clean strace log
(without the modifications?). I'm asking because in that case
maybe it
would be possible to use it on a ktrace log (which is used on
darwin).
The CADRE library is not used on a log. Instead, it must be
activated/deactivated somehow. The information it captures when
activated is stored in a separate location, not in the strace log.
Further, in order to use it, a patch must be applied to a linux kernel.
This patch adds instrumentation to the device drivers. It's that
instrumentation that is then started/stopped by making a call from the
user level application.
So, I don't think this library will run on darwin; I am not familiar
with ktrace, but it looks to be an strace-like utility. Does the "-ti"
option not log interractions of the kernel with the hardware? If it
does, then you don't need CADRE in order to monitor the behaviour of the
disk cache.
I was primarily thinking about your testbench.
Post by Radu Hociung
Post by Arvid Norberg
Not in the case where you're guaranteed that the seed will stay
online,
but it's very uncommon to have such a guarantee. Usually, the seed
could go offline and a half-finished peer will come online. The
optimal
case would be that both peers would have 1 distributed copy and be able
to complete the file. And closest one can get to that (without having
any global knowledge) is to download pieces in random.
Ah, I see... I had not thought of such a scenario. Very interesting.
So, in a torrent like you described (1 seed, 1 hidden/offline peer, 1
blind peer), assuming downloaders get 1/2 of the file each, they would
be able to finish only if they have complementary halves of the file.
With a randomization at the piece level, with say, 1,000 pieces in
total, each of the downloaders need the right 500 pieces to make the
torrent healthy.
The probability for the blind client downloading the right 500 pieces,
not knowing which 500 the hidden peer has, are about 1/(2^1000), since
each piece has 1/2 chance of being the right piece, and that there are
1000 guesses to be made.
I still think sequential downloading gives the torrent a better chance
Say that each client downloads quarters of the file in random
order. The
pieces in each quarter are downloaded sequentially. (ie, all the pieces
in quarter 1 are downloaded in sequential order, but the file may be
downloaded as Q2,Q4,Q3,Q1)
In this case, if the torrent behaves like you described above, the
chances for the torrent remaining healthy with the blind and hidden
peers present, but seed absent, are about 1/(2^4), since there are 4
guesses to be made. If each of the clients has 1/2 of the file when the
seed goes away, they have a 1/16 chance of finishing.
So comparing a download strategy of piece-by-piece-randomly (PBPR) or
quarter-by-quarter-randomly (QBQR), the QBQR strategy has a 6.25% chance
of finishing, while PBPR has no (1/(2^1000)) chance of finishing.
If the file is downloaded in halves instead, there is a 25% chance of
finishing this download. Perhaps each client can decide whether to
download halves, quarters, eights or whatever, depending on how many
1 seed, 0 peers = download halves
1 seed, 1 peer = download quarters
2 seeds, 2 peers = download eights
What do you think?
I would most respectfully encourage you to think again :)

To be honest, I'm not sure I follow all of your reasoning. But what
you do in practice is that you increase the piece sizes. This has the
effect of increasing overlap in the downloads. the chances of
finishing is not the only measurement, you have to consider the
overlap in the downloads. The goal is that any two peers should have
as high probability as possible to exchange data, two-way. The higher
granularity (number of pieces) the closer one would get to the
optimal case, when downloading random pieces.

Also keep in mind that the peer has no global knowledge, so the
number of peers and seeds would have to be counted only among
neighbours.

You can probably get a very good explanation why this is the case in
either #bittorrent on freenode or on the bittorrent mailing list:

http://lists.ibiblio.org/mailman/listinfo/bittorrent
Post by Radu Hociung
Post by Arvid Norberg
No, that's not why I do a seek before every read and write. All IO
operations are done in the same thread. But every read operation is
done without any context. The current file position is ignored. The
storage abstraction basically gives random access to the files. And the
queue for sending blocks may be interleaved with blocks from
different
pieces.
Sounds like the SEEK_CUR is not needed then.
exactly, the extra seek is not necessary, that's why I will look into
why it's there.


--
Arvid Norberg
Radu Hociung
2006-04-20 16:42:58 UTC
Permalink
Post by Arvid Norberg
Post by Radu Hociung
Post by Arvid Norberg
Post by Radu Hociung
Post by Radu Hociung
I doubt that writing whole pieces would make much difference. Looking at
libtorrent's current download pattern, and at the strace, it looks
like
it writes the blocks of a piece in the proper order. So they probably
get assembled into the respective pieces in the OS's cache. If they
were
being downloaded out of order, I would see a case for caching whole
pieces.
I thought you downloaded a "wild" torrent, with more than one peer. If
you had done that, you would have noticed that the write operations
would have been scattered more. Because libtorrent is writing each block
to disk, and even though blocks within a piece are downloaded
sequentially, it would still be more seeking.
In high speeds libtorrent prefers to download whole pieces from single
peers, which means block writes to different pieces would be interleaved.
Ah, of course. I should have thought of this.
In order to answer questions regarding the behaviour of OS caches,
there
is a library available to study physical disk access, written at the
Renaissance Computing Institute. It requires a patch to the kernel, as
well as some calls in the user-level application to start/stop tracing.
If the user-level calls were put in strace, then physical analysis
of an
application would be easier to apply to any client, without modifying
the client itself.
http://www.renci.org/research/cadre/CADRETraceLibraries/
CADREPhysicalIOLibrary.htm
The testbench sw I am using is essentially strace, with
modifications I
made. I plan to release this testbench so that you can run it yourselves
also, which I think will allow you to easily experiment and draw
conclusions, without requiring guessing.
Do you think it would be possible to use it on a clean strace log
(without the modifications?). I'm asking because in that case maybe it
would be possible to use it on a ktrace log (which is used on darwin).
The CADRE library is not used on a log. Instead, it must be
activated/deactivated somehow. The information it captures when
activated is stored in a separate location, not in the strace log.
Further, in order to use it, a patch must be applied to a linux kernel.
This patch adds instrumentation to the device drivers. It's that
instrumentation that is then started/stopped by making a call from the
user level application.
So, I don't think this library will run on darwin; I am not familiar
with ktrace, but it looks to be an strace-like utility. Does the "-ti"
option not log interractions of the kernel with the hardware? If it
does, then you don't need CADRE in order to monitor the behaviour of the
disk cache.
I was primarily thinking about your testbench.
Perhaps I don't understand what you're asking. What would you like to
use on a clean strace log?

The only thing that my testbench does currently is to add a comment on
every _lseek, read and write line showing cumulative efficiency on that
file descriptor, along with the position of the drive heads at the time
the read/write call was made. These two pieces of information are only
used to plot the fancy graphs I've shown.

You can certainly derive the same plots from a clean strace log, which
contains all the necessary information.
Post by Arvid Norberg
Post by Radu Hociung
Post by Arvid Norberg
Not in the case where you're guaranteed that the seed will stay
online,
but it's very uncommon to have such a guarantee. Usually, the seed
could go offline and a half-finished peer will come online. The
optimal
case would be that both peers would have 1 distributed copy and be
able
to complete the file. And closest one can get to that (without having
any global knowledge) is to download pieces in random.
Ah, I see... I had not thought of such a scenario. Very interesting.
So, in a torrent like you described (1 seed, 1 hidden/offline peer, 1
blind peer), assuming downloaders get 1/2 of the file each, they would
be able to finish only if they have complementary halves of the file.
With a randomization at the piece level, with say, 1,000 pieces in
total, each of the downloaders need the right 500 pieces to make the
torrent healthy.
The probability for the blind client downloading the right 500 pieces,
not knowing which 500 the hidden peer has, are about 1/(2^1000), since
each piece has 1/2 chance of being the right piece, and that there are
1000 guesses to be made.
I still think sequential downloading gives the torrent a better chance
Say that each client downloads quarters of the file in random order. The
pieces in each quarter are downloaded sequentially. (ie, all the pieces
in quarter 1 are downloaded in sequential order, but the file may be
downloaded as Q2,Q4,Q3,Q1)
In this case, if the torrent behaves like you described above, the
chances for the torrent remaining healthy with the blind and hidden
peers present, but seed absent, are about 1/(2^4), since there are 4
guesses to be made. If each of the clients has 1/2 of the file when the
seed goes away, they have a 1/16 chance of finishing.
So comparing a download strategy of piece-by-piece-randomly (PBPR) or
quarter-by-quarter-randomly (QBQR), the QBQR strategy has a 6.25% chance
of finishing, while PBPR has no (1/(2^1000)) chance of finishing.
If the file is downloaded in halves instead, there is a 25% chance of
finishing this download. Perhaps each client can decide whether to
download halves, quarters, eights or whatever, depending on how many
1 seed, 0 peers = download halves
1 seed, 1 peer = download quarters
2 seeds, 2 peers = download eights
What do you think?
I would most respectfully encourage you to think again :)
To be honest, I'm not sure I follow all of your reasoning. But what you
do in practice is that you increase the piece sizes. This has the
effect of increasing overlap in the downloads. the chances of finishing
is not the only measurement, you have to consider the overlap in the
downloads. The goal is that any two peers should have as high
probability as possible to exchange data, two-way. The higher
granularity (number of pieces) the closer one would get to the optimal
case, when downloading random pieces.
Also keep in mind that the peer has no global knowledge, so the number
of peers and seeds would have to be counted only among neighbours.
You can probably get a very good explanation why this is the case in
http://lists.ibiblio.org/mailman/listinfo/bittorrent
I will do more reading, but what I meant was for the "super-pieces" to
not overlap. with a 1000-piece content, the first quarter would always
be pieces 0-249, the second would be 250-499, and so on.

Thanks for the pointers, I'll read the recommended info.

Cheers,
Radu.
Arvid Norberg
2006-04-23 09:27:59 UTC
Permalink
Post by Radu Hociung
Post by Arvid Norberg
I was primarily thinking about your testbench.
Perhaps I don't understand what you're asking. What would you like to
use on a clean strace log?
The only thing that my testbench does currently is to add a comment on
every _lseek, read and write line showing cumulative efficiency on that
file descriptor, along with the position of the drive heads at the time
the read/write call was made. These two pieces of information are only
used to plot the fancy graphs I've shown.
You can certainly derive the same plots from a clean strace log, which
contains all the necessary information.
Well, basically I would like to run the kinds of test that you did,
but on darwin. ang generate the graphs to see if any changes I've
made improves any performance.


--
Arvid Norberg
Radu Hociung
2006-04-24 18:53:36 UTC
Permalink
Post by Radu Hociung
Post by Arvid Norberg
I was primarily thinking about your testbench.
Perhaps I don't understand what you're asking. What would you like to
use on a clean strace log?
The only thing that my testbench does currently is to add a comment on
every _lseek, read and write line showing cumulative efficiency on that
file descriptor, along with the position of the drive heads at the time
the read/write call was made. These two pieces of information are only
used to plot the fancy graphs I've shown.
You can certainly derive the same plots from a clean strace log, which
contains all the necessary information.
Well, basically I would like to run the kinds of test that you did, but
on darwin. ang generate the graphs to see if any changes I've made
improves any performance.
That's cool.

The test setup I used initially is very simple, as you know (1 seed + 1
downloader).

I think the ktrace log file will be easy to use to plot the behaviour of
the client. You need to log lseeks, reads and writes.

If you plot the lseek location versus time or lseek number, you'll get
the access pattern. I plot write seeks and read seeks in different colour.

In order to plot the effectiveness curve, you need to calculate from the
logs the distance travelled while seeking, and the distance travelled
while read/writing. I calculate the effectiveness of accesses as the
ratio between distance travelled while read/writing and total distance
travelled. The wikipedia page I published earltier also contains details
about the calculations and measurements.

Let me know if I can help more.

Radu.
M***@massaroddel.de
2006-04-20 05:42:20 UTC
Permalink
Post by Radu Hociung
Ah, I see... I had not thought of such a scenario. Very interesting.
So, in a torrent like you described (1 seed, 1 hidden/offline peer, 1
blind peer), assuming downloaders get 1/2 of the file each, they would
be able to finish only if they have complementary halves of the file.
With a randomization at the piece level, with say, 1,000 pieces in
total, each of the downloaders need the right 500 pieces to make the
torrent healthy.
The probability for the blind client downloading the right 500 pieces,
not knowing which 500 the hidden peer has, are about 1/(2^1000), since
each piece has 1/2 chance of being the right piece, and that there are
1000 guesses to be made.
I still think sequential downloading gives the torrent a better chance
Say that each client downloads quarters of the file in random order. The
pieces in each quarter are downloaded sequentially. (ie, all the pieces
in quarter 1 are downloaded in sequential order, but the file may be
downloaded as Q2,Q4,Q3,Q1)
In this case, if the torrent behaves like you described above, the
chances for the torrent remaining healthy with the blind and hidden
peers present, but seed absent, are about 1/(2^4), since there are 4
guesses to be made. If each of the clients has 1/2 of the file when the
seed goes away, they have a 1/16 chance of finishing.
So comparing a download strategy of piece-by-piece-randomly (PBPR) or
quarter-by-quarter-randomly (QBQR), the QBQR strategy has a 6.25% chance
of finishing, while PBPR has no (1/(2^1000)) chance of finishing.
If the file is downloaded in halves instead, there is a 25% chance of
finishing this download. Perhaps each client can decide whether to
download halves, quarters, eights or whatever, depending on how many
1 seed, 0 peers = download halves
1 seed, 1 peer = download quarters
2 seeds, 2 peers = download eights
What do you think?
There is superseeding ... which defeats this appoarch totally.

MassaRoddel
Radu Hociung
2006-04-20 19:41:17 UTC
Permalink
Post by M***@massaroddel.de
Post by Radu Hociung
Ah, I see... I had not thought of such a scenario. Very interesting.
So, in a torrent like you described (1 seed, 1 hidden/offline peer, 1
blind peer), assuming downloaders get 1/2 of the file each, they would
be able to finish only if they have complementary halves of the file.
With a randomization at the piece level, with say, 1,000 pieces in
total, each of the downloaders need the right 500 pieces to make the
torrent healthy.
The probability for the blind client downloading the right 500 pieces,
not knowing which 500 the hidden peer has, are about 1/(2^1000), since
each piece has 1/2 chance of being the right piece, and that there are
1000 guesses to be made.
I still think sequential downloading gives the torrent a better chance
Say that each client downloads quarters of the file in random
order. The
pieces in each quarter are downloaded sequentially. (ie, all the pieces
in quarter 1 are downloaded in sequential order, but the file may be
downloaded as Q2,Q4,Q3,Q1)
In this case, if the torrent behaves like you described above, the
chances for the torrent remaining healthy with the blind and hidden
peers present, but seed absent, are about 1/(2^4), since there are 4
guesses to be made. If each of the clients has 1/2 of the file when the
seed goes away, they have a 1/16 chance of finishing.
So comparing a download strategy of piece-by-piece-randomly (PBPR) or
quarter-by-quarter-randomly (QBQR), the QBQR strategy has a 6.25% chance
of finishing, while PBPR has no (1/(2^1000)) chance of finishing.
If the file is downloaded in halves instead, there is a 25% chance of
finishing this download. Perhaps each client can decide whether to
download halves, quarters, eights or whatever, depending on how many
1 seed, 0 peers = download halves
1 seed, 1 peer = download quarters
2 seeds, 2 peers = download eights
What do you think?
There is superseeding ... which defeats this appoarch totally.
MassaRoddel
What does superseeding have to do with improving lifespan of hard drives?

The above strategy seeks to minimize the number of seeks, while not
risking creating a bad torrent. All in all, this thread is about how to
best cache a torrent download without violating other protocol design goals.

Radu
M***@massaroddel.de
2006-04-21 05:39:35 UTC
Permalink
Post by Radu Hociung
Post by M***@massaroddel.de
Post by Radu Hociung
Ah, I see... I had not thought of such a scenario. Very interesting.
So, in a torrent like you described (1 seed, 1 hidden/offline peer, 1
blind peer), assuming downloaders get 1/2 of the file each, they would
be able to finish only if they have complementary halves of the file.
With a randomization at the piece level, with say, 1,000 pieces in
total, each of the downloaders need the right 500 pieces to make the
torrent healthy.
The probability for the blind client downloading the right 500 pieces,
not knowing which 500 the hidden peer has, are about 1/(2^1000), since
each piece has 1/2 chance of being the right piece, and that there are
1000 guesses to be made.
I still think sequential downloading gives the torrent a better chance
Say that each client downloads quarters of the file in random order. The
pieces in each quarter are downloaded sequentially. (ie, all the pieces
in quarter 1 are downloaded in sequential order, but the file may be
downloaded as Q2,Q4,Q3,Q1)
In this case, if the torrent behaves like you described above, the
chances for the torrent remaining healthy with the blind and hidden
peers present, but seed absent, are about 1/(2^4), since there are 4
guesses to be made. If each of the clients has 1/2 of the file when the
seed goes away, they have a 1/16 chance of finishing.
So comparing a download strategy of piece-by-piece-randomly (PBPR) or
quarter-by-quarter-randomly (QBQR), the QBQR strategy has a 6.25% chance
of finishing, while PBPR has no (1/(2^1000)) chance of finishing.
If the file is downloaded in halves instead, there is a 25% chance of
finishing this download. Perhaps each client can decide whether to
download halves, quarters, eights or whatever, depending on how many
1 seed, 0 peers = download halves
1 seed, 1 peer = download quarters
2 seeds, 2 peers = download eights
What do you think?
There is superseeding ... which defeats this appoarch totally.
MassaRoddel
What does superseeding have to do with improving lifespan of hard drives?
You assume that you can see the seed and then make sequential downloads
but superseeding was made to prevent that. In that case you are forced by the
peer to download random pieces from the peer.
Arvid Norberg
2006-04-20 09:40:45 UTC
Permalink
Post by Radu Hociung
If the file is downloaded in halves instead, there is a 25% chance of
finishing this download. Perhaps each client can decide whether to
download halves, quarters, eights or whatever, depending on how many
1 seed, 0 peers = download halves
1 seed, 1 peer = download quarters
2 seeds, 2 peers = download eights
What do you think?
Maybe this paper has a good description.

http://www.eurecom.fr/~michiard/pubs/
bt_experiments_techRepINRIA-00001111_VERSION1_13FEBRUARY2006.pdf


--
Arvid Norberg
M***@massaroddel.de
2006-04-14 08:08:27 UTC
Permalink
Post by Radu Hociung
Greetings all,
I have done some experiments on libtorrent and Transmission, comparing
the effectiveness of their client-side caching and download scheduling
algorithms.
http://en.wikipedia.org/wiki/BitTorrent_performance
I hope that you'll find it a worthwhile read.
Regards,
Radu.
Nice work.

But I think a 1:1 (peer:seed) is the most unrealistic test enviroment.
I would say that you need at least 100 peers to make such tests
including a few seeds.

The only reason my client switches to full allocation mode is then
a file selection is made (means there is more than one file inside the torrent).
And because there are several files which will be written to
and read from more seeking occurs naturally.

MassaRoddel
Radu Hociung
2006-04-14 15:28:25 UTC
Permalink
Post by M***@massaroddel.de
Post by Radu Hociung
Greetings all,
I have done some experiments on libtorrent and Transmission, comparing
the effectiveness of their client-side caching and download scheduling
algorithms.
http://en.wikipedia.org/wiki/BitTorrent_performance
I hope that you'll find it a worthwhile read.
Regards,
Radu.
Nice work.
Thank you.
Post by M***@massaroddel.de
But I think a 1:1 (peer:seed) is the most unrealistic test enviroment.
I would say that you need at least 100 peers to make such tests
including a few seeds.
The only reason my client switches to full allocation mode is then
a file selection is made (means there is more than one file inside the torrent).
And because there are several files which will be written to
and read from more seeking occurs naturally.
I chose this very simple test case, with two expectations:

1. It should be easiest to compare expected behaviour vs. measured
behaviour. In a torrent with 100 peers, and possibly many disjoint
swarms, it's much more difficult to formulate an expected behaviour.

2. It makes it easy to study the upper-bound of a swarm's efficiency. A
bigger swarm would be less efficient than the 1:1 swarm I am using.

I think it's worthwhile improving the ideal case as much as possible,
while not breaking the protocol, of course. I think allowing the torrent
to be more susceptible to failure is unnacceptable. I would aim to at
least maintain the torrent's reliability, while improving the local disk
access efficiency.

If the upper bound were showing very good efficiency, there would be
cause to look into more realistic scenarios, but when the upper bound
shows such poor performance, I don't see what benefit you see to
studying more realistic swarms.

Depending on when you looked at wikipedia, you may have missed a late
addition I made, which is the graph of the unix utility "strings"
running on the same 585MB file. Since "strings" does no seeks, and reads
data sequentially, it is as effective as can be. I included this as a
reference for how I would expect the effectiveness of a torrent client
in a perfect swarm to behave.

There's about 99.5% room for improvement in the ideal 1:1 torrent, but
less than 0.8% in more realistic torrents. Would you not agree that the
biggest bang for the buck is to improve the ideal case first?

I am not familiar with the inner design of the various clients, so maybe
you can help me here; What swarm configuration will allow your clients
to shine (brighter than the 1:1 case :) ) in terms of disk-access
efficiency?

Any comments are welcome.

Radu.
Arvid Norberg
2006-04-19 17:54:54 UTC
Permalink
Post by Radu Hociung
Post by Arvid Norberg
[...]
It is true though, that in the case where the swarm has tendencies of
being "serialized", it is very common to request pieces which
were just
made available from a peer. But I'm not sure that is a very common
case. Obviously some data is needed here.
a. well-seeded torrents, and/or
b. applications where it is desirable to start playback as soon as
possible after the download starts. Obviously this makes sense mainly
for media-type torrents.
With serialized, I meant that the pieces downloaded are distributed
in a chain, rather than in a star. Possibly a poorly chosen word.

I think that in the case where peers join at different times, and
have different download speeds, the rare first algorithm might
actually make it less likely that a newly downloaded piece is
requested from a peer. Since that piece just got bumped up to a lower
priority level (having one extra peer).
Post by Radu Hociung
[...]
Post by Arvid Norberg
The client isn't completely free of choosing who to upload to/whose
requests to respond do. In order to maximize download speed, it has to
upload to those that it can download from. I also think that the
most
common case is that the output pipe very seldom is saturated. So, I'm
not sure weighting peers depending on the pieces they request would
work very well, it would affect the behavior of the client in
other ways.
Your distinction of cheap and expensive data doesn't take seek
distance
into account. I think that may be the most important thing to
concentrate on, at least when it comes to writing.
I thought your number 1 aim deals with seeds on high-speed links, hence
no downloading.
I think the cases of downloading, uploading and seeding are a bit
mixed up.
Post by Radu Hociung
Since there are different scenarios which have distinct properties,
maybe caching behaviour could be altered as necessary. The 4 scenarios
a. seeding with out pipe saturated
b. seeding with out pipe underutilized
c. downloading from torrent "good" or better
d. downloading from torrent "acceptable" or worse
The key difference between a-b is that the data being requested is
likely available cheaper from another seed.
The key difference between c-d is that a random dl pattern is
unnecessary in c.
True. The seeding case seems to be more difficult to optimze. Since
it would require to pick the "best" piece requests and respond to
those, and still saturate the out pipe. It is difficult partly
because the limit of the out pipe is unknown or hard to estimate
unless the user sets it (and I think the likelyhood of it being set
accurately is probably not too high). And partly because it involves
picking piece requests that may belong to different files and
estimate the seek distance for the drive's head. The selection would
also make sure that requests are answered sooner or later, to avoid
creating dead locks.
Post by Radu Hociung
Post by Arvid Norberg
Post by Radu Hociung
Post by Arvid Norberg
Number 2 is a little bit more interesting though. Since every block
that is downloaded has to be written to disk, and since every
block is
only downloaded once, you cannot really optimize away disk writes.
What
you can do though, is to optimize the disk seeking.
Absolutely. Also, since while downloading, the client also
uploads, it
would be more disk efficient to only seek for disk writes, but
never for
disk reads. Ie, the client should prefer to respond to requests for data
is has freshly downloaded, instead of reading other data from disk.
Assuming that the client downloads least-common data first, it's a
reasonable assumtion that the data it holds in memory buffers is still
rare to other clients.
Yes, at least to a certain degree. One thing to keep in mind
though, is
that the number of pieces in a torrent is generally much more
than the
number of peers that a client can see. How rare a piece is is
quantized
to the number of peers that has it. This means that the number of
equally rare pieces is usually very high, far higher than a write cache
would be.
I think you're making the (healthy) assumption that a client would cache
one, or very few pieces for upload? This is very good for lowering
memory footprints of the client.
What I am thinking is that as soon as a client advertises a completed
piece, many swarm members would request it immediately, instead of
requesting other pieces. The just-downloaded piece is cheap data, and
the downloading client would prioritize its upload.
I'm not sure that is very common. Some measurement should be done
here. Parsing libtorrent logs would be enough here.
Post by Radu Hociung
In an ideal swarm, each piece in a torrent would be read from a disk
exactly once, and written to disk exactly N-1 times (for N clients in
the swarm). A torrent that is well seeded ("good" or better) should be
able to approach this ideal, without risking the availability of any
piece to be so low as to endanger the torrent. For a torrent that is
"acceptable" or worse, the rarest-first approach may be given a higher
preference than the ideal.
Yes, this is probably a very good idea.
Post by Radu Hociung
Post by Arvid Norberg
Post by Radu Hociung
In this case, the library should write to disk when downloaded pieces
are complete, but do not free the buffers until other clients have
downloaded them enough to cause the uplink utilization to drop.
I have thought about optimizing the dataflow a lot. I believe there are
1. Fewer seeks == less noise == drives lasting longer.
2. Fewer seeks == drive is more responsive to other apps that run
concurently.
Definitely, I think the disk perfomance is very important.
I might add that there's a huge difference for me (OSX 10.4) when
downloading in full allocation mode and compact allocation mode. The
compact allocation mode will have the effect that most reads and
writes
are kept relatively close to each other. They will spread out
over time
of course though. But just downloading with client_test will
saturate
my connection at slightly above 1 MB/s if I'm in compact mode. In full
allocation mode though, I will not get above about 600 kB/s, and my
drive is really making a noise and slowing down the computer to be
virtually useless.
Fascinating! This would seem to support compact/serialized
downloading.
[...]
Post by Arvid Norberg
So, I think my first attempt will be to cache pieces, and only write
whole pieces instead of blocks to disk. To see if the performance is
improved. I don't know how to measure it though, maybe I could try
saturating my download link in full allocation mode and see if it
increases.
But would this not increase the memory footprint of the application,
seeing how as many download buffers are needed as there are parallel
downloads, while none of these buffers can be used to upload from, as
their data has not been checksummed until they have completed
downloading?
If the download cache has a fixed size, finished pieces can be kept
in there as long as possible, allowing it to be used as a read cache
as well.
Post by Radu Hociung
Post by Arvid Norberg
Do you have any other suggestions that are reasonably easy to try out?
I think the first thing to do in studying the effects of libtorrent
caching would be to disable the O/S cache, in order to make
comparisons
between versions relevant.
It would be interesting to find out how much riskier it is for a client
to serialize its downloads on poor torrents. Obviously it is safest to
do rarest-first on such torrents, but I am curious as to the
difference
between rarest-first and serialized.
I would say it's a huge difference. This is not too hard to calculate
analytically though, no real need to test it.
Post by Radu Hociung
Also, we should not forget of the relationship between memory
footprint
and caching effectiveness. It's easy to get more effective by allowing
increased memory use, but we should resist increased memory use when
possible.
Yes. It has to be configurable and possible to disable.
Post by Radu Hociung
How are you able to get 1MB/s ? Do you have torrents running on a test
network? Are the experiments reproducible? Would you mind describing
your test setup?
I have a 10 Mb/s symmetric connection, and so do most other people on
this tracker I use.


--
Arvid Norberg
Radu Hociung
2006-04-19 22:26:37 UTC
Permalink
(adding very little, but keeping most old text for context)
Post by Radu Hociung
Post by Arvid Norberg
[...]
It is true though, that in the case where the swarm has tendencies of
being "serialized", it is very common to request pieces which were
just
made available from a peer. But I'm not sure that is a very common
case. Obviously some data is needed here.
a. well-seeded torrents, and/or
b. applications where it is desirable to start playback as soon as
possible after the download starts. Obviously this makes sense mainly
for media-type torrents.
With serialized, I meant that the pieces downloaded are distributed in
a chain, rather than in a star. Possibly a poorly chosen word.
I think that in the case where peers join at different times, and have
different download speeds, the rare first algorithm might actually make
it less likely that a newly downloaded piece is requested from a peer.
Since that piece just got bumped up to a lower priority level (having
one extra peer).
Post by Radu Hociung
[...]
Post by Arvid Norberg
The client isn't completely free of choosing who to upload to/whose
requests to respond do. In order to maximize download speed, it has to
upload to those that it can download from. I also think that the most
common case is that the output pipe very seldom is saturated. So, I'm
not sure weighting peers depending on the pieces they request would
work very well, it would affect the behavior of the client in other
ways.
Your distinction of cheap and expensive data doesn't take seek
distance
into account. I think that may be the most important thing to
concentrate on, at least when it comes to writing.
I thought your number 1 aim deals with seeds on high-speed links, hence
no downloading.
I think the cases of downloading, uploading and seeding are a bit mixed
up.
Post by Radu Hociung
Since there are different scenarios which have distinct properties,
maybe caching behaviour could be altered as necessary. The 4 scenarios
a. seeding with out pipe saturated
b. seeding with out pipe underutilized
c. downloading from torrent "good" or better
d. downloading from torrent "acceptable" or worse
The key difference between a-b is that the data being requested is
likely available cheaper from another seed.
The key difference between c-d is that a random dl pattern is
unnecessary in c.
True. The seeding case seems to be more difficult to optimze. Since it
would require to pick the "best" piece requests and respond to those,
and still saturate the out pipe. It is difficult partly because the
limit of the out pipe is unknown or hard to estimate unless the user
sets it (and I think the likelyhood of it being set accurately is
probably not too high). And partly because it involves picking piece
requests that may belong to different files and estimate the seek
distance for the drive's head. The selection would also make sure that
requests are answered sooner or later, to avoid creating dead locks.
Post by Radu Hociung
Post by Arvid Norberg
Post by Arvid Norberg
Number 2 is a little bit more interesting though. Since every block
that is downloaded has to be written to disk, and since every
block is
only downloaded once, you cannot really optimize away disk writes.
What
you can do though, is to optimize the disk seeking.
Absolutely. Also, since while downloading, the client also uploads, it
would be more disk efficient to only seek for disk writes, but
never for
disk reads. Ie, the client should prefer to respond to requests for data
is has freshly downloaded, instead of reading other data from disk.
Assuming that the client downloads least-common data first, it's a
reasonable assumtion that the data it holds in memory buffers is still
rare to other clients.
Yes, at least to a certain degree. One thing to keep in mind
though, is
that the number of pieces in a torrent is generally much more than the
number of peers that a client can see. How rare a piece is is
quantized
to the number of peers that has it. This means that the number of
equally rare pieces is usually very high, far higher than a write cache
would be.
I think you're making the (healthy) assumption that a client would cache
one, or very few pieces for upload? This is very good for lowering
memory footprints of the client.
What I am thinking is that as soon as a client advertises a completed
piece, many swarm members would request it immediately, instead of
requesting other pieces. The just-downloaded piece is cheap data, and
the downloading client would prioritize its upload.
I'm not sure that is very common. Some measurement should be done here.
Parsing libtorrent logs would be enough here.
Post by Radu Hociung
In an ideal swarm, each piece in a torrent would be read from a disk
exactly once, and written to disk exactly N-1 times (for N clients in
the swarm). A torrent that is well seeded ("good" or better) should be
able to approach this ideal, without risking the availability of any
piece to be so low as to endanger the torrent. For a torrent that is
"acceptable" or worse, the rarest-first approach may be given a higher
preference than the ideal.
Yes, this is probably a very good idea.
Post by Radu Hociung
Post by Arvid Norberg
In this case, the library should write to disk when downloaded pieces
are complete, but do not free the buffers until other clients have
downloaded them enough to cause the uplink utilization to drop.
I have thought about optimizing the dataflow a lot. I believe there are
1. Fewer seeks == less noise == drives lasting longer.
2. Fewer seeks == drive is more responsive to other apps that run
concurently.
Definitely, I think the disk perfomance is very important.
I might add that there's a huge difference for me (OSX 10.4) when
downloading in full allocation mode and compact allocation mode. The
compact allocation mode will have the effect that most reads and
writes
are kept relatively close to each other. They will spread out over
time
of course though. But just downloading with client_test will saturate
my connection at slightly above 1 MB/s if I'm in compact mode. In full
allocation mode though, I will not get above about 600 kB/s, and my
drive is really making a noise and slowing down the computer to be
virtually useless.
Fascinating! This would seem to support compact/serialized downloading.
[...]
Post by Arvid Norberg
So, I think my first attempt will be to cache pieces, and only write
whole pieces instead of blocks to disk. To see if the performance is
improved. I don't know how to measure it though, maybe I could try
saturating my download link in full allocation mode and see if it
increases.
But would this not increase the memory footprint of the application,
seeing how as many download buffers are needed as there are parallel
downloads, while none of these buffers can be used to upload from, as
their data has not been checksummed until they have completed
downloading?
If the download cache has a fixed size, finished pieces can be kept in
there as long as possible, allowing it to be used as a read cache as well.
I would just like to suggest a possible definition for "as long as
possible": the algorithm for adding/replacing cached data should be "add
rarest pieces, retire most plentiful pieces, and replace only when the
new piece is rarer than the replaced piece", since a rare piece is more
likely to be requested again.
Post by Radu Hociung
Post by Arvid Norberg
Do you have any other suggestions that are reasonably easy to try out?
I think the first thing to do in studying the effects of libtorrent
caching would be to disable the O/S cache, in order to make comparisons
between versions relevant.
It would be interesting to find out how much riskier it is for a client
to serialize its downloads on poor torrents. Obviously it is safest to
do rarest-first on such torrents, but I am curious as to the difference
between rarest-first and serialized.
I would say it's a huge difference. This is not too hard to calculate
analytically though, no real need to test it.
Right. I gave it a shot in the other subthread with the
quarter-by-quarter-random vs. piece-by-piece-random strategy comparison.
Is that what you had in mind?
Post by Radu Hociung
Also, we should not forget of the relationship between memory footprint
and caching effectiveness. It's easy to get more effective by allowing
increased memory use, but we should resist increased memory use when
possible.
Yes. It has to be configurable and possible to disable.
Post by Radu Hociung
How are you able to get 1MB/s ? Do you have torrents running on a test
network? Are the experiments reproducible? Would you mind describing
your test setup?
I have a 10 Mb/s symmetric connection, and so do most other people on
this tracker I use.
Wow... I'm feeling envy and an itch to immigrate :)

Cheers,
Radu.
Arvid Norberg
2006-04-25 19:10:02 UTC
Permalink
Post by Radu Hociung
Post by Arvid Norberg
I would say it's a huge difference. This is not too hard to calculate
analytically though, no real need to test it.
Right. I gave it a shot in the other subthread with the
quarter-by-quarter-random vs. piece-by-piece-random strategy
comparison.
Is that what you had in mind?
No. Saying that it would be easy to show analytically was not very
accurate though. Just to formulate how to measure performance is
quite difficult. My best attempt would be something like:

the goal is to have as high probability as possible that the overlap
between two randomly selected peers is as small as possible.

Striving for that would result in peers being able to exchange data
as much as possible.

Your suggestion is basically just a way to increase the piece sizes,
which will decrease the granularity of the current algorithm.


--
Arvid Norberg

Continue reading on narkive:
Loading...