[gPXE-devel] Sleep entropy at boot time

Mon Aug 23 09:53:33 EDT 2010

Erwan Velu <erwanaliasr1 at gmail.com> writes:

> [...]
>
> Hey Eric,
>
> That's a lot for all your incomes.
>
> I'll try to answer all your questions.
>
> 1°) Why do we need a particular behaving
> My "cluster" is in a very particular network & power industrial environment.
> Those low-power nodes are yes, booted in the same time and they do have to load
> some content from a central server.

I was thinking from my experienced with high performance computing.  Your
case is very similar but just with lower end hardware.  That changes the
assumptions about what is available a little bit but I don't think it
should change on the wire behavior.

> The network bandwidth is really low (100mbit) and each system have to load some
> content to boot as we are running diskless at that point.

You speak of 100mbit and you speak of collisions.  Is this 100mbit going
through a switch?  I don't recall many ways 100mbit can actually be a
shared fabric, I think all of that was back in 10mbit type connections.
There is a bit of buffer advantage to going through a real switch but I
don't expect it makes much of a real world difference when every machine
is talking to the same server.

> The last point is our "main" server is very very light so I'd like to not load
> it too much.

At least for dhcp the load should not be much. But let me play with some
numbers.  A 1500 byte frame on the wire takes roughly 1538 bytes, what
with preamble, ethernet header, checksum and interframe gap.  At 100mbit
you can get roughly 8127 of full sized packets per second, and in
practice you should be sending noticeably smaller packets.

For a dhcp transaction you need 4 packets you need packets: request
reply and a bidirectional ack.  For 720 clients that is 2880 packets
Well within the the 8k you can perform per second.

To keep server load down (as in load average) you might want multicast
or an event oriented server instead of a process/thread per client model
but otherwise I really don't see a problem.  A 100mbit stream is tiny.

> 2°) Why I'm using a 30sec delay
> While computing a random to sleep to avoid collisions, I have to insure that
> few systems aren't in the same time-stamp.

Having systems with exactly the same time-stamp is a problem if that is
your only input to a random number generator, but beyond that I don't
see why having the same time-stamp is a concern.  Any reasonable
protocol should have additional differentiators besides time.

> Let's say a random 15s delay with 720 systems booting at the same time, if
> random thing generates too close bets (let's image an average distance of 7sec
> between systems), we'll face the collisions problems.

Collision problems?  I would think a few systems doing exponential
backoff in the face of collisions or dropped packets will give you the
delay you are looking for, and naturally taking things out of lockstep
without the assumption that everyone is lockstep at the beginning of
time.

> So my first guess is 30sec is enough to avoid too much systems trying to
> download stuff at the same time.
> This value will surely be changed while increasing our experience in our
> environment.

Honestly I think that is a silly way to look at things.

> 3°) Adding more randomization
> Agree with you, we have some improvements to do in the random() call and I
> think that's pretty efficient using part of the MAC address to generate some
> seed.
> What do you think about using the cmos time too ?

In the situations I have dealt with it is the cmos time that is in
lockstep because of clocks being synchronized with ntp.  So I don't
think mixing the cmos time in is wrong, I also don't think it is
particularly interesting.

> 4°) GPXE integration
> I perfectly understand my need isn't common at all and doesn't have to be
> integrated as it in gPXE.
> That said, this patch would have been made with a default value of 
> MAX_RANDOM_SLEEP_TIME set to 0 to disable this behavior.

The initial delay is not common, and I would argue that the initial
delay is almost certainly unnecessary and a little bit wrong.  It
is just a case of inserting a magic delay somewhere and hoping
that makes things work.  That is almost always the sign of a bug,
and of impending trouble when the world acts differently than your
delay.

The problem of congestion is common, and all of the protocols are
specified for what happens in a congested network, this looks to
me like all that is needed is simply bug fixing of the congestion
handling rather than any special handling being needed.   With
that bug fixing everyone will benefit.

It looks to me that there is a bug in that src/net/retry.c because it
does not add any jitter when it is performing exponential backoff.  This
leaves the possibility that two or more machines could cause packet
drops and collisions by backing off exactly in step.  

Other than not introducing jitter in the backoff it looks like the
current gpxe implementation should handle what you are doing just fine.

Eric