[gPXE-devel] Sleep entropy at boot time

Mon Aug 23 02:46:43 EDT 2010

Erwan Velu <erwanaliasr1 at gmail.com> writes:

> Hey fellows,
>
> When booting hundreds of similar systems at the same time, we need to add some sleeping entropy prior the dhcp & pxe stuff start.
> That avoid a massive incast problem. To see that point, consider that I'm facing up to 720 similar hardware that boot at the exactly same time.
>
> Adding some sleeps prior the dhcp start is a good thing for me.
>
> I've been working on a prototype and faced one big issue. The current random() implementation uses currentick() as seed.
> But as you guess, the time I need to reach the dhcp is mostly stable over my systems so I have mostly the same results everywhere.
>
> So I did use the rtc clock to grab the time and use the last digit of the mac address to increase the entropy.
>
> I know the patch isn't perfect, and the cmos code might be moved to the random() thing ... but I preferred submitting a first prototype to rise issues & comments about this strategy.
>
>
> I just have to say this trick worked great on my hosts.
>
> Please find bellow my git commit in my personal gpxe repo.

Hmm.  If you need to do this something feels wrong.

Nearly a decade ago I was doing this with a larger cluster (MCR of
little over a thousand nodes ) with a much shorter randomization.
The nodes were powered on in 10 batches with a 10th of a second delay
in between.  Does you really have enough independence of power to power
on all 720 nodes at precisely the same time?  Grumble this is case
not working is a regression from etherboot 5.4.  Grumble.

Looking at the code we have exponential backoff in retry.c
when one of our timers expires, but we don't add random jitter to that.
Not having random jitter after a collision I expect is your problem.
rfc1531 describes what needs to happen in that case reasonably well.

Improving the random number generator seems like a good idea as well,
especially mixing in the mac address.

Generally the gpxe code has been organized so it doesn't need special
cases, and the same code can be used for all clients, and having seen
this done without special cases I don't see why we would need a special
case now.

Why are you waiting randomly up to 30 seconds?  Is that just an
arbitrary number to spread things out  or is that something to
spread out later unicast tftp downloads?  30 seconds is a lot of time
and even the average 15 seconds is not something I would really like
to wait through unless I had to.  There is a reason most managed
ethernet switches have a port fast option.   So is jitter enough to
sort things out for you?

Given that I could install all of the MCR nodes in about 5 minutes from
power on, on the servers of the day, I can't imagine that our much
faster and more modern servers would really have a problem with the
load.

Eric