[gPXE] [Etherboot-discuss] SRP timeout

M Lowe mlowe at shaw.ca
Sun Jul 11 20:18:11 EDT 2010


I have been able to log the debug messages now however I see no errors
that would indicate where the problem is.

Just to recap quickly, the problem is that san-booting over InfiniBand
using SRP doesn't work and just times out. The timeout occurs while
waiting for a response to the SRP login request. I'm fairly certain the
problem lies within gPXE because I can access the SRP target just fine
through a local installation of Windows. In addition, on the SRP target
side I have traced through the ib_srpt module and found that a login
response is generated and sent (or at least posted to the mthca module
work queue). 

On the gPXE side I've found that I'm not receiving the SRP_LOGIN_RSP
packet even at the InfiniBand protocol level (net/infiniband.c). So far
I have been able to determine the packet is lost at some point in the
Arbel driver (drivers/infiniband/arbel.c) before arbel_complete().This
would indicate the problem exists within the Arbel driver and explains
why SRP sanboot worked with the Hermon driver. Despite compiling with
DEBUG=arbel:3 I get no errors indicating there are any problems or
dropped packets.

Here is the output from autoboot with
DEBUG=srp,ipoib,arp,infiniband,ib_cm,ib_cmrc,ib_mcast,ib_mi,ib_packet,ib
_pathrec,ib_sma,ib_smc,ib_srp

Note: I have added some debug messages to help illustrate the flow of
packets. At the beginning of ipoib_complete_recv, ib_complete_recv, and
ib_mi_complete_recv I have added "RX" debug messages. 

Booting from root path
"ib_srp::::fe800000000000000002c9020022e5e5::0002c9020022e5e4::0002c9020
022e5e4:0002c9020022e5e4"
SRP 0xbb134 using
ib_srp::::fe800000000000000002c9020022e5e5::0002c9020022e5e4::0002c90200
22e5e4:0002c9020022e5e4
SRP attached successfully
IBDEV 0xb9a84 creating completion queue
IBDEV 0xb9a84 created 8-entry completion queue 0xbb4c4 (0xbb214) with
CQN 0x83
IBDEV 0xb9a84 creating queue pair
IBDEV 0xb9a84 created queue pair 0xbb4f4 (0xbb5c4) with QPN 0x550403
IBDEV 0xb9a84 QPN 0x550403 has 4 send entries at [0xbb5a0,0xbb5b0)
IBDEV 0xb9a84 QPN 0x550403 has 2 receive entries at [0xbb5b0,0xbb5b8)
CMRC 0xbb1b4 using QPN 550403
SRP 0xbb134 TX login request tag 0000000000000001
CM 0xbbb64 created for IBDEV 0xb9a84 QPN 550403
CM 0xbbb64 connecting to fe800000:00000000:0002c902:0022e5e5
0002c902:0022e5e4
MI 0xba564 TX TID 6750584500000003 (03,02,01,0035) status 0000
infiniband RX
MI 0xba564 RX
MI 0xba564 RX TID 6750584500000003 (03,02,81,0035) status 0000
IBDEV 0xb9a84 path to fe800000:00000000:0002c902:0022e5e5 is 0007 sl 0
rate 6
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
infiniband RX
IPoIB 0xb9ccc RX
ARP cache add: IP 10.20.76.1 => IPoIB
80000404:fe800000:00000000:0002c902:0022e5e5
ARP reply: IP 10.20.76.45 => IPoIB
00550402:fe800000:00000000:0002c902:00243035
IPoIB peer 4 has MAC 80000404:fe800000:00000000:0002c902:0022e5e5
MI 0xba564 TX TID 6750584500000005 (03,02,01,0035) status 0000
infiniband RX
MI 0xba564 RX
MI 0xba564 RX TID 6750584500000005 (03,02,81,0035) status 0000
MI 0xba564 RX TID 6750584500000005 handling via transaction handler
IBDEV 0xb9a84 path to fe800000:00000000:0002c902:0022e5e5 is 0007 sl 0
rate 6
infiniband RX
IPoIB 0xb9ccc RX
ARP cache update: IP 10.20.76.1 => IPoIB
80000404:fe800000:00000000:0002c902:0022e5e5
ARP reply: IP 10.20.76.45 => IPoIB
00550402:fe800000:00000000:0002c902:00243035
MI 0xba564 TX TID 6750584500000004 (07,02,03,0010) status 0000
MI 0xba564 abandoning TID 6750584500000004
CM 0xbbb64 connection request failed: Connection timed out (0x4c206035)
CMRC 0xbb1b4 disconnected: Connection timed out (0x4c206035)
SRP 0xbb134 socket closed: Connection timed out (0x4c206035)



From: Itay Gazit [mailto:itaygazit at gmail.com] 
Sent: Friday, June 25, 2010 11:47 AM
To: Stefan Hajnoczi; M Lowe
Cc: etherboot-discuss at lists.sourceforge.net; gpxe; Michael Brown
Subject: Re: [Etherboot-discuss] SRP timeout

Hi Matthew,
Stefan is right, you should reduce the DEBUG messages depth to find the
fail cause.
I have tried SRP boot only with Hermon driver (ConnectX) and it worked
for me.
Regards,
Itay


More information about the gPXE mailing list