Michael Decker: Driver Development
Week 7
9 July
A new branch, drivers6
was created. This branch was merged with the mainline via git pull origin master
. This brought the GDB code into my tree. Experimenting with GDB, a segfault was reported following the point where gPXE was freezing during the second NIC boot. I ran a backtrace:
Program received signal SIGSEGV, Segmentation fault. alloc_memblock (size=96, align=<value optimized out>) at include/gpxe/list.h:64 64 __list_add ( new, head, head->next ); (gdb) backtrace #0 alloc_memblock (size=96, align=<value optimized out>) at include/gpxe/list.h:64 #1 0x00007cd1 in realloc (old_ptr=0x0, new_size=80) at core/malloc.c:265 #2 0x00007d2f in zalloc (size=96) at core/malloc.c:332 #3 0x0000814b in resolv (resolv=0x78a8, name=0xf "Ãë\t\017¾CÿèR", sa=0x33ad8) at core/resolv.c:260 #4 0x0000823b in xfer_open_named_socket (xfer=0x784c, semantics=208084, peer=0x33ad8, name=0x13356 "192.168.2.8", local=0x0) at core/resolv.c:389 #5 0x00005f64 in http_open_filter (xfer=0x12de8, uri=0x13324, default_port=80, filter=0) at net/tcp/http.c:501 #6 0x00012a70 in mtftp_uri_opener () #7 0x00012de8 in heap () #8 0x00005fc9 in http_open (xfer=0x60, uri=0xf) at net/tcp/http.c:527 #9 0x00000000 in ?? ()
Not sure why the segfault occurred, although I do see the parameter to resolve
is not valid.
Marty recommended I install wireshark and take a look at what's happening. Additionally, testing at his end showed two different NICs failing iSCSI booting, but passing HTTP booting. I haven't tried iSCSI booting yet, so I'll need to set this up to recreate the errors he's seeing.
In the meantime, analyzing wireshark output should show any problems with rx & tx during HTTP booting. I may also play with GDB a bit more to figure out what's going on, but currently I need to nail down the bug to something more specific.
10 July
This morning I installed wireshark and have been inspecting HTTP boot packet communications. I found a number of duplicate transmissions (including duplication of TCP sequence numbers.) It seemed something was wrong with the transmission path.
I added a few debug lines to ifec_tx_wake()
:
void ifec_tx_wake ( struct net_device *netdev ) { struct ifec_private *priv = netdev->priv; unsigned long ioaddr = priv->ioaddr; struct ifec_active *a = priv->active; struct ifec_tcb *tcb = a->tcb_head->next; /* For the special case of the first transmit, we issue a START. The * card won't RESUME after the configure command. */ if ( a->configured ) { a->configured = 0; ifec_scb_cmd ( netdev, virt_to_bus ( tcb ), CUStart ); ifec_scb_cmd_wait ( netdev ); return; } /* if not suspended, and all other tcbs have suspend flag clear, do NOT clear * the suspend flag. if you do, it will enter a bad state. we need a tcb with * a suspend flag set in the tx ring at all times. */ /* Resume if suspended. */ switch ( ( inw ( ioaddr + SCBStatus ) >> 6 ) & 0x3 ) { case 0: /* Idle - We should not reach this state. */ DBG ( "ifec_net_transmit: tx idle!\n" ); ifec_scb_cmd ( netdev, virt_to_bus ( tcb ), CUStart ); ifec_scb_cmd_wait ( netdev ); break; case 1: /* Suspended */ DBG ( "s" ); //ifec_net_transmit: tx suspended : resume issued\n" ); ifec_scb_cmd_wait ( netdev ); outl ( 0, ioaddr + SCBPointer ); a->tcb_head->command &= ~CmdSuspend; /* Immediately issue Resume command */ outb ( CUResume, ioaddr + SCBCmd ); ifec_scb_cmd_wait ( netdev ); break; default: DBG ( "a" ); a->tcb_head->command &= ~CmdSuspend; } }
This way I could see what state the Command Unit was in prior to each tx. Comparing this debug output with the wireshark output, I found that every instance of an 'a' coincided with a duplicate packet transmission.
Now, the same packet being transmitted twice is odd. The driver is setup to write into the next TCB in the tx ring for each transmit call. I added a debug line in ifec_net_transmit()
:
static int ifec_net_transmit ( struct net_device *netdev, struct io_buffer *iobuf ) { struct ifec_private *priv = netdev->priv; unsigned long ioaddr = priv->ioaddr; struct ifec_active *a = priv->active; struct ifec_tcb *tcb = a->tcb_head->next; unsigned short status; /* Wait for TCB to become available. */ if ( tcb->status || tcb->iob ) { DBGP ( "TX overflow\n" ); return -ENOBUFS; } status = inw ( ioaddr + SCBStatus ); /* Acknowledge all of the current interrupt sources ASAP. */ outw ( status & 0xfc00, ioaddr + SCBStatus ); DBGIO ( "transmitting packet (%d bytes). status = %hX, cmd=%hX\n", iob_len ( iobuf ), status, inw ( ioaddr + SCBCmd ) ); DBGIO_HD ( iobuf->data, iob_len ( iobuf ) ); tcb->command = CmdSuspend | CmdTx | CmdTxFlex; tcb->count = 0x01208000; tcb->tbd_addr0 = virt_to_bus ( iobuf->data ); tcb->tbd_size0 = 0x3FFF & iob_len ( iobuf ); tcb->iob = iobuf; DBG ( "%i", tcb - a->tcbs ); DBGIO ( "tcb: \n" ); DBGIO_HD ( tcb, sizeof ( *tcb ) ); ifec_tx_wake ( netdev ); /* Append to end of ring. */ a->tcb_head = tcb; return 0; }
The line DBG ( “%i”, tcb - a→tcbs );
prints out the index of the current TCB in the tx ring. The debug output showed proper circulation from 0 through 3 and back to 0 repeatedly. However, it also showed no duplicates in wireshark!
From this behavior, I made the assumption that the time delay of printing the debug output at that point prevents the 'a' condition from ever occuring. This, in turn, prevents the duplication bug. The 'a' condition is the CU being in the active state, which occurs when a transmit request occurs quickly before the previous tx finished processing on the card.
Thus, I now have nailed down at least one bug, and now I can determine what's going wrong.
The end of ifec_tx_wake()
performs different operations depending if the state of the CU is active or suspended. After some consideration, it seems if the CU is active, a RESUME should still be issued - this will cause the CU to re-read the current TCB's S-bit. Thus, after clearing that bit, the CU will continue on and process this newly appended transmit command.
Otherwise, if the card was active before the tx, then it would suspend before processing the new TCB. This means the card is suspended at a TCB prior to the tcb_head
. This could happen multiple times, moving the actual TCB suspended closer to tcb_tail
. I think eventually tail would surpass the suspended TCB, and the head may write into the next TCB which is transmitted at the next ifec_net_transmit()
. This is speculation, as there may be some other way this corruption was occurring.
The bottom of ifec_tx_wake()
was changed as such:
/* Resume if suspended. */ switch ( ( inw ( ioaddr + SCBStatus ) >> 6 ) & 0x3 ) { case 0: /* Idle - We should not reach this state. */ DBG ( "\nifec_net_transmit: tx idle!\n" ); ifec_scb_cmd ( netdev, virt_to_bus ( tcb ), CUStart ); ifec_scb_cmd_wait ( netdev ); return; case 1: /* Suspended */ DBG ( "s" ); break; default: /* Active */ DBG ( "a" ); } ifec_scb_cmd_wait ( netdev ); outl ( 0, ioaddr + SCBPointer ); a->tcb_head->command &= ~CmdSuspend; /* Immediately issue Resume command */ outb ( CUResume, ioaddr + SCBCmd ); ifec_scb_cmd_wait ( netdev ); }
As you can see, the RESUME is issued even if the card is active.
Additionally, I removed a line from ifec_tx_process()
:
static void ifec_tx_process ( struct net_device *netdev ) { struct ifec_private *priv = netdev->priv; struct ifec_tcb *tcb = priv->active->tcb_tail; s16 status; /* Check status of transmitted packets */ while ( ( status = tcb->status ) && tcb->iob ) { if ( status & TCB_U ) { DBG ( "ifec_tx_process : tx error!\n " ); netdev_tx_complete_err ( netdev, tcb->iob, -ENOMEM ); } else { netdev_tx_complete ( netdev, tcb->iob ); } DBGIO ( "tx completion\n" ); tcb->iob = NULL; tcb->status = 0; // tcb->command &= ~CmdSuspend; /* Allow controller to resume. */ priv->active->tcb_tail = tcb->next; /* Next TCB */ tcb = tcb->next; } }
This ensures the suspend bit isn't cleared except in the ifec_tx_wake()
routine. This line was redundant.
13 July
In lieu of having iSCSI packet captures to look at, I decided to try booting over AoE. This involves sufficient driver activity that I hope to locate a bug via it.
Booting a Windows image over AoE got stuck at the Windows splash screen. I then tried booting this image using Safe Mode. Every .sys driver loads until it gets to aoe32.sys. The system freezes at this line. I don't know enough about the AoE driver to determine what could be causing this.
I then compiled and attempted the same AoE boot using the legacy eepro100 driver. The boot sequence was exactly the same, with the machine freezing once loading aoe32.sys. I'll need to get a working AoE image to test this properly.