This is an old revision of the document!


A PCRE internal error occured. This might be caused by a faulty plugin

====== Michael Decker: Driver Development ====== ==== Week 7 ==== ---- === 9 July === A new branch, ''drivers6'' was created. This branch was merged with the mainline via ''git pull origin master''. This brought the GDB code into my tree. Experimenting with GDB, a segfault was reported following the point where gPXE was freezing during the second NIC boot. I ran a backtrace: <file> Program received signal SIGSEGV, Segmentation fault. alloc_memblock (size=96, align=<value optimized out>) at include/gpxe/list.h:64 64 __list_add ( new, head, head->next ); (gdb) backtrace #0 alloc_memblock (size=96, align=<value optimized out>) at include/gpxe/list.h:64 #1 0x00007cd1 in realloc (old_ptr=0x0, new_size=80) at core/malloc.c:265 #2 0x00007d2f in zalloc (size=96) at core/malloc.c:332 #3 0x0000814b in resolv (resolv=0x78a8, name=0xf "Ãë\t\017¾CÿèR", sa=0x33ad8) at core/resolv.c:260 #4 0x0000823b in xfer_open_named_socket (xfer=0x784c, semantics=208084, peer=0x33ad8, name=0x13356 "192.168.2.8", local=0x0) at core/resolv.c:389 #5 0x00005f64 in http_open_filter (xfer=0x12de8, uri=0x13324, default_port=80, filter=0) at net/tcp/http.c:501 #6 0x00012a70 in mtftp_uri_opener () #7 0x00012de8 in heap () #8 0x00005fc9 in http_open (xfer=0x60, uri=0xf) at net/tcp/http.c:527 #9 0x00000000 in ?? () </file> Not sure why the segfault occurred, although I do see the parameter to ''resolve'' is not valid. Marty recommended I install wireshark and take a look at what's happening. Additionally, testing at his end showed two different NICs failing iSCSI booting, but passing HTTP booting. I haven't tried iSCSI booting yet, so I'll need to set this up to recreate the errors he's seeing. In the meantime, analyzing wireshark output should show any problems with rx & tx during HTTP booting. I may also play with GDB a bit more to figure out what's going on, but currently I need to nail down the bug to something more specific. === 10 July === This morning I installed wireshark and have been inspecting HTTP boot packet communications. I found a number of duplicate transmissions (including duplication of TCP sequence numbers.) It seemed something was wrong with the transmission path. I added a few debug lines to ''ifec_tx_wake()'': <file> void ifec_tx_wake ( struct net_device *netdev ) { struct ifec_private *priv = netdev->priv; unsigned long ioaddr = priv->ioaddr; struct ifec_active *a = priv->active; struct ifec_tcb *tcb = a->tcb_head->next; /* For the special case of the first transmit, we issue a START. The * card won't RESUME after the configure command. */ if ( a->configured ) { a->configured = 0; ifec_scb_cmd ( netdev, virt_to_bus ( tcb ), CUStart ); ifec_scb_cmd_wait ( netdev ); return; } /* if not suspended, and all other tcbs have suspend flag clear, do NOT clear * the suspend flag. if you do, it will enter a bad state. we need a tcb with * a suspend flag set in the tx ring at all times. */ /* Resume if suspended. */ switch ( ( inw ( ioaddr + SCBStatus ) >> 6 ) & 0x3 ) { case 0: /* Idle - We should not reach this state. */ DBG ( "ifec_net_transmit: tx idle!\n" ); ifec_scb_cmd ( netdev, virt_to_bus ( tcb ), CUStart ); ifec_scb_cmd_wait ( netdev ); break; case 1: /* Suspended */ DBG ( "s" ); //ifec_net_transmit: tx suspended : resume issued\n" ); ifec_scb_cmd_wait ( netdev ); outl ( 0, ioaddr + SCBPointer ); a->tcb_head->command &= ~CmdSuspend; /* Immediately issue Resume command */ outb ( CUResume, ioaddr + SCBCmd ); ifec_scb_cmd_wait ( netdev ); break; default: DBG ( "a" ); a->tcb_head->command &= ~CmdSuspend; } } </file> This way I could see what state the Command Unit was in prior to each tx. Comparing this debug output with the wireshark output, I found that every instance of an 'a' coincided with a duplicate packet transmission. Now, the same packet being transmitted twice is odd. The driver is setup to write into the next TCB in the tx ring for each transmit call. I added a debug line in ''ifec_net_transmit()'': <file> static int ifec_net_transmit ( struct net_device *netdev, struct io_buffer *iobuf ) { struct ifec_private *priv = netdev->priv; unsigned long ioaddr = priv->ioaddr; struct ifec_active *a = priv->active; struct ifec_tcb *tcb = a->tcb_head->next; unsigned short status; /* Wait for TCB to become available. */ if ( tcb->status || tcb->iob ) { DBGP ( "TX overflow\n" ); return -ENOBUFS; } status = inw ( ioaddr + SCBStatus ); /* Acknowledge all of the current interrupt sources ASAP. */ outw ( status & 0xfc00, ioaddr + SCBStatus ); DBGIO ( "transmitting packet (%d bytes). status = %hX, cmd=%hX\n", iob_len ( iobuf ), status, inw ( ioaddr + SCBCmd ) ); DBGIO_HD ( iobuf->data, iob_len ( iobuf ) ); tcb->command = CmdSuspend | CmdTx | CmdTxFlex; tcb->count = 0x01208000; tcb->tbd_addr0 = virt_to_bus ( iobuf->data ); tcb->tbd_size0 = 0x3FFF & iob_len ( iobuf ); tcb->iob = iobuf; DBG ( "%i", tcb - a->tcbs ); DBGIO ( "tcb: \n" ); DBGIO_HD ( tcb, sizeof ( *tcb ) ); ifec_tx_wake ( netdev ); /* Append to end of ring. */ a->tcb_head = tcb; return 0; } </file> The line ''DBG ( "%i", tcb - a->tcbs );'' prints out the index of the current TCB in the tx ring. The debug output showed proper circulation from 0 through 3 and back to 0 repeatedly. However, it also showed no duplicates in wireshark! From this behavior, I made the assumption that the time delay of printing the debug output at that point prevents the 'a' condition from ever occuring. This, in turn, prevents the duplication bug. The 'a' condition is the CU being in the active state, which occurs when a transmit request occurs quickly before the previous tx finished processing on the card. Thus, I now have nailed down at least //one// bug, and now I can determine what's going wrong. * [[http://git.etherboot.org/?p=people/mdeck/gpxe.git;a=commit;h=9f561a19282078cc0346487d2a2b34060e1a3f62|[Drivers-eepro100] Bug fixes]] The end of ''ifec_tx_wake()'' performs different operations depending if the state of the CU is active or suspended. After some consideration, it seems if the CU is active, a RESUME should still be issued - this will cause the CU to re-read the current TCB's S-bit. Thus, after clearing that bit, the CU will continue on and process this newly appended transmit command. Otherwise, if the card was active before the tx, then it would suspend before processing the new TCB. This means the card is suspended at a TCB prior to the ''tcb_head''. This could happen multiple times, moving the actual TCB suspended closer to ''tcb_tail''. I think eventually tail would surpass the suspended TCB, and the head may write into the next TCB which is transmitted at the next ''ifec_net_transmit()''. This is speculation, as there may be some other way this corruption was occurring. The bottom of ''ifec_tx_wake()'' was changed as such: <file> /* Resume if suspended. */ switch ( ( inw ( ioaddr + SCBStatus ) >> 6 ) & 0x3 ) { case 0: /* Idle - We should not reach this state. */ DBG ( "\nifec_net_transmit: tx idle!\n" ); ifec_scb_cmd ( netdev, virt_to_bus ( tcb ), CUStart ); ifec_scb_cmd_wait ( netdev ); return; case 1: /* Suspended */ DBG ( "s" ); break; default: /* Active */ DBG ( "a" ); } ifec_scb_cmd_wait ( netdev ); outl ( 0, ioaddr + SCBPointer ); a->tcb_head->command &= ~CmdSuspend; /* Immediately issue Resume command */ outb ( CUResume, ioaddr + SCBCmd ); ifec_scb_cmd_wait ( netdev ); } </file> As you can see, the RESUME is issued even if the card is active. Additionally, I removed a line from ''ifec_tx_process()'': <file> static void ifec_tx_process ( struct net_device *netdev ) { struct ifec_private *priv = netdev->priv; struct ifec_tcb *tcb = priv->active->tcb_tail; s16 status; /* Check status of transmitted packets */ while ( ( status = tcb->status ) && tcb->iob ) { if ( status & TCB_U ) { DBG ( "ifec_tx_process : tx error!\n " ); netdev_tx_complete_err ( netdev, tcb->iob, -ENOMEM ); } else { netdev_tx_complete ( netdev, tcb->iob ); } DBGIO ( "tx completion\n" ); tcb->iob = NULL; tcb->status = 0; // tcb->command &= ~CmdSuspend; /* Allow controller to resume. */ priv->active->tcb_tail = tcb->next; /* Next TCB */ tcb = tcb->next; } } </file> This ensures the suspend bit isn't cleared except in the ''ifec_tx_wake()'' routine. This line was redundant.


Navigation

* [[:start|Home]] * [[:about|About our Project]] * [[:download|Download]] * [[:screenshots|Screenshots]] * Documentation * [[:howtos|HowTo Guides]] * [[:appnotes|Application Notes]] * [[:faq:|FAQs]] * [[:doc|General Doc]] * [[:talks|Videos, Talks, and Papers]] * [[:hardwareissues|Hardware Issues]] * [[:mailinglists|Mailing lists]] * [[http://support.etherboot.org/|Bugtracker]] * [[:contributing|Contributing]] * [[:editing_permission|Wiki Edit Permission]] * [[:wiki:syntax|Wiki Syntax]] * [[:contact|Contact]] * [[:relatedlinks|Related Links]] * [[:commerciallinks|Commercial Links]] * [[:acknowledgements|Acknowledgements]] * [[:logos|Logo Art]]

QR Code
QR Code soc:2008:mdeck:journal:week7 (generated for current page)