This is an old revision of the document!
====== Michael Decker: Driver Development ====== ==== Week 5 ==== ---- === 23 & 24 June === OK, so I rewrote the tx path again, using the suspend bit as necessary. To give you an idea of what I'm working with, here is a **short overview**. The 8255x processes //command blocks// to perform actions, such as configuration and transmit. {{:soc:2008:mdeck:journal:memarch.png|}} The gPXE driver executes an //individual address setup// command and then a //configure// command in ''ifec_net_open()''. In addition, a ring of //transmit control blocks// (TCB) are initialized. The configure command's ''link'' member points to the first TCB in the list, and the suspend bit is set. When ''ifec_net_transmit()'' is called, the next TCB in the ring is configured to transmit and suspend, then the card is issued a //resume// command. The card will fetch the next command block, the address of which was cached from the ''link'' field of the previous command block. In our case, the ''link'' fields never change, and they form the ring of TCBs. The transmit command will be processed, and the card will suspend. If multiple ''ifec_net_transmit()''s are issued quickly, they may form a chain of TCBs without intermediate suspends occuring. This is enabled by clearing the suspend bit in the previous TCB when preparing a TCB for transmission. The code to do this is rather straight-forward, so why did it take two days to rewrite this? In a word, **bugs**. There were a few mistakes in my code, such as calls ''ifec_scb_cmd_wait ( ioaddr )'' instead of ''ifec_scb_cmd_wait ( ioaddr + SCBCmd )'', or assigning ''tcb->link = ptr'' instead of ''tcb->link = virt_to_bus ( ptr )''. However, there was a momma-jomma bug that eluded me for some time. This bug caused the machine to triple fault or simply freeze at seemingly random moments, although the moment was the same for any given compile. The bug was located in a loop in ''ifec_net_poll()'': <file> /* Check status of transmitted packets */ while ( ( status = tcb->status ) && tcb->iob ) { if ( status & TCB_U ) { netdev_tx_complete_err ( netdev, tcb->iob, -ENOMEM ); } else { netdev_tx_complete ( netdev, tcb->iob ); } DBGIO ( "tx completion\n" ); tcb->iob = NULL; tcb->status = 0; tcb->command &= ~CmdSuspend; /* Allow controller to resume. */ tcb = a->tcb_tail = tcb->next; /* Next TCB */ } </file> The above is the proper code. The bug was one line, [[http://git.etherboot.org/?p=people/mdeck/gpxe.git;a=blob;f=src/drivers/net/eepro100.c;h=5a3fa83c81a550f2d4624401c3909c356350c2be;hb=1a2f25aa714f8f4c0adbaf2c186d69f986000dc5#l453|line 453 in this commit]]. The line ''free_iob ( tcb->iob );'' is not only unnecessary, it is wrong. The ''netdev_tx_complete()'' functions free this io_buffer. After finally fixing this bug, I tested it out more and it was working just fine. //Except//, a few times after it successfully downloaded the kernel and initrd over http, the system froze. I haven't been able to duplicate this behavior since; perhaps it's the same [[:soc:2008:dverkamp:journal:week5|bug that DrV encountered]]. === 25 June === Marty spared a few hours today and performed some code inspection on my driver. The bug mentioned at the end of the previous journal entry continued to intermittently occur, so we prepared to debug the beast. We scrutinized ''ifec_net_close()'', as the bug occurs after downloading everything, just prior to booting the kernel. This would be when this function is called. <file> /* * Close a network device. * * @v netdev Device to close. * * This is a gPXE Network Device Driver API function. */ static void ifec_net_close ( struct net_device *netdev ) { struct ifec_private *priv = netdev->priv; struct ifec_active *a = priv->active; unsigned long ioaddr = priv->ioaddr; unsigned short intr_status; int i; ifec_net_irq ( netdev, 0 ); /* Mask ints */ intr_status = inw ( ioaddr + SCBStatus ); /* Ack & clear ints */ outw ( intr_status, ioaddr + SCBStatus ); inw ( ioaddr + SCBStatus ); ifec_reset ( ioaddr ); /* Free any necessary resources */ for ( i = 0; i < RFD_COUNT; i++ ) if ( priv->rx_iobs[i] ) free_iob ( priv->rx_iobs[i] ); free ( a ); } </file> This is the correct code. The offending lines were [[http://git.etherboot.org/?p=people/mdeck/gpxe.git;a=blob;f=src/drivers/net/eepro100.c;h=5a3fa83c81a550f2d4624401c3909c356350c2be;hb=1a2f25aa714f8f4c0adbaf2c186d69f986000dc5#l241|lines 241 & 242 in this commit]]. <file> 241 for ( i = 0; i < TCB_COUNT; i++ ) 242 if ( a->tcbs[i].iob ) free_iob ( a->tcbs[i].iob ); </file> Look at that, more ''free_iob'' calls on tx io_buffers! I erroneously assumed the driver //owns// the io_buffer received in ''ifec_net_transmit()''. I reasoned, //if the driver has not relinquished the io_buffers via netdev_tx_complete() in ifec_net_poll(), then the driver is responsible for freeing these buffers.// You know what they say about assumptions. Marty showed that ''netdevice.c'' tracks all issued io_buffers, and will free those not-yet-completed after closing a net_device. The removal of the offending lines from my driver placed it into a bug-free state. During code inspection, Marty suggested key points for more comment documentation. Additionally, I mentioned I may modify the rx ring traversal code slightly, in an effort to clean it up and slightly reduce code size. These changes should be completed by tomorrow. === 26 June === The tx io_buffer free bug was removed. Additional comments were added. Some straggling ''printf''s were converted to ''DBG''s at Marty's suggestion. Some extraneous ''DBG''s were removed. Some testing was performed. A commit was made. * [[http://git.etherboot.org/?p=people/mdeck/gpxe.git;a=commit;h=2c2322662e619ac6817f1822d6f185c7d39f8612|[Drivers-eepro100] Free iob in ifec_net_close bug removed. More comments.]] More on the way. 8-)