====== Michael Decker: Driver Development ====== ==== Week 5 ==== ---- === 23 & 24 June === OK, so I rewrote the tx path again, using the suspend bit as necessary. To give you an idea of what I'm working with, here is a **short overview**. The 8255x processes //command blocks// to perform actions, such as configuration and transmit. {{:soc:2008:mdeck:journal:memarch.png|}} The gPXE driver executes an //individual address setup// command and then a //configure// command in ''ifec_net_open()''. In addition, a ring of //transmit control blocks// (TCB) are initialized. The configure command's ''link'' member points to the first TCB in the list, and the suspend bit is set. When ''ifec_net_transmit()'' is called, the next TCB in the ring is configured to transmit and suspend, then the card is issued a //resume// command. The card will fetch the next command block, the address of which was cached from the ''link'' field of the previous command block. In our case, the ''link'' fields never change, and they form the ring of TCBs. The transmit command will be processed, and the card will suspend. If multiple ''ifec_net_transmit()''s are issued quickly, they may form a chain of TCBs without intermediate suspends occuring. This is enabled by clearing the suspend bit in the previous TCB when preparing a TCB for transmission. The code to do this is rather straight-forward, so why did it take two days to rewrite this? In a word, **bugs**. There were a few mistakes in my code, such as calls ''ifec_scb_cmd_wait ( ioaddr )'' instead of ''ifec_scb_cmd_wait ( ioaddr + SCBCmd )'', or assigning ''tcb->link = ptr'' instead of ''tcb->link = virt_to_bus ( ptr )''. However, there was a momma-jomma bug that eluded me for some time. This bug caused the machine to triple fault or simply freeze at seemingly random moments, although the moment was the same for any given compile. The bug was located in a loop in ''ifec_net_poll()'': /* Check status of transmitted packets */ while ( ( status = tcb->status ) && tcb->iob ) { if ( status & TCB_U ) { netdev_tx_complete_err ( netdev, tcb->iob, -ENOMEM ); } else { netdev_tx_complete ( netdev, tcb->iob ); } DBGIO ( "tx completion\n" ); tcb->iob = NULL; tcb->status = 0; tcb->command &= ~CmdSuspend; /* Allow controller to resume. */ tcb = a->tcb_tail = tcb->next; /* Next TCB */ } The above is the proper code. The bug was one line, [[http://git.etherboot.org/?p=people/mdeck/gpxe.git;a=blob;f=src/drivers/net/eepro100.c;h=5a3fa83c81a550f2d4624401c3909c356350c2be;hb=1a2f25aa714f8f4c0adbaf2c186d69f986000dc5#l453|line 453 in this commit]]. The line ''free_iob ( tcb->iob );'' is not only unnecessary, it is wrong. The ''netdev_tx_complete()'' functions free this io_buffer. After finally fixing this bug, I tested it out more and it was working just fine. //Except//, a few times after it successfully downloaded the kernel and initrd over http, the system froze. I haven't been able to duplicate this behavior since; perhaps it's the same [[:soc:2008:dverkamp:journal:week5|bug that DrV encountered]]. === 25 June === Marty spared a few hours today and performed some code inspection on my driver. The bug mentioned at the end of the previous journal entry continued to intermittently occur, so we prepared to debug the beast. We scrutinized ''ifec_net_close()'', as the bug occurs after downloading everything, just prior to booting the kernel. This would be when this function is called. /* * Close a network device. * * @v netdev Device to close. * * This is a gPXE Network Device Driver API function. */ static void ifec_net_close ( struct net_device *netdev ) { struct ifec_private *priv = netdev->priv; struct ifec_active *a = priv->active; unsigned long ioaddr = priv->ioaddr; unsigned short intr_status; int i; ifec_net_irq ( netdev, 0 ); /* Mask ints */ intr_status = inw ( ioaddr + SCBStatus ); /* Ack & clear ints */ outw ( intr_status, ioaddr + SCBStatus ); inw ( ioaddr + SCBStatus ); ifec_reset ( ioaddr ); /* Free any necessary resources */ for ( i = 0; i < RFD_COUNT; i++ ) if ( priv->rx_iobs[i] ) free_iob ( priv->rx_iobs[i] ); free ( a ); } This is the correct code. The offending lines were [[http://git.etherboot.org/?p=people/mdeck/gpxe.git;a=blob;f=src/drivers/net/eepro100.c;h=5a3fa83c81a550f2d4624401c3909c356350c2be;hb=1a2f25aa714f8f4c0adbaf2c186d69f986000dc5#l241|lines 241 & 242 in this commit]]. 241 for ( i = 0; i < TCB_COUNT; i++ ) 242 if ( a->tcbs[i].iob ) free_iob ( a->tcbs[i].iob ); Look at that, more ''free_iob'' calls on tx io_buffers! I erroneously assumed the driver //owns// the io_buffer received in ''ifec_net_transmit()''. I reasoned, //if the driver has not relinquished the io_buffers via netdev_tx_complete() in ifec_net_poll(), then the driver is responsible for freeing these buffers.// You know what they say about assumptions. Marty showed that ''netdevice.c'' tracks all issued io_buffers, and will free those not-yet-completed after closing a net_device. The removal of the offending lines from my driver placed it into a bug-free state. During code inspection, Marty suggested key points for more comment documentation. Additionally, I mentioned I may modify the rx ring traversal code slightly, in an effort to clean it up and slightly reduce code size. These changes should be completed by tomorrow. === 26 June === The tx io_buffer free bug was removed. Additional comments were added. Some straggling ''printf''s were converted to ''DBG''s at Marty's suggestion. Some extraneous ''DBG''s were removed. Some testing was performed. A commit was made. * [[http://git.etherboot.org/?p=people/mdeck/gpxe.git;a=commit;h=2c2322662e619ac6817f1822d6f185c7d39f8612|[Drivers-eepro100] Free iob in ifec_net_close bug removed. More comments.]] More on the way. 8-) * [[http://git.etherboot.org/?p=people/mdeck/gpxe.git;a=commitdiff;h=27c70247fdc7d8d250731cc7c077cee19b96044a|[Drivers-eepro100] Moved rx data into struct ifec_active]] * [[http://git.etherboot.org/?p=people/mdeck/gpxe.git;a=commitdiff;h=fac57320f5b6a0df8e9064c02e03333c0f7d2f19|[Drivers-eepro100] Fixed ias bug introduced when moving rx to ifec_active]] All data members of ''ifec_private'' supporting the rx ring were moved into ''ifec_active''. ''ifec_active'' is allocated in ''ifec_net_open()'' and freed in ''ifec_net_close()'', so the driver's footprint is minimized while not open. I also converted the IAS command block allocation in ''ifec_net_open()'' to use ''malloc_dma()'' instead of ''malloc()''. The card requires command blocks to be aligned on even physical addresses. Any alignment provided via gcc will affect virtual addresses. Although the functionality of page-mapping hardware dictates the lower bits of a virtual address will correspond to a physical address, I'm unsure of how much of an assumption the driver should make as to the underlying hardware. ''malloc_dma()'' provides an interface that guarantees aligned physical addresses, so this is the best method of allocation. The second commit corrects a small typo in ''ifec_net_open()'', caused by some temporary changes to the ias allocation. The destination of the ''link'' field shouldn't be followed, since the card is ordered to suspend, and is later followed by the issuance of a start command. However, the IAS command is a little touchy in how it is performed, so to be safe it links to itself. * [[http://git.etherboot.org/?p=people/mdeck/gpxe.git;a=commitdiff;h=c2b105c2be6313fea5305aa925f97fbe74e09f82|[Drivers-eepro100] Moved EEPROM caching into fn, made dynamic allocation.]] The EEPROM caching sequence in ''ifec_pci_probe()'' was moved into a separate routine, ''ifec_eeprom_cache()''. Additionally ''eeprom'' was changed from a static array member of ''ifec_private'' to a pointer. The cache is allocated in ''ifec_eeprom_cache()'' and is later freed after its usage in ''ifec_net_open()'''s call to ''ifec_mdio_setup()''. If the pointer is not freed, e.g. an error occurs or ''ifec_net_open()'' is never called, then ''ifec_pci_remove()'' will free the pointer. I also changed ''ifec_scb_cmd_wait()'' to //not// be inline. No reason for it. * [[http://git.etherboot.org/?p=people/mdeck/gpxe.git;a=commitdiff;h=ee2c4b1990d204daeaa3a9dff40558bb20f01451|[Drivers-eepro100] Converted all fns to take net_device rather than ioaddr.]] Previously, most of the non-API functions took ''ioaddr'' as a parameter. Marty suggested I change them to take a ''net_device *netdev'' as did the API functions, so all the function prototypes will be consistent. This should make it easier for newbies browsing the code to figure things out. Only one function, ''ifec_rfd_init()'', doesn't take a netdev. This function only encapsulates a data structure filling sequence. Tomorrow we shall see what's next! === 27 June === Had the weekly meeting today. Marty suggested reworking my error handling technique. He also pointed me to this [[http://lxr.linux.no/linux/Documentation/CodingStyle|Linux coding style guide]]. So the code will be getting a style upgrade ;) === 28 & 29 June === * [[http://git.etherboot.org/?p=people/mdeck/gpxe.git;a=commitdiff;h=feb5f34d4c1d83e175aba47603e9842acbe2cf98|[Drivers-eepro100] Formatting, code modularizing, etc.]] Formatting changes were made throughout the code to conform with the aforementioned style guide. Multiple variable declarations on a single line were separated. Multiple assignments on a line were separated. if-statements on a single line were split. Large comments were moved to function comments or to the header. In ''ifec_net_open()'', the ''struct ifec_cfg'' code was redundant. The struct's data declaration and the run-time modification were not necessary. I moved most of the assignments into the data declaration for the struct. Also, Individual Address Setup (ias) command references were coalesced into one sequence for code clarity. In ''ifec_net_poll()'', tx processing and rx processing were separated into ''ifec_tx_process()'' and ''ifec_rx_process()''. In ''ifec_net_transmit()'', the switch statement responsible for issuing resume commands as necessary was moved into ''ifec_tx_wake()''. Also, a pointer check was added in ''ifec_net_close()'' and error handline in ''ifec_net_open()'' was cleaned up. The code segment at the beginning of ''ifec_net_open()'' is currently commented out. I don't believe this is needed, as it was added while trying to debug earlier. I will do more thorough testing before removing it completely.