Table of Contents
Day 0 ( July 10 )
Today I finally found the cause for the status field in the hardware status block being in big endian while it should have been swapped to little endian by the card. The driver was using a #ifdef __BIG_ENDIAN. As it turns out this is not the correct way to distinguish between big/little endian systems in gPXE and the expression will always evaluate to true. Therefore we were always assuming we're a big endian system and disabled byte swapping again.
Day 1 ( July 11 )
Now that the byteswapping issue is resolved I worked on getting transmit working again. I found out that gPXE would properly transmit DHCP packets on one test card, but not for the other cards. Tells us that we're at least doing _something_ right :). After transmitting a few DHCP packets, however, an assertion in the malloc() code would fail when the DHCP code tried to allocate space for a new packet. I tried tracking down this issue with not much luck.
Day 2 ( July 12 )
I was diving into how the malloc() code works to get a better understanding how this assert() could fail. It looks like the list of free memory blocks that malloc() maintains internally must somehow get corrupted. I couldn't find the cause of this memory corruption yet.
Day 3 ( July 13 )
After another lengthy debugging session I finally found the cause of the memory corruption. We were allocating memory for the RX ring with an incorrect(too small) size. The memset to initialize the memory was clearing the expected number of bytes, which corrupted memory behind our allocated buffer.
Day 4 ( July 14 )
Back to getting transmit working again. To eliminate errors in my testing setup as the cause of this bug I connected the test card to a 100MBit/s switch and promptly received the expected DHCP packets in Wireshark. The direct connection I was using before autonegotiated to 1GBit/s. Further testing revealed that 10/100MBit/s work with all 3 cards I've tested with, while 1GBit/s still works with only one card.
Day 5 ( July 15 )
The NIC sets an error bit called “ODI(Output Data Interface) Underrun”. Unfortunately the only place where “ODI” or “Output Data Interface” occur in the datasheet is the explanation of this error bit, and it doesn't go into Detail about what this error actually means. I'm suspecting another PHY issue, or a disagreement between the MAC and PHY about the link speed.