Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
soc:2009:oremanj:journal:week11 [2009/08/05 12:25]
rwcr
soc:2009:oremanj:journal:week11 [2009/08/08 12:15] (current)
rwcr
Line 62: Line 62:
 ==== Wednesday, August 5 ==== ==== Wednesday, August 5 ====
 Back to debugging ath5k... Back to debugging ath5k...
 +
 +I solved the problem I was having with ath5k: I was processing the status bits in a way that is only suitable for interrupt-driven use of the card. Making a fairly obvious fix (removing the "is interrupt pending"​ check) allows it to work in polling mode. With this, I'm able to scan for networks and associate with a WPA network, including both sending and receiving packets.
 +
 +Unfortunately,​ there'​s a memory corruption bug somewhere, of an extremely difficult-to-track variety. When I DHCP with my neighbor'​s network, gPXE locks up after receiving the DHCPOFFER. It appears that it's the card's DMA that's doing the corruption, as the data structures I set up to track things (in hopes of reading them out over FireWire after the lockup) get zeroed without triggering gdb watchpoints. I added a check that every RX buffer was exactly 2400 bytes long and formatted like an io_buffer (*(bufbase + 2408) == bufbase) and it never failed.
 +
 +Incidentally,​ my debugging led to a rather nifty use for FireWire: with -finstrument-functions (which asks gcc to insert calls to special functions at the beginning and end of every function) and a very small amount of code, one can keep track of function calls and wind up with a backtrace on demand, even after a lockup (as long as it doesn'​t zero out memory like this one is!) [The reboot-variety of crash is already trappable with gdb.] It's not always perfect due to inline functions and optimizations,​ but it works very well. Take a look:
 +
 +    #0  in 0x242a7 <​bios_putchar>​
 +    #1  0x2d5be <​putchar+3a>​ at /​home/​oremanj/​dev/​gpxe/​src/​core/​console.c:​28
 +    #2  0x16321 <​printf_putchar+18>​ at /​home/​oremanj/​dev/​gpxe/​src/​core/​vsprintf.c:​390
 +    #3  0x15ed3 <​cputchar+1a>​ at /​home/​oremanj/​dev/​gpxe/​src/​core/​vsprintf.c:​154
 +    #4  0x15fa7 <​vcprintf+45>​ at /​home/​oremanj/​dev/​gpxe/​src/​core/​vsprintf.c:​179
 +    #5  0x16192 <​vprintf+29>​ at /​home/​oremanj/​dev/​gpxe/​src/​core/​vsprintf.c:​405
 +    #6  0x161cb <​printf+22>​ at /​home/​oremanj/​dev/​gpxe/​src/​core/​vsprintf.c:​420
 +    #7  0x09943 <​ath5k_hw_noise_floor_calibration+b9>​ at /​home/​oremanj/​dev/​gpxe/​src/​drivers/​net/​ath5k/​ath5k_phy.c:​1161
 +    #8  0x0c7cf <​ath5k_hw_reset+d23>​ at /​home/​oremanj/​dev/​gpxe/​src/​drivers/​net/​ath5k/​ath5k_reset.c:​1122
 +    #9  0x00958 <​ath5k_reset+4f>​ at /​home/​oremanj/​dev/​gpxe/​src/​drivers/​net/​ath5k/​ath5k.c:​1550
 +    #10  0x00be9 <​ath5k_chan_set+9d>​ at /​home/​oremanj/​dev/​gpxe/​src/​drivers/​net/​ath5k/​ath5k.c:​708
 +    #11  0x0f4ed <​ath5k_config+e9a1>​ at /​home/​oremanj/​dev/​gpxe/​src/​net/​80211/​net80211.c:​1428
 +    #12  0x0f4ed <​net80211_probe_step+f7>​ at /​home/​oremanj/​dev/​gpxe/​src/​net/​80211/​net80211.c:​1428
 +    #13  0x1032e <​net80211_step_associate+118>​ at /​home/​oremanj/​dev/​gpxe/​src/​net/​80211/​net80211.c:​1756
 +    #14  0x13b22 <​step+35>​ at /​home/​oremanj/​dev/​gpxe/​src/​core/​process.c:​79
 +    #15  0x2d60e <​getchar+34>​ at /​home/​oremanj/​dev/​gpxe/​src/​core/​console.c:​104
 +    #16  0x2de2a <​getkey+16>​ at /​home/​oremanj/​dev/​gpxe/​src/​core/​getkey.c:​67
 +    #17  0x3b423 <​readline+83>​ at /​home/​oremanj/​dev/​gpxe/​src/​hci/​readline.c:​101
 +    #18  0x38c6c <​shell+25>​ at /​home/​oremanj/​dev/​gpxe/​src/​hci/​shell.c:​96
 +    #19  0x2ed99 <​main+d5>​ at /​home/​oremanj/​dev/​gpxe/​src/​core/​main.c:​90
 +    #20  0x2ccff <​prot_to_real+??>​ at comboot_call.c:​0
 +
 +Strangely, -finstrument-functions causes gcc to report phantom "may be used uninitialized in this function"​ that don't show up without that option.
 +
 +This is definitely a hack, and I don't think it'd be suitable for inclusion into the main tree as it is - too invasive - but it's cool to play around with. :-)
 +
 +==== Thursday, 7 August ====
 +ath5k works! I still don't know what the problem, as none of the small changes I made should'​ve fixed it, but when I rebased against git master the memory corruption went away. I have cleaned up my ath5k branch and pushed it to staging, and it's ready for mainline review.
 +
 +While I was still trying to figure out the issue, I figured a malloc() error analysis might be helpful, so I added specially-formatted debugging statements to core/​malloc.c that printed backtraces using the -finstrument-functions backtrace code I developed yesterday, and wrote a couple of small Perl scripts, one to look for alloc/free inconsistencies and print the backtraces for them, one to resolve addresses into function names and file line numbers (using the binutils program addr2line). The result was that I could capture gPXE console output and do
 +
 +  % ./​util/​gpxegrind.pl ../​ath5k-memory.log | ./​util/​resolveaddr.pl | less -R
 +
 +and get valgrind-style output about the locations of double frees, memory leaks, and so forth. It didn't help with ath5k, but it did let me catch a small memory leak in the net80211 code, which is included in the ath5k branch in staging. Currently this is very much a hack, but I'll work on cleaning it up for conditional use in mainline.  ​
 +
 +When compiling an ath5k ROM, I ran into a problem: the linker tries to put the uncompressed sector count into a one-byte field, expecting the compressor to subtract from it as necessary based on the compression achieved. If the uncompressed length is over 128k, the linker will complain of a truncated relocation, even if the compressor fixup would have made everything work. I've worked around this by adding a new type of fixup (ADDx) that adds the compressed length to a field, to complement the current SUBx fixup that subtracts the compression delta from a field. In my tests it worked fine for ROMs that would have worked under the old system; the larger ROM I tried to flash caused the adventure described below, but I don't know if that's the fault of its size.
 +
 +After trying and failing to find a way to do ''​REQUIRE_IF()''​ using the linker table system, as suggested by mcb30, I have implemented the ''​config/​config_//​objname//​.o''​ solution where ''//​objname//​.c''​ will automatically ''​REQUEST_OBJECT(config_//​objname//​)''​. It seems to work well. I've pushed this and my earlier linker change to branch **linker** in staging, and it's also ready for review.
 +
 +I received an e1000 NIC, and had the chance to do some wireless ROM tests. Using memdisk, I was able to flash a dual-driver e1000/​rtl8185 gPXE, and boot off the wireless using it - huzzah! (The e1000 has a 128KB EEPROM, so it's particularly good for this.) Unfortunately,​ my flash of an ath5k ROM produced a card that would freeze the system during option ROM scanning by the BIOS (when gPXE normally prints the "Press Ctrl+B to configure..."​). Three hours from installation to bricking... not so good. I'm not sure why the ath5k flash failed, but it may have to do with my use of an iSCSI DOS boot instead of the prior use of memdisk. It's possible that IBAUTIL does something strange that's not respectful of gPXE's low-memory state, and causes iSCSI reads through the int 13h interface to return corrupted. It's also possible, since this is the first ROM I flashed that required my uncompressed-size-over-128k modifications,​ that there'​s an issue (either architecturally or with gPXE's implementation) with pushing the envelope in that way.
 +
 +I'd be able to reflash the e1000 easily enough if I could get into IBAUTIL, but since the system won't boot with it installed, that doesn'​t work. Since I don't have another ROMable card, the only way I could see of fixing it was plugging it into the PCI slot after the system had started. Yes, I did try it, and I was incredibly lucky not to fry either the card or the slot --- folks, regular PCI cards are //not// meant to be hot-plugged. (It locked up the system and wouldn'​t recognize the card until I power-cycled.) I believe the solution to this one lies in getting another card, flashing it properly with a basic gPXE, putting it in a PCI slot that gets scanned before the bricked card, and using gPXE's "Press Ctrl+B to configure..."​ escape hatch to get IBAUTIL loaded before the bad option ROM gets scanned. IBAUTIL has an option to disable the flash.
 +
 +  * On branch **linker** in staging:
 +    * [[http://​git.etherboot.org/?​p=gpxe-staging.git;​a=commit;​h=77fdc539b9a3bf33e2541ff7858500ee5940eb7d|
 +[linker] Expand and correct symbol requirement macros]]
 +    * [[http://​git.etherboot.org/?​p=gpxe-staging.git;​a=commit;​h=c78419aab1b33c6184abe9d10cacc24a438b3790|
 +[linker] Add mechanism for subsystem-dependent configuration options]]
 +  * On branch **ath5k** in staging:
 +    * [[http://​git.etherboot.org/?​p=gpxe-staging.git;​a=commit;​h=9eeb6a04e914b1687b63924b4236e1e6abb08270|
 +[802.11] Enhance support for driver PHY differences]]
 +    * [[http://​git.etherboot.org/?​p=gpxe-staging.git;​a=commit;​h=9babe5842ad054b703fc57b15e75546a38e362b4|
 +[802.11] Set channels early on to avoid tuning to an undefined channel]]
 +    * [[http://​git.etherboot.org/?​p=gpxe-staging.git;​a=commit;​h=d84586bf362af20972bcb2fec9af9df980d81fbf|
 +[802.11] Fix maximum packet length]]
 +    * [[http://​git.etherboot.org/?​p=gpxe-staging.git;​a=commit;​h=e572f57d6ecb969c0521965fd9887a7cd41c4e19|
 +[802.11] Fix memory leak on unsuccessful probes]]
 +    * [[http://​git.etherboot.org/?​p=gpxe-staging.git;​a=commit;​h=12e12ecc4c6f115d7d20cd74e2afcbee74c404fd|
 +[legal] Add MIT licence declaration]]
 +    * [[http://​git.etherboot.org/?​p=gpxe-staging.git;​a=commit;​h=739a9e76613fac83a1ff0196f96ac78e70b36164|
 +[ath5k] Add support for non-802.11n Atheros wireless NICs]]
 +    * [[http://​git.etherboot.org/?​p=gpxe-staging.git;​a=commit;​h=1dac2e8491b0c7a73f0a1bb2b9517f32dda15cbd|
 +[rom] Allow ROM images to have uncompressed size greater than 128k]]
 +
 +==== Friday, 8 August ====
 +I discovered that my test system'​s video card takes up 55k after it initializes,​ meaning my 90k gPXE ROM would cause an option ROM overflow. Seems my test system'​s BIOS doesn'​t handle this gracefully. That's one mystery solved, though it doesn'​t make my e1000 any less bricked...
 +
 +On Michael'​s suggestion, I implemented a generic means for placing variables in base memory, in much the manner of ''​_''''​_data16''​ for pcbios builds (it actually uses ''​_''''​_data16''​ if it can). On EFI it places the variables in a special section that is relocated into a freshly allocated bit of base memory at init time. This is used by the FireWire debug code to place its portal structure low, where the debugging host can scan for it. It's a much nicer mechanism than the ''​umalloc_low()''​ of my original implementation.
 +
 +With that, the firewire branch is (again) ready to be reviewed:
 +  * On branch **firewire** in staging:
 +    * [[http://​git.etherboot.org/?​p=gpxe-staging.git;​a=commit;​h=21c48d156f23bbb802334ef4c0f6d8138bac091a|
 +[basemem] Add facility for placing variables in base memory]]
 +    * [[http://​git.etherboot.org/?​p=gpxe-staging.git;​a=commit;​h=7f0ab72f3826e019835fae713398d85821e7440b|
 +[fwdebug] Add generic FireWire debugging interface]]
 +
 +I spent a while cleaning up my 802.11 crypto changes, and have created a branch for them and the iwmgmt commands using the new ''​config_//​subsystem//​.o''​ mechanism for object-dependent configuration:​
 +  * On branch **wireless-pretty** in my personal repository:
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=85f81110f6f0d72e5cc85e83bc87b458f7162866|
 +[802.11] Add core support for detecting and using encrypted networks]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=435a44c6ef4347c7ede8c2749bfeaecb0d7e1ad9|
 +[iwmgmt] Add wireless management commands and text for common errors]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=51b17ad4f41a8395999db7090ac37043d6f22d6a|
 +[digest] Add generic CRC32 function]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=fc1b8a22e169d0701d254af53315f88ae93991c4|
 +[cipher] Add the ARC4 stream cipher]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=be7fb2860acd4b594f1ec747ed5e8917a0a3999e|
 +[digest] Add HMAC-SHA1 based pseudorandom function and PBKDF2]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=acfde649a0864309247c6ca0da4149740c54bd5f|
 +[crypto] Add parentheses around len argument in blocksize assert]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=5db4c2ac9b8ad571355b367a6846beb61f93af53|
 +[crypto] Make AES context size and algorithm structure externally available]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=79dc6c2845b0c93bebd59f5865d098e5a436af12|
 +[crypto] Add AES key-wrap mode (RFC 3394)]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=bf09bf67df2f1f15e7ae1e70cd44f7c99a12d111|
 +[crypto] Add a placeholder for a proper random number generator]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=97c4eedca8437e34a7f09139c1ff1b3db35d9135|
 +[eapol] Add basic support for 802.1X EAP over LANs]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=eb227c549824e95ffe8be0e78749b2b14d362c97|
 +[802.11] Add support for WEP-protected networks]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=a7f151439ab414c877e229c4c19f9508221661b1|
 +[wpa] Add general support for WPA-protected 802.11 networks]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=f13249ce3397929469c3983af8289cf447ea3f8c|
 +[wpa] Add pre-shared key frontend (WPA "​Personal"​ with just a passphrase)]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=f2eb1cb5d62517021dc10c75ab464f57dccc228d|
 +[wpa] Add TKIP backend (legacy RC4-based cryptosystem)]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=80fa41ea472eadc928a20663ca2b75cae6e33076|
 +[wpa] Add CCMP backend (new AES-based cryptosystem)]]
 +
 +I've separated out my EAP code (WPA Enterprise) into a different branch. It's currently useless without any EAP authentication methods implemented,​ but the structure is there if someone (perhaps me at a later date) wants to implement some.
 +  * On branch **eap** in my personal repository:
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=4ce9a93fcb813e96a787a444d63c571c4c2c0719|
 +[eap] Add basic support for the 802.1X Extensible Authentication Protocol]]
 +    * [[http://​git.etherboot.org/?​p=people/​oremanj/​gpxe.git;​a=commit;​h=23c3bd939b47564875afe5119cabf5e5f9afa929|
 +[wpa] Add EAP frontend (WPA "​Enterprise"​ using an authentication server)]]
 +
 +==== Saturday, 8 August ====
 +Meeting today, the majority of which was spent discussing an idea I had for loading large ROMs in crowded option ROM environments:​ have a small ROM stub that loads the rest of the ROM to an area not subject to the 128k option ROM limit. There are several ways of implementing this:
 +  * My initial idea: program the PCI ROM BAR to map the full ROM to an area in high memory. This has some serious practical issues of the "where do you put it?" variety, because one would need to walk both the e820 memory map and the PCI bus to find an area not used by RAM or any other memory-mapped I/O device in the system. And there are some devices, such as the APICs, that don't show up in either.
 +    * On the other hand, it may be safe to do this by looking for a sufficiently PCI memory BAR mapping (e.g. video card), disabling that mapping while we access the ROM, and reenabling it later - bears trying, at least.
 +  * Michael'​s idea: use the NVS subsystem to access the flash directly, scan for a gPXE image embedded within it, expose it via int13h, and boot it. The only practical issue here is the fact that most supported NICs don't have an NVS driver. The result may also be rather larger than accessing the PCI BARs, but tiny code that doesn'​t work is useless.
 +This will be an interesting project to hack on over the next week. :-)
 +
 +Things left to do this summer:
 +  * Make sure all relevant commits described on this page get merged or fixed to be mergeable;
 +  * Clean up the **firmware** branch to use new linker macros, and push to staging;
 +  * Document, document, document!
 +  * With remaining time, work on some of the dangling threads:
 +    * Flash stub
 +    * EAP
 +    * Something else?
 +

QR Code
QR Code soc:2009:oremanj:journal:week11 (generated for current page)