Piotr Jaroszyński: Usermode debugging under Linux

Week 11 [ Aug 2 - Aug 8 2010 ]

drivers in userspace

Addressing spaces

First a bit of background. There are four different kind of addresses in gPXE:

  • virtual - these are the ones you can access directly in gPXE via normal pointer derefence
  • user - these are a superset (sometimes not proper) of virtual, which possibly allow addressing more stuff
  • phys - these are physical memory addresses
  • bus - these are memory addresses how the devices see the memory

In linux userpace there is a big (32 or 64 bit depending on arch) flat address space. This lets us easily make virtual addresses equal to user addresses.

The problems begin with phys addresses. What a process sees as contiguous memory may not be physically contiguous at all. The kernel maps PAGEs from all over the physical memory to construct the process' address-space. Hence there is no way of converting user addresses to real physical addresses. The good news is that real physical addresses seem to be only really (i.e. passed somewhere to be derefenced) used by virtio and some infiniband drivers. For now we can live without them and currently phys addresses are equal to user and virtual addresses.

Bus addresses have the same general problem as phys addresses. The difference is that all memory that is to be accessed by hardware (except infiniband) is allocated with malloc_dma() which can be controled. And indeed malloc_dma() is made to use memory allocated by an UIO-DMA kernel module, which guarantees it's ready to be accessed by hardware and provides us with a bus address for it. Hence we can return a proper address when doing phys_to_bus conversion for these returned by malloc_dma().

The handling of bus addresses begs a question why not make phys == bus addresses and handle the conversion in user_to_phys as it seems to make more sense especially as the adresses are really equal on some hardware. This was tried and backed off because malloc() aligns allocations by their phys address. That could be worked around but doesn't really seem worth it.

Unused functions

I was a bit overoptimistic about linker stripping unused functions. It turns out that functions are linked with object granularity so adding strtoull() to core/misc.c which is used only on linux was in fact growing other builds as well. Moved it to a separate object core/strtoull.c to eliminate that problem.

That makes me wonder whether we have any dead functions that can be nuked to reduce size. Also on a similar note gcc 4.5.0 introduced whole program optimizations, which might be worth looking into. Although I'm waiting for 4.5.1 as it also introduced some nasty regressions that are quite noticable on a source-based distro like exherbo that I'm using.

Nicer UIO-DMA support

Up till recently the UIO-DMA initialization was a bit hacky. UIO-DMA requires a kernel module to register the device it is supposed to be handling DMA mappings for (and rightly so as the kernel DMA API needs the device). As part of the registration it returns a unique device_id that is supposed to be passed by the userspace code to the UIO-DMA module later to tell different devices apart. To handle the device registration I have introduced a very simple uio-dma-pci module, but I didn't have a good idea how to pass the device_id returned by UIO-DMA to userspace (e.g. creating an extra char device with an ioctl just for that seemed ilke an overkill). Not wanting to spend too much time on this then I just hacked UIO-DMA to recognize a special id value meaning the id of the last device registered with it.

This worked just fine, but this kind of hacks just don't let me sleep comfortably at night ;) Since that time I have had the chance to inspect some of the code behind the /sys/ PCI interface (see sysfs-pci.txt) and it looked quite simple as far as creating new sysfs attributes goes and the obvious solution occurred to me. Just add a sysfs device attribute with the UIO-DMA device_id. And so I did:

ssize_t uio_dma_id_show(struct device *dev, struct device_attribute *attr, char *buf)
{
	struct dev_data *dd = dev_get_drvdata(dev);
 
	return snprintf(buf, PAGE_SIZE, "0x%08x\n", dd->uio_dma_id);
}
 
/** Device attribute with the UIO-DMA device id */
static DEVICE_ATTR(uio_dma_id, S_IRUGO, uio_dma_id_show, NULL);
 
static int __devinit probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
    ...
    device_create_file(&pdev->dev, &dev_attr_uio_dma_id);
    ...
}

And done. It also has the the nice side-effect of ensuring the device we are using is handled by our simple driver and not anything else.

$ ls -al /sys/bus/pci/devices/0000:09:00.0/
total 0
drwxr-xr-x 3 root root        0 2010-08-08 15:13 .
drwxr-xr-x 5 root root        0 2010-08-08 15:13 ..
-rw-r--r-- 1 root root     4096 2010-08-08 16:00 broken_parity_status
-r--r--r-- 1 root root     4096 2010-08-08 15:47 class
-rw-r--r-- 1 root root      256 2010-08-08 15:47 config
-r--r--r-- 1 root root     4096 2010-08-08 16:00 consistent_dma_mask_bits
-r--r--r-- 1 root root     4096 2010-08-08 15:47 device
-r--r--r-- 1 root root     4096 2010-08-08 16:00 dma_mask_bits
lrwxrwxrwx 1 root root        0 2010-08-08 15:47 driver -> ../../../../../bus/pci/drivers/uio-dma-pci
-rw------- 1 root root     4096 2010-08-08 16:00 enable
-r--r--r-- 1 root root     4096 2010-08-08 15:47 irq
-r--r--r-- 1 root root     4096 2010-08-08 16:00 local_cpulist
-r--r--r-- 1 root root     4096 2010-08-08 16:00 local_cpus
-r--r--r-- 1 root root     4096 2010-08-08 16:00 modalias
-rw-r--r-- 1 root root     4096 2010-08-08 16:00 msi_bus
drwxr-xr-x 2 root root        0 2010-08-08 16:00 power
--w--w---- 1 root root     4096 2010-08-08 16:00 remove
--w--w---- 1 root root     4096 2010-08-08 16:00 rescan
--w------- 1 root root     4096 2010-08-08 16:00 reset
-r--r--r-- 1 root root     4096 2010-08-08 15:47 resource
-rw------- 1 root root 33554432 2010-08-08 16:00 resource0
-r-------- 1 root root    65536 2010-08-08 16:00 rom
lrwxrwxrwx 1 root root        0 2010-08-08 16:00 subsystem -> ../../../../../bus/pci
-r--r--r-- 1 root root     4096 2010-08-08 16:00 subsystem_device
-r--r--r-- 1 root root     4096 2010-08-08 16:00 subsystem_vendor
-rw-r--r-- 1 root root     4096 2010-08-08 16:00 uevent
-r--r--r-- 1 root root     4096 2010-08-08 15:13 uio_dma_id
-r--r--r-- 1 root root     4096 2010-08-08 15:47 vendor
-rw------- 1 root root      128 2010-08-08 15:47 vpd
$ cat /sys/bus/pci/devices/0000:09:00.0/uio_dma_id 
0x00000002

Extra checks and cleanup

Along with the changes described above went in some extra checks for common problems like whether we have root access, the device exists at /sys/ at all etc. Also the lpci driver now can only support one device at the same time. That comes from the fact that gPXE's malloc_dma() API is device agnostic and on linux it isn't (as mentioned earlier linux DMA API requires a device to be passed).

bnx2

I have received this nic late on Friday and started working on a driver for it. lspci output below:

09:00.0 Ethernet controller [0200]: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet [14e4:164c] (rev 12)
        Subsystem: Hewlett-Packard Company NC373T PCI Express Multifunction Gigabit Server Adapter [103c:7037]
        Flags: 66MHz, medium devsel, IRQ 16
        Memory at fa000000 (64-bit, non-prefetchable) [size=32M]
        Expansion ROM at f9ff0000 [disabled] [size=64K]
        Capabilities: [40] PCI-X non-bridge device
        Capabilities: [48] Power Management version 2
        Capabilities: [50] Vital Product Data
        Capabilities: [58] MSI: Enable- Count=1/1 Maskable- 64bit+
        Kernel driver in use: uio-dma-pci

On the good side it has a datasheet and a driver in linux kernel. On the bad side the linux driver is quite big:

  8601 linux-2.6/drivers/net/bnx2.c
  7368 linux-2.6/drivers/net/bnx2.h
 15969 total

and it requires firmware to work at all. All in all I'm still a bit overwhelmed, but trying to get probe() to do something useful ;) As there is no driver needing firmware in gPXE yet there is no support for it in mainline, but Josh foresaw that problem last year when working on wireless drivers and worked on a firmware branch. I have rebased it on current master and will get to use it soon hopefully ;)


QR Code
QR Code soc:2010:peper:journal:week11 (generated for current page)