How to debug hanging up opensolaris 0906 when running as KVM guest


I write how to debug hanging up opensolaris 0906 on KVM under Debian sid.

Step 1. Get the source of opensolaris.

mkdir opensolaris
cd opensolaris
sudo aptitude install mercurial
hg clone http://bitbucket.org/mirror/onnv-gate 
# I prefer bitbucket.org than hg.opensolaris.org. 
# It is because cloning from hg.opensolaris.org is too slow for me.
cd onn-gate

Step 2. Run opensolaris 0906 as guest os with kmdb.

Open a terminal, and start opensolaris on KVM with serial console.

sudo virsh start opensolaris --console
Domain osol-01 started
Connected to domain osol-01
Escape character is ^]

Open an another window, and start virt-viewer.

sudo virt-viewer opensolaris

On grub screen,edit boot entry pushing 'e' key ,also allow keys. And append

',console=ttya -k -v' on kernel entry, push 'Return','b', then boot.

kernel$ /platform/i86pc/kernel/$ISADIR/unix -B $ZFS-BOOT,console=ttya -k -v
...after edit ,push 'Return' key,push 'b' key then boot ...

Step 3. Examine the kernel panic and find out the ground-zero

Waiting a while, kmdb traps kernel panic like this,

Escape character is ^]
module /platform/i86pc/kernel//unix: text at [0xfe800000, 0xfe8d5487] data at 0xfec00000
module /kernel/genunix: text at [0xfe8d5488, 0xfeaa1d8f] data at 0xfec4bb80
Loading kmdb...
module /kernel/misc/kmdbmod: text at [0xfeaa1d90, 0xfeb10577] data at 0xfec96c88
module /kernel/misc/ctf: text at [0xfeb10578, 0xfeb17f37] data at 0xfeca7808
SunOS Release 5.11 Version snv_111b 32-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
features: 1007fff<cpuid,sse2,sse,sep,pat,cx8,pae,mca,mmx,cmov,de,pge,mtrr,msr,tsc,lgpg>
mem = 1048168K (0x3ff9a000)
initialized model-specific module 'cpu_ms.AuthenticAMD' on chip 0 core 0 strand 0

panic[cpu0]/thread=fec1fc20: BAD TRAP: type=8 (#df Double fault) rp=fec26a9c addr=0

#df Double fault
pid=0, pc=0xfe80037c, sp=0xfec3ae70, eflags=0x202
... snip ...

fec269d8 unix:die+e5 (8, fec26a9c, 0, 0)
fec26a88 unix:trap+12b9 (fec26a9c, 0, 0)
fec26a9c unix:_cmntrap+7c (1b0, 0, 160)

panic: entering debugger (no dump device, continue to reboot)

Welcome to kmdb
kmdb: unable to determine terminal type: assuming `vt100'
Loaded modules: [ mac unix krtld genunix specfs cpu.generic ]

First,I examine stack using $C ($C is backtrace) command. Unfortunately,in "Double fault", the stack trace shows only the status of nearest exception,so it is not useful.

Then trying another approach, I use "::threadlist -v" command.

[0]> ::threadlist -v
fec1fc20 fec1f398 fec21580   0  96        0
  PC: panicsys+0x4b    CMD: 
  stack pointer for thread fec1fc20: fec3ae70


After disassembling wrmsr (wrmsr::dis command),and also cpu.generic`gcpu_mca_init+0x51c (cpu.generic`gcpu_mca_init+0x51c::dis command), there might be the ground zero of panic in gcpu_mca_init+0x51c. Then, search the kernel source code with keywords 'gcpu_mca_init','cms_mcgctl_val','cmi_hdl_wrmsr',and 'cmi_hdl_enable_mce'.

find . -type '*.c' | xargs fgrep 'gcpu_mca_init' | less
...I find 'opensolaris/onnv-gate/usr/src/uts/i86pc/cpu/generic_cpu/gcpu_mca.c'
contains 'gcpu_mca_init()'. ...
less usr/src/uts/i86pc/cpu/generic_cpu/gcpu_mca.c
... searching gcpu_mca_init() in gcpu_mca.c for resembling disassemble 
'gcpu_mca_init+0x51c' with statement of calling

I find line 1973 in gcpu_mca.c. This line is most likely to the disassemble of gcpu_mca_init+0x51c. Considering this source, I hit an idea that disabling 'mce' feature might cause the opensolaris is running as stable.

Step 4. Try this idea and get result.

I append some statements like this to libvirt xml definition of guest os,

  <cpu match='exact'>
    <feature policy='disable' name='mce'/>

Then virsh start opensolaris , then this workaround is worked well.

An better workaround of hanging up opensolaris 0906 as KVM guest under debian sid.


I mentioned before, using 'qemu32' cpu type is avoiding from hanging up opensolaris 0906 on KVM under Debian sid. However this workaround has the big side-effect,which is disabling almost all cpu feature.

I found a more better workaround that only disabling 'mce' (Machine Check Exception) cpu feature cause opensolaris is running as stable under KVM.I'll show the sample directive of libvirt xml to disable 'mce' feature,

  <cpu match='exact'>
    <model>athlon</model> <!-- you can also specify another cpu model, ex. qemu64 etc... -->
    <feature policy='disable' name='mce'/>

Then opensolaris 0906 KVM geust run as stable ,at least under my environment ;-)

I also write down about my environment,

Connection: close