Perfect XEN Setup on Debian Lenny - Live Migration Issues

Discussion in 'HOWTO-Related Questions' started by AlexKent, Apr 2, 2009.

  1. AlexKent

    AlexKent New Member

    Hi,

    As someone who tried to set a virtual servers IP address on KVM, I loved the tutorial on xen working very well on Debian Lenny.

    One thing though, I've setup two servers using the guide and are now having issues with live migration.

    So the command is:

    xm migrate --live domU.com dom0-2.com

    dom0-1 seems to successfully complete the transfer of domU to domU2.

    However, on dom0-2 the domain stays in the p (paused) state for about 5 minutes and when it finally goes onto b or r and you connect to the console, you just see a stream of kernel panic about tasks being blocked for a certain number of seconds.

    (have attached small kernel output selection)

    Does anyone know if this is 'just me'? Or did anyone else have these issues about Xen and Debian Lenny not working out the box for live migration.

    Thanks,

    Alex
     

    Attached Files:

  2. falko

    falko Super Moderator

    Are the dom0-1 and dom0-2 setups identical? Do they have the same architecture (e.g. x86_64 or i386), do the use the same kernel versions, etc.?
     
  3. AlexKent

    AlexKent New Member

    Confirmation of identical setups

    Hi Falko,

    uname -a on both dom0 is:

    Linux xen02.example.com 2.6.26-1-xen-amd64 #1 SMP Fri Mar 13 21:39:38 UTC 2009 x86_64 GNU/Linux

    Linux xen03.example.com 2.6.26-1-xen-amd64 #1 SMP Fri Mar 13 21:39:38 UTC 2009 x86_64 GNU/Linux

    They are both setup the same using your tutorial and are otherwise running flawlessly.

    One difference to your first tutorial is that they use LVM for managing their discs, like I saw on your follow on tutorial.

    I've just tried again, this time without the --live migrate option and the same problem occurs.

    I can connect to a console, but after typing in username and pressing enter, the system blocks for a period time and then returns:

    ---

    dom1.example.com login: alex
    [ 168.431234] vbd vbd-2049: 16 Device in use; refusing to close

    ---

    After a little while longer of nothingness (couple of minutes), I get:

    [ 232.123828] INFO: task rsyslogd:1157 blocked for more than 120 seconds.
    [ 232.123837] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 232.123843] rsyslogd D ffff88008162d000 0 1157 1
    [ 232.123850] ffff880001617c58 0000000000000286 ffff88000e130120 ffffffff8037e2cb
    [ 232.123862] ffff88000fd904c0 ffff88000fd42340 ffff88000fd90740 000000008020e810
    [ 232.123872] 0000000000000000 ffff88000e159aa8 0000000000000dfe ffff88008162d000
    [ 232.123879] Call Trace:
    [ 232.123890] [<ffffffff8037e2cb>] gnttab_free_grant_references+0x19/0x87
    [ 232.123898] [<ffffffff8020e911>] xen_clocksource_read+0xd/0x9c
    [ 232.123905] [<ffffffff80265ed3>] sync_page_killable+0x0/0x30
    [ 232.123912] [<ffffffff80243f2e>] getnstimeofday+0x39/0x98
    [ 232.123920] [<ffffffff80265ed3>] sync_page_killable+0x0/0x30
    [ 232.123926] [<ffffffff80434a5f>] io_schedule+0x5c/0x9e
    [ 232.123932] [<ffffffff802646fd>] sync_page+0x3c/0x41
    [ 232.123937] [<ffffffff80265edc>] sync_page_killable+0x9/0x30
    [ 232.123942] [<ffffffff80434bd2>] __wait_on_bit_lock+0x36/0x66
    [ 232.123948] [<ffffffff8026464a>] __lock_page_killable+0x5e/0x64
    [ 232.123956] [<ffffffff8023f6c7>] wake_bit_function+0x0/0x23
    [ 232.123962] [<ffffffff802661fd>] generic_file_aio_read+0x2fa/0x4b2
    [ 232.123972] [<ffffffff8028a23b>] do_sync_read+0xc9/0x10c
    [ 232.123978] [<ffffffff8023f699>] autoremove_wake_function+0x0/0x2e
    [ 232.123985] [<ffffffff8028aa2c>] vfs_read+0xaa/0x152
    [ 232.123990] [<ffffffff8028ae0d>] sys_read+0x45/0x6e
    [ 232.123996] [<ffffffff8020b528>] system_call+0x68/0x6d
    [ 232.124001] [<ffffffff8020b4c0>] system_call+0x0/0x6d
    [ 232.124005]
    [ 232.124009] INFO: task cron:2109 blocked for more than 120 seconds.
    [ 232.124015] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 232.124021] cron D ffff88008162d000 0 2109 1197
    [ 232.124029] ffff88000f2499e8 0000000000000282 ffff88000e1300b0 ffffffff8037e2cb
    [ 232.124039] ffff88000fd436c0 ffff88000fce2140 ffff88000fd43940 000000008020e810
    [ 232.124048] ffff88000f802980 ffff88000fd4d1a8 0000000000000dfe ffff88008162d000
    [ 232.124055] Call Trace:
    [ 232.124061] [<ffffffff8037e2cb>] gnttab_free_grant_references+0x19/0x87
    [ 232.124068] [<ffffffff8020e911>] xen_clocksource_read+0xd/0x9c
    [ 232.124074] [<ffffffff80243f2e>] getnstimeofday+0x39/0x98
    [ 232.124080] [<ffffffff802ac72a>] sync_buffer+0x0/0x3f
    [ 232.124085] [<ffffffff80434a5f>] io_schedule+0x5c/0x9e
    [ 232.124090] [<ffffffff802ac765>] sync_buffer+0x3b/0x3f
    [ 232.124095] [<ffffffff80434cba>] __wait_on_bit+0x40/0x6e
    [ 232.124102] [<ffffffff802ac72a>] sync_buffer+0x0/0x3f
    [ 232.124107] [<ffffffff80434d54>] out_of_line_wait_on_bit+0x6c/0x78
    [ 232.124114] [<ffffffff8023f6c7>] wake_bit_function+0x0/0x23
    [ 232.124128] [<ffffffffa00612e3>] :ext3:ext3_find_entry+0x3f3/0x5d5
    [ 232.124135] [<ffffffff8026c44c>] mark_page_accessed+0x18/0x2b
    [ 232.124141] [<ffffffff802abf64>] __getblk+0x1d/0x222
    [ 232.124153] [<ffffffffa005cb1b>] :ext3:__ext3_get_inode_loc+0xf9/0x2ac
    [ 232.124167] [<ffffffffa0062c72>] :ext3:ext3_lookup+0x31/0xc9
    [ 232.124173] [<ffffffff8029b826>] d_alloc+0x15b/0x1c7
    [ 232.124179] [<ffffffff802919c5>] do_lookup+0xd7/0x1c1
    [ 232.124186] [<ffffffff80293cdb>] __link_path_walk+0x96f/0xdfa
    [ 232.124193] [<ffffffff80294028>] __link_path_walk+0xcbc/0xdfa
    [ 232.124198] [<ffffffff8026908d>] get_page_from_freelist+0xde/0x518
    [ 232.124204] [<ffffffff802941ac>] path_walk+0x46/0x8b
    [ 232.124210] [<ffffffff802944d8>] do_path_lookup+0x158/0x1ce
    [ 232.124215] [<ffffffff8029501b>] __path_lookup_intent_open+0x56/0x97
    [ 232.124221] [<ffffffff80295151>] do_filp_open+0x9c/0x7c4
    [ 232.124228] [<ffffffff8028893f>] get_unused_fd_flags+0x74/0x13f
    [ 232.124234] [<ffffffff80288a50>] do_sys_open+0x46/0xc3
    [ 232.124239] [<ffffffff8020b528>] system_call+0x68/0x6d
    [ 232.124244] [<ffffffff8020b4c0>] system_call+0x0/0x6d
    [ 232.124248]
    [ 251.318307] INFO: task getty:1268 blocked for more than 120 seconds.
    [ 251.318314] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 251.318320] getty D ffffffff80449f00 0 1268 1
    [ 251.318327] ffff88000e707bc8 0000000000000286 ffff88000e130dd0 ffffffff8037e2cb
    [ 251.318338] ffff88000fd7b7c0 ffffffff804fe460 ffff88000fd7ba40 000000008020e810
    [ 251.318348] 0000000000000000 ffff88000e1597a8 0000000000000f88 ffff88008162d000
    [ 251.318355] Call Trace:
    [ 251.318362] [<ffffffff8037e2cb>] gnttab_free_grant_references+0x19/0x87
    [ 251.318369] [<ffffffff8020e911>] xen_clocksource_read+0xd/0x9c
    [ 251.318375] [<ffffffff80265ed3>] sync_page_killable+0x0/0x30
    [ 251.318380] [<ffffffff80243f2e>] getnstimeofday+0x39/0x98
    [ 251.318387] [<ffffffff80265ed3>] sync_page_killable+0x0/0x30
    [ 251.318395] [<ffffffff80434a5f>] io_schedule+0x5c/0x9e
    [ 251.318400] [<ffffffff802646fd>] sync_page+0x3c/0x41
    [ 251.318406] [<ffffffff80265edc>] sync_page_killable+0x9/0x30
    [ 251.318411] [<ffffffff80434bd2>] __wait_on_bit_lock+0x36/0x66
    [ 251.318417] [<ffffffff8026464a>] __lock_page_killable+0x5e/0x64
    [ 251.318423] [<ffffffff8023f6c7>] wake_bit_function+0x0/0x23
    [ 251.318429] [<ffffffff802661fd>] generic_file_aio_read+0x2fa/0x4b2
    [ 251.318436] [<ffffffff8028a23b>] do_sync_read+0xc9/0x10c
    [ 251.318444] [<ffffffff8023f699>] autoremove_wake_function+0x0/0x2e
    [ 251.318451] [<ffffffff802789b8>] __vma_link+0x42/0x4b
    [ 251.318457] [<ffffffff8028aa2c>] vfs_read+0xaa/0x152
    [ 251.318463] [<ffffffff8028e2a9>] kernel_read+0x38/0x4c
    [ 251.318468] [<ffffffff8028f6b8>] do_execve+0xf1/0x215
    [ 251.318474] [<ffffffff80209425>] sys_execve+0x35/0x4c
    [ 251.318479] [<ffffffff8020b970>] stub_execve+0x40/0x70
    [ 251.318485]
    [ 253.498342] INFO: task kjournald:587 blocked for more than 120 seconds.
    [ 253.498349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 253.498355] kjournald D ffffffff80449f00 0 587 2
    [ 253.498362] ffff8800016c7d30 0000000000000246 ffff88000e159b78 0000000000000d5a
    [ 253.498370] ffff88000fc3a080 ffffffff804fe460 ffff88000fc3a300 000000008020e810
    [ 253.498383] ffff8800020709a0 ffff88000e159b68 0000000000000fa4 ffff88008162d000
    [ 253.498390] Call Trace:
    [ 253.498397] [<ffffffff8020e911>] xen_clocksource_read+0xd/0x9c
    [ 253.498403] [<ffffffff80243f2e>] getnstimeofday+0x39/0x98
    [ 253.498409] [<ffffffff802ac72a>] sync_buffer+0x0/0x3f
    [ 253.498414] [<ffffffff80434a5f>] io_schedule+0x5c/0x9e
    [ 253.498421] [<ffffffff802ac765>] sync_buffer+0x3b/0x3f
    [ 253.498426] [<ffffffff80434cba>] __wait_on_bit+0x40/0x6e
    [ 253.498433] [<ffffffff802ac72a>] sync_buffer+0x0/0x3f
    [ 253.498438] [<ffffffff80434d54>] out_of_line_wait_on_bit+0x6c/0x78
    [ 253.498444] [<ffffffff8023f6c7>] wake_bit_function+0x0/0x23
    [ 253.498455] [<ffffffffa004a196>] :jbd:__journal_file_buffer+0xb5/0x154
    [ 253.498465] [<ffffffffa004c426>] :jbd:journal_commit_transaction+0x527/0xe5d
    [ 253.498471] [<ffffffff80235c38>] lock_timer_base+0x26/0x4b
    [ 253.498477] [<ffffffff80235cae>] try_to_del_timer_sync+0x51/0x5a
    [ 253.498488] [<ffffffffa004fba5>] :jbd:kjournald+0xd5/0x25a
    [ 253.498495] [<ffffffff8023f699>] autoremove_wake_function+0x0/0x2e
    [ 253.498504] [<ffffffffa004fad0>] :jbd:kjournald+0x0/0x25a
    [ 253.498510] [<ffffffff8023f56b>] kthread+0x47/0x74
    [ 253.498516] [<ffffffff8022839f>] schedule_tail+0x27/0x5c
    [ 253.498521] [<ffffffff8020be28>] child_rip+0xa/0x12
    [ 253.498527] [<ffffffff8023f524>] kthread+0x0/0x74
    [ 253.498532] [<ffffffff8020be1e>] child_rip+0x0/0x12

    -----

    Any ideas would be appreciated and thanks for your help with the tutorial.

    Regards,

    Alex
     
  4. falko

    falko Super Moderator

    I'm not sure why this is happening. I think I have to try myself.
     
  5. AlexKent

    AlexKent New Member

    Thanks Falko, would love to know if it can be done.

    Regards,

    Alex
     
  6. AlexKent

    AlexKent New Member

    Any updates?

    Hi Falko,

    Just wondering if you got round to testing if Live Migration works out the box?

    Thanks,

    Alex
     
  7. falko

    falko Super Moderator

    Yes, I've tried it and written a tutorial about it (will publish it in a few days). I've used shared storage (iSCSI) for the virtual machines. As far as I know, Xen live migration doesn't work without shared storage.
     
  8. AlexKent

    AlexKent New Member

    Thanks falko

    Hi Falko,

    Thanks for confirming that the live migration issue isn't just me.

    I'm just getting my head round LVM at the moment, thinking live migration would be an excellent solution for getting domU's off a failing server, so it's a shame it's not supported.

    I don't quite get why though, because the data seems to copy, the configuration gets loaded, yet something isn't quite right and the server streams errors.

    If I went back to disc images, do you know if live migration works? I guess that's where I diverged from your tutorial, to my peril.

    I look forward to reading your article on iSCSI - sounds fascinating.

    Best wishes,

    Alex
     
  9. falko

    falko Super Moderator

    I don't think this has anything to do with disk images vs. LVM images.
     

Share This Page