qemu-kvm/kvm-spapr-Reset-CAS-IRQ-subsystem-after-devices.patch

From 2edb7c1181fb69e410ffc688986a12d36899f976 Mon Sep 17 00:00:00 2001
From: Miroslav Rezanina <mrezanin@redhat.com>
Date: Mon, 19 Aug 2019 08:54:16 +0100
Subject: [PATCH 4/7] spapr: Reset CAS & IRQ subsystem after devices
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

RH-Author: Miroslav Rezanina <mrezanin@redhat.com>
Message-id: <9b7c319c271fa2c8cda410e87aef985d8c180049.1566204425.git.mrezanin@redhat.com>
Patchwork-id: 90057
O-Subject: [RHEL-AV-8.1 qemu-kvm PATCH 2/5] spapr: Reset CAS & IRQ subsystem after devices
Bugzilla: 1733977
RH-Acked-by: Philippe Mathieu-Daudé <philmd@redhat.com>
RH-Acked-by: Yash Mankad <ymankad@redhat.com>
RH-Acked-by: Danilo de Paula <ddepaula@redhat.com>

From: David Gibson <david@gibson.dropbear.id.au>

Bugzilla: 1733977

This fixes a nasty regression in qemu-4.1 for the 'pseries' machine,
caused by the new "dual" interrupt controller model.  Specifically,
qemu can crash when used with KVM if a 'system_reset' is requested
while there's active I/O in the guest.

The problem is that in spapr_machine_reset() we:

1. Reset the CAS vector state
	spapr_ovec_cleanup(spapr->ov5_cas);

2. Reset all devices
	qemu_devices_reset()

3. Reset the irq subsystem
	spapr_irq_reset();

However (1) implicitly changes the interrupt delivery mode, because
whether we're using XICS or XIVE depends on the CAS state.  We don't
properly initialize the new irq mode until (3) though - in particular
setting up the KVM devices.

During (2), we can temporarily drop the BQL allowing some irqs to be
delivered which will go to an irq system that's not properly set up.

Specifically, if the previous guest was in (KVM) XIVE mode, the CAS
reset will put us back in XICS mode.  kvm_kernel_irqchip() still
returns true, because XIVE was using KVM, however XICs doesn't have
its KVM components intialized and kernel_xics_fd == -1.  When the irq
is delivered it goes via ics_kvm_set_irq() which assert()s that
kernel_xics_fd != -1.

This change addresses the problem by delaying the CAS reset until
after the devices reset.  The device reset should quiesce all the
devices so we won't get irqs delivered while we mess around with the
IRQ.  The CAS reset and irq re-initialize should also now be under the
same BQL critical section so nothing else should be able to interrupt
it either.

We also move the spapr_irq_msi_reset() used in one of the legacy irq
modes, since it logically makes sense at the same point as the
spapr_irq_reset() (it's essentially an equivalent operation for older
machine types).  Since we don't need to switch between different
interrupt controllers for those old machine types it shouldn't
actually be broken in those cases though.

Cc: Cédric Le Goater <clg@kaod.org>

Fixes: b2e22477 "spapr: add a 'reset' method to the sPAPR IRQ backend"
Fixes: 13db0cd9 "spapr: introduce a new sPAPR IRQ backend supporting
                 XIVE and XICS"
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 25c9780d38d4494f8610371d883865cf40b35dd6)

Signed-off-by: Miroslav Rezanina <mrezanin@redhat.com>
Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>
---
 hw/ppc/spapr.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index ab64d43..669eae1 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1727,6 +1727,18 @@ static void spapr_machine_reset(MachineState *machine)
     }
 
     /*
+     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
+     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
+     * called from vPHB reset handler so we initialize the counter here.
+     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
+     * must be equally distant from any other node.
+     * The final value of spapr->gpu_numa_id is going to be written to
+     * max-associativity-domains in spapr_build_fdt().
+     */
+    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
+    qemu_devices_reset();
+
+    /*
      * If this reset wasn't generated by CAS, we should reset our
      * negotiated options and start from scratch
      */
@@ -1742,18 +1754,6 @@ static void spapr_machine_reset(MachineState *machine)
     }
 
     /*
-     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
-     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
-     * called from vPHB reset handler so we initialize the counter here.
-     * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
-     * must be equally distant from any other node.
-     * The final value of spapr->gpu_numa_id is going to be written to
-     * max-associativity-domains in spapr_build_fdt().
-     */
-    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
-    qemu_devices_reset();
-
-    /*
      * This is fixing some of the default configuration of the XIVE
      * devices. To be called after the reset of the machine devices.
      */
-- 
1.8.3.1
* Mon Aug 19 2019 Danilo Cesar Lemes de Paula <ddepaula@redhat.com> - 4.1.0-2.el8 - kvm-spec-Update-seavgabios-dependency.patch [bz#1725664] - kvm-pc-Don-t-make-die-id-mandatory-unless-necessary.patch [bz#1741451] - kvm-display-bochs-fix-pcie-support.patch [bz#1733977 bz#1740692] - kvm-spapr-Reset-CAS-IRQ-subsystem-after-devices.patch [bz#1733977] - kvm-spapr-xive-Fix-migration-of-hot-plugged-CPUs.patch [bz#1733977] - kvm-riscv-roms-Fix-make-rules-for-building-sifive_u-bios.patch [bz#1733977 bz#1740692] - kvm-Update-version-for-v4.1.0-release.patch [bz#1733977 bz#1740692] - Resolves: bz#1725664 (Update seabios dependency) - Resolves: bz#1733977 (Qemu core dumped: /home/ngu/qemu/hw/intc/xics_kvm.c:321: ics_kvm_set_irq: Assertion `kernel_xics_fd != -1' failed) - Resolves: bz#1740692 (Backport QEMU 4.1.0 rc5 & ga patches) - Resolves: bz#1741451 (Failed to hot-plug vcpus) 2019-08-19 14:50:32 +00:00			`From 2edb7c1181fb69e410ffc688986a12d36899f976 Mon Sep 17 00:00:00 2001`
			`From: Miroslav Rezanina <mrezanin@redhat.com>`
			`Date: Mon, 19 Aug 2019 08:54:16 +0100`
			`Subject: [PATCH 4/7] spapr: Reset CAS & IRQ subsystem after devices`
			`MIME-Version: 1.0`
			`Content-Type: text/plain; charset=UTF-8`
			`Content-Transfer-Encoding: 8bit`

			`RH-Author: Miroslav Rezanina <mrezanin@redhat.com>`
			`Message-id: <9b7c319c271fa2c8cda410e87aef985d8c180049.1566204425.git.mrezanin@redhat.com>`
			`Patchwork-id: 90057`
			`O-Subject: [RHEL-AV-8.1 qemu-kvm PATCH 2/5] spapr: Reset CAS & IRQ subsystem after devices`
			`Bugzilla: 1733977`
			`RH-Acked-by: Philippe Mathieu-Daudé <philmd@redhat.com>`
			`RH-Acked-by: Yash Mankad <ymankad@redhat.com>`
			`RH-Acked-by: Danilo de Paula <ddepaula@redhat.com>`

			`From: David Gibson <david@gibson.dropbear.id.au>`

			`Bugzilla: 1733977`

			`This fixes a nasty regression in qemu-4.1 for the 'pseries' machine,`
			`caused by the new "dual" interrupt controller model. Specifically,`
			`qemu can crash when used with KVM if a 'system_reset' is requested`
			`while there's active I/O in the guest.`

			`The problem is that in spapr_machine_reset() we:`

			`1. Reset the CAS vector state`
			`spapr_ovec_cleanup(spapr->ov5_cas);`

			`2. Reset all devices`
			`qemu_devices_reset()`

			`3. Reset the irq subsystem`
			`spapr_irq_reset();`

			`However (1) implicitly changes the interrupt delivery mode, because`
			`whether we're using XICS or XIVE depends on the CAS state. We don't`
			`properly initialize the new irq mode until (3) though - in particular`
			`setting up the KVM devices.`

			`During (2), we can temporarily drop the BQL allowing some irqs to be`
			`delivered which will go to an irq system that's not properly set up.`

			`Specifically, if the previous guest was in (KVM) XIVE mode, the CAS`
			`reset will put us back in XICS mode. kvm_kernel_irqchip() still`
			`returns true, because XIVE was using KVM, however XICs doesn't have`
			`its KVM components intialized and kernel_xics_fd == -1. When the irq`
			`is delivered it goes via ics_kvm_set_irq() which assert()s that`
			`kernel_xics_fd != -1.`

			`This change addresses the problem by delaying the CAS reset until`
			`after the devices reset. The device reset should quiesce all the`
			`devices so we won't get irqs delivered while we mess around with the`
			`IRQ. The CAS reset and irq re-initialize should also now be under the`
			`same BQL critical section so nothing else should be able to interrupt`
			`it either.`

			`We also move the spapr_irq_msi_reset() used in one of the legacy irq`
			`modes, since it logically makes sense at the same point as the`
			`spapr_irq_reset() (it's essentially an equivalent operation for older`
			`machine types). Since we don't need to switch between different`
			`interrupt controllers for those old machine types it shouldn't`
			`actually be broken in those cases though.`

			`Cc: Cédric Le Goater <clg@kaod.org>`

			`Fixes: b2e22477 "spapr: add a 'reset' method to the sPAPR IRQ backend"`
			`Fixes: 13db0cd9 "spapr: introduce a new sPAPR IRQ backend supporting`
			`XIVE and XICS"`
			`Signed-off-by: David Gibson <david@gibson.dropbear.id.au>`
			`(cherry picked from commit 25c9780d38d4494f8610371d883865cf40b35dd6)`

			`Signed-off-by: Miroslav Rezanina <mrezanin@redhat.com>`
			`Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>`
			`---`
			`hw/ppc/spapr.c \| 24 ++++++++++++------------`
			`1 file changed, 12 insertions(+), 12 deletions(-)`

			`diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c`
			`index ab64d43..669eae1 100644`
			`--- a/hw/ppc/spapr.c`
			`+++ b/hw/ppc/spapr.c`
			`@@ -1727,6 +1727,18 @@ static void spapr_machine_reset(MachineState *machine)`
			`}`

			`/*`
			`+ * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.`
			`+ * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is`
			`+ * called from vPHB reset handler so we initialize the counter here.`
			`+ * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM`
			`+ * must be equally distant from any other node.`
			`+ * The final value of spapr->gpu_numa_id is going to be written to`
			`+ * max-associativity-domains in spapr_build_fdt().`
			`+ */`
			`+ spapr->gpu_numa_id = MAX(1, nb_numa_nodes);`
			`+ qemu_devices_reset();`
			`+`
			`+ /*`
			`* If this reset wasn't generated by CAS, we should reset our`
			`* negotiated options and start from scratch`
			`*/`
			`@@ -1742,18 +1754,6 @@ static void spapr_machine_reset(MachineState *machine)`
			`}`

			`/*`
			`- * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.`
			`- * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is`
			`- * called from vPHB reset handler so we initialize the counter here.`
			`- * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM`
			`- * must be equally distant from any other node.`
			`- * The final value of spapr->gpu_numa_id is going to be written to`
			`- * max-associativity-domains in spapr_build_fdt().`
			`- */`
			`- spapr->gpu_numa_id = MAX(1, nb_numa_nodes);`
			`- qemu_devices_reset();`
			`-`
			`- /*`
			`* This is fixing some of the default configuration of the XIVE`
			`* devices. To be called after the reset of the machine devices.`
			`*/`
			`--`
			`1.8.3.1`