Documentation/x86/mds.rst

   1 Microarchitectural Data Sampling (MDS) mitigation
   2 =================================================
   3
   4 .. _mds:
   5
   6 Overview
   7 --------
   8
   9 Microarchitectural Data Sampling (MDS) is a family of side channel attacks
  10 on internal buffers in Intel CPUs. The variants are:
  11
  12  - Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
  13  - Microarchitectural Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
  14  - Microarchitectural Load Port Data Sampling (MLPDS) (CVE-2018-12127)
  15  - Microarchitectural Data Sampling Uncacheable Memory (MDSUM) (CVE-2019-11091)
  16
  17 MSBDS leaks Store Buffer Entries which can be speculatively forwarded to a
  18 dependent load (store-to-load forwarding) as an optimization. The forward
  19 can also happen to a faulting or assisting load operation for a different
  20 memory address, which can be exploited under certain conditions. Store
  21 buffers are partitioned between Hyper-Threads so cross thread forwarding is
  22 not possible. But if a thread enters or exits a sleep state the store
  23 buffer is repartitioned which can expose data from one thread to the other.
  24
  25 MFBDS leaks Fill Buffer Entries. Fill buffers are used internally to manage
  26 L1 miss situations and to hold data which is returned or sent in response
  27 to a memory or I/O operation. Fill buffers can forward data to a load
  28 operation and also write data to the cache. When the fill buffer is
  29 deallocated it can retain the stale data of the preceding operations which
  30 can then be forwarded to a faulting or assisting load operation, which can
  31 be exploited under certain conditions. Fill buffers are shared between
  32 Hyper-Threads so cross thread leakage is possible.
  33
  34 MLPDS leaks Load Port Data. Load ports are used to perform load operations
  35 from memory or I/O. The received data is then forwarded to the register
  36 file or a subsequent operation. In some implementations the Load Port can
  37 contain stale data from a previous operation which can be forwarded to
  38 faulting or assisting loads under certain conditions, which again can be
  39 exploited eventually. Load ports are shared between Hyper-Threads so cross
  40 thread leakage is possible.
  41
  42 MDSUM is a special case of MSBDS, MFBDS and MLPDS. An uncacheable load from
  43 memory that takes a fault or assist can leave data in a microarchitectural
  44 structure that may later be observed using one of the same methods used by
  45 MSBDS, MFBDS or MLPDS.
  46
  47 Exposure assumptions
  48 --------------------
  49
  50 It is assumed that attack code resides in user space or in a guest with one
  51 exception. The rationale behind this assumption is that the code construct
  52 needed for exploiting MDS requires:
  53
  54  - to control the load to trigger a fault or assist
  55
  56  - to have a disclosure gadget which exposes the speculatively accessed
  57    data for consumption through a side channel.
  58
  59  - to control the pointer through which the disclosure gadget exposes the
  60    data
  61
  62 The existence of such a construct in the kernel cannot be excluded with
  63 100% certainty, but the complexity involved makes it extremly unlikely.
  64
  65 There is one exception, which is untrusted BPF. The functionality of
  66 untrusted BPF is limited, but it needs to be thoroughly investigated
  67 whether it can be used to create such a construct.
  68
  69
  70 Mitigation strategy
  71 -------------------
  72
  73 All variants have the same mitigation strategy at least for the single CPU
  74 thread case (SMT off): Force the CPU to clear the affected buffers.
  75
  76 This is achieved by using the otherwise unused and obsolete VERW
  77 instruction in combination with a microcode update. The microcode clears
  78 the affected CPU buffers when the VERW instruction is executed.
  79
  80 For virtualization there are two ways to achieve CPU buffer
  81 clearing. Either the modified VERW instruction or via the L1D Flush
  82 command. The latter is issued when L1TF mitigation is enabled so the extra
  83 VERW can be avoided. If the CPU is not affected by L1TF then VERW needs to
  84 be issued.
  85
  86 If the VERW instruction with the supplied segment selector argument is
  87 executed on a CPU without the microcode update there is no side effect
  88 other than a small number of pointlessly wasted CPU cycles.
  89
  90 This does not protect against cross Hyper-Thread attacks except for MSBDS
  91 which is only exploitable cross Hyper-thread when one of the Hyper-Threads
  92 enters a C-state.
  93
  94 The kernel provides a function to invoke the buffer clearing:
  95
  96     mds_clear_cpu_buffers()
  97
  98 The mitigation is invoked on kernel/userspace, hypervisor/guest and C-state
  99 (idle) transitions.
 100
 101 As a special quirk to address virtualization scenarios where the host has
 102 the microcode updated, but the hypervisor does not (yet) expose the
 103 MD_CLEAR CPUID bit to guests, the kernel issues the VERW instruction in the
 104 hope that it might actually clear the buffers. The state is reflected
 105 accordingly.
 106
 107 According to current knowledge additional mitigations inside the kernel
 108 itself are not required because the necessary gadgets to expose the leaked
 109 data cannot be controlled in a way which allows exploitation from malicious
 110 user space or VM guests.
 111
 112 Kernel internal mitigation modes
 113 --------------------------------
 114
 115  ======= ============================================================
 116  off      Mitigation is disabled. Either the CPU is not affected or
 117           mds=off is supplied on the kernel command line
 118
 119  full     Mitigation is enabled. CPU is affected and MD_CLEAR is
 120           advertised in CPUID.
 121
 122  vmwerv   Mitigation is enabled. CPU is affected and MD_CLEAR is not
 123           advertised in CPUID. That is mainly for virtualization
 124           scenarios where the host has the updated microcode but the
 125           hypervisor does not expose MD_CLEAR in CPUID. It's a best
 126           effort approach without guarantee.
 127  ======= ============================================================
 128
 129 If the CPU is affected and mds=off is not supplied on the kernel command
 130 line then the kernel selects the appropriate mitigation mode depending on
 131 the availability of the MD_CLEAR CPUID bit.
 132
 133 Mitigation points
 134 -----------------
 135
 136 1. Return to user space
 137 ^^^^^^^^^^^^^^^^^^^^^^^
 138
 139    When transitioning from kernel to user space the CPU buffers are flushed
 140    on affected CPUs when the mitigation is not disabled on the kernel
 141    command line. The migitation is enabled through the static key
 142    mds_user_clear.
 143
 144    The mitigation is invoked in prepare_exit_to_usermode() which covers
 145    most of the kernel to user space transitions. There are a few exceptions
 146    which are not invoking prepare_exit_to_usermode() on return to user
 147    space. These exceptions use the paranoid exit code.
 148
 149    - Non Maskable Interrupt (NMI):
 150
 151      Access to sensible data like keys, credentials in the NMI context is
 152      mostly theoretical: The CPU can do prefetching or execute a
 153      misspeculated code path and thereby fetching data which might end up
 154      leaking through a buffer.
 155
 156      But for mounting other attacks the kernel stack address of the task is
 157      already valuable information. So in full mitigation mode, the NMI is
 158      mitigated on the return from do_nmi() to provide almost complete
 159      coverage.
 160
 161    - Double fault (#DF):
 162
 163      A double fault is usually fatal, but the ESPFIX workaround, which can
 164      be triggered from user space through modify_ldt(2) is a recoverable
 165      double fault. #DF uses the paranoid exit path, so explicit mitigation
 166      in the double fault handler is required.
 167
 168    - Machine Check Exception (#MC):
 169
 170      Another corner case is a #MC which hits between the CPU buffer clear
 171      invocation and the actual return to user. As this still is in kernel
 172      space it takes the paranoid exit path which does not clear the CPU
 173      buffers. So the #MC handler repopulates the buffers to some
 174      extent. Machine checks are not reliably controllable and the window is
 175      extremly small so mitigation would just tick a checkbox that this
 176      theoretical corner case is covered. To keep the amount of special
 177      cases small, ignore #MC.
 178
 179    - Debug Exception (#DB):
 180
 181      This takes the paranoid exit path only when the INT1 breakpoint is in
 182      kernel space. #DB on a user space address takes the regular exit path,
 183      so no extra mitigation required.
 184
 185
 186 2. C-State transition
 187 ^^^^^^^^^^^^^^^^^^^^^
 188
 189    When a CPU goes idle and enters a C-State the CPU buffers need to be
 190    cleared on affected CPUs when SMT is active. This addresses the
 191    repartitioning of the store buffer when one of the Hyper-Threads enters
 192    a C-State.
 193
 194    When SMT is inactive, i.e. either the CPU does not support it or all
 195    sibling threads are offline CPU buffer clearing is not required.
 196
 197    The idle clearing is enabled on CPUs which are only affected by MSBDS
 198    and not by any other MDS variant. The other MDS variants cannot be
 199    protected against cross Hyper-Thread attacks because the Fill Buffer and
 200    the Load Ports are shared. So on CPUs affected by other variants, the
 201    idle clearing would be a window dressing exercise and is therefore not
 202    activated.
 203
 204    The invocation is controlled by the static key mds_idle_clear which is
 205    switched depending on the chosen mitigation mode and the SMT state of
 206    the system.
 207
 208    The buffer clear is only invoked before entering the C-State to prevent
 209    that stale data from the idling CPU from spilling to the Hyper-Thread
 210    sibling after the store buffer got repartitioned and all entries are
 211    available to the non idle sibling.
 212
 213    When coming out of idle the store buffer is partitioned again so each
 214    sibling has half of it available. The back from idle CPU could be then
 215    speculatively exposed to contents of the sibling. The buffers are
 216    flushed either on exit to user space or on VMENTER so malicious code
 217    in user space or the guest cannot speculatively access them.
 218
 219    The mitigation is hooked into all variants of halt()/mwait(), but does
 220    not cover the legacy ACPI IO-Port mechanism because the ACPI idle driver
 221    has been superseded by the intel_idle driver around 2010 and is
 222    preferred on all affected CPUs which are expected to gain the MD_CLEAR
 223    functionality in microcode. Aside of that the IO-Port mechanism is a
 224    legacy interface which is only used on older systems which are either
 225    not affected or do not receive microcode updates anymore.