Makoto report a below KASAN error: zram does out-of-bounds read. Because
strscpy copies from source up to count bytes unconditionally. It could
cause out-of-bounds read on next object in slab.
To prevent it, use strlcpy which checks source's length automatically.
BUG: KASAN: slab-out-of-bounds in strscpy+0x68/0x154
Read of size 8 at addr ffffffc0c3495a00 by task system_server/1314
..
Call trace:
strscpy+0x68/0x154
idle_store+0xc4/0x34c
dev_attr_store+0x50/0x6c
sysfs_kf_write+0x98/0xb4
kernfs_fop_write+0x198/0x260
__vfs_write+0x10c/0x338
vfs_write+0x114/0x238
SyS_write+0xc8/0x168
__sys_trace_return+0x0/0x4
Allocated by task 1314:
__kmalloc+0x280/0x318
kernfs_fop_write+0xac/0x260
__vfs_write+0x10c/0x338
vfs_write+0x114/0x238
SyS_write+0xc8/0x168
__sys_trace_return+0x0/0x4
The buggy address belongs to the object at ffffffc0c3495a00
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 0 bytes inside of
128-byte region [ffffffc0c3495a00, ffffffc0c3495a80)
The buggy address belongs to the page:
page:ffffffbf030d2500 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0
flags: 0x4000000000010200(slab|head)
page dumped because: kasan: bad access detected
Memory state around the buggy address: ffffffc0c3495900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffffc0c3495980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffffffc0c3495a00: 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^ ffffffc0c3495a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffffffc0c3495b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Qian Cai [Fri, 29 Mar 2019 03:44:21 +0000 (20:44 -0700)]
mm/page_isolation.c: fix a wrong flag in set_migratetype_isolate()
Due to has_unmovable_pages() taking an incorrect irqsave flag instead of
the isolation flag in set_migratetype_isolate(), there are issues with
HWPOSION and error reporting where dump_page() is not called when there
is an unmovable page.
Link: http://lkml.kernel.org/r/20190320204941.53731-1-cai@lca.pw Fixes: d381c54760dc ("mm: only report isolation failures when offlining memory") Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Qian Cai <cai@lca.pw> Cc: <stable@vger.kernel.org> [5.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Fri, 29 Mar 2019 03:44:16 +0000 (20:44 -0700)]
mm/memory_hotplug.c: fix notification in offline error path
When start_isolate_page_range() returned -EBUSY in __offline_pages(), it
calls memory_notify(MEM_CANCEL_OFFLINE, &arg) with an uninitialized
"arg". As the result, it triggers warnings below. Also, it is only
necessary to notify MEM_CANCEL_OFFLINE after MEM_GOING_OFFLINE.
Link: http://lkml.kernel.org/r/20190320204255.53571-1-cai@lca.pw Fixes: 7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure") Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> [5.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrei Vagin [Fri, 29 Mar 2019 03:44:13 +0000 (20:44 -0700)]
ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASK
There are a few system calls (pselect, ppoll, etc) which replace a task
sigmask while they are running in a kernel-space
When a task calls one of these syscalls, the kernel saves a current
sigmask in task->saved_sigmask and sets a syscall sigmask.
On syscall-exit-stop, ptrace traps a task before restoring the
saved_sigmask, so PTRACE_GETSIGMASK returns the syscall sigmask and
PTRACE_SETSIGMASK does nothing, because its sigmask is replaced by
saved_sigmask, when the task returns to user-space.
This patch fixes this problem. PTRACE_GETSIGMASK returns saved_sigmask
if it's set. PTRACE_SETSIGMASK drops the TIF_RESTORE_SIGMASK flag.
Link: http://lkml.kernel.org/r/20181120060616.6043-1-avagin@gmail.com Fixes: 29000caecbe8 ("ptrace: add ability to get/set signal-blocked mask") Signed-off-by: Andrei Vagin <avagin@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oscar Salvador [Fri, 29 Mar 2019 03:44:01 +0000 (20:44 -0700)]
mm/debug.c: fix __dump_page when mapping->host is not set
While debugging something, I added a dump_page() into do_swap_page(),
and I got the splat from below. The issue happens when dereferencing
mapping->host in __dump_page():
Swap address space does not contain an inode information, and so
mapping->host equals NULL.
Although the dump_page() call was added artificially into
do_swap_page(), I am not sure if we can hit this from any other path, so
it looks worth fixing it. We can easily do that by checking
mapping->host first.
Link: http://lkml.kernel.org/r/20190318072931.29094-1-osalvador@suse.de Fixes: 1c6fb1d89e73c ("mm: print more information about mapping in __dump_page") Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Fri, 29 Mar 2019 03:43:55 +0000 (20:43 -0700)]
mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified
When MPOL_MF_STRICT was specified and an existing page was already on a
node that does not follow the policy, mbind() should return -EIO. But
commit 6f4576e3687b ("mempolicy: apply page table walker on
queue_pages_range()") broke the rule.
And commit c8633798497c ("mm: mempolicy: mbind and migrate_pages support
thp migration") didn't return the correct value for THP mbind() too.
If MPOL_MF_STRICT is set, ignore vma_migratable() to make sure it
reaches queue_pages_to_pte_range() or queue_pages_pmd() to check if an
existing page was already on a node that does not follow the policy.
And, non-migratable vma may be used, return -EIO too if MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL was specified.
Tested with https://github.com/metan-ucw/ltp/blob/master/testcases/kernel/syscalls/mbind/mbind02.c
[akpm@linux-foundation.org: tweak code comment] Link: http://lkml.kernel.org/r/1553020556-38583-1-git-send-email-yang.shi@linux.alibaba.com Fixes: 6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Oscar Salvador <osalvador@suse.de> Reported-by: Cyril Hrubis <chrubis@suse.cz> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Rafael Aquini <aquini@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>> mm/memory.c:3968:21: sparse: incorrect type in assignment (different
>> base types) @@ expected restricted vm_fault_t [usertype] ret @@
>> got e] ret @@
mm/memory.c:3968:21: expected restricted vm_fault_t [usertype] ret
mm/memory.c:3968:21: got int
This patch converts to return vm_fault_t type for hugetlb_fault() when
CONFIG_HUGETLB_PAGE=n.
Regarding the sparse warning, Luc said:
: This is the expected behaviour. The constant 0 is magic regarding bitwise
: types but ({ ...; 0; }) is not, it is just an ordinary expression of type
: 'int'.
:
: So, IMHO, Souptick's patch is the right thing to do.
Link: http://lkml.kernel.org/r/20190318162604.GA31553@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nicolas Boichat [Fri, 29 Mar 2019 03:43:46 +0000 (20:43 -0700)]
iommu/io-pgtable-arm-v7s: request DMA32 memory, and improve debugging
IOMMUs using ARMv7 short-descriptor format require page tables (level 1
and 2) to be allocated within the first 4GB of RAM, even on 64-bit
systems.
For level 1/2 pages, ensure GFP_DMA32 is used if CONFIG_ZONE_DMA32 is
defined (e.g. on arm64 platforms).
For level 2 pages, allocate a slab cache in SLAB_CACHE_DMA32. Note that
we do not explicitly pass GFP_DMA[32] to kmem_cache_zalloc, as this is
not strictly necessary, and would cause a warning in mm/sl*b.c, as we
did not update GFP_SLAB_BUG_MASK.
Also, print an error when the physical address does not fit in
32-bit, to make debugging easier in the future.
Link: http://lkml.kernel.org/r/20181210011504.122604-3-drinkcat@chromium.org Fixes: ad67f5a6545f ("arm64: replace ZONE_DMA with ZONE_DMA32") Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: Huaisheng Ye <yehs1@lenovo.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sasha Levin <Alexander.Levin@microsoft.com> Cc: Tomasz Figa <tfiga@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Yong Wu <yong.wu@mediatek.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nicolas Boichat [Fri, 29 Mar 2019 03:43:42 +0000 (20:43 -0700)]
mm: add support for kmem caches in DMA32 zone
Patch series "iommu/io-pgtable-arm-v7s: Use DMA32 zone for page tables",
v6.
This is a followup to the discussion in [1], [2].
IOMMUs using ARMv7 short-descriptor format require page tables (level 1
and 2) to be allocated within the first 4GB of RAM, even on 64-bit
systems.
For L1 tables that are bigger than a page, we can just use
__get_free_pages with GFP_DMA32 (on arm64 systems only, arm would still
use GFP_DMA).
For L2 tables that only take 1KB, it would be a waste to allocate a full
page, so we considered 3 approaches:
1. This series, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2 page
tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable to reuse
freed fragments until the whole page is freed. [3]
This series is the most memory-efficient approach.
stable@ note:
We confirmed that this is a regression, and IOMMU errors happen on 4.19
and linux-next/master on MT8173 (elm, Acer Chromebook R13). The issue
most likely starts from commit ad67f5a6545f ("arm64: replace ZONE_DMA
with ZONE_DMA32"), i.e. 4.15, and presumably breaks a number of Mediatek
platforms (and maybe others?).
IOMMUs using ARMv7 short-descriptor format require page tables to be
allocated within the first 4GB of RAM, even on 64-bit systems. On arm64,
this is done by passing GFP_DMA32 flag to memory allocation functions.
For IOMMU L2 tables that only take 1KB, it would be a waste to allocate
a full page using get_free_pages, so we considered 3 approaches:
1. This patch, adding support for GFP_DMA32 slab caches.
2. genalloc, which requires pre-allocating the maximum number of L2
page tables (4096, so 4MB of memory).
3. page_frag, which is not very memory-efficient as it is unable
to reuse freed fragments until the whole page is freed.
This change makes it possible to create a custom cache in DMA32 zone using
kmem_cache_create, then allocate memory using kmem_cache_alloc.
We do not create a DMA32 kmalloc cache array, as there are currently no
users of kmalloc(..., GFP_DMA32). These calls will continue to trigger a
warning, as we keep GFP_DMA32 in GFP_SLAB_BUG_MASK.
This implies that calls to kmem_cache_*alloc on a SLAB_CACHE_DMA32
kmem_cache must _not_ use GFP_DMA32 (it is anyway redundant and
unnecessary).
Link: http://lkml.kernel.org/r/20181210011504.122604-2-drinkcat@chromium.org Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Sasha Levin <Alexander.Levin@microsoft.com> Cc: Huaisheng Ye <yehs1@lenovo.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Yong Wu <yong.wu@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Tomasz Figa <tfiga@google.com> Cc: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hsin-Yi Wang <hsinyi@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Darrick J. Wong [Fri, 29 Mar 2019 03:43:38 +0000 (20:43 -0700)]
ocfs2: fix inode bh swapping mixup in ocfs2_reflink_inodes_lock
ocfs2_reflink_inodes_lock() can swap the inode1/inode2 variables so that
we always grab cluster locks in order of increasing inode number.
Unfortunately, we forget to swap the inode record buffer head pointers
when we've done this, which leads to incorrect bookkeepping when we're
trying to make the two inodes have the same refcount tree.
This has the effect of causing filesystem shutdowns if you're trying to
reflink data from inode 100 into inode 97, where inode 100 already has a
refcount tree attached and inode 97 doesn't. The reflink code decides
to copy the refcount tree pointer from 100 to 97, but uses inode 97's
inode record to open the tree root (which it doesn't have) and blows up.
This issue causes filesystem shutdowns and metadata corruption!
Link: http://lkml.kernel.org/r/20190312214910.GK20533@magnolia Fixes: 29ac8e856cb369 ("ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Fri, 29 Mar 2019 03:43:34 +0000 (20:43 -0700)]
mm/hotplug: fix offline undo_isolate_page_range()
Commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded
memory to zones until online") introduced move_pfn_range_to_zone() which
calls memmap_init_zone() during onlining a memory block.
memmap_init_zone() will reset pagetype flags and makes migrate type to
be MOVABLE.
However, in __offline_pages(), it also call undo_isolate_page_range()
after offline_isolated_pages() to do the same thing. Due to commit 2ce13640b3f4 ("mm: __first_valid_page skip over offline pages") changed
__first_valid_page() to skip offline pages, undo_isolate_page_range()
here just waste CPU cycles looping around the offlining PFN range while
doing nothing, because __first_valid_page() will return NULL as
offline_isolated_pages() has already marked all memory sections within
the pfn range as offline via offline_mem_sections().
Also, after calling the "useless" undo_isolate_page_range() here, it
reaches the point of no returning by notifying MEM_OFFLINE. Those pages
will be marked as MIGRATE_MOVABLE again once onlining. The only thing
left to do is to decrease the number of isolated pageblocks zone counter
which would make some paths of the page allocation slower that the above
commit introduced.
Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages
on ppc64, an "int" should still be enough to represent the number of
pageblocks there. Fix an incorrect comment along the way.
[cai@lca.pw: v4] Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw Fixes: 2ce13640b3f4 ("mm: __first_valid_page skip over offline pages") Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [4.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Fri, 29 Mar 2019 03:43:30 +0000 (20:43 -0700)]
fs/open.c: allow opening only regular files during execve()
syzbot is hitting lockdep warning [1] due to trying to open a fifo
during an execve() operation. But we don't need to open non regular
files during an execve() operation, for all files which we will need are
the executable file itself and the interpreter programs like /bin/sh and
ld-linux.so.2 .
Since the manpage for execve(2) says that execve() returns EACCES when
the file or a script interpreter is not a regular file, and the manpage
for uselib(2) says that uselib() can return EACCES, and we use
FMODE_EXEC when opening for execve()/uselib(), we can bail out if a non
regular file is requested with FMODE_EXEC set.
Since this deadlock followed by khungtaskd warnings is trivially
reproducible by a local unprivileged user, and syzbot's frequent crash
due to this deadlock defers finding other bugs, let's workaround this
deadlock until we get a chance to find a better solution.
Qian Cai [Fri, 29 Mar 2019 03:43:23 +0000 (20:43 -0700)]
mm/debug.c: add a cast to u64 for atomic64_read()
atomic64_read() on ppc64le returns "long int", so fix the same way as
commit d549f545e690 ("drm/virtio: use %llu format string form
atomic64_t") by adding a cast to u64, which makes it work on all arches.
In file included from ./include/linux/printk.h:7,
from ./include/linux/kernel.h:15,
from mm/debug.c:9:
mm/debug.c: In function 'dump_mm':
./include/linux/kern_levels.h:5:18: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 19 has type 'long int' [-Wformat=]
#define KERN_SOH "A" /* ASCII Start Of Header */
^~~~~~
./include/linux/kern_levels.h:8:20: note: in expansion of macro
'KERN_SOH'
#define KERN_EMERG KERN_SOH "0" /* system is unusable */
^~~~~~~~
./include/linux/printk.h:297:9: note: in expansion of macro 'KERN_EMERG'
printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
^~~~~~~~~~
mm/debug.c:133:2: note: in expansion of macro 'pr_emerg'
pr_emerg("mm %px mmap %px seqnum %llu task_size %lu"
^~~~~~~~
mm/debug.c:140:17: note: format string is defined here
"pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx"
~~~^
%lx
Link: http://lkml.kernel.org/r/20190310183051.87303-1-cai@lca.pw Fixes: 70f8a3ca68d3 ("mm: make mm->pinned_vm an atomic64 counter") Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Fri, 29 Mar 2019 03:43:15 +0000 (20:43 -0700)]
kasan: fix variable 'tag' set but not used warning
set_tag() compiles away when CONFIG_KASAN_SW_TAGS=n, so make
arch_kasan_set_tag() a static inline function to fix warnings below.
mm/kasan/common.c: In function '__kasan_kmalloc':
mm/kasan/common.c:475:5: warning: variable 'tag' set but not used [-Wunused-but-set-variable]
u8 tag;
^~~
Link: http://lkml.kernel.org/r/20190307185244.54648-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gao Xiang [Thu, 28 Mar 2019 20:14:58 +0000 (04:14 +0800)]
staging: erofs: keep corrupted fs from crashing kernel in erofs_readdir()
After commit 419d6efc50e9, kernel cannot be crashed in the namei
path. However, corrupted nameoff can do harm in the process of
readdir for scenerios without dm-verity as well. Fix it now.
Joerg Roedel [Thu, 28 Mar 2019 10:44:59 +0000 (11:44 +0100)]
iommu/amd: Reserve exclusion range in iova-domain
If a device has an exclusion range specified in the IVRS
table, this region needs to be reserved in the iova-domain
of that device. This hasn't happened until now and can cause
data corruption on data transfered with these devices.
Treat exclusion ranges as reserved regions in the iommu-core
to fix the problem.
Fixes: be2a022c0dd0 ('x86, AMD IOMMU: add functions to parse IOMMU memory mapping requirements for devices') Signed-off-by: Joerg Roedel <jroedel@suse.de> Reviewed-by: Gary R Hook <gary.hook@amd.com>
Changbin Du [Mon, 25 Mar 2019 15:16:47 +0000 (15:16 +0000)]
kconfig/[mn]conf: handle backspace (^H) key
Backspace is not working on some terminal emulators which do not send the
key code defined by terminfo. Terminals either send '^H' (8) or '^?' (127).
But currently only '^?' is handled. Let's also handle '^H' for those
terminals.
Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Merge tag 'fixes-for-v5.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-linus
Felipe writes:
usb: fixes for v5.1-rc2
One deadlock fix on f_hid. NET2280 got a fix on its dequeue
implementation and a fix for overrun of OUT messages.
DWC3 learned about another Intel product: Comment Lake PCH.
NET2272 got a similar fix to NET2280 on its dequeue implementation.
* tag 'fixes-for-v5.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb:
USB: gadget: f_hid: fix deadlock in f_hidg_write()
usb: gadget: net2272: Fix net2272_dequeue()
usb: gadget: net2280: Fix net2280_dequeue()
usb: gadget: net2280: Fix overrun of OUT messages
usb: dwc3: pci: add support for Comet Lake PCH ID
powerpc/pseries/mce: Fix misleading print for TLB mutlihit
On pseries, TLB multihit are reported as D-Cache Multihit. This is because
the wrongly populated mc_err_types[] array. Per PAPR, TLB error type is 0x04
and mc_err_types[4] points to "D-Cache" instead of "TLB" string. Fixup the
mc_err_types[] array.
Linus Walleij [Fri, 29 Mar 2019 02:04:47 +0000 (03:04 +0100)]
Merge tag 'gpio-v5.1-rc3-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux into fixes
gpio fixes for v5.1-rc3
- fix for a potential NULL-pointer dereference in the aspeed driver
- revert of the commit using the new gpio_set_config() when setting
debaunce and transitory state config as it caused a regression in
the aspeed driver
- two fixes for gpio-mockup for debugfs problems introduced in the
last merge window
Dave Airlie [Fri, 29 Mar 2019 00:38:25 +0000 (10:38 +1000)]
Merge tag 'drm-intel-fixes-2019-03-28' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
drm/i915 fixes for v5.2-rc3:
- fix mmap range checks
- fix gvt ppgtt mm LRU list access races
- fix selftest error pointer check
- fix a macro definition (pre-emptive for potential further backports)
- fix one AML SKU ULX status
Linus Torvalds [Thu, 28 Mar 2019 20:29:09 +0000 (13:29 -0700)]
Merge tag 'pci-v5.1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
"PCI fixes:
- Clear level-triggered interrupts for the bandwidth notification
supported added for v5.1 (Alexandru Gagniuc)
- Clear bandwidth notification interrupts before enabling them (Lukas
Wunner)
- Report post-enumeration bandwidth changes only once for
multi-function devices (Lukas Wunner)"
* tag 'pci-v5.1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI/LINK: Deduplicate bandwidth reports for multi-function devices
PCI/LINK: Clear bandwidth notification interrupt before enabling it
PCI/LINK: Supply IRQ handler so level-triggered IRQs are acked
Adrian Hunter [Wed, 27 Mar 2019 07:28:26 +0000 (09:28 +0200)]
perf scripts python: exported-sql-viewer.py: Fix python3 support
Unlike python2, python3 strings are not compatible with byte strings.
That results in disassembly not working for the branches reports. Fixup
those places overlooked in the port to python3.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Fixes: beda0e725e5f ("perf script python: Add Python3 support to exported-sql-viewer.py") Link: http://lkml.kernel.org/r/20190327072826.19168-3-adrian.hunter@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Paolo Bonzini [Thu, 28 Mar 2019 18:07:30 +0000 (19:07 +0100)]
Merge tag 'kvmarm-fixes-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
KVM/ARM fixes for 5.1
- Fix THP handling in the presence of pre-existing PTEs
- Honor request for PTE mappings even when THPs are available
- GICv4 performance improvement
- Take the srcu lock when writing to guest-controlled ITS data structures
- Reset the virtual PMU in preemptible context
- Various cleanups
pyside version 1 fails to handle python3 large integers in some cases,
resulting in Qt getting into a never-ending loop. This affects:
samples Table
samples_view Table
All branches Report
Selected branches Report
Add workarounds for those cases.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Fixes: beda0e725e5f ("perf script python: Add Python3 support to exported-sql-viewer.py") Link: http://lkml.kernel.org/r/20190327072826.19168-2-adrian.hunter@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Wei Li [Thu, 28 Feb 2019 09:20:03 +0000 (17:20 +0800)]
perf machine: Update kernel map address and re-order properly
Since commit 1fb87b8e9599 ("perf machine: Don't search for active kernel
start in __machine__create_kernel_maps"), the __machine__create_kernel_maps()
just create a map what start and end are both zero. Though the address will be
updated later, the order of map in the rbtree may be incorrect.
The commit ee05d21791db ("perf machine: Set main kernel end address properly")
fixed the logic in machine__create_kernel_maps(), but it's still wrong in
function machine__process_kernel_mmap_event().
To reproduce this issue, we need an environment which the module address
is before the kernel text segment. I tested it on an aarch64 machine with
kernel 4.19.25:
[root@localhost hulk]# grep _stext /proc/kallsyms ffff000008081000 T _stext
[root@localhost hulk]# grep _etext /proc/kallsyms ffff000009780000 R _etext
[root@localhost hulk]# tail /proc/modules
hisi_sas_v2_hw 77824 0 - Live 0xffff00000191d000
nvme_core 126976 7 nvme, Live 0xffff0000018b6000
mdio 20480 1 ixgbe, Live 0xffff0000018ab000
hisi_sas_main 106496 1 hisi_sas_v2_hw, Live 0xffff000001861000
hns_mdio 20480 2 - Live 0xffff000001822000
hnae 28672 3 hns_dsaf,hns_enet_drv, Live 0xffff000001815000
dm_mirror 40960 0 - Live 0xffff000001804000
dm_region_hash 32768 1 dm_mirror, Live 0xffff0000017f5000
dm_log 32768 2 dm_mirror,dm_region_hash, Live 0xffff0000017e7000
dm_mod 315392 17 dm_mirror,dm_log, Live 0xffff000001780000
[root@localhost hulk]#
Before fix:
[root@localhost bin]# perf record sleep 3
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.011 MB perf.data (9 samples) ]
[root@localhost bin]# perf buildid-list -i perf.data 4c4e46c971ca935f781e603a09b52a92e8bdfee8 [vdso]
[root@localhost bin]# perf buildid-list -i perf.data -H 0000000000000000000000000000000000000000 /proc/kcore
[root@localhost bin]#
Signed-off-by: Wei Li <liwei391@huawei.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Kim Phillips <kim.phillips@arm.com> Cc: Li Bin <huawei.libin@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20190228092003.34071-1-liwei391@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Warning: Kernel ABI header at 'tools/arch/powerpc/include/uapi/asm/kvm.h' differs from latest version at 'arch/powerpc/include/uapi/asm/kvm.h'
diff -u tools/arch/powerpc/include/uapi/asm/kvm.h arch/powerpc/include/uapi/asm/kvm.h
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Link: https://lkml.kernel.org/n/tip-4pb7ywp9536hub2pnj4hu6i4@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
tools headers: Update x86's syscall_64.tbl and uapi/asm-generic/unistd
To pick up the changes introduced in the following csets:
2b188cc1bb85 ("Add io_uring IO interface") edafccee56ff ("io_uring: add support for pre-mapped user IO buffers") 3eb39f47934f ("signal: add pidfd_send_signal() syscall")
This makes 'perf trace' to become aware of these new syscalls, so that
one can use them like 'perf trace -e ui_uring*,*signal' to do a system
wide strace-like session looking at those syscalls, for instance.
More work needed to make the tools/perf/examples/bpf/augmented_raw_syscalls.c
BPF program to copy the 'struct io_uring_params' arguments to perf's ring
buffer so that 'perf trace' can use the BTF info put in place by pahole's
conversion of the kernel DWARF and then auto-beautify those arguments.
This patch produces the expected change in the generated syscalls table
for x86_64:
Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h'
diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
Warning: Kernel ABI header at 'tools/perf/arch/x86/entry/syscalls/syscall_64.tbl' differs from latest version at 'arch/x86/entry/syscalls/syscall_64.tbl'
diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: Christian Brauner <christian@brauner.io> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Yonghong Song <yhs@fb.com> Link: https://lkml.kernel.org/n/tip-p0ars3otuc52x5iznf21shhw@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
e46c2e99f600 ("drm/i915: Expose RPCS (SSEU) configuration to userspace (Gen11 only)")
That don't cause changes in the generated perf binaries.
To silence this perf build warning:
Warning: Kernel ABI header at 'tools/include/uapi/drm/i915_drm.h' differs from latest version at 'include/uapi/drm/i915_drm.h'
diff -u tools/include/uapi/drm/i915_drm.h include/uapi/drm/i915_drm.h
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://lkml.kernel.org/n/tip-h6bspm1nomjnpr90333rrx7q@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
tools arch x86: Sync asm/cpufeatures.h with the kernel sources
To get the changes from:
52f64909409c ("x86: Add TSX Force Abort CPUID/MSR")
That don't cause any changes in the generated perf binaries.
And silence this perf build warning:
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/cpufeatures.h' differs from latest version at 'arch/x86/include/asm/cpufeatures.h'
diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/n/tip-zv8kw8vnb1zppflncpwfsv2w@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
tools headers uapi: Sync linux/fcntl.h to get the F_SEAL_FUTURE_WRITE addition
To get the changes in:
ab3948f58ff8 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
And silence this perf build warning:
Warning: Kernel ABI header at 'tools/include/uapi/linux/fcntl.h' differs from latest version at 'include/uapi/linux/fcntl.h'
diff -u tools/include/uapi/linux/fcntl.h include/uapi/linux/fcntl.h
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-lvfx5cgf0xzmdi9mcjva1ttl@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
And to have the system's sys/mman.h find the definition of MAP_SHARED
and MAP_PRIVATE, make sure they are defined in the tools/ mman-common.h
in a way that keeps it the same as the kernel's, need for keeping the
Android's NDK cross build working.
This silences these perf build warnings:
Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/mman-common.h' differs from latest version at 'include/uapi/asm-generic/mman-common.h'
diff -u tools/include/uapi/asm-generic/mman-common.h include/uapi/asm-generic/mman-common.h
Warning: Kernel ABI header at 'tools/include/uapi/linux/mman.h' differs from latest version at 'include/uapi/linux/mman.h'
diff -u tools/include/uapi/linux/mman.h include/uapi/linux/mman.h
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-h80ycpc6pedg9s5z2rwpy6ws@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Thu, 14 Mar 2019 14:00:10 +0000 (15:00 +0100)]
perf evsel: Fix max perf_event_attr.precise_ip detection
After a discussion with Andi, move the perf_event_attr.precise_ip
detection for maximum precise config (via :P modifier or for default
cycles event) to perf_evsel__open().
The current detection in perf_event_attr__set_max_precise_ip() is
tricky, because precise_ip config is specific for given event and it
currently checks only hw cycles.
We now check for valid precise_ip value right after failing
sys_perf_event_open() for specific event, before any of the
perf_event_attr fallback code gets executed.
This way we get the proper config in perf_event_attr together with
allowed precise_ip settings.
Adrian Hunter [Mon, 25 Mar 2019 13:51:35 +0000 (15:51 +0200)]
perf intel-pt: Fix TSC slip
A TSC packet can slip past MTC packets so that the timestamp appears to
go backwards. One estimate is that can be up to about 40 CPU cycles,
which is certainly less than 0x1000 TSC ticks, but accept slippage an
order of magnitude more to be on the safe side.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: stable@vger.kernel.org Fixes: 79b58424b821c ("perf tools: Add Intel PT support for decoding MTC packets") Link: http://lkml.kernel.org/r/20190325135135.18348-1-adrian.hunter@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Solomon Tan [Fri, 22 Mar 2019 05:22:55 +0000 (13:22 +0800)]
perf cs-etm: Add missing case value
The following error was thrown when compiling `tools/perf` using OpenCSD
v0.11.1. This patch fixes said error.
CC util/intel-pt-decoder/intel-pt-log.o
CC util/cs-etm-decoder/cs-etm-decoder.o
util/cs-etm-decoder/cs-etm-decoder.c: In function
‘cs_etm_decoder__buffer_range’:
util/cs-etm-decoder/cs-etm-decoder.c:370:2: error: enumeration value
‘OCSD_INSTR_WFI_WFE’ not handled in switch [-Werror=switch-enum]
switch (elem->last_i_type) {
^~~~~~
CC util/intel-pt-decoder/intel-pt-decoder.o
cc1: all warnings being treated as errors
Because `OCSD_INSTR_WFI_WFE` case was added only in v0.11.0, the minimum
required OpenCSD library version for this patch is no longer v0.10.0.
Signed-off-by: Solomon Tan <solomonbobstoner@gmail.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Walker <robert.walker@arm.com> Cc: Suzuki K Poulouse <suzuki.poulose@arm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/20190322052255.GA4809@w-OptiPlex-7050 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jens Axboe [Thu, 28 Mar 2019 17:20:53 +0000 (11:20 -0600)]
Merge branch 'nvme-5.1' of git://git.infradead.org/nvme into for-linus
Pull NVMe fixes from Christoph:
"A few accumulated small fixes:
- fix an endianess misannotation that sneaked in this merge window in
nvme-tcp (me)
- fix nvme-loop to handle multi-page segments (Ming)
- fix error handling in the nvmet configfs code (Max)
- add a missing requeue point in the multipath code (Martin George)"
* 'nvme-5.1' of git://git.infradead.org/nvme:
nvmet: fix error flow during ns enable
nvmet: fix building bvec from sg list
nvme-multipath: relax ANA state check
nvme-tcp: fix an endianess miss-annotation
Ming Lei [Wed, 27 Mar 2019 09:07:22 +0000 (17:07 +0800)]
nvmet: fix building bvec from sg list
There are two mistakes for building bvec from sg list for file
backed ns:
- use request data length to compute number of io vector, this way
doesn't consider sg->offset, and the result may be smaller than required
io vectors
- bvec->bv_len isn't capped by sg->length
This patch fixes this issue by building bvec from sg directly, given
the whole IO stack is ready for multi-page bvec.
Reported-by: Yi Zhang <yi.zhang@redhat.com> Fixes: 3a85a5de29ea ("nvme-loop: add a NVMe loopback host driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Martin George [Wed, 27 Mar 2019 08:52:56 +0000 (09:52 +0100)]
nvme-multipath: relax ANA state check
When undergoing state transitions I/O might be requeued, hence
we should always call nvme_mpath_set_live() to schedule requeue_work
whenever the nvme device is live, independent on whether the
old state was live or not.
Signed-off-by: Martin George <marting@netapp.com> Signed-off-by: Gargi Srinivas <sring@netapp.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
If the call to of_gpiochip_scan_gpios() in of_gpiochip_add() fails, no
error handling is performed. This lead to the need of callers to call
of_gpiochip_remove() on failure, which causes "BAD of_node_put() on ..."
if the failure happened before the call to of_node_get().
Fix this by adding proper error handling.
Note that calling gpiochip_remove_pin_ranges() multiple times causes no
harm: subsequent calls are a no-op.
KVM: doc: Document the life cycle of a VM and its resources
The series to add memcg accounting to KVM allocations[1] states:
There are many KVM kernel memory allocations which are tied to the
life of the VM process and should be charged to the VM process's
cgroup.
While it is correct to account KVM kernel allocations to the cgroup of
the process that created the VM, it's technically incorrect to state
that the KVM kernel memory allocations are tied to the life of the VM
process. This is because the VM itself, i.e. struct kvm, is not tied to
the life of the process which created it, rather it is tied to the life
of its associated file descriptor. In other words, kvm_destroy_vm() is
not invoked until fput() decrements its associated file's refcount to
zero. A simple example is to fork() in Qemu and have the child sleep
indefinitely; kvm_destroy_vm() isn't called until Qemu closes its file
descriptor *and* the rogue child is killed.
The allocations are guaranteed to be *accounted* to the process which
created the VM, but only because KVM's per-{VM,vCPU} ioctls reject the
ioctl() with -EIO if kvm->mm != current->mm. I.e. the child can keep
the VM "alive" but can't do anything useful with its reference.
Note that because 'struct kvm' also holds a reference to the mm_struct
of its owner, the above behavior also applies to userspace allocations.
Given that mucking with a VM's file descriptor can lead to subtle and
undesirable behavior, e.g. memcg charges persisting after a VM is shut
down, explicitly document a VM's lifecycle and its impact on the VM's
resources.
Alternatively, KVM could aggressively free resources when the creating
process exits, e.g. via mmu_notifier->release(). However, mmu_notifier
isn't guaranteed to be available, and freeing resources when the creator
exits is likely to be error prone and fragile as KVM would need to
ensure that it only freed resources that are truly out of reach. In
practice, the existing behavior shouldn't be problematic as a properly
configured system will prevent a child process from being moved out of
the appropriate cgroup hierarchy, i.e. prevent hiding the process from
the OOM killer, and will prevent an unprivileged user from being able to
to hold a reference to struct kvm via another method, e.g. debugfs.
KVM: selftests: complete IO before migrating guest state
Documentation/virtual/kvm/api.txt states:
NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
KVM_EXIT_EPR the corresponding operations are complete (and guest
state is consistent) only after userspace has re-entered the
kernel with KVM_RUN. The kernel side will first finish incomplete
operations and then check for pending signals. Userspace can
re-enter the guest with an unmasked signal pending to complete
pending operations.
Because guest state may be inconsistent, starting state migration after
an IO exit without first completing IO may result in test failures, e.g.
a proposed change to KVM's handling of %rip in its fast PIO handling[1]
will cause the new VM, i.e. the post-migration VM, to have its %rip set
to the IN instruction that triggered KVM_EXIT_IO, leading to a test
assertion due to a stage mismatch.
For simplicitly, require KVM_CAP_IMMEDIATE_EXIT to complete IO and skip
the test if it's not available. The addition of KVM_CAP_IMMEDIATE_EXIT
predates the state selftest by more than a year.
Fixes: fa3899add1056 ("kvm: selftests: add basic test for state save and restore") Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: selftests: disable stack protector for all KVM tests
Since 4.8.3, gcc has enabled -fstack-protector by default. This is
problematic for the KVM selftests as they do not configure fs or gs
segments (the stack canary is pulled from fs:0x28). With the default
behavior, gcc will insert a stack canary on any function that creates
buffers of 8 bytes or more. As a result, ucall() will hit a triple
fault shutdown due to reading a bad fs segment when inserting its
stack canary, i.e. every test fails with an unexpected SHUTDOWN.
Fixes: 14c47b7530e2d ("kvm: selftests: introduce ucall") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM selftests embed the guest "image" as a function in the test itself
and extract the guest code at runtime by manually parsing the elf
headers. The parsing is very simple and doesn't supporting fancy things
like position independent executables. Recent versions of gcc enable
pie by default, which results in triple fault shutdowns in the guest due
to the virtual address in the headers not matching up with the virtual
address retrieved from the function pointer.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: selftests: assert on exit reason in CR4/cpuid sync test
...so that the test doesn't end up in an infinite loop if it fails for
whatever reason, e.g. SHUTDOWN due to gcc inserting stack canary code
into ucall() and attempting to derefence a null segment.
Fixes: ca359066889f7 ("kvm: selftests: add cr4_cpuid_sync_test") Cc: Wei Huang <wei@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
OUT 92h or CF9h. Userspace may emulate said mechanism, i.e. reset a
vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
To avoid corruping %rip after such a reset, commit 0967b7bf1c22 ("KVM:
Skip pio instruction when it is emulated, not executed") changed the
behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
instruction prior to exiting to userspace. Full emulation doesn't need
such tricks becase re-emulating the instruction will naturally handle
%rip being changed to point at the reset vector.
Updating %rip prior to executing to userspace has several drawbacks:
- Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
fails it will likely yell about the wrong address.
- Single step exits to userspace for are effectively dropped as
KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
- Behavior of PIO emulation is different depending on whether it
goes down the fast path or the slow path.
Rather than skip the PIO instruction before exiting to userspace,
snapshot the linear %rip and cancel PIO completion if the current
value does not match the snapshot. For a 64-bit vCPU, i.e. the most
common scenario, the snapshot and comparison has negligible overhead
as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
VMREAD in this case.
All other alternatives to snapshotting the linear %rip that don't
rely on an explicit reset announcenment suffer from one corner case
or another. For example, canceling PIO completion on any write to
%rip fails if userspace does a save/restore of %rip, and attempting to
avoid that issue by canceling PIO only if %rip changed then fails if PIO
collides with the reset %rip. Attempting to zero in on the exact reset
vector won't work for APs, which means adding more hooks such as the
vCPU's MP_STATE, and so on and so forth.
Checking for a linear %rip match technically suffers from corner cases,
e.g. userspace could theoretically rewrite the underlying code page and
expect a different instruction to execute, or the guest hardcodes a PIO
reset at 0xfffffff0, but those are far, far outside of what can be
considered normal operation.
Fixes: 432baf60eee3 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O") Cc: <stable@vger.kernel.org> Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Vitaly Kuznetsov [Wed, 13 Mar 2019 17:13:42 +0000 (18:13 +0100)]
x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init
When userspace initializes guest vCPUs it may want to zero all supported
MSRs including Hyper-V related ones including HV_X64_MSR_STIMERn_CONFIG/
HV_X64_MSR_STIMERn_COUNT. With commit f3b138c5d89a ("kvm/x86: Update SynIC
timers on guest entry only") we began doing stimer_mark_pending()
unconditionally on every config change.
The issue I'm observing manifests itself as following:
- Qemu writes 0 to STIMERn_{CONFIG,COUNT} MSRs and marks all stimers as
pending in stimer_pending_bitmap, arms KVM_REQ_HV_STIMER;
- kvm_hv_has_stimer_pending() starts returning true;
- kvm_vcpu_has_events() starts returning true;
- kvm_arch_vcpu_runnable() starts returning true;
- when kvm_arch_vcpu_ioctl_run() gets into
(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) case:
- kvm_vcpu_block() gets in 'kvm_vcpu_check_block(vcpu) < 0' and returns
immediately, avoiding normal wait path;
- -EAGAIN is returned from kvm_arch_vcpu_ioctl_run() immediately forcing
userspace to retry.
So instead of normal wait path we get a busy loop on all secondary vCPUs
before they get INIT signal. This seems to be undesirable, especially given
that this happens even when Hyper-V extensions are not used.
Generally, it seems to be pointless to mark an stimer as pending in
stimer_pending_bitmap and arm KVM_REQ_HV_STIMER as the only thing
kvm_hv_process_stimers() will do is clear the corresponding bit. We may
just not mark disabled timers as pending instead.
Fixes: f3b138c5d89a ("kvm/x86: Update SynIC timers on guest entry only") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiaoyao Li [Fri, 8 Mar 2019 07:57:20 +0000 (15:57 +0800)]
kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs
Since MSR_IA32_ARCH_CAPABILITIES is emualted unconditionally even if
host doesn't suppot it. We should move it to array emulated_msrs from
arry msrs_to_save, to report to userspace that guest support this msr.
Signed-off-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts
The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
regardless of hardware support under the pretense that KVM fully
emulates MSR_IA32_ARCH_CAPABILITIES. Unfortunately, only VMX hosts
handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).
Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
that it's emulated on AMD hosts.
Fixes: 1eaafe91a0df4 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported") Cc: stable@vger.kernel.org Reported-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The function irqfd_wakeup() has flags defined as __poll_t and then it
has additional flags which is used for irqflags.
Redefine the inner flags variable as iflags so it does not shadow the
outer flags.
Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: kvm@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ben Gardon [Tue, 12 Mar 2019 18:45:59 +0000 (11:45 -0700)]
kvm: mmu: Used range based flushing in slot_handle_level_range
Replace kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address
in slot_handle_level_range. When range based flushes are not enabled
kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs.
This changes the behavior of many functions that indirectly use
slot_handle_level_range, iff the range based flushes are enabled. The
only potential problem I see with this is that kvm->tlbs_dirty will be
cleared less often, however the only caller of slot_handle_level_range that
checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which
checks it and does a kvm_flush_remote_tlbs after calling
kvm_unmap_hva_range anyway.
Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and
without this patch. The patch introduced no new failures.
Signed-off-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[3] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
csky, nds32, riscv
This does not match to the actual KVM support. At least, [2] is
half-baked.
Nor do arch maintainers look like they care about this. For example,
commit 0add53713b1c ("microblaze: Add missing kvm_para.h to Kbuild")
exported <asm/kvm_para.h> to user-space in order to fix an in-kernel
build error.
We have two ways to make this consistent:
[A] export both <linux/kvm_para.h> and <asm/kvm_para.h> for all
architectures, irrespective of the KVM support
[B] Match the header export of <linux/kvm_para.h> and <asm/kvm_para.h>
to the KVM support
My first attempt was [A] because the code looks cleaner, but Paolo
suggested [B].
So, this commit goes with [B].
For most architectures, <asm/kvm_para.h> was moved to the kernel-space.
I changed include/uapi/linux/Kbuild so that it checks generated
asm/kvm_para.h as well as check-in ones.
After this commit, there will be two groups:
[1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
arm, arm64, mips, powerpc, s390, x86
[2] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
Wei Yang [Thu, 27 Sep 2018 00:31:26 +0000 (08:31 +0800)]
KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region()
* nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is
non-zero.
* nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages()
never return zero.
Based on these two reasons, we can merge the two *if* clause and use the
return value from kvm_mmu_calculate_mmu_pages() directly. This simplify
the code and also eliminate the possibility for reader to believe
nr_mmu_pages would be zero.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Singh, Brijesh [Fri, 15 Feb 2019 17:24:12 +0000 (17:24 +0000)]
KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)
Errata#1096:
On a nested data page fault when CR.SMAP=1 and the guest data read
generates a SMAP violation, GuestInstrBytes field of the VMCB on a
VMEXIT will incorrectly return 0h instead the correct guest
instruction bytes .
Recommend Workaround:
To determine what instruction the guest was executing the hypervisor
will have to decode the instruction at the instruction pointer.
The recommended workaround can not be implemented for the SEV
guest because guest memory is encrypted with the guest specific key,
and instruction decoder will not be able to decode the instruction
bytes. If we hit this errata in the SEV guest then log the message
and request a guest shutdown.
Reported-by: Venkatesh Srinivas <venkateshs@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Reject device ioctls from processes other than the VM's creator
KVM's API requires thats ioctls must be issued from the same process
that created the VM. In other words, userspace can play games with a
VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
creator can do anything useful. Explicitly reject device ioctls that
are issued by a process other than the VM's creator, and update KVM's
API documentation to extend its requirements to device ioctls.
Fixes: 852b6d57dc7f ("kvm: add device control API") Cc: <stable@vger.kernel.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: doc: Fix incorrect word ordering regarding supported use of APIs
Per Paolo[1], instantiating multiple VMs in a single process is legal;
but this conflicts with KVM's API documentation, which states:
The only supported use is one virtual machine per process, and one
vcpu per thread.
However, an earlier section in the documentation states:
Only run VM ioctls from the same process (address space) that was used
to create the VM.
and:
Only run vcpu ioctls from the same thread that was used to create the
vcpu.
This suggests that the conflicting documentation is simply an incorrect
ordering of of words, i.e. what's really meant is that a virtual machine
can't be shared across multiple processes and a vCPU can't be shared
across multiple threads.
Tweak the blurb on issuing ioctls to use a more assertive tone, and
rewrite the "supported use" sentence to reference said blurb instead of
poorly restating it in different terms.
KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'
The cr4_pae flag is a bit of a misnomer, its purpose is really to track
whether the guest PTE that is being shadowed is a 4-byte entry or an
8-byte entry. Prior to supporting nested EPT, the size of the gpte was
reflected purely by CR4.PAE. KVM fudged things a bit for direct sptes,
but it was mostly harmless since the size of the gpte never mattered.
Now that a spte may be tracking an indirect EPT entry, relying on
CR4.PAE is wrong and ill-named.
For direct shadow pages, force the gpte_size to '1' as they are always
8-byte entries; EPT entries can only be 8-bytes and KVM always uses
8-byte entries for NPT and its identity map (when running with EPT but
not unrestricted guest).
Likewise, nested EPT entries are always 8-bytes. Nested EPT presents a
unique scenario as the size of the entries are not dictated by CR4.PAE,
but neither is the shadow page a direct map. To handle this scenario,
set cr0_wp=1 and smap_andnot_wp=1, an otherwise impossible combination,
to denote a nested EPT shadow page. Use the information to avoid
incorrectly zapping an unsync'd indirect page in __kvm_sync_page().
Providing a consistent and accurate gpte_size fixes a bug reported by
Vitaly where fast_cr3_switch() always fails when switching from L2 to
L1 as kvm_mmu_get_page() would force role.cr4_pae=0 for direct pages,
whereas kvm_calc_mmu_role_common() would set it according to CR4.PAE.
Fixes: 7dcd575520082 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed") Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: nVMX: Do not inherit quadrant and invalid for the root shadow EPT
Explicitly zero out quadrant and invalid instead of inheriting them from
the root_mmu. Functionally, this patch is a nop as we (should) never
set quadrant for a direct mapped (EPT) root_mmu and nested EPT is only
allowed if EPT is used for L1, and the root_mmu will never be invalid at
this point.
Explicitly setting flags sets the stage for repurposing the legacy
paging bits in role, e.g. nxe, cr0_wp, and sm{a,e}p_andnot_wp, at which
point 'smm' would be the only flag to be inherited from root_mmu.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrey Smirnov [Tue, 26 Mar 2019 06:32:09 +0000 (23:32 -0700)]
gpio: of: Check for "spi-cs-high" in child instead of parent node
"spi-cs-high" is going to be specified in child node of an SPI
controller's representing attached SPI device, so change the code to
look for it there, instead of checking parent node.
Andrey Smirnov [Tue, 26 Mar 2019 06:32:08 +0000 (23:32 -0700)]
gpio: of: Check propname before applying "cs-gpios" quirks
SPI GPIO device has more than just "cs-gpio" property in its node and
would request those GPIOs as a part of its initialization. To avoid
applying CS-specific quirk to all of them add a check to make sure
that propname is "cs-gpios".
Jann Horn [Thu, 28 Mar 2019 15:49:48 +0000 (16:49 +0100)]
x86/cpufeature: Fix __percpu annotation in this_cpu_has()
&cpu_info.x86_capability is __percpu, and the second argument of
x86_this_cpu_test_bit() is expected to be __percpu. Don't cast the
__percpu away and then implicitly add it again. This gets rid of 106
lines of sparse warnings with the kernel config I'm using.
David Howells [Wed, 27 Mar 2019 22:48:02 +0000 (22:48 +0000)]
afs: Fix StoreData op marshalling
The marshalling of AFS.StoreData, AFS.StoreData64 and YFS.StoreData64 calls
generated by ->setattr() ops for the purpose of expanding a file is
incorrect due to older documentation incorrectly describing the way the RPC
'FileLength' parameter is meant to work.
The older documentation says that this is the length the file is meant to
end up at the end of the operation; however, it was never implemented this
way in any of the servers, but rather the file is truncated down to this
before the write operation is effected, and never expanded to it (and,
indeed, it was renamed to 'TruncPos' in 2014).
Fix this by setting the position parameter to the new file length and doing
a zero-lengh write there.
The bug causes Xwayland to SIGBUS due to unexpected non-expansion of a file
it then mmaps. This can be tested by giving the following test program a
filename in an AFS directory:
Linus Torvalds [Thu, 28 Mar 2019 15:35:32 +0000 (08:35 -0700)]
Merge tag 's390-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Martin Schwidefsky:
"Improvements and bug fixes for 5.1-rc2:
- Fix early free of the channel program in vfio
- On AP device removal make sure that all messages are flushed with
the driver still attached that queued the message
- Limit brk randomization to 32MB to reduce the chance that the heap
of ld.so is placed after the main stack
- Add a rolling average for the steal time of a CPU, this will be
needed for KVM to decide when to do busy waiting
- Fix a warning in the CPU-MF code
- Add a notification handler for AP configuration change to react
faster to new AP devices"
* tag 's390-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/cpumf: Fix warning from check_processor_id
zcrypt: handle AP Info notification from CHSC SEI command
vfio: ccw: only free cp on final interrupt
s390/vtime: steal time exponential moving average
s390/zcrypt: revisit ap device remove procedure
s390: limit brk randomization to 32MB
Linus Torvalds [Thu, 28 Mar 2019 15:23:45 +0000 (08:23 -0700)]
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"A couple of minor fixes only for now
- fix for incorrect DMA channels on Renesas R-Car
- Broadcom bcm2835 error handling fixes
- Kconfig dependency fixes for bcm2835 and davinci
- CPU idle wakeup fix for i.MX6
- MMC regression on Tegra186
- fix incorrect phy settings on one imx board"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
arm64: tegra: Disable CQE Support for SDMMC4 on Tegra186
ARM: dts: nomadik: Fix polarity of SPI CS
ARM: davinci: fix build failure with allnoconfig
ARM: imx_v4_v5_defconfig: enable PWM driver
ARM: imx_v6_v7_defconfig: continue compiling the pwm driver
ARM: dts: imx6dl-yapp4: Use correct pseudo PHY address for the switch
ARM: dts: imx6qdl: Fix typo in imx6qdl-icore-rqs.dtsi
ARM: dts: imx6ull: Use the correct style for SPDX License Identifier
ARM: dts: pfla02: increase phy reset duration
ARM: imx6q: cpuidle: fix bug that CPU might not wake up at expected time
ARM: imx51: fix a leaked reference by adding missing of_node_put
ARM: dts: imx6dl-yapp4: Use rgmii-id phy mode on the cpu port
arm64: bcm2835: Add missing dependency on MFD_CORE.
ARM: dts: bcm283x: Fix hdmi hpd gpio pull
soc: bcm: bcm2835-pm: Fix error paths of initialization.
soc: bcm: bcm2835-pm: Fix PM_IMAGE_PERI power domain support.
arm64: dts: renesas: r8a774c0: Fix SCIF5 DMA channels
arm64: dts: renesas: r8a77990: Fix SCIF5 DMA channels
Fredrik Noring [Wed, 27 Mar 2019 18:12:50 +0000 (19:12 +0100)]
kbuild: modversions: Fix relative CRC byte order interpretation
Fix commit 56067812d5b0 ("kbuild: modversions: add infrastructure for
emitting relative CRCs") where CRCs are interpreted in host byte order
rather than proper kernel byte order. The bug is conditional on
CONFIG_MODULE_REL_CRCS.
For example, when loading a BE module into a BE kernel compiled with a LE
system, the error "disagrees about version of symbol module_layout" is
produced. A message such as "Found checksum D7FA6856 vs module 5668FAD7"
will be given with debug enabled, which indicates an obvious endian
problem within __kcrctab within the kernel image.
The general solution is to use the macro TO_NATIVE, as is done in
similar cases throughout modpost.c. With this correction it has been
verified that a BE kernel compiled with a LE system accepts BE modules.
This change has also been verified with a LE kernel compiled with a LE
system, in which case TO_NATIVE returns its value unmodified since the
byte orders match. This is by far the common case.
Fixes: 56067812d5b0 ("kbuild: modversions: add infrastructure for emitting relative CRCs") Signed-off-by: Fredrik Noring <noring@nocrew.org> Cc: stable@vger.kernel.org Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Joe Lawrence [Tue, 26 Mar 2019 14:50:28 +0000 (10:50 -0400)]
kbuild: strip whitespace in cmd_record_mcount findstring
CC_FLAGS_FTRACE may contain trailing whitespace that interferes with
findstring.
For example, commit 6977f95e63b9 ("powerpc: avoid -mno-sched-epilog on
GCC 4.9 and newer") introduced a change such that on my ppc64le box,
CC_FLAGS_FTRACE="-pg -mprofile-kernel ". (Note the trailing space.)
When cmd_record_mcount is now invoked, findstring fails as the ftrace
flags were found at very end of _c_flags, without the trailing space.
_c_flags=" ... -pg -mprofile-kernel"
CC_FLAGS_FTRACE="-pg -mprofile-kernel "
^
findstring is looking for this extra space
Remove the redundant whitespaces from CC_FLAGS_FTRACE in
cmd_record_mcount to avoid this problem.
[masahiro.yamada: This issue only happens in the released versions
of GNU Make. CC_FLAGS_FTRACE will not contain the trailing space if
you use the latest GNU Make, which contains commit b90fabc8d6f3
("* NEWS: Do not insert a space during '+=' if the value is empty.") ]
Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com> (refactoring) Fixes: 6977f95e63b9 ("powerpc: avoid -mno-sched-epilog on GCC 4.9 and newer"). Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Masahiro Yamada [Tue, 26 Mar 2019 04:26:58 +0000 (13:26 +0900)]
kbuild: do not overwrite .gitignore in output directory
Commit 3a51ff344204 ("kbuild: gitignore output directory") seemed to
bother people who version-control output directories.
Andre Przywara says:
"Unfortunately this breaks my setup, because I keep a totally separate
git repository in my build directories to track (various versions of)
.config. So .gitignore there is carefully crafted to ignore most build
artefacts, but not .config, for instance."
Link: https://lkml.org/lkml/2019/3/22/1819 Reported-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Tested-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Masahiro Yamada [Tue, 26 Mar 2019 04:02:19 +0000 (13:02 +0900)]
kbuild: skip parsing pre sub-make code for recursion
When Make recurses to the top Makefile with sub-make-done unset,
the code block surrounded by 'ifneq ($(sub-make-done),1) ... endif'
is parsed multiple times. This happens for in-tree building of
include/config/auto.conf, *-pkg, etc. with GNU Make 4.x.
This is a slight regression by commit 688931a5ad4e ("kbuild: skip
sub-make for in-tree build with GNU Make 4.x") in terms of performance
since that code block contains one $(shell ...) invocation.
Fix it by exporting the variable irrespective of sub-make being run.
I renamed it because GNU Make cannot properly export variables
containing hyphens. This is probably a bug of GNU Make, and the issue
in Kbuild had already been reported by commit 2bfbe7881ee0 ("kbuild:
Do not use hyphen in exported variable name").
GT VEBOX DISABLE is only 4 bits wide but it was using a 8 bits wide
mask, the remaning reserved bits is set to 0 causing 4 more
nonexistent VEBOX engines being detected as enabled, triggering the
BUG_ON() because of mismatch between vebox_mask and newly added
VEBOX_MASK().
BSpec: 20680 Fixes: 26376a7e74d2 ("drm/i915/icl: Check for fused-off VDBOX and VEBOX instances") Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Oscar Mateo <oscar.mateo@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20190326230223.26336-1-jose.souza@intel.com
(cherry picked from commit 547fcf9b1c608cf5c43c156a8773a94c6a38dc44) Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Ralph Campbell [Tue, 26 Mar 2019 00:18:17 +0000 (17:18 -0700)]
x86/mm: Don't exceed the valid physical address space
valid_phys_addr_range() is used to sanity check the physical address range
of an operation, e.g., access to /dev/mem. It uses __pa(high_memory)
internally.
If memory is populated at the end of the physical address space, then
__pa(high_memory) is outside of the physical address space because:
For the comparison in valid_phys_addr_range() this is not an issue, but if
CONFIG_DEBUG_VIRTUAL is enabled, __pa() maps to __phys_addr(), which
verifies that the resulting physical address is within the valid physical
address space of the CPU. So in the case that memory is populated at the
end of the physical address space, this is not true and triggers a
VIRTUAL_BUG_ON().
Use __pa(high_memory - 1) to prevent the conversion from going beyond
the end of valid physical addresses.
Fixes: be62a3204406 ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses") Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Craig Bergstrom <craigb@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hans Verkuil <hans.verkuil@cisco.com> Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sander Eikelenboom <linux@eikelenboom.it> Cc: Sean Young <sean@mess.org> Link: https://lkml.kernel.org/r/20190326001817.15413-2-rcampbell@nvidia.com
Daniel Borkmann [Mon, 25 Mar 2019 13:56:20 +0000 (14:56 +0100)]
x86/retpolines: Disable switch jump tables when retpolines are enabled
Commit ce02ef06fcf7 ("x86, retpolines: Raise limit for generating indirect
calls from switch-case") raised the limit under retpolines to 20 switch
cases where gcc would only then start to emit jump tables, and therefore
effectively disabling the emission of slow indirect calls in this area.
After this has been brought to attention to gcc folks [0], Martin Liska
has then fixed gcc to align with clang by avoiding to generate switch jump
tables entirely under retpolines. This is taking effect in gcc starting
from stable version 8.4.0. Given kernel supports compilation with older
versions of gcc where the fix is not being available or backported anymore,
we need to keep the extra KBUILD_CFLAGS around for some time and generally
set the -fno-jump-tables to align with what more recent gcc is doing
automatically today.
More than 20 switch cases are not expected to be fast-path critical, but
it would still be good to align with gcc behavior for versions < 8.4.0 in
order to have consistency across supported gcc versions. vmlinux size is
slightly growing by 0.27% for older gcc. This flag is only set to work
around affected gcc, no change for clang.
Thomas Gleixner [Tue, 26 Mar 2019 16:36:06 +0000 (17:36 +0100)]
x86/smp: Enforce CONFIG_HOTPLUG_CPU when SMP=y
The SMT disable 'nosmt' command line argument is not working properly when
CONFIG_HOTPLUG_CPU is disabled. The teardown of the sibling CPUs which are
required to be brought up due to the MCE issues, cannot work. The CPUs are
then kept in a half dead state.
As the 'nosmt' functionality has become popular due to the speculative
hardware vulnerabilities, the half torn down state is not a proper solution
to the problem.
Enforce CONFIG_HOTPLUG_CPU=y when SMP is enabled so the full operation is
possible.
Reported-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Mukesh Ojha <mojha@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Micheal Kelley <michael.h.kelley@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190326163811.598166056@linutronix.de
Thomas Gleixner [Tue, 26 Mar 2019 16:36:05 +0000 (17:36 +0100)]
cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n
Tianyu reported a crash in a CPU hotplug teardown callback when booting a
kernel which has CONFIG_HOTPLUG_CPU disabled with the 'nosmt' boot
parameter.
It turns out that the SMP=y CONFIG_HOTPLUG_CPU=n case has been broken
forever in case that a bringup callback fails. Unfortunately this issue was
not recognized when the CPU hotplug code was reworked, so the shortcoming
just stayed in place.
When a bringup callback fails, the CPU hotplug code rolls back the
operation and takes the CPU offline.
The 'nosmt' command line argument uses a bringup failure to abort the
bringup of SMT sibling CPUs. This partial bringup is required due to the
MCE misdesign on Intel CPUs.
With CONFIG_HOTPLUG_CPU=y the rollback works perfectly fine, but
CONFIG_HOTPLUG_CPU=n lacks essential mechanisms to exercise the low level
teardown of a CPU including the synchronizations in various facilities like
RCU, NOHZ and others.
As a consequence the teardown callbacks which must be executed on the
outgoing CPU within stop machine with interrupts disabled are executed on
the control CPU in interrupt enabled and preemptible context causing the
kernel to crash and burn. The pre state machine code has a different
failure mode which is more subtle and resulting in a less obvious use after
free crash because the control side frees resources which are still in use
by the undead CPU.
But this is not a x86 only problem. Any architecture which supports the
SMP=y HOTPLUG_CPU=n combination suffers from the same issue. It's just less
likely to be triggered because in 99.99999% of the cases all bringup
callbacks succeed.
The easy solution of making HOTPLUG_CPU mandatory for SMP is not working on
all architectures as the following architectures have either no hotplug
support at all or not all subarchitectures support it:
Crashing the kernel in such a situation is not an acceptable state
either.
Implement a minimal rollback variant by limiting the teardown to the point
where all regular teardown callbacks have been invoked and leave the CPU in
the 'dead' idle state. This has the following consequences:
- the CPU is brought down to the point where the stop_machine takedown
would happen.
- the CPU stays there forever and is idle
- The CPU is cleared in the CPU active mask, but not in the CPU online
mask which is a legit state.
- Interrupts are not forced away from the CPU
- All facilities which only look at online mask would still see it, but
that is the case during normal hotplug/unplug operations as well. It's
just a (way) longer time frame.
This will expose issues, which haven't been exposed before or only seldom,
because now the normally transient state of being non active but online is
a permanent state. In testing this exposed already an issue vs. work queues
where the vmstat code schedules work on the almost dead CPU which ends up
in an unbound workqueue and triggers 'preemtible context' warnings. This is
not a problem of this change, it merily exposes an already existing issue.
Still this is better than crashing fully without a chance to debug it.
This is mainly thought as workaround for those architectures which do not
support HOTPLUG_CPU. All others should enforce HOTPLUG_CPU for SMP.
Fixes: 2e1a3483ce74 ("cpu/hotplug: Split out the state walk into functions") Reported-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Konrad Wilk <konrad.wilk@oracle.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Mukesh Ojha <mojha@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Rik van Riel <riel@surriel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Micheal Kelley <michael.h.kelley@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190326163811.503390616@linutronix.de
Thomas Gleixner [Tue, 26 Mar 2019 21:51:02 +0000 (22:51 +0100)]
watchdog: Respect watchdog cpumask on CPU hotplug
The rework of the watchdog core to use cpu_stop_work broke the watchdog
cpumask on CPU hotplug.
The watchdog_enable/disable() functions are now called unconditionally from
the hotplug callback, i.e. even on CPUs which are not in the watchdog
cpumask. As a consequence the watchdog can become unstoppable.
Only invoke them when the plugged CPU is in the watchdog cpumask.
Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work") Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1903262245490.1789@nanos.tec.linutronix.de
Thomas Richter [Mon, 18 Mar 2019 14:50:27 +0000 (15:50 +0100)]
s390/cpumf: Fix warning from check_processor_id
Function __hw_perf_event_init() used a CPU variable without
ensuring CPU preemption has been disabled. This caused the
following warning in the kernel log:
The variable is now accessed in proper context. Use
get_cpu_var()/put_cpu_var() pair to disable
preemption during access.
As the hardware authorization settings apply to all CPUs, it
does not matter which CPU is used to check the authorization setting.
Remove the event->count assignment. It is not needed as function
perf_event_alloc() allocates memory for the event with kzalloc() and
thus count is already set to zero.
Fixes: fe5908bccc56 ("s390/cpum_cf_diag: Add support for s390 counter facility diagnostic trace") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Lorenz Messtechnik has a device that is controlled by the cp210x driver,
so add the device id to the driver. The device id was provided by
Silicon-Labs for the devices from this vendor.