Heiner Kallweit [Mon, 19 Nov 2018 21:40:04 +0000 (22:40 +0100)]
r8169: simplify ocp functions
rtl8168_oob_notify is used in rtl8168dp_driver_start and
rtl8168dp_driver_stop only, so we can rename it to r8168dp_oob_notify.
The same applies to condition rtl_ocp_read_cond which can be renamed
to rtl_dp_ocp_read_cond. This allows to simplify the code.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Mon, 19 Nov 2018 21:37:34 +0000 (22:37 +0100)]
r8169: remove "not PCI Express" message
The ones who want to know can easily identify whether chip is PCI or
PCIe based on the chip name. I doubt there's any benefit in this
message, so remove it.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Mon, 19 Nov 2018 21:36:15 +0000 (22:36 +0100)]
r8169: remove print_mac_version
The syslog message printed on driver load allows to easily identify
the mac version number (based on chip name and XID). So we don't
need this extra debug message which is wrong anyway because e.g.
RTL_GIGA_MAC_VER_01 has value 0.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Mon, 19 Nov 2018 21:34:17 +0000 (22:34 +0100)]
r8169: replace event_slow with irq_mask
Recently the "slow event" handler was removed, therefore the member
name isn't appropriate any longer. In addition store the full mask,
including the RTL_EVENT_NAPI interrupt source bits.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Mon, 19 Nov 2018 21:33:00 +0000 (22:33 +0100)]
r8169: remove unused interrupt sources
Setting PCSTimeout interrupt source was copied from the vendor driver
which uses the chip programmable timer interrupt. The mainline driver
doesn't use this timer interrupt.
SYSErr indicates a PCI error and isn't defined on the PCIe models.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
sctp: add subscribe per asoc and sockopt SCTP_EVENT
This patchset mainly adds the Event Subscription sockopt described in
rfc6525#section-6.2:
"Subscribing to events as described in [RFC6458] uses a setsockopt()
call with the SCTP_EVENT socket option. This option takes the
following structure, which specifies the association, the event type
(using the same value found in the event type field), and an on/off
boolean.
The user fills in the se_type field with the same value found in the
strreset_type field, i.e., SCTP_STREAM_RESET_EVENT. The user will
also fill in the se_assoc_id field with either the association to set
this event on (this field is ignored for one-to-one style sockets) or
one of the reserved constant values defined in [RFC6458]. Finally,
the se_on field is set with a 1 to enable the event or a 0 to disable
the event."
As for the old SCTP_EVENTS Option with struct sctp_event_subscribe,
it's being DEPRECATED.
v1->v2:
- fix some key word in changelog that triggerred the filters at
vger.kernel.org.
v2->v3:
- fix an array out of bounds noticed by Neil in patch 1/4.
====================
Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sun, 18 Nov 2018 08:08:51 +0000 (16:08 +0800)]
sctp: define subscribe in sctp_sock as __u16
The member subscribe in sctp_sock is used to indicate to which of
the events it is subscribed, more like a group of flags. So it's
better to be defined as __u16 (2 bytpes), instead of struct
sctp_event_subscribe (13 bytes).
Note that sctp_event_subscribe is an UAPI struct, used on sockopt
calls, and thus it will not be removed. This patch only changes
the internal storage of the flags.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
1) Fix some potentially uninitialized variables and use-after-free in
kvaser_usb can drier, from Jimmy Assarsson.
2) Fix leaks in qed driver, from Denis Bolotin.
3) Socket leak in l2tp, from Xin Long.
4) RSS context allocation fix in bnxt_en from Michael Chan.
5) Fix cxgb4 build errors, from Ganesh Goudar.
6) Route leaks in ipv6 when removing exceptions, from Xin Long.
7) Memory leak in IDR allocation handling of act_pedit, from Davide
Caratti.
8) Use-after-free of bridge vlan stats, from Nikolay Aleksandrov.
9) When MTU is locked, do not force DF bit on ipv4 tunnels. From
Sabrina Dubroca.
10) When NAPI cached skb is reused, we must set it to the proper initial
state which includes skb->pkt_type. From Eric Dumazet.
11) Lockdep and non-linear SKB handling fix in tipc from Jon Maloy.
12) Set RX queue properly in various tuntap receive paths, from Matthew
Cover.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
tuntap: fix multiqueue rx
ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
tipc: don't assume linear buffer when reading ancillary data
tipc: fix lockdep warning when reinitilaizing sockets
net-gro: reset skb->pkt_type in napi_reuse_skb()
tc-testing: tdc.py: Guard against lack of returncode in executed command
tc-testing: tdc.py: ignore errors when decoding stdout/stderr
ip_tunnel: don't force DF when MTU is locked
MAINTAINERS: Add entry for CAKE qdisc
net: bridge: fix vlan stats use-after-free on destruction
socket: do a generic_file_splice_read when proto_ops has no splice_read
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
Revert "net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs"
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
net/sched: act_pedit: fix memory leak when IDR allocation fails
net: lantiq: Fix returned value in case of error in 'xrx200_probe()'
ipv6: fix a dst leak when removing its exception
net: mvneta: Don't advertise 2.5G modes
drivers/net/ethernet/qlogic/qed/qed_rdma.h: fix typo
net/mlx4: Fix UBSAN warning of signed integer overflow
...
Matthew Cover [Sun, 18 Nov 2018 07:46:00 +0000 (00:46 -0700)]
tuntap: fix multiqueue rx
When writing packets to a descriptor associated with a combined queue, the
packets should end up on that queue.
Before this change all packets written to any descriptor associated with a
tap interface end up on rx-0, even when the descriptor is associated with a
different queue.
The rx traffic can be generated by either of the following.
1. a simple tap program which spins up multiple queues and writes packets
to each of the file descriptors
2. tx from a qemu vm with a tap multiqueue netdev
The queue for rx traffic can be observed by either of the following (done
on the hypervisor in the qemu case).
1. a simple netmap program which opens and reads from per-queue
descriptors
2. configuring RPS and doing per-cpu captures with rxtxcpu
Alternatively, if you printk() the return value of skb_get_rx_queue() just
before each instance of netif_receive_skb() in tun.c, you will get 65535
for every skb.
Calling skb_record_rx_queue() to set the rx queue to the queue_index fixes
the association between descriptor and rx queue.
Signed-off-by: Matthew Cover <matthew.cover@stackpath.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Sun, 18 Nov 2018 18:45:30 +0000 (10:45 -0800)]
ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
Preethi reported that PMTU discovery for UDP/raw applications is not
working in the presence of VRF when the socket is not bound to a device.
The problem is that ip6_sk_update_pmtu does not consider the L3 domain
of the skb device if the socket is not bound. Update the function to
set oif to the L3 master device if relevant.
Fixes: ca254490c8df ("net: Add VRF support to IPv6 stack") Reported-by: Preethi Ramachandra <preethir@juniper.net> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 18 Nov 2018 15:37:33 +0000 (07:37 -0800)]
tun: use netdev_alloc_frag() in tun_napi_alloc_frags()
In order to cook skbs in the same way than Ethernet drivers,
it is probably better to not use GFP_KERNEL, but rather
use the GFP_ATOMIC and PFMEMALLOC mechanisms provided by
netdev_alloc_frag().
This would allow to use tun driver even in memory stress
situations, especially if swap is used over this tun channel.
Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Petar Penkov <peterpenkov96@gmail.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
IP101GR: devicetree based configuration of SEL_INTR32
The IP101GR is a 32-pin QFN package variant of the IP101G/IP101GA
Ethernet PHY. Due to it's limited amount of pins the RXER (receive
error) and INTR32 (interrupt) functions share pin 21.
The goal of this series is:
- some small cleanups in patches 3, 4 and 5
- allowing the kernel to detect IRQ floods on boards where the IP101GR
is configured in RXER mode but the RXER line is configured on the
host SoC as interrupt line (patch 6)
- configuration of the SEL_INTR32 register so we can use the interrupt
function on boards where the RXER/INTR32 pin (pin 21) is routed to
one of the host SoC's interrupt inputs (patches 1, 2, 7)
A use-case where this is needed is the Endless Mini (EC-100). I have
tested my changes on that board. This also confirms that Heiner
Kallweit's recent icplus.c PHY driver changes are working (at least on
my setup).
This series is based on net-next commit 7c460cf9cd1a ("net: aquantia:
fix spelling mistake "specfield" -> "specified"")
Changes since v1 at [0]:
- collected Andrew's Reviewed-by's (thank you!)
- updated description of patch #2 to explain why two properties were
added instead of adding an "this is a IP101GR" property
- validate that there's no conflicting configuration in patch #7
- rebased on top of latest net-next
net: phy: icplus: allow configuring the interrupt function on IP101GR
The IP101GR is a 32-pin QFN package variant of the IP101G/IP101GA
Ethernet PHY. Due to it's limited amount of pins the RXER (receive
error) and INTR32 (interrupt) functions share pin 21.
By default the PHY is configured to output the "receive error" status on
pin 21. Depending on the board layout and requirements we may want to
re-configure the PHY to output the interrupt signal there.
The mode of pin 21 can be configured in the "Digital I/O Specific
Control Register" (register 29), bit 2:
- 0 = RXER function
- 1 = INTR(32) function
Depending on the devicetree configuration we will now:
- change the mode to either ther RXER or INTR32 function
- keep the SEL_INTR32 value set by the bootloader (default) if no
configuration is provided (to ensure that we're not breaking existing
boards)
- error out if conflicting configuration is given (RXER and INTR32 mode
are enabled at the same time)
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: phy: icplus: implement .did_interrupt for IP101A/G
The IP101A_G_IRQ_CONF_STATUS register has bits to detect which
interrupts have fired. Implement the .did_interrupt callback to let the
PHY core know whether the interrupt was for this specific PHY.
This is useful for debugging interrupt problems with 32-pin IP101GR PHYs
where the interrupt line is shared with the RX_ERR (receive error
status) signal. The default values are:
- RX_ERR is enabled by default (LOW means that there is no receive
error)
- the PHY's interrupt line is configured "active low" by default
Without any additional changes there is a flood of interrupts if the
RX_ERR/INTR32 signal is configured in RX_ERR mode (which is the
default). Having a did_interrupt ensures that the PHY core returns
IRQ_NONE instead of endlessly triggering the PHY state machine.
Additionally the kernel will report this after a while:
irq 28: nobody cared (try booting with the "irqpoll" option)
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
net: phy: icplus: rename IP101A_G_NO_IRQ to IP101A_G_IRQ_ALL_MASK
The datasheet uses the name "All Mask" for this bit. Change the name of
our #define to be consistent with the datasheet. While here also replace
the tab between the #define and IP101A_G_IRQ_ALL_MASK with a space.
No functional changes.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
net: phy: icplus: use the BIT macro where possible
This makes the code consistent by using the BIT() macro instead of
manual bit-shifting for some of the fields. No functional changes.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
net: phy: icplus: keep all ip101a_g functions together
This simply moves ip101a_g_config_init right above
ip101a_g_config_intr so all functions for the ICPlus IP101A/G PHYs are
grouped together.
No functional changes.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
dt-bindings: net: phy: add bindings for the IC Plus Corp. IP101A/G PHYs
The IP101A and IP101G series both have various models. Depending on the
board implementation we need a special property for the IP101GR (32-pin
LQFP package) PHY:
pin 21 ("RXER/INTR_32") outputs the "receive error" signal by default
(LOW means "normal operation", HIGH means that there's either a decoding
error of the received signal or that the PHY is receiving LPI). This pin
can also be switched to INTR32 mode, where the interrupt signal is
routed to this pin. The other PHYs don't need this special handling
because they have more pins available so the interrupt function gets a
dedicated pin.
This adds two properties to either select the "receive error" or
"interrupt" function of pin 21. Not specifying any function means that
the default set by the bootloader is used. This is required because the
IP101GR cannot be differentiated between other IP101 PHYs as the PHY
identification registers on all of these is 0x02430c54.
The IP101G (sold as die only, without package) may suffer from the same
issue depending on how it's integrated into a multi chip package by
another manufacturer. If only the RXER/INTR_32 pin is routed then the
users of the die-only variant may also have to explicitly configure the
mode of hte RXER/INTR_32 pin. This is the reason why no "is-ip101gr"
property was added. I have no evidence though which would confirm this
theory - so the binding itself is independent of that.
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 18 Nov 2018 20:21:09 +0000 (12:21 -0800)]
Merge tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"A small batch of fixes for v4.20-rc3.
The overflow continuation fix addresses something that has been broken
for several releases. Arguably it could wait even longer, but it's a
one line fix and this finishes the last of the known address range
scrub bug reports. The revert addresses a lockdep regression. The unit
tests are not critical to fix, but no reason to hold this fix back.
Summary:
- Address Range Scrub overflow continuation handling has been broken
since it was initially merged. It was only recently that error
injection and platform-BIOS support enabled this corner case to be
exercised.
- The recent attempt to provide more isolation for the kernel Address
Range Scrub state machine from userapace initiated sessions
triggers a lockdep report. Revert and try again at the next merge
window.
- Fix a kasan reported buffer overflow in libnvdimm unit test
infrastrucutre (nfit_test)"
* tag 'libnvdimm-fixes-4.20-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
Revert "acpi, nfit: Further restrict userspace ARS start requests"
acpi, nfit: Fix ARS overflow continuation
tools/testing/nvdimm: Fix the array size for dimm devices.
Linus Torvalds [Sun, 18 Nov 2018 19:31:26 +0000 (11:31 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"16 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/memblock.c: fix a typo in __next_mem_pfn_range() comments
mm, page_alloc: check for max order in hot path
scripts/spdxcheck.py: make python3 compliant
tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
mm/vmstat.c: fix NUMA statistics updates
mm/gup.c: fix follow_page_mask() kerneldoc comment
ocfs2: free up write context when direct IO failed
scripts/faddr2line: fix location of start_kernel in comment
mm: don't reclaim inodes with many attached pages
mm, memory_hotplug: check zone_movable in has_unmovable_pages
mm/swapfile.c: use kvzalloc for swap_info_struct allocation
MAINTAINERS: update OMAP MMC entry
hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
kernel/sched/psi.c: simplify cgroup_move_task()
z3fold: fix possible reclaim races
Linus Torvalds [Sun, 18 Nov 2018 18:58:20 +0000 (10:58 -0800)]
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix an exec() related scalability/performance regression, which was
caused by incorrectly calculating load and migrating tasks on exec()
when they shouldn't be"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix cpu_util_wake() for 'execl' type workloads
Linus Torvalds [Sun, 18 Nov 2018 18:54:59 +0000 (10:54 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Fix uncore PMU enumeration for CofeeLake CPUs"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Support CoffeeLake 8th CBOX
perf/x86/intel/uncore: Add more IMC PCI IDs for KabyLake and CoffeeLake CPUs
Linus Torvalds [Sun, 18 Nov 2018 18:52:26 +0000 (10:52 -0800)]
Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
"Misc fixes: two warning splat fixes, a leak fix and persistent memory
allocation fixes for ARM"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Permit calling efi_mem_reserve_persistent() from atomic context
efi/arm: Defer persistent reservations until after paging_init()
efi/arm/libstub: Pack FDT after populating it
efi/arm: Revert deferred unmap of early memmap mapping
efi: Fix debugobjects warning on 'efi_rts_work'
Linus Torvalds [Sun, 18 Nov 2018 18:45:09 +0000 (10:45 -0800)]
Merge branch 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM spectre updates from Russell King:
"These are the currently known final bits that resolve the Spectre
issues. big.Little systems used to be sufficiently identical in that
there were no differences between individual CPUs in the system that
mattered to the kernel. With the advent of the Spectre problem, the
CPUs now have differences in how the workaround is applied.
As a result of previous Spectre patches, these systems ended up
reporting quite a lot of:
"CPUx: Spectre v2: incorrect context switching function, system vulnerable"
messages due to the action of the big.Little switcher causing the CPUs
to be re-initialised regularly. This series resolves that issue by
making the CPU vtable unique to each CPU.
However, since this is used very early, before per-cpu is setup,
per-cpu can't be used. We also have a problem that two of the methods
are not called from preempt-safe paths, but thankfully these remain
identical between all CPUs in the system. To make sure, we validate
that these are identical during boot"
* 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: spectre-v2: per-CPU vtables to work around big.Little systems
ARM: add PROC_VTABLE and PROC_TABLE macros
ARM: clean up per-processor check_bugs method call
ARM: split out processor lookup
ARM: make lookup_processor_type() non-__init
the problem is that we only check for an out of bound order in the slow
path and the node reclaim might happen from the fast path already. This
is fixable by making sure that kvmalloc doesn't ever use kmalloc for
requests that are larger than KMALLOC_MAX_SIZE but this also shows that
the code is rather fragile. A recent UBSAN report just underlines that
by the following report
Note that this is not a kvmalloc path. It is just that the fast path
really depends on having sanitzed order as well. Therefore move the
order check to the fast path.
Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reported-by: Kyungtae Kim <kt0755@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Byoungyoung Lee <lifeasageek@gmail.com> Cc: "Dae R. Jeong" <threeearcat@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Uwe Kleine-König [Fri, 16 Nov 2018 23:08:43 +0000 (15:08 -0800)]
scripts/spdxcheck.py: make python3 compliant
Without this change the following happens when using Python3 (3.6.6):
$ echo "GPL-2.0" | python3 scripts/spdxcheck.py -
FAIL: 'str' object has no attribute 'decode'
Traceback (most recent call last):
File "scripts/spdxcheck.py", line 253, in <module>
parser.parse_lines(sys.stdin, args.maxlines, '-')
File "scripts/spdxcheck.py", line 171, in parse_lines
line = line.decode(locale.getpreferredencoding(False), errors='ignore')
AttributeError: 'str' object has no attribute 'decode'
So as the line is already a string, there is no need to decode it and
the line can be dropped.
/usr/bin/python on Arch is Python 3. So this would indeed be worth
going into 4.19.
Link: http://lkml.kernel.org/r/20181023070802.22558-1-u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Joe Perches <joe@perches.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yufen Yu [Fri, 16 Nov 2018 23:08:39 +0000 (15:08 -0800)]
tmpfs: make lseek(SEEK_DATA/SEK_HOLE) return ENXIO with a negative offset
Other filesystems such as ext4, f2fs and ubifs all return ENXIO when
lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset.
man 2 lseek says
: EINVAL whence is not valid. Or: the resulting file offset would be
: negative, or beyond the end of a seekable device.
:
: ENXIO whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond
: the end of the file.
Make tmpfs return ENXIO under these circumstances as well. After this,
tmpfs also passes xfstests's generic/448.
[akpm@linux-foundation.org: rewrite changelog] Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.com Signed-off-by: Yufen Yu <yuyufen@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Fri, 16 Nov 2018 23:08:35 +0000 (15:08 -0800)]
lib/ubsan.c: don't mark __ubsan_handle_builtin_unreachable as noreturn
gcc-8 complains about the prototype for this function:
lib/ubsan.c:432:1: error: ignoring attribute 'noreturn' in declaration of a built-in function '__ubsan_handle_builtin_unreachable' because it conflicts with attribute 'const' [-Werror=attributes]
This is actually a GCC's bug. In GCC internals
__ubsan_handle_builtin_unreachable() declared with both 'noreturn' and
'const' attributes instead of only 'noreturn':
Janne Huttunen [Fri, 16 Nov 2018 23:08:32 +0000 (15:08 -0800)]
mm/vmstat.c: fix NUMA statistics updates
Scan through the whole array to see if an update is needed. While we're
at it, use sizeof() to be safe against any possible type changes in the
future.
The bug here is that we wouldn't sync per-cpu counters into global ones
if there was an update of numa_stats for higher cpus. Highly
theoretical one though because it is much more probable that zone_stats
are updated so we would refresh anyway. So I wouldn't bother to mark
this for stable, yet something nice to fix.
[mhocko@suse.com: changelog enhancement] Link: http://lkml.kernel.org/r/1541601517-17282-1-git-send-email-janne.huttunen@nokia.com Fixes: 1d90ca897cb0 ("mm: update NUMA counter threshold size") Signed-off-by: Janne Huttunen <janne.huttunen@nokia.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit df06b37ffe5a ("mm/gup: cache dev_pagemap while pinning pages")
modified the signature of follow_page_mask() but left the parameter
description behind.
Update the description to make the code and comments agree again.
While at it, update formatting of the return value description to match
Documentation/doc-guide/kernel-doc.rst guidelines.
Wengang Wang [Fri, 16 Nov 2018 23:08:25 +0000 (15:08 -0800)]
ocfs2: free up write context when direct IO failed
The write context should also be freed even when direct IO failed.
Otherwise a memory leak is introduced and entries remain in
oi->ip_unwritten_list causing the following BUG later in unlink path:
Roman Gushchin [Fri, 16 Nov 2018 23:08:18 +0000 (15:08 -0800)]
mm: don't reclaim inodes with many attached pages
Spock reported that commit 172b06c32b94 ("mm: slowly shrink slabs with a
relatively small number of objects") leads to a regression on his setup:
periodically the majority of the pagecache is evicted without an obvious
reason, while before the change the amount of free memory was balancing
around the watermark.
The reason behind is that the mentioned above change created some
minimal background pressure on the inode cache. The problem is that if
an inode is considered to be reclaimed, all belonging pagecache page are
stripped, no matter how many of them are there. So, if a huge
multi-gigabyte file is cached in the memory, and the goal is to reclaim
only few slab objects (unused inodes), we still can eventually evict all
gigabytes of the pagecache at once.
The workload described by Spock has few large non-mapped files in the
pagecache, so it's especially noticeable.
To solve the problem let's postpone the reclaim of inodes, which have
more than 1 attached page. Let's wait until the pagecache pages will be
evicted naturally by scanning the corresponding LRU lists, and only then
reclaim the inode structure.
Link: http://lkml.kernel.org/r/20181023164302.20436-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Spock <dairinin@gmail.com> Tested-by: Spock <dairinin@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: <stable@vger.kernel.org> [4.19.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 16 Nov 2018 23:08:15 +0000 (15:08 -0800)]
mm, memory_hotplug: check zone_movable in has_unmovable_pages
Page state checks are racy. Under a heavy memory workload (e.g. stress
-m 200 -t 2h) it is quite easy to hit a race window when the page is
allocated but its state is not fully populated yet. A debugging patch to
dump the struct page state shows
Note that the state has been checked for both PageLRU and PageSwapBacked
already. Closing this race completely would require some sort of retry
logic. This can be tricky and error prone (think of potential endless
or long taking loops).
Workaround this problem for movable zones at least. Such a zone should
only contain movable pages. Commit 15c30bc09085 ("mm, memory_hotplug:
make has_unmovable_pages more robust") has told us that this is not
strictly true though. Bootmem pages should be marked reserved though so
we can move the original check after the PageReserved check. Pages from
other zones are still prone to races but we even do not pretend that
memory hotremove works for those so pre-mature failure doesn't hurt that
much.
Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org Fixes: 15c30bc09085 ("mm, memory_hotplug: make has_unmovable_pages more robust") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Baoquan He <bhe@redhat.com> Tested-by: Baoquan He <bhe@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vasily Averin [Fri, 16 Nov 2018 23:08:11 +0000 (15:08 -0800)]
mm/swapfile.c: use kvzalloc for swap_info_struct allocation
Commit a2468cc9bfdf ("swap: choose swap device according to numa node")
changed 'avail_lists' field of 'struct swap_info_struct' to an array.
In popular linux distros it increased size of swap_info_struct up to 40
Kbytes and now swap_info_struct allocation requires order-4 page.
Switch to kvzmalloc allows to avoid unexpected allocation failures.
Link: http://lkml.kernel.org/r/fc23172d-3c75-21e2-d551-8b1808cbe593@virtuozzo.com Fixes: a2468cc9bfdf ("swap: choose swap device according to numa node") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aaro Koskinen [Fri, 16 Nov 2018 23:08:08 +0000 (15:08 -0800)]
MAINTAINERS: update OMAP MMC entry
Jarkko's e-mail address hasn't worked for a long time. We still want to
keep this driver working as it is critical for some of the OMAP boards.
I use and test this driver frequently, so change myself as a maintainer
with "Odd Fixes" status.
Link: http://lkml.kernel.org/r/20181106222750.12939-1-aaro.koskinen@iki.fi Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Acked-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Fri, 16 Nov 2018 23:08:04 +0000 (15:08 -0800)]
hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!
This bug has been experienced several times by the Oracle DB team. The
BUG is in remove_inode_hugepages() as follows:
/*
* If page is mapped, it was faulted in after being
* unmapped in caller. Unmap (again) now after taking
* the fault mutex. The mutex will prevent faults
* until we finish removing the page.
*
* This race can only happen in the hole punch case.
* Getting here in a truncate operation is a bug.
*/
if (unlikely(page_mapped(page))) {
BUG_ON(truncate_op);
In this case, the elevated map count is not the result of a race.
Rather it was incorrectly incremented as the result of a bug in the huge
pmd sharing code. Consider the following:
- Process A maps a hugetlbfs file of sufficient size and alignment
(PUD_SIZE) that a pmd page could be shared.
- Process B maps the same hugetlbfs file with the same size and
alignment such that a pmd page is shared.
- Process B then calls mprotect() to change protections for the mapping
with the shared pmd. As a result, the pmd is 'unshared'.
- Process B then calls mprotect() again to chage protections for the
mapping back to their original value. pmd remains unshared.
- Process B then forks and process C is created. During the fork
process, we do dup_mm -> dup_mmap -> copy_page_range to copy page
tables. Copying page tables for hugetlb mappings is done in the
routine copy_hugetlb_page_range.
In copy_hugetlb_page_range(), the destination pte is obtained by:
dst_pte = huge_pte_alloc(dst, addr, sz);
If pmd sharing is possible, the returned pointer will be to a pte in an
existing page table. In the situation above, process C could share with
either process A or process B. Since process A is first in the list,
the returned pte is a pointer to a pte in process A's page table.
However, the check for pmd sharing in copy_hugetlb_page_range is:
/* If the pagetables are shared don't copy or take references */
if (dst_pte == src_pte)
continue;
Since process C is sharing with process A instead of process B, the
above test fails. The code in copy_hugetlb_page_range which follows
assumes dst_pte points to a huge_pte_none pte. It copies the pte entry
from src_pte to dst_pte and increments this map count of the associated
page. This is how we end up with an elevated map count.
To solve, check the dst_pte entry for huge_pte_none. If !none, this
implies PMD sharing so do not copy.
Link: http://lkml.kernel.org/r/20181105212315.14125-1-mike.kravetz@oracle.com Fixes: c5c99429fa57 ("fix hugepages leak due to pagetable page sharing") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Olof Johansson [Fri, 16 Nov 2018 23:08:00 +0000 (15:08 -0800)]
kernel/sched/psi.c: simplify cgroup_move_task()
The existing code triggered an invalid warning about 'rq' possibly being
used uninitialized. Instead of doing the silly warning suppression by
initializa it to NULL, refactor the code to bail out early instead.
Warning was:
kernel/sched/psi.c: In function `cgroup_move_task':
kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]
Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net Fixes: 2ce7135adc9ad ("psi: cgroup support") Signed-off-by: Olof Johansson <olof@lixom.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vitaly Wool [Fri, 16 Nov 2018 23:07:56 +0000 (15:07 -0800)]
z3fold: fix possible reclaim races
Reclaim and free can race on an object which is basically fine but in
order for reclaim to be able to map "freed" object we need to encode
object length in the handle. handle_to_chunks() is then introduced to
extract object length from a handle and use it during mapping.
Moreover, to avoid racing on a z3fold "headless" page release, we should
not try to free that page in z3fold_free() if the reclaim bit is set.
Also, in the unlikely case of trying to reclaim a page being freed, we
should not proceed with that page.
While at it, fix the page accounting in reclaim function.
This patch supersedes "[PATCH] z3fold: fix reclaim lock-ups".
Link: http://lkml.kernel.org/r/20181105162225.74e8837d03583a9b707cf559@gmail.com Signed-off-by: Vitaly Wool <vitaly.vul@sony.com> Signed-off-by: Jongseok Kim <ks77sj@gmail.com> Reported-by-by: Jongseok Kim <ks77sj@gmail.com> Reviewed-by: Snild Dolkow <snild@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jon Maloy [Sat, 17 Nov 2018 17:17:06 +0000 (12:17 -0500)]
tipc: don't assume linear buffer when reading ancillary data
The code for reading ancillary data from a received buffer is assuming
the buffer is linear. To make this assumption true we have to linearize
the buffer before message data is read.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
It is not enough to return an error code from the driver suspend
routine. The driver must also restore the device functionality.
This commit corrects the issue introduced by commit 0db55093b566
("net: bcmgenet: return correct value 'ret' from bcmgenet_power_down")
by calling the driver resume function if the suspend function returns
an error.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Doug Berger [Sat, 17 Nov 2018 02:00:22 +0000 (18:00 -0800)]
net: bcmgenet: abort suspend on error
If an error occurs during suspension of the driver the driver should
restore the hardware configuration and return an error to force the
system to resume.
Fixes: 0db55093b566 ("net: bcmgenet: return correct value 'ret' from bcmgenet_power_down") Signed-off-by: Doug Berger <opendmb@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Doug Berger [Sat, 17 Nov 2018 02:00:21 +0000 (18:00 -0800)]
net: bcmgenet: code movement
This commit switches the order of bcmgenet_suspend and bcmgenet_resume
in the file to prevent the need for a forward declaration in the next
commit and to make the review of that commit easier.
Signed-off-by: Doug Berger <opendmb@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/geneve.c:428:29: error: suggest braces around initialization
of subobject [-Werror,-Wmissing-braces]
struct in6_addr addr6 = { 0 };
^
{}
Rather than trying to appease the various compilers that support the
kernel, use memset, which is unambiguous.
Fixes: a07966447f39 ("geneve: ICMP error lookup handler") Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The reason seems to be that tipc_net_finalize()->tipc_sk_reinit() is
calling the function rhashtable_walk_enter() within a timer interrupt.
We fix this by executing tipc_net_finalize() in work queue context.
Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 18 Nov 2018 05:57:02 +0000 (21:57 -0800)]
net-gro: reset skb->pkt_type in napi_reuse_skb()
eth_type_trans() assumes initial value for skb->pkt_type
is PACKET_HOST.
This is indeed the value right after a fresh skb allocation.
However, it is possible that GRO merged a packet with a different
value (like PACKET_OTHERHOST in case macvlan is used), so
we need to make sure napi->skb will have pkt_type set back to
PACKET_HOST.
Otherwise, valid packets might be dropped by the stack because
their pkt_type is not PACKET_HOST.
napi_reuse_skb() was added in commit 96e93eab2033 ("gro: Add
internal interfaces for VLAN"), but this bug always has
been there.
Fixes: 96e93eab2033 ("gro: Add internal interfaces for VLAN") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Sun, 18 Nov 2018 03:19:14 +0000 (03:19 +0000)]
net: hns3: up/down netdev in hclge module when setting mtu
Currently netdev is down in enet module, and it is before
mtu range checking in hclge module, which may be cause
netdev being down unnecessarily.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Sun, 18 Nov 2018 03:19:13 +0000 (03:19 +0000)]
net: hns3: Add mtu setting support for vf
The patch adds mtu setting support for vf, currently
vf and pf share the same hardware mtu setting. Mtu set
by vf must be less than or equal to pf' mtu, and mtu
set by pf must be greater than or equal to vf' mtu.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Sun, 18 Nov 2018 03:19:12 +0000 (03:19 +0000)]
net: hns3: Add vport alive state checking support
Currently there is no way for pf to know if a vf device is
alive or not, so PF does not know which vf to notify when
reset happens, or which vf's mtu is invalid when vf and pf
share the same hardware mtu setting.
This patch adds vport alive state checking support, in order
to support the above scenario.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Sun, 18 Nov 2018 03:19:11 +0000 (03:19 +0000)]
net: hns3: Refactor mac mtu setting related functions
This patch refactors mac mtu setting related functions,
normalizes the use of mps and mtu.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yunsheng Lin [Sun, 18 Nov 2018 03:19:10 +0000 (03:19 +0000)]
net: hns3: Support two vlan header when setting mtu
This patch adds supports for two vlan header when setting mtu.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 18 Nov 2018 05:54:53 +0000 (21:54 -0800)]
Merge branch 'tdc-fixes'
Lucas Bates says:
====================
Prevent uncaught exceptions in tdc
This patch series addresses two potential bugs in tdc that can
cause exceptions to be raised in certain circumstances. These
exceptions are generally not handled, so instead we will prevent
them from being raised.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Brenda J. Butler [Fri, 16 Nov 2018 22:37:56 +0000 (17:37 -0500)]
tc-testing: tdc.py: Guard against lack of returncode in executed command
Add some defensive coding in case one of the subprocesses created by tdc
returns nothing. If no object is returned from exec_cmd, then tdc will
halt with an unhandled exception.
Signed-off-by: Brenda J. Butler <bjb@mojatatu.com> Signed-off-by: Lucas Bates <lucasb@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lucas Bates [Fri, 16 Nov 2018 22:37:55 +0000 (17:37 -0500)]
tc-testing: tdc.py: ignore errors when decoding stdout/stderr
Prevent exceptions from being raised while decoding output
from an executed command. There is no impact on tdc's
execution and the verify command phase would fail the pattern
match.
Signed-off-by: Lucas Bates <lucasb@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Rob Herring [Fri, 16 Nov 2018 22:11:03 +0000 (16:11 -0600)]
net: fsl: Use device_type helpers to access the node type
Remove directly accessing device_node.type pointer and use the accessors
instead. This will eventually allow removing the type pointer.
Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Rob Herring [Fri, 16 Nov 2018 22:05:37 +0000 (16:05 -0600)]
atm: Convert to using %pOFn instead of device_node.name
In preparation to remove the node name pointer from struct device_node,
convert printf users to use the %pOFn format specifier.
Cc: Chas Williams <3chas3@gmail.com> Cc: linux-atm-general@lists.sourceforge.net Cc: netdev@vger.kernel.org Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca [Fri, 16 Nov 2018 15:58:19 +0000 (16:58 +0100)]
ip_tunnel: don't force DF when MTU is locked
The various types of tunnels running over IPv4 can ask to set the DF
bit to do PMTU discovery. However, PMTU discovery is subject to the
threshold set by the net.ipv4.route.min_pmtu sysctl, and is also
disabled on routes with "mtu lock". In those cases, we shouldn't set
the DF bit.
This patch makes setting the DF bit conditional on the route's MTU
locking state.
This issue seems to be older than git history.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
We would like the existing community to be kept in the loop for any new
developments on CAKE; and I certainly plan to keep maintaining it. Reflect
this in MAINTAINERS.
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: fix vlan stats use-after-free on destruction
Syzbot reported a use-after-free of the global vlan context on port vlan
destruction. When I added per-port vlan stats I missed the fact that the
global vlan context can be freed before the per-port vlan rcu callback.
There're a few different ways to deal with this, I've chosen to add a
new private flag that is set only when per-port stats are allocated so
we can directly check it on destruction without dereferencing the global
context at all. The new field in net_bridge_vlan uses a hole.
v2: cosmetic change, move the check to br_process_vlan_info where the
other checks are done
v3: add change log in the patch, add private (in-kernel only) flags in a
hole in net_bridge_vlan struct and use that instead of mixing
user-space flags with private flags
Fixes: 9163a0fc1f0c ("net: bridge: add support for per-port vlan stats") Reported-by: syzbot+04681da557a0e49a52e5@syzkaller.appspotmail.com Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Slavomir Kaslev [Fri, 16 Nov 2018 09:27:53 +0000 (11:27 +0200)]
socket: do a generic_file_splice_read when proto_ops has no splice_read
splice(2) fails with -EINVAL when called reading on a socket with no splice_read
set in its proto_ops (such as vsock sockets). Switch this to fallbacks to a
generic_file_splice_read instead.
Signed-off-by: Slavomir Kaslev <kaslevs@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Schiller [Fri, 16 Nov 2018 07:38:36 +0000 (08:38 +0100)]
net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
Up until commit 7e5fbd1e0700 ("net: mdio-gpio: Convert to use gpiod
functions where possible"), the _cansleep variants of the gpio_ API was
used. After that commit and the change to gpiod_ API, the _cansleep()
was dropped. This then results in WARN_ON() when used with GPIO
devices which do sleep. Add back the _cansleep() to avoid this.
Fixes: 7e5fbd1e0700 ("net: mdio-gpio: Convert to use gpiod functions where possible") Signed-off-by: Martin Schiller <ms@dev.tdt.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This series extends the NCSI driver to configure multiple packages
and/or channels simultaneously. Since the RFC series this includes a few
extra changes to fix areas in the driver that either made this harder or
were roadblocks due to deviations from the NCSI specification.
Patches 1 & 2 fix two issues where the driver made assumptions about the
capabilities of the NCSI topology.
Patches 3 & 4 change some internal semantics slightly to make multi-mode
easier.
Patch 5 introduces a cleaner way of reconfiguring the NCSI configuration
and keeping track of channel states.
Patch 6 implements the main multi-package/multi-channel configuration,
configured via the Netlink interface.
Readers who have an interesting NCSI setup - especially multi-package
with HWA - please test! I think I've covered all permutations but I
don't have infinite hardware to test on.
Changes in v2:
- Updated use of the channel lock in ncsi_reset_dev(), making the
channel invisible and leaving the monitor check to
ncsi_stop_channel_monitor().
- Fixed ncsi_channel_is_tx() to consider the state of channels in other
packages.
Changes in v3:
- Fixed bisectability bug in patch 1
- Consider channels on all packages in a few places when multi-package
is enabled.
- Avoid doubling up reset operations, and check the current driver state
before reset to let any running operations complete.
- Reorganise the LSC handler slightly to avoid enabling Tx twice.
Changes in v4:
- Fix failover in the single-channel case
- Better handle ncsi_reset_dev() entry during a current config/suspend
operation.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ncsi: Configure multi-package, multi-channel modes with failover
This patch extends the ncsi-netlink interface with two new commands and
three new attributes to configure multiple packages and/or channels at
once, and configure specific failover modes.
NCSI_CMD_SET_PACKAGE mask and NCSI_CMD_SET_CHANNEL_MASK set a whitelist
of packages or channels allowed to be configured with the
NCSI_ATTR_PACKAGE_MASK and NCSI_ATTR_CHANNEL_MASK attributes
respectively. If one of these whitelists is set only packages or
channels matching the whitelist are considered for the channel queue in
ncsi_choose_active_channel().
These commands may also use the NCSI_ATTR_MULTI_FLAG to signal that
multiple packages or channels may be configured simultaneously. NCSI
hardware arbitration (HWA) must be available in order to enable
multi-package mode. Multi-channel mode is always available.
If the NCSI_ATTR_CHANNEL_ID attribute is present in the
NCSI_CMD_SET_CHANNEL_MASK command the it sets the preferred channel as
with the NCSI_CMD_SET_INTERFACE command. The combination of preferred
channel and channel whitelist defines a primary channel and the allowed
failover channels.
If the NCSI_ATTR_MULTI_FLAG attribute is also present then the preferred
channel is configured for Tx/Rx and the other channels are enabled only
for Rx.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When the NCSI driver is stopped with ncsi_stop_dev() the channel
monitors are stopped and the state set to "inactive". However the
channels are still configured and active from the perspective of the
network controller. We should suspend each active channel but in the
context of ncsi_stop_dev() the transmit queue has been or is about to be
stopped so we won't have time to do so.
Instead when ncsi_start_dev() is called if the NCSI topology has already
been probed then call ncsi_reset_dev() to suspend any channels that were
previously active. This resets the network controller to a known state,
provides an up to date view of channel link state, and makes sure that
mode flags such as NCSI_MODE_TX_ENABLE are properly reset.
In addition to ncsi_start_dev() use ncsi_reset_dev() in ncsi-netlink.c
to update the channel configuration more cleanly.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The concepts of a channel being 'active' and it having link are slightly
muddled in the NCSI driver. Tweak this slightly so that
NCSI_CHANNEL_ACTIVE represents a channel that has been configured and
enabled, and NCSI_CHANNEL_INACTIVE represents a de-configured channel.
This distinction is important because a channel can be 'active' but have
its link down; in this case the channel may still need to be configured
so that it may receive AEN link-state-change packets.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/ncsi: Don't deselect package in suspend if active
When a package is deselected all channels of that package cease
communication. If there are other channels active on the package of the
suspended channel this will disable them as well, so only send a
deselect-package command if no other channels are active.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the NCSI driver sends a select-package command to all possible
packages simultaneously to discover what packages are available. However
at this stage in the probe process the driver does not know if
hardware arbitration is available: if it isn't then this process could
cause collisions on the RMII bus when packages try to respond.
Update the probe loop to probe each package one by one, and once
complete check if HWA is universally supported.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/ncsi: Don't enable all channels when HWA available
NCSI hardware arbitration allows multiple packages to be enabled at once
and share the same wiring. If the NCSI driver recognises that HWA is
available it unconditionally enables all packages and channels; but that
is a configuration decision rather than something required by HWA.
Additionally the current implementation will not failover on link events
which can cause connectivity to be lost unless the interface is manually
bounced.
Retain basic HWA support but remove the separate configuration path to
enable all channels, leaving this to be handled by a later
implementation.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arjun Vynipadath [Fri, 16 Nov 2018 03:45:42 +0000 (09:15 +0530)]
cxgb4: Remove SGE_HOST_PAGE_SIZE dependency on page size
The SGE Host Page Size has nothing to do with the actual
Host Page Size. It's the SGE's BAR2 Doorbell/GTS Page Size
for interpreting the SGE Ingress/Egress Queue per Page values.
Firmware reads all of these things and makes all the
subsequent changes necessary. The Host Driver uses the SGE
Host Page Size in order to properly calculate BAR2 Offsets.
Signed-off-by: Casey Leedom <leedom@chelsio.com> Signed-off-by: Arjun Vynipadath <arjun@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Thu, 15 Nov 2018 09:43:10 +0000 (17:43 +0800)]
tuntap: free XDP dropped packets in a batch
Thanks to the batched XDP buffs through msg_control. Instead of
calling put_page() for each page which involves a atomic operation,
let's batch them by record the last page that needs to be freed and
its refcnt count and free them in a batch.
Jason Wang [Thu, 15 Nov 2018 09:43:09 +0000 (17:43 +0800)]
vhost_net: mitigate page reference counting during page frag refill
We do a get_page() which involves a atomic operation. This patch tries
to mitigate a per packet atomic operation by maintaining a reference
bias which is initially USHRT_MAX. Each time a page is got, instead of
calling get_page() we decrease the bias and when we find it's time to
use a new page we will decrease the bias at one time through
__page_cache_drain_cache().
Testpmd(virtio_user + vhost_net) + XDP_DROP on TAP shows about 1.6%
improvement.
Before: 4.63Mpps
After: 4.71Mpps
Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This series updates the GRED Qdisc. The Qdisc matches nfp offload very
well, but before we can offload it there are a number of improvements
to make.
First few patches add extack messages to the Qdisc and pass extack
to netlink validation.
Next a new netlink attribute group is added, to allow GRED to be
extended more easily. Currently GRED passes C structures as attributes,
and even an array of C structs for virtual queue configuration. User
space has hard coded the expected length of that array, so adding new
fields is not possible.
Statistics are dump only. Patch 4 switches the byte counts to be 64 bit,
and patch 5 introduces the new stats attributes for dump. Patch 6
switches RED flags to be per-virtual queue, and patch 7 allows them
to be dumped and set at virtual queue granularity.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 15 Nov 2018 06:23:51 +0000 (22:23 -0800)]
net: sched: gred: allow manipulating per-DP RED flags
Allow users to set and dump RED flags (ECN enabled and harddrop)
on per-virtual queue basis. Validation of attributes is split
from changes to make sure we won't have to undo previous operations
when we find out configuration is invalid.
The objective is to allow changing per-Qdisc parameters without
overwriting the per-vq configured flags.
Old user space will not pass the TCA_GRED_VQ_FLAGS attribute and
per-Qdisc flags will always get propagated to the virtual queues.
New user space which wants to make use of per-vq flags should set
per-Qdisc flags to 0 and then configure per-vq flags as it
sees fit. Once per-vq flags are set per-Qdisc flags can't be
changed to non-zero. Vice versa - if the per-Qdisc flags are
non-zero the TCA_GRED_VQ_FLAGS attribute has to either be omitted
or set to the same value as per-Qdisc flags.
Update per-Qdisc parameters:
per-Qdisc | per-VQ | result
0 | 0 | all vq flags updated
0 | non-0 | error (vq flags in use)
non-0 | 0 | -- impossible --
non-0 | non-0 | all vq flags updated
Update per-VQ state (flags parameter not specified):
no change to flags
Update per-VQ state (flags parameter set):
per-Qdisc | per-VQ | result
0 | any | per-vq flags updated
non-0 | 0 | -- impossible --
non-0 | non-0 | error (per-Qdisc flags in use)
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 15 Nov 2018 06:23:50 +0000 (22:23 -0800)]
net: sched: gred: store red flags per virtual queue
Right now ECN marking and HARD drop (the common RED flags) can only
be configured for the entire Qdisc. In preparation for per-vq flags
store the values in the virtual queue structure. Setting per-vq
flags will only be allowed when no flags are set for the entire Qdisc.
For the new flags we will also make sure undefined bits are 0.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 15 Nov 2018 06:23:49 +0000 (22:23 -0800)]
net: sched: gred: provide a better structured dump and expose stats
Currently all GRED's virtual queue data is dumped in a single
array in a single attribute. This makes it pretty much impossible
to add new fields. In order to expose more detailed stats add a
new set of attributes. We can now expose the 64 bit value of bytesin
and all the mark stats which were not part of the original design.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>