Ioana Radulescu [Fri, 1 Mar 2019 17:47:23 +0000 (17:47 +0000)]
dpaa2-eth: Add software annotation types
We write different metadata information in the software annotation
area of Tx frames, depending on frame type. Make this more explicit
by introducing a type field and separate structures for single buffer
and scatter-gather frames.
Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
sched: Patches from out-of-tree version of sch_cake
This series includes a couple of patches with updates from the out-of-tree
version of sch_cake. The first one is a fix to the fairness scheduling when
dual-mode fairness is enabled. The second patch is an additional feature flag
that allows using fwmark as a tin selector, as a convenience for people who want
to customise tin selection. The third patch is just a cleanup to the tin
selection logic.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With more modes added the logic in cake_select_tin() was getting a bit
hairy, and it turns out we can actually simplify it quite a bit. This also
allows us to get rid of one of the two diffserv parsing functions, which
has the added benefit that already-zeroed DSCP fields won't get re-written.
Suggested-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
sch_cake: Permit use of connmarks as tin classifiers
Add flag 'FWMARK' to enable use of firewall connmarks as tin selector.
The connmark (skbuff->mark) needs to be in the range 1->tin_cnt ie.
for diffserv3 the mark needs to be 1->3.
Background
Typically CAKE uses DSCP as the basis for tin selection. DSCP values
are relatively easily changed as part of the egress path, usually with
iptables & the mangle table, ingress is more challenging. CAKE is often
used on the WAN interface of a residential gateway where passthrough of
DSCP from the ISP is either missing or set to unhelpful values thus use
of ingress DSCP values for tin selection isn't helpful in that
environment.
An approach to solving the ingress tin selection problem is to use
CAKE's understanding of tc filters. Naive tc filters could match on
source/destination port numbers and force tin selection that way, but
multiple filters don't scale particularly well as each filter must be
traversed whether it matches or not. e.g. a simple example to map 3
firewall marks to tins:
MAJOR=$( tc qdisc show dev $DEV | head -1 | awk '{print $3}' )
tc filter add dev $DEV parent $MAJOR protocol all handle 0x01 fw action skbedit priority ${MAJOR}1
tc filter add dev $DEV parent $MAJOR protocol all handle 0x02 fw action skbedit priority ${MAJOR}2
tc filter add dev $DEV parent $MAJOR protocol all handle 0x03 fw action skbedit priority ${MAJOR}3
Another option is to use eBPF cls_act with tc filters e.g.
MAJOR=$( tc qdisc show dev $DEV | head -1 | awk '{print $3}' )
tc filter add dev $DEV parent $MAJOR bpf da obj my-bpf-fwmark-to-class.o
This has the disadvantages of a) needing someone to write & maintain
the bpf program, b) a bpf toolchain to compile it and c) needing to
hardcode the major number in the bpf program so it matches the cake
instance (or forcing the cake instance to a particular major number)
since the major number cannot be passed to the bpf program via tc
command line.
As already hinted at by the previous examples, it would be helpful
to associate tins with something that survives the Internet path and
ideally allows tin selection on both egress and ingress. Netfilter's
conntrack permits setting an identifying mark on a connection which
can also be restored to an ingress packet with tc action connmark e.g.
tc filter add dev eth0 parent ffff: protocol all prio 10 u32 \
match u32 0 0 flowid 1:1 action connmark action mirred egress redirect dev ifb1
Since tc's connmark action has restored any connmark into skb->mark,
any of the previous solutions are based upon it and in one form or
another copy that mark to the skb->priority field where again CAKE
picks this up.
This change cuts out at least one of the (less intuitive &
non-scalable) middlemen and permit direct access to skb->mark.
Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
George Amanakis [Fri, 1 Mar 2019 15:04:05 +0000 (16:04 +0100)]
sch_cake: Make the dual modes fairer
CAKE host fairness does not work well with TCP flows in dual-srchost and
dual-dsthost setup. The reason is that ACKs generated by TCP flows are
classified as sparse flows, and affect flow isolation from other hosts. Fix
this by calculating host_load based only on the bulk flows a host
generates. In a hash collision the host_bulk_flow_count values must be
decremented on the old hosts and incremented on the new ones *if* the queue
is in the bulk set.
Reported-by: Pete Heist <peteheist@gmail.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
spi: sh-msiof: Restrict bits per word to 8/16/24/32 on R-Car Gen2/3
While the MSIOF variants in older SuperH and SH/R-Mobile SoCs support
bits-per-word values in the full range 8..32, the variants present in
R-Car Gen2 and Gen3 SoCs are restricted to 8, 16, 24, or 32.
Obtain the value from family-specific sh_msiof_chipdata to fix this.
Reported-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Mark Brown <broonie@kernel.org>
Axel Lin [Thu, 28 Feb 2019 13:40:13 +0000 (21:40 +0800)]
regulator: core: Add set/get_current_limit helpers for regmap users
By setting curr_table, n_current_limits, csel_reg and csel_mask, the
regmap users can use regulator_set_current_limit_regmap and
regulator_get_current_limit_regmap for set/get_current_limit callbacks.
Signed-off-by: Axel Lin <axel.lin@ingics.com> Signed-off-by: Mark Brown <broonie@kernel.org>
====================
Macb power management support for ZynqMP
This series adds support for macb suspend/resume with system power down.
In relation to the above, this series also updates mdio_read/write
function for PM and adds tsu clock management.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Harini Katakam [Fri, 1 Mar 2019 10:50:35 +0000 (16:20 +0530)]
net: macb: Add support for suspend/resume with full power down
When macb device is suspended and system is powered down, the clocks
are removed and hence macb should be closed gracefully and restored
upon resume. This patch does the same by switching off the net device,
suspending phy and performing necessary cleanup of interrupts and BDs.
Upon resume, all these are reinitialized again.
Reset of macb device is done only when GEM is not a wake device.
Even when gem is a wake device, tx queues can be stopped and ptp device
can be closed (tsu clock will be disabled in pm_runtime_suspend) as
wake event detection has no dependency on this.
Signed-off-by: Kedareswara rao Appana <appanad@xilinx.com> Signed-off-by: Harini Katakam <harini.katakam@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Harini Katakam [Fri, 1 Mar 2019 10:50:34 +0000 (16:20 +0530)]
net: macb: Add pm runtime support
Add runtime pm functions and move clock handling there.
Add runtime PM calls to mdio functions to allow for active mdio bus.
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@xilinx.com> Signed-off-by: Harini Katakam <harini.katakam@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Harini Katakam [Fri, 1 Mar 2019 10:50:33 +0000 (16:20 +0530)]
net: macb: Support clock management for tsu_clk
TSU clock needs to be enabled/disabled as per support in devicetree
and it should also be controlled during suspend/resume (WOL has no
dependency on this clock).
Signed-off-by: Harini Katakam <harini.katakam@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Harini Katakam [Fri, 1 Mar 2019 10:50:32 +0000 (16:20 +0530)]
net: macb: Check MDIO state before read/write and use timeouts
Replace the while loop in MDIO read/write functions with a timeout.
In addition, add a check for MDIO bus busy before initiating a new
operation as well to make sure there is no ongoing MDIO operation.
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@xilinx.com> Signed-off-by: Sai Pavan Boddu <sai.pavan.boddu@xilinx.com> Signed-off-by: Harini Katakam <harini.katakam@xilinx.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: dsa: microchip: add KSZ9893 switch support
This series of patches is to modify the KSZ9477 DSA driver to support
running KSZ9893 switch.
The KSZ9893 switch is similar to KSZ9477 except the ingress tail tag has
1 byte instead of 2 bytes. The XMII register that governs the MAC
communication also has different register definitions.
v1
- Put KSZ9893 tagging in separate patch
- Remove other switch support
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tristram Ha [Fri, 1 Mar 2019 03:57:24 +0000 (19:57 -0800)]
net: dsa: microchip: add KSZ9893 switch support
Add KSZ9893 switch support in KSZ9477 driver. This switch is similar to
KSZ9477 except the ingress tail tag has 1 byte instead of 2 bytes, so
KSZ9893 tagging will be used.
The XMII register that governs how the host port communicates with the
MAC also has different register definitions.
Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tristram Ha [Fri, 1 Mar 2019 03:57:23 +0000 (19:57 -0800)]
net: dsa: add KSZ9893 switch tagging support
KSZ9893 switch is similar to KSZ9477 switch except the ingress tail tag
has 1 byte instead of 2 bytes. The size of the portmap is smaller and
so the override and lookup bits are also moved.
Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tristram Ha [Fri, 1 Mar 2019 03:57:22 +0000 (19:57 -0800)]
dt-bindings: net: dsa: document additional Microchip KSZ9477 family switches
Document additional Microchip KSZ9477 family switches.
Show how KSZ8565 switch should be configured as the host port is port 7
instead of port 5.
Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 3 Mar 2019 21:01:49 +0000 (13:01 -0800)]
Merge branch 'appletalk-small-cleanup-and-bugfix'
Yue Haibing says:
====================
appletalk: small cleanup and bugfix
v2:
- Add cover letter log
This patch series mainly fix a use-after-free bug in atalk_proc_exit.
patch 1 use remove_proc_subtree helper to simplify atalk_proc fs code,
also some other cleanup.
patch 2 add proper error cleanup path in atalk_init to fix the issue, which
based on the patch 1 because of the change of atalk_proc_exit context.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Memory state around the buggy address: ffff8881f41fe480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff8881f41fe500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8881f41fe580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^ ffff8881f41fe600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ffff8881f41fe680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
It should check the return value of atalk_proc_init fails,
otherwise atalk_exit will trgger use-after-free in pde_subdir_find
while unload the module.This patch fix error cleanup path of atalk_init
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 28 Feb 2019 20:55:43 +0000 (12:55 -0800)]
net: sched: put back q.qlen into a single location
In the series fc8b81a5981f ("Merge branch 'lockless-qdisc-series'")
John made the assumption that the data path had no need to read
the qdisc qlen (number of packets in the qdisc).
It is true when pfifo_fast is used as the root qdisc, or as direct MQ/MQPRIO
children.
But pfifo_fast can be used as leaf in class full qdiscs, and existing
logic needs to access the child qlen in an efficient way.
HTB breaks badly, since it uses cl->leaf.q->q.qlen in :
htb_activate() -> WARN_ON()
htb_dequeue_tree() to decide if a class can be htb_deactivated
when it has no more packets.
HFSC, DRR, CBQ, QFQ have similar issues, and some calls to
qdisc_tree_reduce_backlog() also read q.qlen directly.
Using qdisc_qlen_sum() (which iterates over all possible cpus)
in the data path is a non starter.
It seems we have to put back qlen in a central location,
at least for stable kernels.
For all qdisc but pfifo_fast, qlen is guarded by the qdisc lock,
so the existing q.qlen{++|--} are correct.
For 'lockless' qdisc (pfifo_fast so far), we need to use atomic_{inc|dec}()
because the spinlock might be not held (for example from
pfifo_fast_enqueue() and pfifo_fast_dequeue())
This patch adds atomic_qlen (in the same location than qlen)
and renames the following helpers, since we want to express
they can be used without qdisc lock, and that qlen is no longer percpu.
Later (net-next) we might revert this patch by tracking all these
qlen uses and replace them by a more efficient method (not having
to access a precise qlen, but an empty/non_empty status that might
be less expensive to maintain/track).
Another possibility is to have a legacy pfifo_fast version that would
be used when used a a child qdisc, since the parent qdisc needs
a spinlock anyway. But then, future lockless qdiscs would also
have the same problem.
Fixes: 7e66016f2c65 ("net: sched: helpers to sum qlen and qlen for per cpu logic") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 2 Mar 2019 22:04:20 +0000 (14:04 -0800)]
Merge tag 'mlx5-updates-2019-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2019-03-01
This series adds multipath offload support and contains some small updates
to mlx5 driver.
Multipath offload support from Roi Dayan:
We are going to track SW multipath route and related nexthops and reflect
that as port affinity to the HW.
1) Some patches are preparation.
2) add the multipath mode and fib events handling.
3) add support to handle offload failure for net error, i.e.
port down.
4) Small updates to match the behavior of multipath
Two small updates from Eran Ben Elisha,
5) Make a function static
6) Update PCIe supported devices list.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next:
1) Add .release_ops to properly unroll .select_ops, use it from nft_compat.
After this change, we can remove list of extensions too to simplify this
codebase.
2) Update amanda conntrack helper to support v3.4, from Florian Tham.
3) Get rid of the obsolete BUGPRINT macro in ebtables, from
Florian Westphal.
4) Merge IPv4 and IPv6 masquerading infrastructure into one single module.
From Florian Westphal.
5) Patchset to remove nf_nat_l3proto structure to get rid of
indirections, from Florian Westphal.
6) Skip unnecessary conntrack timeout updates in case the value is
still the same, also from Florian Westphal.
7) Remove unnecessary 'fall through' comments in empty switch cases,
from Li RongQing.
8) Fix lookup to fixed size hashtable sets on big endian with 32-bit keys.
9) Incorrect logic to deactivate path of fixed size hashtable sets,
element was being tested to self.
10) Remove nft_hash_key(), the bitmap set is always selected for 16-bit
keys.
11) Use boolean whenever possible in IPVS codebase, from Andrea Claudi.
12) Enter close state in conntrack if RST matches exact sequence number,
from Florian Westphal.
13) Initialize dst_cache in tunnel extension, from wenxu.
14) Pass protocol as u16 to xt_check_match and xt_check_target, from
Li RongQing.
15) SCTP header is granted to be in a linear area from IPVS NAT handler,
from Xin Long.
16) Don't steal packets coming from slave VRF device from the
ip_sabotage_in() path, from David Ahern.
17) Fix unsafe update of basechain stats, from Li RongQing.
18) Make sure CONNTRACK_LOCKS is power of 2 to let compiler optimize
modulo operation as bitwise AND, from Li RongQing.
19) Use device_attribute instead of internal definition in the IDLETIMER
target, from Sami Tolvanen.
20) Merge redir, masq and IPv4/IPv6 NAT chain types, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Here's one more bluetooth-next pull request for the 5.1 kernel:
- Added support for MediaTek MT7663U and MT7668U UART devices
- Cleanups & fixes to the hci_qca driver
- Fixed wakeup pin behavior for QCA6174A controller
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 2 Mar 2019 19:47:29 +0000 (11:47 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two last minute fixes:
- Prevent value evaluation via functions happening in the user access
enabled region of __put_user() (put another way: make sure to
evaluate the value to be stored in user space _before_ enabling
user space accesses)
- Correct the definition of a Hyper-V hypercall constant"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hyper-v: Fix definition of HV_MAX_FLUSH_REP_COUNT
x86/uaccess: Don't leak the AC flag into __put_user() value evaluation
Linus Torvalds [Sat, 2 Mar 2019 19:39:54 +0000 (11:39 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Nine small fixes.
The resume fix is a cosmetic removal of a warning with an incorrect
condition causing it to alarm people wrongly.
The other eight patches correct a thinko in Christoph Hellwig's DMA
conversion series. Without it all these drivers end up with 32 bit DMA
masks meaning they bounce any page over 4GB before sending it to the
controller.
Nowadays, even laptops mostly have memory above 4GB, so this can lead
to significant performance degradation with all the bouncing"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: core: Avoid that system resume triggers a kernel warning
scsi: hptiop: fix calls to dma_set_mask()
scsi: hisi_sas: fix calls to dma_set_mask_and_coherent()
scsi: csiostor: fix calls to dma_set_mask_and_coherent()
scsi: bfa: fix calls to dma_set_mask_and_coherent()
scsi: aic94xx: fix calls to dma_set_mask_and_coherent()
scsi: 3w-sas: fix calls to dma_set_mask_and_coherent()
scsi: 3w-9xxx: fix calls to dma_set_mask_and_coherent()
scsi: lpfc: fix calls to dma_set_mask_and_coherent()
====================
Recently we had linux-next bpf/bpf-next conflict when we added new
functionality to the test_progs.c at the same location. Let's split
test_progs.c the same way we recently split test_verifier.c.
I follow the same patten we did in commit 2dfb40121ee8 ("selftests: bpf:
prepare for break up of verifier tests") for verifier: create
scaffolding to support dedicated files and slowly move the tests into
separate files.
The first patch adds scaffolding, subsequent patches move progs into
separate files.
In theory, many of the standalone tests can be migrated to this new
framework as well. They get the benefit of common CHECK macro and
bpf_find_map function which a lot of standalone tests need to redefine.
v3 changes:
* respin on top of commit ebace0e981b2 ("selftests/bpf: use
__bpf_constant_htons in test_prog.c for flow dissector")
* put bpf_rlimit.h into test_progs.c instead of test_progs.h
v2 changes:
* added cover letter, added more description about file structure
====================
selftests: bpf: break up test_progs - preparations
Add new prog_tests directory where tests are supposed to land.
Each prog_tests/<filename>.c is expected to have a global function
with signature 'void test_<filename>(void)'. Makefile automatically
generates prog_tests/tests.h file with entry for each prog_tests file:
prog_tests/tests.h is included in test_progs.c in two places with
appropriate defines. This scheme allows us to move each function with
a separate patch without breaking anything.
Compared to the recent verifier split, each separate file here is
a compilation unit and test_progs.[ch] is now used as a place to put
some common routines that might be used by multiple tests.
Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Sean Wang [Sat, 2 Mar 2019 18:44:09 +0000 (02:44 +0800)]
Bluetooth: mediatek: add support for MediaTek MT7663U and MT7668U UART devices
This adds the support of enabling MT7663U and MT7668U Bluetooth function
running on the top of btmtkuart driver.
There are a few differences between MT766[3,8]U and MT7622 where
MT766[3,8]U are standalone devices based on UART transport while MT7622
bluetooth is a built-in device on MediaTek SoC communicating with the host
through BTIF serial transport. Thus, extra setup sequence is necessary
for these standalone devices such as remote regulator and reset control via
GPIO, baud rate changing handshake between the host and device and so on.
Signed-off-by: Sean Wang <sean.wang@mediatek.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
====================
Host Bandwidth Manager is a framework for limiting the bandwidth used
by v2 cgroups. It consists of 1 BPF helper, a sample BPF program to
limit egress bandwdith as well as a sample user program and script to
simplify HBM testing.
The sample HBM BPF program is not meant to be production quality, it is
provided as proof of concept. A lot more information, including sample
runs in some cases, are provided in the commit messages of the individual
patches.
A future patch will add support for reducing TCP's cwnd (we are evaluating
alternatives). Another patch will add support for fair queueing's Earliest
Departure Time. Until then, HBM is better suited for flows supporitng ECN.
In addition, A BPF program to limit ingress bandwidth will be provided in
an upcomming patchset.
Changes from v1 to v2:
* bpf_tcp_enter_cwr can only be called from a cgroup skb egress BPF
program (otherwise load or attach will fail) where we already hold
the sk lock. Also only applies for ESTABLISHED state.
* bpf_skb_ecn_set_ce uses INET_ECN_set_ce()
* bpf_tcp_check_probe_timer now uses tcp_reset_xmit_timer. Can only be
used by egress cgroup skb programs.
* removed load_cg_skb user program.
* nrm bpf egress program checks packet header in skb to determine
ECN value. Now also works for ECN enabled UDP packets.
Using ECN_ defines instead of integers.
* NRM script test program now uses bpftool instead of load_cg_skb
Changes from v2 to v3:
* Changed name from NRM (Network Resource Manager) to HBM (Host
Bandwdith Manager)
* The bpf helper to set ECN ce now checks that the header is writeable
* Removed helper bpf functions that modified TCP state due to a concern
about whether the socket is locked by the current thread.
====================
brakmo [Fri, 1 Mar 2019 20:38:50 +0000 (12:38 -0800)]
bpf: HBM test script
Script for testing HBM (Host Bandwidth Manager) framework.
It creates a cgroup to use for testing and load a BPF program to limit
egress bandwidht. It then uses iperf3 or netperf to create
loads. The output is the goodput in Mbps (unless -D is used).
It can work on a single host using loopback or among two hosts (with netperf).
When using loopback, it is recommended to also introduce a delay of at least
1ms (-d=1), otherwise the assigned bandwidth is likely to be underutilized.
USAGE: $name [out] [-b=<prog>|--bpf=<prog>] [-c=<cc>|--cc=<cc>] [-D]
[-d=<delay>|--delay=<delay>] [--debug] [-E]
[-f=<#flows>|--flows=<#flows>] [-h] [-i=<id>|--id=<id >] [-l]
[-N] [-p=<port>|--port=<port>] [-P] [-q=<qdisc>]
[-R] [-s=<server>|--server=<server] [--stats]
[-t=<time>|--time=<time>] [-w] [cubic|dctcp]
Where:
out Egress (default egress)
-b or --bpf BPF program filename to load and attach.
Default is nrm_out_kern.o for egress,
-c or -cc TCP congestion control (cubic or dctcp)
-d or --delay Add a delay in ms using netem
-D In addition to the goodput in Mbps, it also outputs
other detailed information. This information is
test dependent (i.e. iperf3 or netperf).
--debug Print BPF trace buffer
-E Enable ECN (not required for dctcp)
-f or --flows Number of concurrent flows (default=1)
-i or --id cgroup id (an integer, default is 1)
-l Do not limit flows using loopback
-N Use netperf instead of iperf3
-h Help
-p or --port iperf3 port (default is 5201)
-P Use an iperf3 instance for each flow
-q Use the specified qdisc.
-r or --rate Rate in Mbps (default 1s 1Gbps)
-R Use TCP_RR for netperf. 1st flow has req
size of 10KB, rest of 1MB. Reply in all
cases is 1 byte.
More detailed output for each flow can be found
in the files netperf.<cg>.<flow>, where <cg> is the
cgroup id as specified with the -i flag, and <flow>
is the flow id starting at 1 and increasing by 1 for
flow (as specified by -f).
-s or --server hostname of netperf server. Used to create netperf
test traffic between to hosts (default is within host)
netserver must be running on the host.
--stats Get HBM stats (marked, dropped, etc.)
-t or --time duration of iperf3 in seconds (default=5)
-w Work conserving flag. cgroup can increase its
bandwidth beyond the rate limit specified
while there is available bandwidth. Current
implementation assumes there is only one NIC
(eth0), but can be extended to support multiple
NICs. This is just a proof of concept.
cubic or dctcp specify TCP CC to use
Examples:
./do_hbm_test.sh -l -d=1 -D --stats
Runs a 5 second test, using a single iperf3 flow and with the default
rate limit of 1Gbps and a delay of 1ms (using netem) using the default
TCP congestion control on the loopback device (hence we use "-l" to
enforce bandwidth limit on loopback device). Since no direction is
specified, it defaults to egress. Since no TCP CC algorithm is
specified it uses the system default (Cubic for this test).
With no -D flag, only the value of the AGGREGATE OUTPUT would show.
id refers to the cgroup id and is useful when running multi cgroup
tests (supported by a future patch).
This patchset does not support calling TCP's congesion window
reduction, even when packets are dropped by the BPF program, resulting
in a large number of packets dropped. It is recommended that the current
HBM implemenation only be used with ECN enabled flows. A future patch
will add support for reducing TCP's cwnd and will increase the
performance of non-ECN enabled flows.
Output:
Details for HBM in cgroup 1
id:1
rate_mbps:493
duration:4.8 secs
packets:11355
bytes_MB:590
pkts_dropped:4497
bytes_dropped_MB:292
pkts_marked_percent: 39.60
bytes_marked_percent: 49.49
pkts_dropped_percent: 39.60
bytes_dropped_percent: 49.49
PING AVG DELAY:2.075
AGGREGATE_GOODPUT:505
./do_nrm_test.sh -l -d=1 -D --stats dctcp
Same as above but using dctcp. Note that fewer bytes are dropped
(0.01% vs. 49%).
Output:
Details for HBM in cgroup 1
id:1
rate_mbps:945
duration:4.9 secs
packets:16859
bytes_MB:578
pkts_dropped:1
bytes_dropped_MB:0
pkts_marked_percent: 28.74
bytes_marked_percent: 45.15
pkts_dropped_percent: 0.01
bytes_dropped_percent: 0.01
PING AVG DELAY:2.083
AGGREGATE_GOODPUT:965
./do_nrm_test.sh -d=1 -D --stats
As first example, but without limiting loopback device (i.e. no
"-l" flag). Since there is no bandwidth limiting, no details for
HBM are printed out.
Output:
Details for HBM in cgroup 1
PING AVG DELAY:2.019
AGGREGATE_GOODPUT:42655
./do_hbm.sh -l -d=1 -D --stats -f=2
Uses iper3 and does 2 flows
./do_hbm.sh -l -d=1 -D --stats -f=4 -P
Uses iperf3 and does 4 flows, each flow as a separate process.
./do_hbm.sh -l -d=1 -D --stats -f=4 -N
Uses netperf, 4 flows
./do_hbm.sh -f=1 -r=2000 -t=5 -N -D --stats dctcp -s=<server-name>
Uses netperf between two hosts. The remote host name is specified
with -s= and you need to start the program netserver manually on
the remote host. It will use 1 flow, a rate limit of 2Gbps and dctcp.
./do_hbm.sh -f=1 -r=2000 -t=5 -N -D --stats -w dctcp \
-s=<server-name>
As previous, but allows use of extra bandwidth. For this test the
rate is 8Gbps vs. 1Gbps of the previous test.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
brakmo [Fri, 1 Mar 2019 20:38:49 +0000 (12:38 -0800)]
bpf: User program for testing HBM
The program nrm creates a cgroup and attaches a BPF program to the
cgroup for testing HBM (Host Bandwidth Manager) for egress traffic.
One still needs to create network traffic. This can be done through
netesto, netperf or iperf3.
A follow-up patch contains a script to create traffic.
USAGE: hbm [-d] [-l] [-n <id>] [-r <rate>] [-s] [-t <secs>]
[-w] [-h] [prog]
Where:
-d Print BPF trace debug buffer
-l Also limit flows doing loopback
-n <#> To create cgroup "/hbm#" and attach prog. Default is /nrm1
This is convenient when testing HBM in more than 1 cgroup
-r <rate> Rate limit in Mbps
-s Get HBM stats (marked, dropped, etc.)
-t <time> Exit after specified seconds (deault is 0)
-w Work conserving flag. cgroup can increase its bandwidth
beyond the rate limit specified while there is available
bandwidth. Current implementation assumes there is only
NIC (eth0), but can be extended to support multiple NICs.
Currrently only supported for egress. Note, this is just
a proof of concept.
-h Print this info
prog BPF program file name. Name defaults to hbm_out_kern.o
More information about HBM can be found in the paper "BPF Host Resource
Management" presented at the 2018 Linux Plumbers Conference, Networking Track
(http://vger.kernel.org/lpc_net2018_talks/LPC%20BPF%20Network%20Resource%20Paper.pdf)
Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
brakmo [Fri, 1 Mar 2019 20:38:48 +0000 (12:38 -0800)]
bpf: Sample HBM BPF program to limit egress bw
A cgroup skb BPF program to limit cgroup output bandwidth.
It uses a modified virtual token bucket queue to limit average
egress bandwidth. The implementation uses credits instead of tokens.
Negative credits imply that queueing would have happened (this is
a virtual queue, so no queueing is done by it. However, queueing may
occur at the actual qdisc (which is not used for rate limiting).
This implementation uses 3 thresholds, one to start marking packets and
the other two to drop packets:
CREDIT
- <--------------------------|------------------------> +
| | | 0
| Large pkt |
| drop thresh |
Small pkt drop Mark threshold
thresh
The effect of marking depends on the type of packet:
a) If the packet is ECN enabled, then the packet is ECN ce marked.
The current mark threshold is tuned for DCTCP.
c) Else, it is dropped if it is a large packet.
If the credit is below the drop threshold, the packet is dropped.
Note that dropping a packet through the BPF program does not trigger CWR
(Congestion Window Reduction) in TCP packets. A future patch will add
support for triggering CWR.
This BPF program actually uses 2 drop thresholds, one threshold
for larger packets (>= 120 bytes) and another for smaller packets. This
protects smaller packets such as SYNs, ACKs, etc.
The default bandwidth limit is set at 1Gbps but this can be changed by
a user program through a shared BPF map. In addition, by default this BPF
program does not limit connections using loopback. This behavior can be
overwritten by the user program. There is also an option to calculate
some statistics, such as percent of packets marked or dropped, which
the user program can access.
A latter patch provides such a program (hbm.c)
Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
brakmo [Fri, 1 Mar 2019 20:38:46 +0000 (12:38 -0800)]
bpf: add bpf helper bpf_skb_ecn_set_ce
This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce
"int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to
BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can
be attached to the ingress and egress path. The helper is needed
because his type of bpf_prog cannot modify the skb directly.
This helper is used to set the ECN field of ECN capable IP packets to ce
(congestion encountered) in the IPv6 or IPv4 header of the skb. It can be
used by a bpf_prog to manage egress or ingress network bandwdith limit
per cgroupv2 by inducing an ECN response in the TCP sender.
This works best when using DCTCP.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
1) Fix refcount leak in act_ipt during replace, from Davide Caratti.
2) Set task state properly in tun during blocking reads, from Timur
Celik.
3) Leaked reference in DSA, from Wen Yang.
4) NULL deref in act_tunnel_key, from Vlad Buslov.
5) cipso_v4_erro can reference the skb IPCB in inappropriate contexts
thus referencing garbage, from Nazarov Sergey.
6) Don't accept RTA_VIA and RTA_GATEWAY in contexts where those
attributes make no sense.
7) Fix hung sendto in tipc, from Tung Nguyen.
8) Out-of-bounds access in netlabel, from Paul Moore.
9) Grant reference leak in xen-netback, from Igor Druzhinin.
10) Fix tx stalls with lan743x, from Bryan Whitehead.
11) Fix interrupt storm with mv88e6xxx, from Hein Kallweit.
12) Memory leak in sit on device registry failure, from Mao Wenan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
net: sit: fix memory leak in sit_init_net()
net: dsa: mv88e6xxx: Fix statistics on mv88e6161
geneve: correctly handle ipv6.disable module parameter
net: dsa: mv88e6xxx: prevent interrupt storm caused by mv88e6390x_port_set_cmode
bpf: fix sanitation rewrite in case of non-pointers
ipv4: Add ICMPv6 support when parse route ipproto
MIPS: eBPF: Fix icache flush end address
lan743x: Fix TX Stall Issue
net: phy: phylink: fix uninitialized variable in phylink_get_mac_state
net: aquantia: regression on cpus with high cores: set mode with 8 queues
selftests: fixes for UDP GRO
bpf: drop refcount if bpf_map_new_fd() fails in map_create()
net: dsa: mv88e6xxx: power serdes on/off for 10G interfaces on 6390X
net: dsa: mv88e6xxx: Fix u64 statistics
xen-netback: don't populate the hash cache on XenBus disconnect
xen-netback: fix occasional leak of grant ref mappings under memory pressure
sctp: chunk.c: correct format string for size_t in printk
net: netem: fix skb length BUG_ON in __skb_to_sgvec
netlabel: fix out-of-bounds memory accesses
ipv4: Pass original device to ip_rcv_finish_core
...
Bluetooth: hci_qca: Reduce delay after sending baudrate request for WCN3990
The current 300ms delay after a baudrate change is extremely long.
For WCN3990 it is sufficient to wait 10ms after the baudrate change
request has been sent over the wire.
Linus Torvalds [Sat, 2 Mar 2019 16:32:02 +0000 (08:32 -0800)]
Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull more crypto fixes from Herbert Xu:
"This fixes a couple of issues in arm64/chacha that was introduced in
5.0"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: arm64/chacha - fix hchacha_block_neon() for big endian
crypto: arm64/chacha - fix chacha_4block_xor_neon() for big endian
Ido Schimmel [Fri, 1 Mar 2019 13:38:43 +0000 (13:38 +0000)]
net: ipv4: Fix NULL pointer dereference in route lookup
When calculating the multipath hash for input routes the flow info is
not available and therefore should not be used.
Fixes: 24ba14406c5c ("route: Add multipath_hash in flowi_common to make user-define hash") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Cc: wenxu <wenxu@ucloud.cn> Acked-by: wenxu <wenxu@ucloud.cn> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 2 Mar 2019 07:23:35 +0000 (23:23 -0800)]
Merge branch 'net-mvpp2-fixes-and-improvements'
Antoine Tenart says:
====================
net: mvpp2: fixes and improvements
This series aims to improve the Marvell PPv2 driver and to fix various
issues we encountered while testing the ports in many different
configurations. The series is based on top of Russell PPv2 phylink
rework and improvement.
I'm not sending a v2 of the previous fixes series as half the patches
are not the same and lots of development happened in between.
While this series contains fixes, it's sent to net-next as it is based
on top of Russell patches that were merged into net-next. I'm also
aiming at net-next as the series reworks critical paths of the PPv2
driver, such as the reset handling of various blocks, to let more weeks
for users to tests and for possible fixes to be sent before it lands
into a stable kernel version.
The series is divided into three parts:
- Patches 1 to 3 are cosmetic changes, sent alongside the series, as I
saw these small issues while working on this.
- Patches 5 to 8 are fixing (or improving) individual issues that we
found while testing PPv2.1 and PPv2.2 ports while using various
interfaces.
Notable fixes are we support back RGMII interfaces (on both PPv2.1 and
PPv2.2), as their support was broken by previous patches. We also
reworked the RXQ computation as the RXQ assignment was not checking
the maximum number of RXQ available, and was broken for PPv2.1.
- As discussed in a previous fixes series, patches 9 to 15 rework the
way blocks are set in reset in the PPv2 engine (plus related changes).
There are four blocks we want to control the reset status: two MAC
(GMAC and XLG MAC) and two PCS (MPCS and XPCS). The XLG MAC is used
for 10G connexions and uses the MPCS or the XPCS depending on the mode
used (10GKR / XAUI / RXAUI) and the GMAC is used for the other modes.
The idea is to set all blocks in reset by default, and when not used,
and to de-assert the reset only when a block is used. There are four
cases to take in account:
1. Boot time: all four blocks should be put in reset, as we do not
know their initial state (configured by the firmware/bootloader).
2. Link up: only the blocks used by a given mode should be put out of
reset (eg. 10GKR uses the XLG MAC and the MPCS).
3. Mode reconfiguration: some ports may support mode reconfiguration,
and switching between the GMAC and the XLG MAC (or between the two
PCS). All blocks should be put in reset, and only the one used
should be put out of reset.
4. Link down: all four blocks are put in reset.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:17 +0000 (11:52 +0100)]
net: mvpp2: set the GMAC, XLG MAC, XPCS and MPCS in reset when a port is down
This patch adds calls in the stop() helper to ensure both MACs and
both PCS blocks are set in reset when the user manually sets a port
down. This is done so that we have the exact same block reset states at
boot time and when a port is set down.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:16 +0000 (11:52 +0100)]
net: mvpp2: set the XPCS and MPCS in reset when not used
This patch sets both the XPCS and MPCS blocks in reset when they aren't
used. This is done both at boot time and when reconfiguring a port mode.
The advantage now is that only the PCS used is set out of reset when the
port is configured (10GKR uses the MCPS while RXAUI uses the XPCS).
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:15 +0000 (11:52 +0100)]
net: mvpp2: reset the MACs when reconfiguring a port
This patch makes sure both PPv2 MACs (GMAC + XLG MAC) are set in reset
while a port is reconfigured. This is done so that we make sure a MAC is
in a reset state when not used, as only one of the two will be set out
of reset after the port is configured properly.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:14 +0000 (11:52 +0100)]
net: mvpp2: rework the XLG MAC reset handling
This patch reworks the way the XLG MAC is set in reset: the XLG MAC is
set in reset at probe time and taken out of this state only when used.
The idea is to move forward a situation where only the blocks used are
taken out of reset. This also has the effect to handle the GMAC and the
XLG MAC in a similar way (the GMAC already is set in reset at boot
time).
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:13 +0000 (11:52 +0100)]
net: mvpp2: force the XLG MAC link up or down when not using in-band
This patch force the XLG MAC link state in the phylink link_up() and
link_down() helpers when not using in-band auto-negotiation. This mimics
what's already done for the GMAC and follows what's advised in the
phylink documentation.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:12 +0000 (11:52 +0100)]
net: mvpp2: only update the XLG configuration when needed
This patch improves the XLG configuration function, to only update the
XLG configuration register when a change is needed. This helps not
writing over and over the same XLG configuration each time phylink
request the MAC to be configured. This mimics the GMAC configuration
function.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:11 +0000 (11:52 +0100)]
net: mvpp2: always disable both MACs when disabling a port
This patch modifies the port_disable() helper to always disable both the
GMAC and the XLG MAC when called. At boot time we do not know of a port
was enabled in the firmware/bootloader, and if so what mode was used
(hence which of the two MACs was used).
This also help in implementing a logic where all blocks are disabled
when not used, and only enabled regarding the current mode used on a
given port.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:10 +0000 (11:52 +0100)]
net: mvpp2: some AN fields require the link to be down when updated
The GMAC configuration helper modifies values in the auto-negotiation
register. Some of its values require the port to be forced down when
modifying their values. This patches fixes the check made on the bit to
be updated in this register, so that the port is forced down when
needed. This fix cases where some of those parameters were updated, but
not taken into account, such as when using RGMII interfaces.
Fixes: d14e078f23cc ("net: marvell: mvpp2: only reprogram what is necessary on mac_config") Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:09 +0000 (11:52 +0100)]
net: mvpp2: fix the computation of the RXQs
The patch fixes the computation of RXQs being used by the PPv2 driver,
which is set depending on the PPv2 engine version and the queue mode
used. There are three cases:
- PPv2.1: 1 RXQ per CPU.
- PPV2.2 with MVPP2_QDIST_MULTI_MODE: 1 RXQ per CPU.
- PPv2.2 with MVPP2_QDIST_SINGLE_MODE: 1 RXQ is shared between the CPUs.
The PPv2 engine supports a maximum of 32 queues per port. This patch
adds a check so that we do not overstep this maximum.
It appeared the calculation was broken for PPv2.1 engines since f8c6ba8424b0, as PPv2.1 ports ended up with a single RXQ while they
needed 4. This patch fixes it.
Fixes: f8c6ba8424b0 ("net: mvpp2: use only one rx queue per port per CPU") Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:08 +0000 (11:52 +0100)]
net: mvpp2: fix validate for PPv2.1
The Phylink validate function is the Marvell PPv2 driver makes a check
on the GoP id. This is valid an has to be done when using PPv2.2 engines
but makes no sense when using PPv2.1. The check done when using an RGMII
interface makes sure the GoP id is not 0, but this breaks PPv2.1. Fixes
it.
Fixes: 0fb628f0f250 ("net: mvpp2: fix phylink handling of invalid PHY modes") Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:07 +0000 (11:52 +0100)]
net: mvpp2: reconfiguring the port interface is PPv2.2 specific
This patch adds a check on the PPv2 version in-use not to reconfigure
the port mode when an interface is updated when using PPv2.1 as the
functions called are PPv2.2 specific.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:06 +0000 (11:52 +0100)]
net: mvpp2: a port can be disabled even if we use the link IRQ
We had a check in the mvpp2_mac_link_down() function (called by phylink)
to avoid disabling the port when link interrupts are used. It turned out
the interrupt can still be used with the port disabled. We can thus
remove this check.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Fri, 1 Mar 2019 10:52:04 +0000 (11:52 +0100)]
net: mvpp2: update the port documentation regarding the GoP
The Marvell PPv2 port structure stores the GoP id of a given port. This
information is specific to PPv2.2, but cannot be used by PPv2.1. Update
its comment to denote this specificity.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
cxgb4vf: Call netif_carrier_off properly in pci_probe
netif_carrier_off() should be called only after register_netdev().
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com> Signed-off-by: Vishal Kulkarni <vishal@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Reverting force link up changes since this behaviour can be
achieved using VF link state feature.
Reverts:
commit 0913667ab3ad ("cxgb4vf: Forcefully link up virtual interfaces")
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com> Signed-off-by: Vishal Kulkarni <vishal@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Use ndo_set_vf_link_state to control the link states associated
with the virtual interfaces.
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com> Signed-off-by: Vishal Kulkarni <vishal@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Some of these macros were conflicting with global namespace,
hence prefixing them with CXGB4VF.
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com> Signed-off-by: Vishal Kulkarni <vishal@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Sean Wang <sean.wang@mediatek.com> for mt7530 and mtk_eth_soc Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 28 Feb 2019 23:17:28 +0000 (15:17 -0800)]
net: support 64bit rates for getsockopt(SO_MAX_PACING_RATE)
For legacy applications using 32bit variable, SO_MAX_PACING_RATE
has to cap the returned value to 0xFFFFFFFF, meaning that
rates above 34.35 Gbit are capped.
This patch allows applications to read socket pacing rate
at full resolution, if they provide a 64bit variable to store it,
and the kernel is 64bit.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lucas Bates [Thu, 28 Feb 2019 22:38:40 +0000 (17:38 -0500)]
tc-testing: Allow test cases to be skipped
By adding a check for an optional key/value pair to the test case
data, individual test cases may be skipped to prevent tdc from
aborting a test run due to setup or teardown failure.
If a test case is skipped, it will still appear in the results
output to allow for a consistent number of executed tests in each
run. However, the test will be marked as skipped.
This support for skipping extends to any plugins that may generate
additional results for each executed test.
Signed-off-by: Lucas Bates <lucasb@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When IPv6 is compiled but disabled at runtime, geneve_sock_add returns
-EAFNOSUPPORT. For metadata based tunnels, this causes failure of the whole
operation of bringing up the tunnel.
Ignore failure of IPv6 socket creation for metadata based tunnels caused by
IPv6 not being available.
This is the same fix as what commit d074bf960044 ("vxlan: correctly handle
ipv6.disable module parameter") is doing for vxlan.
Note there's also commit c0a47e44c098 ("geneve: should not call rt6_lookup()
when ipv6 was disabled") which fixes a similar issue but for regular
tunnels, while this patch is needed for metadata based tunnels.
Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 2 Mar 2019 05:44:11 +0000 (21:44 -0800)]
Merge branch 'mlxsw-rehash-split'
Ido Schimmel says:
====================
mlxsw: spectrum_acl: Split rehash work into chunks
Jiri says:
When rehash happens on a vregion with many rules and they are being
migrated, it might take significant time to finish the job. During that
time vregion->lock is taken which prevents rules from being
added/deleted from the vregion.
Aim of this patchset is to allow to interrupt migration of rules during
rehash, reschedule and give chance for rules to be added/deleted. Then
continue migration in another execution of scheduled work.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 28 Feb 2019 06:59:26 +0000 (06:59 +0000)]
mlxsw: spectrum_acl: Remember where to continue rehash migration
Store pointer to vchunk where the migration was interrupted, as well as
ventry pointer to start from and to stop at (during rollback). This
saved pointers need to be forgotten in case of ventries list or vchunk
list changes, which is done by couple of "changed" helpers.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 28 Feb 2019 06:59:25 +0000 (06:59 +0000)]
mlxsw: spectrum_acl: Allow to interrupt/continue rehash work
Currently, migration of vregions with many entries may take long time
during which insertions and removals of the rules are blocked
due to wait to acquire vregion->lock.
To overcome this, allow to interrupt and continue rehash work according
to the set credits - number of rules to migrate.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 28 Feb 2019 06:59:24 +0000 (06:59 +0000)]
mlxsw: spectrum_acl: Do rollback as another call to mlxsw_sp_acl_tcam_vchunk_migrate_all()
In order to simplify the code and to prepare it for
interrupted/continued migration process, do the rollback in case of
migration error as another call to mlxsw_sp_acl_tcam_vchunk_migrate_all().
It can be understood as "migrate all back".
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 28 Feb 2019 06:59:24 +0000 (06:59 +0000)]
mlxsw: spectrum_acl: Put vchunk migrate start/end code into separate functions
In preparations of interrupt/continue of rehash work, put the code that
is done at the beginning/end of vchunk migrate function into separate
functions.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>