]> asedeno.scripts.mit.edu Git - linux.git/log
linux.git
5 years agoMerge mlx5-next into rdma for-next
Jason Gunthorpe [Wed, 3 Jul 2019 19:43:45 +0000 (16:43 -0300)]
Merge mlx5-next into rdma for-next

From git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Required for dependencies in the next patches.

Resolved the conflicts:
 - esw_destroy_offloads_acl_tables() use the newer mlx5_esw_for_all_vports()
   version
 - esw_offloads_steering_init() drop the cap test
 - esw_offloads_init() drop the extra function arguments

* branch 'mlx5-next': (39 commits)
  net/mlx5: Expose device definitions for object events
  net/mlx5: Report EQE data upon CQ completion
  net/mlx5: Report a CQ error event only when a handler was set
  net/mlx5: mlx5_core_create_cq() enhancements
  net/mlx5: Expose the API to register for ANY event
  net/mlx5: Use event mask based on device capabilities
  net/mlx5: Fix mlx5_core_destroy_cq() error flow
  net/mlx5: E-Switch, Handle UC address change in switchdev mode
  net/mlx5: E-Switch, Consider host PF for inline mode and vlan pop
  net/mlx5: E-Switch, Use iterator for vlan and min-inline setups
  net/mlx5: E-Switch, Reg/unreg function changed event at correct stage
  net/mlx5: E-Switch, Consolidate eswitch function number of VFs
  net/mlx5: E-Switch, Refactor eswitch SR-IOV interface
  net/mlx5: Handle host PF vport mac/guid for ECPF
  net/mlx5: E-Switch, Use correct flags when configuring vlan
  net/mlx5: Reduce dependency on enabled_vfs counter and num_vfs
  net/mlx5: Don't handle VF func change if host PF is disabled
  net/mlx5: Limit scope of mlx5_get_next_phys_dev() to PCI PF devices
  net/mlx5: Move pci status reg access mutex to mlx5_pci_init
  net/mlx5: Rename mlx5_pci_dev_type to mlx5_coredev_type
  ...

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/mlx5: Fixed reporting counters on 2nd port for Dual port RoCE
Parav Pandit [Sun, 30 Jun 2019 07:52:52 +0000 (10:52 +0300)]
IB/mlx5: Fixed reporting counters on 2nd port for Dual port RoCE

Currently during dual port IB device registration in below code flow,

ib_register_device()
  ib_device_register_sysfs()
    ib_setup_port_attrs()
      add_port()
        get_counter_table()
          get_perf_mad()
            process_mad()
              mlx5_ib_process_mad()

mlx5_ib_process_mad() fails on 2nd port when both the ports are not fully
setup at the device level (because 2nd port is unaffiliated).

As a result, get_perf_mad() registers different PMA counter group for 1st
and 2nd port, namely pma_counter_ext and pma_counter. However both ports
have the same capability and counter offsets.

Due to this when counters are read by the user via sysfs in below code
flow, counters are queried from wrong location from the device mainly from
PPCNT instead of VPORT counters.

show_pma_counter()
  get_perf_mad()
    process_mad()
      mlx5_ib_process_mad()
        process_pma_cmd()

This shows all zero counters for 2nd port.

To overcome this, process_pma_cmd() is invoked, and when unaffiliated port
is not yet setup during device registration phase, make the query on the
first port.  while at it, only process_pma_cmd() needs to work on the
native port number and underlying mdev, so shift the get, put calls to
where its needed inside process_pma_cmd().

Fixes: 212f2a87b74f ("IB/mlx5: Route MADs for dual port RoCE")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agonet/mlx5: Expose device definitions for object events
Yishai Hadas [Sun, 30 Jun 2019 16:23:28 +0000 (19:23 +0300)]
net/mlx5: Expose device definitions for object events

Expose an extra device definitions for objects events.

It includes: object_type values for legacy objects and generic data
header for any other object.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
5 years agonet/mlx5: Report EQE data upon CQ completion
Yishai Hadas [Sun, 30 Jun 2019 16:23:27 +0000 (19:23 +0300)]
net/mlx5: Report EQE data upon CQ completion

Report EQE data upon CQ completion to let upper layers use this data.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
5 years agonet/mlx5: Report a CQ error event only when a handler was set
Yishai Hadas [Sun, 30 Jun 2019 16:23:26 +0000 (19:23 +0300)]
net/mlx5: Report a CQ error event only when a handler was set

Report a CQ error event only when a handler was set.

This enables mlx5_ib to not set a handler upon CQ creation and use some
other mechanism to get this event as of other events by the
mlx5_eq_notifier_register API.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
5 years agonet/mlx5: mlx5_core_create_cq() enhancements
Yishai Hadas [Sun, 30 Jun 2019 16:23:25 +0000 (19:23 +0300)]
net/mlx5: mlx5_core_create_cq() enhancements

Enhance mlx5_core_create_cq() to get the command out buffer from the
callers to let them use the output.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
5 years agonet/mlx5: Expose the API to register for ANY event
Yishai Hadas [Sun, 30 Jun 2019 16:23:24 +0000 (19:23 +0300)]
net/mlx5: Expose the API to register for ANY event

Expose the API to register for ANY event, mlx5_ib will be able to use
this functionality for its needs.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
5 years agonet/mlx5: Use event mask based on device capabilities
Yishai Hadas [Sun, 30 Jun 2019 16:23:23 +0000 (19:23 +0300)]
net/mlx5: Use event mask based on device capabilities

Use the reported device capabilities for the supported user events (i.e.
affiliated and un-affiliated) to set the EQ mask.

As the event mask can be up to 256 defined by 4 entries of u64 change
the applicable code to work accordingly.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
5 years agonet/mlx5: Fix mlx5_core_destroy_cq() error flow
Yishai Hadas [Sun, 30 Jun 2019 16:23:22 +0000 (19:23 +0300)]
net/mlx5: Fix mlx5_core_destroy_cq() error flow

The firmware command to destroy a CQ might fail when the object is
referenced by other object and the ref count is managed by the firmware.

To enable a second successful destruction post the first failure need to
change  mlx5_eq_del_cq() to be a void function.

As an error in mlx5_eq_del_cq() is quite fatal from the option to
recover, a debug message inside it should be good enougth and it was
changed to be void.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
5 years agoRDMA/hns: Remove set but not used variable 'fclr_write_fail_flag'
YueHaibing [Wed, 3 Jul 2019 03:10:21 +0000 (03:10 +0000)]
RDMA/hns: Remove set but not used variable 'fclr_write_fail_flag'

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/infiniband/hw/hns/hns_roce_hw_v2.c: In function 'hns_roce_function_clear':
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:1135:7: warning:
 variable 'fclr_write_fail_flag' set but not used [-Wunused-but-set-variable]

It is never used, so can be removed.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/i40iw: Set queue pair state when being queried
Liu, Changcheng [Fri, 28 Jun 2019 06:16:13 +0000 (14:16 +0800)]
RDMA/i40iw: Set queue pair state when being queried

The API for ib_query_qp requires the driver to set qp_state and
cur_qp_state on return, add the missing sets.

Fixes: d37498417947 ("i40iw: add files for iwarp interface")
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/i40iw: Use kmemdup rather than open coding
Fuqian Huang [Wed, 3 Jul 2019 16:27:42 +0000 (00:27 +0800)]
IB/i40iw: Use kmemdup rather than open coding

Use kmemdump instead of kzmalloc + memcpy.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/ipoib: Remove memset after vzalloc in ipoib_cm.c
Fuqian Huang [Thu, 27 Jun 2019 17:38:04 +0000 (01:38 +0800)]
IB/ipoib: Remove memset after vzalloc in ipoib_cm.c

vzalloc has already zeroed the memory.  So a memset is unneeded.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB: Remove unneeded memset
Fuqian Huang [Fri, 28 Jun 2019 02:47:19 +0000 (10:47 +0800)]
IB: Remove unneeded memset

In commit af7ddd8a627c ("Merge tag 'dma-mapping-4.21' of
git://git.infradead.org/users/hch/dma-mapping"),
dma_alloc_coherent/dmam_alloc_coherent always zeroed the returned memory.
So the memset after a coherent allocation function is not needed.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoMerge branch 'siw' into rdma.git for-next
Jason Gunthorpe [Tue, 2 Jul 2019 19:57:54 +0000 (16:57 -0300)]
Merge branch 'siw' into rdma.git for-next

Bernard Metzler says:

====================
This patch set contributes the SoftiWarp driver rebased for latest
rdma-next. SoftiWarp (siw) implements the iWarp RDMA protocol over kernel
TCP sockets. The driver integrates with the linux-rdma framework.

A matching userlevel driver is available as PR at
https://github.com/linux-rdma/rdma-core/pull/536

Many thanks for reviewing and testing the driver, especially to Leon,
Jason, Steve, Doug, Olga, Dennis, Gal. You all helped to significantly
improve the driver over the last year.

Please find below a list of changes and comments, compared to older
versions of the siw driver.

Many thanks!
Bernard.

CHANGES:
========

v3 (this version)
-----------------

- Rebased to rdma-next

- Removed unneccessary initialization of enums in siw-abi.h

- Added comment on sizing of all work queues to power of two.

v2
-----------------

- Changed recieve path CRC calculation to compute CRC32c not
  on target buffer after placement, but on original skbuf.
  This change severely hurts performance, if CRC is switched
  on, since skb must now be walked twice. It is planned to
  work on an extension to skb_copy_bits() to fold in CRC
  computation.

- Moved debugging to using ibdev_dbg().

- Dropped detailed packet debug printing.

- Removed siw_debug.[ch] files.

- Removed resource tracking, code now relies on restrack of
  RDMA midlayer. Only object counting to enforce reported
  device limits is left in place.

- Removed all nested switch-case statements.

- Cleaned up header file #include's

- Moved CQ create/destroy to new semantics,
  where midlayer creates/destroys containing object.

- Set siw's ABI version to 1 (was 0 before)

- Removed all enum initialization where not needed.

- Fixed MAINTANERS entry for siw driver

- This version stays with the current siw specific
  management of user memory (siw_umem_get() vs.
  ib_umem_get(), etc.). This, since the current ib_umem
  implementation is less efficient for user page lookup
  on the fast path, where effciency is important for a
  SW RDMA driver.
  It is planned to contribute enhancements to the ib_umem
  framework, wich makes it suitable for SW drivers as well.

v1 (first version after v9 of siw RFC)
--------------------------------------

- Rebased to 5.2-rc1

- All IDR code got removed.

- Both MR and QP deallocation verbs now synchronously
  free the resources referenced by the RDMA mid-layer.

- IPv6 support was added.

- For compatibility with Chelsio iWarp hardware, the RX
  path was slightly reworked. It now allows packet intersection
  between tagged and untagged RDMAP operations. While not
  a defined behavior as of IETF RFC 5040/5041, some RDMA hardware
  may intersect an ongoing outbound (large) tagged message, such
  as an multisegment RDMA Read Response with sending an untagged
  message, such as an RDMA Send frame. This behavior was only
  detected in an NVMeF setup, where siw was used at target side,
  and RDMA hardware at client side (during file write). siw now
  implements two input paths for tagged and untagged messages each,
  and allows the intersected placement of both messages.

- The siw kernel abi file got renamed from siw_user.h to siw-abi.h.
====================

* branch 'siw':
  SIW addition to kernel build environment
  SIW completion queue methods
  SIW receive path
  SIW transmit path
  SIW queue pair methods
  SIW application buffer management
  SIW application interface
  SIW connection management
  SIW network and RDMA core interface
  SIW main include file
  iWarp wire packet format

5 years agordma/siw: addition to kernel build environment
Bernard Metzler [Thu, 20 Jun 2019 16:21:33 +0000 (18:21 +0200)]
rdma/siw: addition to kernel build environment

Broken up commit to add the Soft iWarp RDMA driver.

Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agordma/siw: completion queue methods
Bernard Metzler [Thu, 20 Jun 2019 16:21:32 +0000 (18:21 +0200)]
rdma/siw: completion queue methods

Broken up commit to add the Soft iWarp RDMA driver.

Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agordma/siw: receive path
Bernard Metzler [Thu, 20 Jun 2019 16:21:31 +0000 (18:21 +0200)]
rdma/siw: receive path

Broken up commit to add the Soft iWarp RDMA driver.

Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agordma/siw: transmit path
Bernard Metzler [Thu, 20 Jun 2019 16:21:30 +0000 (18:21 +0200)]
rdma/siw: transmit path

Broken up commit to add the Soft iWarp RDMA driver.

Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agordma/siw: queue pair methods
Bernard Metzler [Thu, 20 Jun 2019 16:21:29 +0000 (18:21 +0200)]
rdma/siw: queue pair methods

Broken up commit to add the Soft iWarp RDMA driver.

Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agordma/siw: application buffer management
Bernard Metzler [Thu, 20 Jun 2019 16:21:28 +0000 (18:21 +0200)]
rdma/siw: application buffer management

Broken up commit to add the Soft iWarp RDMA driver.

Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agordma/siw: application interface
Bernard Metzler [Thu, 20 Jun 2019 16:21:27 +0000 (18:21 +0200)]
rdma/siw: application interface

Broken up commit to add the Soft iWarp RDMA driver.

Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agordma/siw: connection management
Bernard Metzler [Thu, 20 Jun 2019 16:21:26 +0000 (18:21 +0200)]
rdma/siw: connection management

Broken up commit to add the Soft iWarp RDMA driver.

Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agordma/siw: network and RDMA core interface
Bernard Metzler [Thu, 20 Jun 2019 16:21:25 +0000 (18:21 +0200)]
rdma/siw: network and RDMA core interface

Broken up commit to add the Soft iWarp RDMA driver.

Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agordma/siw: main include file
Bernard Metzler [Thu, 20 Jun 2019 16:21:24 +0000 (18:21 +0200)]
rdma/siw: main include file

Broken up commit to add the Soft iWarp RDMA driver.

Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agordma/siw: iWarp wire packet format
Bernard Metzler [Thu, 20 Jun 2019 16:21:23 +0000 (18:21 +0200)]
rdma/siw: iWarp wire packet format

Broken up commit to add the Soft iWarp RDMA driver.

Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agonet/mlx5: E-Switch, Handle UC address change in switchdev mode
Bodong Wang [Fri, 28 Jun 2019 22:36:23 +0000 (22:36 +0000)]
net/mlx5: E-Switch, Handle UC address change in switchdev mode

When NVME device emulation mode is enabled, more than one PFs use the
same physical port. In this case, MPFS is required to program L2
addresses.

It used to rely on netdev set_rx_mode in switchdev mode, but driver
later changed to not create netdev for eswitch manager once in
switchdev mode. So, UC address event should be handled.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Consider host PF for inline mode and vlan pop
Bodong Wang [Fri, 28 Jun 2019 22:36:22 +0000 (22:36 +0000)]
net/mlx5: E-Switch, Consider host PF for inline mode and vlan pop

When ECPF is the eswitch manager, host PF is treated like other VFs.
Driver should do the same for inline mode and vlan pop.

Add new iterators to include host PF if ECPF is the eswitch manager.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Use iterator for vlan and min-inline setups
Bodong Wang [Fri, 28 Jun 2019 22:36:20 +0000 (22:36 +0000)]
net/mlx5: E-Switch, Use iterator for vlan and min-inline setups

Use the defined iterators to traversal VF reps/vport. Also, rely on
num of VFs rather than the counter of enabled vports as PF will also
be enabled from ECPF side, and the counter will be different from
num of VFs.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Reg/unreg function changed event at correct stage
Bodong Wang [Fri, 28 Jun 2019 22:36:18 +0000 (22:36 +0000)]
net/mlx5: E-Switch, Reg/unreg function changed event at correct stage

When driver is doing eswitch mode change, it's critical to keep number
of enabled VFs unchanged. However, it can be changed on the fly once
function changed event is registered.

To remove this uncertainty, function changed event should not be
registered before all setups, and first be unregistered before all
cleanups. Wrap this functionality together with vport event handler.

Fixes: 61fc880839e6 ("net/mlx5: E-Switch, Handle representors creation in handler context")
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Consolidate eswitch function number of VFs
Bodong Wang [Fri, 28 Jun 2019 22:36:16 +0000 (22:36 +0000)]
net/mlx5: E-Switch, Consolidate eswitch function number of VFs

Enabled number of VFs is key for eswich manager to do flow steering
initialization and vport configurations. However, the number of
enabled VFs may come from two sources as below.

PF: num of VFs is provided by enabled SR-IOV of itself.
ECPF: num of VFs is provided by enabled SR-IOV from its peer PF. And
      SR-IOV can't be enabled from ECPF itself.

Current driver handles the two cases in different stages and passing
the number of enabled VFs among a large scope of internal functions.
It is usually hard to find out where is the real number of VFs from
due to layers of argument pass-in.

This patch consolidated that number from the entry point of doing
eswitch setup, and maintained a copy so that eswitch functions can
refer to it directly.

Eswitch driver shall always use this number when referring to enabled
number of VFs, don't use other numbers such as from SR-IOV.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Refactor eswitch SR-IOV interface
Bodong Wang [Fri, 28 Jun 2019 22:36:15 +0000 (22:36 +0000)]
net/mlx5: E-Switch, Refactor eswitch SR-IOV interface

Devlink eswitch mode is not necessarily related to SR-IOV, e.g, ECPF
can be at offload mode when SR-IOV is not enabled.

Rename the interface and eswitch mode names to decouple from SR-IOV,
and cleanup eswitch messages accordingly.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Handle host PF vport mac/guid for ECPF
Bodong Wang [Fri, 28 Jun 2019 22:36:13 +0000 (22:36 +0000)]
net/mlx5: Handle host PF vport mac/guid for ECPF

When ECPF is eswitch manager, it has the privilege to query and
configure the mac and node guid of host PF.

While vport number of host PF is 0, the vport command should be
issued with other_vport set in this case as the cmd is issued by
ECPF vport(0xfffe).

Add a specific function to query own vport mac. Low level functions
are used by vport manager to query/modify any vport mac and node guid.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Use correct flags when configuring vlan
Bodong Wang [Fri, 28 Jun 2019 22:36:11 +0000 (22:36 +0000)]
net/mlx5: E-Switch, Use correct flags when configuring vlan

Before the offending commit, vlan will be configured if either vlan
or qos is set. After the change with new set flags, function callers
should provide flags accordingly.

Fixes: e33dfe316cf3 ("net/mlx5: E-Switch, Allow fine tuning of eswitch vport push/pop vlan")
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Reduce dependency on enabled_vfs counter and num_vfs
Parav Pandit [Fri, 28 Jun 2019 22:36:06 +0000 (22:36 +0000)]
net/mlx5: Reduce dependency on enabled_vfs counter and num_vfs

While enabling SR-IOV, PCI core already checks that if SR-IOV is already
enabled, it returns failure error code.
Hence, remove such duplicate check from mlx5_core driver.

While at it, make mlx5_device_disable_sriov() to perform cleanup of VFs in
reverse order of mlx5_device_enable_sriov().

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Don't handle VF func change if host PF is disabled
Bodong Wang [Fri, 28 Jun 2019 22:36:04 +0000 (22:36 +0000)]
net/mlx5: Don't handle VF func change if host PF is disabled

When ECPF eswitch manager is at offloads mode, it monitors functions
changed event from host PF side and acts according to the number of
VFs enabled/disabled.

As ECPF and host PF work in two independent hosts, it's possible that
host PF OS reboots but ECPF system is still kept on and continues
monitoring events from host PF. When kernel from host PF side is
booting, PCI iov driver does sriov_init and compute_max_vf_buses by
iterating over all valid num of VFs. This triggers FLR and generates
functions changed events, even though host PF HCA is not enabled at
this time. However, ECPF is not aware of this information, and still
handles these events as usual. ECPF system will see massive number of
reps are created, but destroyed immediately once creation finished.

To eliminate this noise, a bit is added to host parameter context to
indicate host PF is disabled. ECPF will not handle the VF changed
event if this bit is set.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Limit scope of mlx5_get_next_phys_dev() to PCI PF devices
Parav Pandit [Fri, 28 Jun 2019 22:36:02 +0000 (22:36 +0000)]
net/mlx5: Limit scope of mlx5_get_next_phys_dev() to PCI PF devices

As mlx5_get_next_phys_dev is used only for PCI PF devices use case,
limit it to search only for PCI devices.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Move pci status reg access mutex to mlx5_pci_init
Parav Pandit [Fri, 28 Jun 2019 22:36:00 +0000 (22:36 +0000)]
net/mlx5: Move pci status reg access mutex to mlx5_pci_init

mlx5_pci_init() performs pci specific initialization of the
mlx5_core_dev struct.
Hence move pci_status_mutex to pci initialization routine
mlx5_pci_init().
This allows reusing mlx5_mdev_init() to non PCI devices.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Rename mlx5_pci_dev_type to mlx5_coredev_type
Huy Nguyen [Fri, 28 Jun 2019 22:35:58 +0000 (22:35 +0000)]
net/mlx5: Rename mlx5_pci_dev_type to mlx5_coredev_type

Rename mlx5_pci_dev_type to mlx5_coredev_type to distinguish different mlx5
device types.

mlx5_coredev_type represents mlx5_core_dev instance type. Hence keep
mlx5_coredev_type in mlx5_core_dev structure.

Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agoRDMA/mlx5: Cleanup rep when doing unload
Bodong Wang [Fri, 28 Jun 2019 22:35:55 +0000 (22:35 +0000)]
RDMA/mlx5: Cleanup rep when doing unload

When an IB rep is loaded, netdev for the same vport is saved for later
reference. However, it's not cleaned up when doing unload. For ECPF,
kernel crashes when driver is referring to the already removed netdev.

Following steps lead to a shown call trace:
1. Create n VFs from host PF
2. Distroy the VFs
3. Run "rdma link" from ARM

Call trace:
  mlx5_ib_get_netdev+0x9c/0xe8 [mlx5_ib]
  mlx5_query_port_roce+0x268/0x558 [mlx5_ib]
  mlx5_ib_rep_query_port+0x14/0x34 [mlx5_ib]
  ib_query_port+0x9c/0xfc [ib_core]
  fill_port_info+0x74/0x28c [ib_core]
  nldev_port_get_doit+0x1a8/0x1e8 [ib_core]
  rdma_nl_rcv_msg+0x16c/0x1c0 [ib_core]
  rdma_nl_rcv+0xe8/0x144 [ib_core]
  netlink_unicast+0x184/0x214
  netlink_sendmsg+0x288/0x354
  sock_sendmsg+0x18/0x2c
  __sys_sendto+0xbc/0x138
  __arm64_sys_sendto+0x28/0x34
  el0_svc_common+0xb0/0x100
  el0_svc_handler+0x6c/0x84
  el0_svc+0x8/0xc

Cleanup the rep and netdev reference when unloading IB rep.

Fixes: 26628e2d58c9 ("RDMA/mlx5: Move to single device multiport ports in switchdev mode")
Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years ago{IB, net}/mlx5: E-Switch, Use index of rep for vport to IB port mapping
Bodong Wang [Fri, 28 Jun 2019 22:35:53 +0000 (22:35 +0000)]
{IB, net}/mlx5: E-Switch, Use index of rep for vport to IB port mapping

In the single IB device mode, the mapping between vport number and
rep relies on a counter. However for dynamic vport allocation, it is
desired to keep consistent map of eswitch vport and IB port.

Hence, simplify code to remove the free running counter and instead
use the available vport index during load/unload sequence from the
eswitch.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Suggested-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Use vport index when init rep
Bodong Wang [Fri, 28 Jun 2019 22:35:51 +0000 (22:35 +0000)]
net/mlx5: E-Switch, Use vport index when init rep

Driver is referring to the array index when doing rep initialization,
using vport is confusing as it's normally interpreted as vport number.

This patch doesn't change any functionality.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Added MCQI and MCQS registers' description to ifc
Shay Agroskin [Fri, 28 Jun 2019 22:35:50 +0000 (22:35 +0000)]
net/mlx5: Added MCQI and MCQS registers' description to ifc

Given a fw component index, the MCQI register allows us to query
this component's information (e.g. its version and capabilities).

Given a fw component index, the MCQS register allows us to query the
status of a fw component, including its type and state
(e.g. PRESET/IN_USE).
It can be used to find the index of a component of a specific type, by
sequentially increasing the component index, and querying each time the
type of the returned component.
If max component index is reached, 'last_index_flag' is set by the HCA.

These registers' description was added to query the running and pending
fw version of the HCA.

Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Add hardware definitions for sub functions
Parav Pandit [Fri, 28 Jun 2019 22:35:48 +0000 (22:35 +0000)]
net/mlx5: Add hardware definitions for sub functions

Update mlx5 device interface data structures for:
1. New command definitions for allocating, deallocating SF
2. Query SF partition
3. Eswitch SF fields
4. HCA CAP SF fields
5. Extend Eswitch functions command for SF

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Vu Pham <vuhuong@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agoIB/hfi1: No need to use try_module_get for debugfs
Dennis Dalessandro [Fri, 28 Jun 2019 18:22:46 +0000 (14:22 -0400)]
IB/hfi1: No need to use try_module_get for debugfs

The call in debugfs.c for try_module_get() is not needed. A reference to
the module will be taken by the VFS layer as long as the owner field is
set in the file ops struct. So set this as well as remove the call.

Suggested-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/rdmavt: Add trace for map_mr_sg
Mike Marciniszyn [Fri, 28 Jun 2019 18:22:39 +0000 (14:22 -0400)]
IB/rdmavt: Add trace for map_mr_sg

Add trace to debug map_mr_sg handling.

Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/rdmavt: Enhance trace information for FRWR debug
Mike Marciniszyn [Fri, 28 Jun 2019 18:22:33 +0000 (14:22 -0400)]
IB/rdmavt: Enhance trace information for FRWR debug

This patch enhances the MR trace information to enable more focused debug
of MR issues.

Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Add missing INVALIDATE opcodes for trace
Mike Marciniszyn [Fri, 28 Jun 2019 18:22:23 +0000 (14:22 -0400)]
IB/hfi1: Add missing INVALIDATE opcodes for trace

This was missed in the original implementation of the memory management
extensions.

Fixes: 0db3dfa03c08 ("IB/hfi1: Work request processing for fast register mr and invalidate")
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Reduce excessive aspm inlines
Michael J. Ruhl [Fri, 28 Jun 2019 18:22:17 +0000 (14:22 -0400)]
IB/hfi1: Reduce excessive aspm inlines

Uninline the aspm API since it increases code space for no reason.

Move the aspm module param to the new aspm C file.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/{rdmavt, hfi1, qib}: Add helpers to hide SWQE WR details
Michael J. Ruhl [Fri, 28 Jun 2019 18:22:11 +0000 (14:22 -0400)]
IB/{rdmavt, hfi1, qib}: Add helpers to hide SWQE WR details

Add some helper functions to hide struct rvt_swqe details.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/{rdmavt, hfi1, qib}: Remove AH refcount for UD QPs
Michael J. Ruhl [Fri, 28 Jun 2019 18:22:04 +0000 (14:22 -0400)]
IB/{rdmavt, hfi1, qib}: Remove AH refcount for UD QPs

Historically rdmavt destroy_ah() has returned an -EBUSY when the AH has a
non-zero reference count.  IBTA 11.2.2 notes no such return value or error
case:

Output Modifiers:
- Verb results:
- Operation completed successfully.
- Invalid HCA handle.
- Invalid address handle.

ULPs never test for this error and this will leak memory.

The reference count exists to allow for driver independent progress
mechanisms to process UD SWQEs in parallel with post sends.  The SWQE will
hold a reference count until the UD SWQE completes and then drops the
reference.

Fix by removing need to reference count the AH.  Add a UD specific
allocation to each SWQE entry to cache the necessary information for
independent progress.  Copy the information during the post send
processing.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/rdmavt: Set QP allowed opcodes after QP allocation
Michael J. Ruhl [Fri, 28 Jun 2019 18:21:58 +0000 (14:21 -0400)]
IB/rdmavt: Set QP allowed opcodes after QP allocation

Currently QP allowed_ops is set after the QP is completely initialized.
This curtails the use of this optimization for any initialization before
allowed_ops is set.

Fix by adding a helper to determine the correct allowed_ops and moving the
setting of the allowed_ops to just after QP allocation.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/{hfi1, qib, rdmavt}: Put qp in error state when cq is full
Kamenee Arumugam [Fri, 28 Jun 2019 18:21:52 +0000 (14:21 -0400)]
IB/{hfi1, qib, rdmavt}: Put qp in error state when cq is full

When a completion queue is full, the associated queue pairs are not put
into the error state. According to the IBTA specification, this is a
violation.

Quote from IBTA spec:
C9-218: A Requester Class F error occurs when the CQ is inaccessible or
full and an attempt is made to complete a WQE.  The Affected QP shall be
moved to the error state and affiliated asynchronous errors generated as
described in 11.6.3.1 Affiliated Asynchronous Events on page 678. The
current WQE and any subsequent WQEs are left in an unknown state.

C11-37: The CI shall generate a CQ Error when a CQ overrun is
detected. This condition will result in an Affiliated Asynchronous Error
for any associated Work Queues when they attempt to use that
CQ. Completions can no longer be added to the CQ. It is not guaranteed
that completions present in the CQ at the time the error occurred can be
retrieved. Possible causes include a CQ overrun or a CQ protection error.

Put the qp in error state when cq is full. Implement a state called full
to continue to put other associated QPs in error state.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/rdmavt: Fracture single lock used for posting and processing RWQEs
Kamenee Arumugam [Fri, 28 Jun 2019 18:04:30 +0000 (14:04 -0400)]
IB/rdmavt: Fracture single lock used for posting and processing RWQEs

Usage of single lock prevents fetching posted and processing receive work
queue entries from progressing simultaneously and impacts overall
performance.

Fracture the single lock used for posting and processing Receive Work
Queue Entries (RWQEs) to allow the circular buffer to be filled and
emptied at the same time. Two new spinlocks - one for the producers and
one for the consumers used for posting and processing RWQEs simultaneously
and the two indices are define on two different cache lines. The threshold
count is used to avoid reading other index in different cache line every
time.

Signed-off-by: Harish Chegondi <harish.chegondi@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Move receive work queue struct into uapi directory
Kamenee Arumugam [Fri, 28 Jun 2019 18:04:24 +0000 (14:04 -0400)]
IB/hfi1: Move receive work queue struct into uapi directory

The rvt_rwqe and rvt_rwq struct elements are shared between rdmavt and the
providers but are not in uapi directory.  As per the comment in
https://marc.info/?l=linux-rdma&m=152296522708522&w=2, The hfi1 driver and
the rdma core driver are not using shared structures in the uapi
directory.

Move rvt_rwqe and rvt_rwq struct into rvt-abi.h header in uapi directory.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/hfi1: Move rvt_cq_wc struct into uapi directory
Kamenee Arumugam [Fri, 28 Jun 2019 18:04:17 +0000 (14:04 -0400)]
IB/hfi1: Move rvt_cq_wc struct into uapi directory

The rvt_cq_wc struct elements are shared between rdmavt and the providers
but not in uapi directory.  As per the comment in
https://marc.info/?l=linux-rdma&m=152296522708522&w=2 The hfi1 driver and
the rdma core driver are not using shared structures in the uapi
directory.

In that case, move rvt_cq_wc struct into the rvt-abi.h header file and
create a rvt_k_cq_w for the kernel completion queue.

Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoMerge tag 'v5.2-rc6' into rdma.git for-next
Jason Gunthorpe [Sat, 29 Jun 2019 00:18:23 +0000 (21:18 -0300)]
Merge tag 'v5.2-rc6' into rdma.git for-next

For dependencies in next patches.

Resolve conflicts:
- Use uverbs_get_cleared_udata() with new cq allocation flow
- Continue to delete nes despite SPDX conflict
- Resolve list appends in mlx5_command_str()
- Use u16 for vport_rule stuff
- Resolve list appends in struct ib_client

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agonet/mlx5: E-Switch, Enable vport metadata matching if firmware supports it
Jianbo Liu [Tue, 25 Jun 2019 17:48:14 +0000 (17:48 +0000)]
net/mlx5: E-Switch, Enable vport metadata matching if firmware supports it

As the ingress ACL rules save vhca id and vport number to packet's
metadata REG_C_0, and the metadata matching for the rules in both fast
path and slow path are all added, enable this feature if supported.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agoRDMA/mlx5: Add vport metadata matching for IB representors
Jianbo Liu [Tue, 25 Jun 2019 17:48:12 +0000 (17:48 +0000)]
RDMA/mlx5: Add vport metadata matching for IB representors

If vport metadata matching is enabled in eswitch, the rule created
must be changed to match on the metadata, instead of source port.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Add match on vport metadata for rule in slow path
Jianbo Liu [Tue, 25 Jun 2019 17:48:09 +0000 (17:48 +0000)]
net/mlx5: E-Switch, Add match on vport metadata for rule in slow path

In slow path, packet that not matched by any offloaded rule is
forwarded to eswitch vport manager for further processing.
Add matching on metadata for peer miss rules in FDB, and rules which
forward packet to correct representor in esw manager NIC_RX table.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Pass metadata from FDB to eswitch manager
Jianbo Liu [Tue, 25 Jun 2019 17:48:07 +0000 (17:48 +0000)]
net/mlx5: E-Switch, Pass metadata from FDB to eswitch manager

In order to do matching on metadata in slow path when demuxing traffic
to representors, explicitly enable the feature that allows HW to pass
metadata REG_C_0 from FDB to eswitch manager NIC_RX table.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Add query and modify esw vport context functions
Jianbo Liu [Tue, 25 Jun 2019 17:48:05 +0000 (17:48 +0000)]
net/mlx5: E-Switch, Add query and modify esw vport context functions

Add esw vport query and modify functions, and exposing them is needed for
enabling or disabling registers passed as metatdata to vport NIC_RX table
in slow path.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Add match on vport metadata for rule in fast path
Jianbo Liu [Tue, 25 Jun 2019 17:48:04 +0000 (17:48 +0000)]
net/mlx5: E-Switch, Add match on vport metadata for rule in fast path

If FW's capabilities and configurations meet the requirement of vport
metadata matching, this feature will be used. As the information
about vport number and vhca_id related to packet is already stored to
its metadata register, which is used as an indicator for perticular
vport, now we can change to match on this metadata for all the
offloading rules in fast path.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5e: Specifying known origin of packets matching the flow
Jianbo Liu [Tue, 25 Jun 2019 17:48:02 +0000 (17:48 +0000)]
net/mlx5e: Specifying known origin of packets matching the flow

In vport metadata matching, source port number is replaced by metadata.
While FW has no idea about what it is in the metadata, a syndrome will
happen. Specify a known origin to avoid the syndrome.
However, there is no functional change because ANY_VPORT (0) is filled
in flow_source, the same default value as before, as a pre-step towards
metadata matching for fast path.
There are two other values can be filled in flow_source. When setting
0x1, packet matching this rule is from uplink, while 0x2 is for packet
from other local vports.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: E-Switch, Tag packet with vport number in VF vports and uplink ingress ACLs
Jianbo Liu [Tue, 25 Jun 2019 17:48:00 +0000 (17:48 +0000)]
net/mlx5: E-Switch, Tag packet with vport number in VF vports and uplink ingress ACLs

When a dual-port VHCA sends a RoCE packet on its non-native port, and the
packet arrives to its affiliated vport FDB, a mismatch might occur on the
rules that match the packet source vport as it is not represented by single
VHCA only in this case. So we change to match on metadata instead of source
vport.
To do that, a rule is created in all vports and uplink ingress ACLs, to
save the source vport number and vhca id in the packet's metadata in order
to match on it later.
The metadata register used is the first of the 32-bit type C registers. It
can be used for matching and header modify operations. The higher 16 bits
of this register are for vhca id, and the lower 16 ones is for vport
number.
This change is not for dual-port RoCE only. If HW and FW allow, the vport
metadata matching is enabled by default.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Add flow context for flow tag
Jianbo Liu [Tue, 25 Jun 2019 17:47:58 +0000 (17:47 +0000)]
net/mlx5: Add flow context for flow tag

Refactor the flow data structures, add new flow_context and move
flow_tag into it, as flow_tag doesn't belong to the rule action.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Introduce a helper API to check VF vport
Parav Pandit [Tue, 25 Jun 2019 17:47:56 +0000 (17:47 +0000)]
net/mlx5: Introduce a helper API to check VF vport

Introduce a helper API mlx5_eswitch_is_vf_vport() to check
if a given vport_num belongs to VF or not.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Jianbo Liu <jianbol@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Support allocating modify header context from ingress ACL
Jianbo Liu [Tue, 25 Jun 2019 17:47:54 +0000 (17:47 +0000)]
net/mlx5: Support allocating modify header context from ingress ACL

That modify header action can be then attached to a steering rule in
the ingress ACL.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Get vport ACL namespace by vport index
Jianbo Liu [Tue, 25 Jun 2019 17:47:52 +0000 (17:47 +0000)]
net/mlx5: Get vport ACL namespace by vport index

The ingress and egress ACL root namespaces are created per vport and
stored into arrays. However, the vport number is not the same as the
index. Passing the array index, instead of vport number, to get the
correct ingress and egress acl namespace.

Fixes: 9b93ab981e3b ("net/mlx5: Separate ingress/egress namespaces for each vport")
Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agonet/mlx5: Introduce vport metadata matching bits and enum constants
Jianbo Liu [Tue, 25 Jun 2019 17:47:50 +0000 (17:47 +0000)]
net/mlx5: Introduce vport metadata matching bits and enum constants

When a dual-port VHCA sends a RoCE packet on its non-native port, and
the packet arrives to its affiliated vport FDB, a mismatch might occur
on the rules that match the packet source vport. So we replace the
match on source port with the match on metadata that was configured in
ingress ACL, and that metadata will be passed further also to the NIC
RX table of the eswitch manager.

Introduce vport metadata matching bits and enum constants as a pre-step
towards metadata matching.
    o metadata type C registers in the misc parameters 2 fields.
    o esw_uplink_ingress_acl bit in esw cap. If it set, the device supports
      ingress ACL for the uplink vport.
    o fdb_to_vport_reg_* bits in flow table cap and esw vport context, to
      support propagating the metadata to the nic rx through the loopback
      path.
    o flow_source in flow context, to indicate the known origin of packets.
    o enum constants, to support the above bits.

Signed-off-by: Jianbo Liu <jianbol@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agoRDMA/hns: fix spelling mistake "attatch" -> "attach"
Colin Ian King [Mon, 24 Jun 2019 12:16:49 +0000 (13:16 +0100)]
RDMA/hns: fix spelling mistake "attatch" -> "attach"

There is a spelling mistake in an dev_err message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/netlink: Audit policy settings for netlink attributes
Doug Ledford [Fri, 21 Jun 2019 21:00:44 +0000 (17:00 -0400)]
RDMA/netlink: Audit policy settings for netlink attributes

For all string attributes for which we don't currently accept the element
as input, we only use it as output, set the string length to
RDMA_NLDEV_ATTR_EMPTY_STRING which is defined as 1.  That way we will only
accept a null string for that element.  This will prevent someone from
writing a new input routine that uses the element without also updating
the policy to have a valid value.

Also while there, make sure the existing entries that are valid have the
correct policy, if not, correct the policy.  Remove unnecessary checks
for nla_strlcpy() overflow once the policy has been set correctly.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Cleanup unnecessary exported symbols
Lijun Ou [Wed, 19 Jun 2019 07:00:47 +0000 (15:00 +0800)]
RDMA/hns: Cleanup unnecessary exported symbols

This patch removes the hns-roce.ko for cleanup all the exported symbols in
common part.

Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agodocs: infiniband: convert docs to ReST and rename to *.rst
Mauro Carvalho Chehab [Sun, 9 Jun 2019 02:27:03 +0000 (23:27 -0300)]
docs: infiniband: convert docs to ReST and rename to *.rst

The InfiniBand docs are plain text with no markups.  So, all we needed to
do were to add the title markups and some markup sequences in order to
properly parse tables, lists and literal blocks.

At its new index.rst, let's add a :orphan: while this is not linked to the
main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: Fix an error code in hns_roce_set_user_sq_size()
Dan Carpenter [Sat, 8 Jun 2019 09:27:14 +0000 (12:27 +0300)]
RDMA/hns: Fix an error code in hns_roce_set_user_sq_size()

This function is supposed to return negative kernel error codes but here
it returns CMD_RST_PRC_EBUSY (2).  The error code eventually gets passed
to IS_ERR() and since it's not an error pointer it leads to an Oops in
hns_roce_v1_rsv_lp_qp()

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/hns: fix potential integer overflow on left shift
Colin Ian King [Mon, 24 Jun 2019 21:46:08 +0000 (22:46 +0100)]
RDMA/hns: fix potential integer overflow on left shift

There is a potential integer overflow when int i is left shifted as this
is evaluated using 32 bit arithmetic but is being used in a context that
expects an expression of type dma_addr_t.  Fix this by casting integer i
to dma_addr_t before shifting to avoid the overflow.

Addresses-Coverity: ("Unintentional integer overflow")
Fixes: 2ac0bc5e725e ("RDMA/hns: Add a group interfaces for optimizing buffers getting flow")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agonet/mlx5: Convert mkey_table to XArray
Matthew Wilcox [Thu, 20 Jun 2019 07:03:47 +0000 (07:03 +0000)]
net/mlx5: Convert mkey_table to XArray

The lock protecting the data structure does not need to be an rwlock.  The
only read access to the lock is in an error path, and if that's limiting
your scalability, you have bigger performance problems.

Eliminate mlx5_mkey_table in favour of using the xarray directly.
reg_mr_callback must use GFP_ATOMIC for allocating XArray nodes as it may
be called in interrupt context.

This also fixes a minor bug where SRCU locking was being used on the radix
tree read side, when RCU was needed too.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
5 years agoRDMA/mlx5: Refactor MR descriptors allocation
Max Gurtovoy [Tue, 11 Jun 2019 15:52:57 +0000 (18:52 +0300)]
RDMA/mlx5: Refactor MR descriptors allocation

Improve code readability using static helpers for each memory region
type. Re-use the common logic to get smaller functions that are easy
to maintain and reduce code duplication.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/mlx5: Use PA mapping for PI handover
Max Gurtovoy [Tue, 11 Jun 2019 15:52:56 +0000 (18:52 +0300)]
RDMA/mlx5: Use PA mapping for PI handover

If possibe, avoid doing a UMR operation to register data and protection
buffers (via MTT/KLM mkeys). Instead, use the local DMA key and map the
SG lists using PA access. This is safe, since the internal key for data
and protection never exposed to the remote server (only signature key
might be exposed). If PA mappings are not possible, perform mapping
using MTT/KLM descriptors.

The setup of the tested benchmark (using iSER ULP):
 - 2 servers with 24 cores (1 initiator and 1 target)
 - ConnectX-4/ConnectX-5 adapters
 - 24 target sessions with 1 LUN each
 - ramdisk backstore
 - PI active

Performance results running fio (24 jobs, 128 iodepth) using
write_generate=1 and read_verify=1 (w/w.o patch):

bs      IOPS(read)        IOPS(write)
----    ----------        ----------
512   1266.4K/1262.4K    1720.1K/1732.1K
4k    793139/570902      1129.6K/773982
32k   72660/72086        97229/96164

Using write_generate=0 and read_verify=0 (w/w.o patch):
bs      IOPS(read)        IOPS(write)
----    ----------        ----------
512   1590.2K/1600.1K    1828.2K/1830.3K
4k    1078.1K/937272     1142.1K/815304
32k   77012/77369        98125/97435

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/mlx5: Improve PI handover performance
Israel Rukshin [Tue, 11 Jun 2019 15:52:55 +0000 (18:52 +0300)]
RDMA/mlx5: Improve PI handover performance

In some loads, there is performance degradation when using KLM mkey
instead of MTT mkey. This is because KLM descriptor access is via
indirection that might require more HW resources and cycles.
Using KLM descriptor is not necessary when there are no gaps at the
data/metadata sg lists. As an optimization, use MTT mkey whenever it
is possible. For that matter, allocate internal MTT mkey and choose the
effective pi_mr for in transaction according to the required mapping
scheme.

The setup of the tested benchmark (using iSER ULP):
 - 2 servers with 24 cores (1 initiator and 1 target)
 - ConnectX-4/ConnectX-5 adapters
 - 24 target sessions with 1 LUN each
 - ramdisk backstore
 - PI active

Performance results running fio (24 jobs, 128 iodepth) using
write_generate=1 and read_verify=1 (w/w.o/baseline):

bs      IOPS(read)                IOPS(write)
----    ----------                ----------
512   1262.4K/1243.3K/1147.1K    1732.1K/1725.1K/1423.8K
4k    570902/571233/457874       773982/743293/642080
32k   72086/72388/71933          96164/71789/93249

Using write_generate=0 and read_verify=0 (w/w.o patch):
bs      IOPS(read)                IOPS(write)
----    ----------                ----------
512   1600.1K/1572.1K/1393.3K    1830.3K/1823.5K/1557.2K
4k    937272/921992/762934       815304/753772/646071
32k   77369/75052/72058          97435/73180/94612

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Suggested-by: Max Gurtovoy <maxg@mellanox.com>
Suggested-by: Idan Burstein <idanb@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/mlx5: Remove unused IB_WR_REG_SIG_MR code
Israel Rukshin [Tue, 11 Jun 2019 15:52:54 +0000 (18:52 +0300)]
RDMA/mlx5: Remove unused IB_WR_REG_SIG_MR code

IB_WR_REG_SIG_MR is not needed after IB_WR_REG_MR_INTEGRITY
was used.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/rw: Use IB_WR_REG_MR_INTEGRITY for PI handover
Israel Rukshin [Tue, 11 Jun 2019 15:52:53 +0000 (18:52 +0300)]
RDMA/rw: Use IB_WR_REG_MR_INTEGRITY for PI handover

Replace the old signature handover API with the new one. The new API
simplifes PI handover code complexity for ULPs and improve performance.
For RW API it will reduce the maximum number of work requests per task
and the need of dealing with multiple MRs (and their registrations and
invalidations) per task. All the mappings and registration of the data
and the protection buffers is done by the LLD using a single WR and a
special MR type (IB_MR_TYPE_INTEGRITY) for the PI handover operation.

The setup of the tested benchmark (using iSER ULP):
 - 2 servers with 24 cores (1 initiator and 1 target)
 - ConnectX-4/ConnectX-5 adapters
 - 24 target sessions with 1 LUN each
 - ramdisk backstore
 - PI active

Performance results running fio (24 jobs, 128 iodepth) using
write_generate=1 and read_verify=1 (w/w.o patch):

bs      IOPS(read)        IOPS(write)
----    ----------        ----------
512   1243.3K/1182.3K    1725.1K/1680.2K
4k    571233/528835      743293/748259
32k   72388/71086        71789/93573

Using write_generate=0 and read_verify=0 (w/w.o patch):
bs      IOPS(read)        IOPS(write)
----    ----------        ----------
512   1572.1K/1427.2K    1823.5K/1724.3K
4k    921992/916194      753772/768267
32k   75052/73960        73180/95484

There is a performance degradation when writing big block sizes.
Degradation is caused by the complexity of combining multiple
indirections and perform RDMA READ operation from it. This will be
fixed in the following patches by reducing the indirections if
possible.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/rw: Introduce rdma_rw_inv_key helper
Israel Rukshin [Tue, 11 Jun 2019 15:52:52 +0000 (18:52 +0300)]
RDMA/rw: Introduce rdma_rw_inv_key helper

This is a preparation for adding new signature API to the rw-API.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Validate integrity handover device cap
Max Gurtovoy [Tue, 11 Jun 2019 15:52:51 +0000 (18:52 +0300)]
RDMA/core: Validate integrity handover device cap

Protect the case that a ULP tries to allocate a QP with signature
enabled flag while the LLD doesn't support this feature.
While we're here, also move integrity_en attribute from mlx5_qp to
ib_qp as a preparation for adding new integrity API to the rw-API
(that is part of ib_core module).

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Rename signature qp create flag and signature device capability
Israel Rukshin [Tue, 11 Jun 2019 15:52:50 +0000 (18:52 +0300)]
RDMA/core: Rename signature qp create flag and signature device capability

Rename IB_QP_CREATE_SIGNATURE_EN to IB_QP_CREATE_INTEGRITY_EN
and IB_DEVICE_SIGNATURE_HANDOVER to IB_DEVICE_INTEGRITY_HANDOVER.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Add an integrity MR pool support
Israel Rukshin [Tue, 11 Jun 2019 15:52:49 +0000 (18:52 +0300)]
RDMA/core: Add an integrity MR pool support

This is a preparation for adding new signature API to the rw-API.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/iser: Unwind WR union at iser_tx_desc
Israel Rukshin [Tue, 11 Jun 2019 15:52:48 +0000 (18:52 +0300)]
IB/iser: Unwind WR union at iser_tx_desc

After decreasing WRs array size from 7 to 3 it is more
readable to give each WR a descriptive name.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoIB/iser: Use IB_WR_REG_MR_INTEGRITY for PI handover
Israel Rukshin [Tue, 11 Jun 2019 15:52:47 +0000 (18:52 +0300)]
IB/iser: Use IB_WR_REG_MR_INTEGRITY for PI handover

Using this new API reduces iSER code complexity.
It also reduces the maximum number of work requests per task and the need
of dealing with multiple MRs (and their registrations and invalidations)
per task. It is done by using a single WR and a special MR type
(IB_MR_TYPE_INTEGRITY) for PI operation.

The setup of the tested benchmark:
 - 2 servers with 24 cores (1 initiator and 1 target)
 - 24 target sessions with 1 LUN each
 - ramdisk backstore
 - PI active

Performance results running fio (24 jobs, 128 iodepth) using
write_generate=0 and read_verify=0 (w/w.o patch):

bs      IOPS(read)        IOPS(write)
----    ----------        ----------
512     1236.6K/1164.3K   1357.2K/1332.8K
1k      1196.5K/1163.8K   1348.4K/1262.7K
2k      1016.7K/921950    1003.7K/931230
4k      662728/600545     595423/501513
8k      385954/384345     333775/277090
16k     222864/222820     170317/170671
32k     116869/114896     82331/82244
64k     55205/54931       40264/40021

Using write_generate=1 and read_verify=1 (w/w.o patch):

bs      IOPS(read)        IOPS(write)
----    ----------        ----------
512     1090.1K/1030.9K   1303.9K/1101.4K
1k      1057.7K/904583    1318.4K/988085
2k      965226/638799     1008.6K/692514
4k      555479/410151     542414/414517
8k      298675/224964     264729/237508
16k     133485/122481     164625/138647
32k     74329/67615       80143/78743
64k     35716/35519       39294/37334

We get performance improvement at all block sizes.
The most significant improvement is when writing 4k bs (almost 30% more
iops).

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/mlx5: Introduce and implement new IB_WR_REG_MR_INTEGRITY work request
Max Gurtovoy [Tue, 11 Jun 2019 15:52:46 +0000 (18:52 +0300)]
RDMA/mlx5: Introduce and implement new IB_WR_REG_MR_INTEGRITY work request

This new WR will be used to perform PI (protection information) handover
using the new API. Using the new API, the user will post a single WR that
will internally perform all the needed actions to complete PI operation.
This new WR will use a memory region that was allocated as
IB_MR_TYPE_INTEGRITY and was mapped using ib_map_mr_sg_pi to perform the
registration. In the old API, in order to perform a signature handover
operation, each ULP should perform the following:
1. Map and register the data buffers.
2. Map and register the protection buffers.
3. Post a special reg WR to configure the signature handover operation
   layout.
4. Invalidate the signature memory key.
5. Invalidate protection buffers memory key.
6. Invalidate data buffers memory key.

In the new API, the mapping of both data and protection buffers is
performed using a single call to ib_map_mr_sg_pi function. Also the
registration of the buffers and the configuration of the signature
operation layout is done by a single new work request called
IB_WR_REG_MR_INTEGRITY.
This patch implements this operation for mlx5 devices that are capable to
offload data integrity generation/validation while performing the actual
buffer transfer.
This patch will not remove the old signature API that is used by the iSER
initiator and target drivers. This will be done in the future.

In the internal implementation, for each IB_WR_REG_MR_INTEGRITY work
request, we are using a single UMR operation to register both data and
protection buffers using KLM's.
Afterwards, another UMR operation will describe the strided block format.
These will be followed by 2 SET_PSV operations to set the memory/wire
domains initial signature parameters passed by the user.
In the end of the whole transaction, only the signature memory key
(the one that exposed for the RDMA operation) will be invalidated.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/mlx5: Update set_sig_data_segment attribute for new signature API
Max Gurtovoy [Tue, 11 Jun 2019 15:52:45 +0000 (18:52 +0300)]
RDMA/mlx5: Update set_sig_data_segment attribute for new signature API

Explicitly pass the sig_mr and the access flags for the mkey segment
configuration. This function will be used also in the new signature
API, so modify it in order to use it in both APIs. This is a preparation
commit before adding new signature API.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/mlx5: Pass UMR segment flags instead of boolean
Max Gurtovoy [Tue, 11 Jun 2019 15:52:44 +0000 (18:52 +0300)]
RDMA/mlx5: Pass UMR segment flags instead of boolean

UMR ctrl segment flags can vary between UMR operations. for example,
using inline UMR or adding free/not-free checks for a memory key.
This is a preparation commit before adding new signature API that
will not need not-free checks for the internal memory key during the
UMR operation.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/mlx5: Add attr for max number page list length for PI operation
Max Gurtovoy [Tue, 11 Jun 2019 15:52:43 +0000 (18:52 +0300)]
RDMA/mlx5: Add attr for max number page list length for PI operation

PI offload (protection information) is a feature that each RDMA provider
can implement differently. Thus, introduce new device attribute to define
the maximal length of the page list for PI fast registration operation. For
example, mlx5 driver uses a single internal MR to map both data and
protection SGL's, so it's equal to max_fast_reg_page_list_len / 2.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/mlx5: Implement mlx5_ib_map_mr_sg_pi and mlx5_ib_alloc_mr_integrity
Max Gurtovoy [Tue, 11 Jun 2019 15:52:42 +0000 (18:52 +0300)]
RDMA/mlx5: Implement mlx5_ib_map_mr_sg_pi and mlx5_ib_alloc_mr_integrity

mlx5_ib_map_mr_sg_pi() will map the PI and data dma mapped SG lists to the
mlx5 memory region prior to the registration operation. In the new
API, the mlx5 driver will allocate an internal memory region for the
UMR operation to register both PI and data SG lists. The internal MR
will use KLM mode in order to map 2 (possibly non-contiguous/non-align)
SG lists using 1 memory key. In the new API, each ULP will use 1 memory
region for the signature operation (instead of 3 in the old API). This
memory region will have a key that will be exposed to remote server to
perform RDMA operation. The internal memory key that will map the SG lists
will stay private.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Add signature attrs element for ib_mr structure
Max Gurtovoy [Tue, 11 Jun 2019 15:52:41 +0000 (18:52 +0300)]
RDMA/core: Add signature attrs element for ib_mr structure

This element will describe the needed characteristics for the signature
operation per signature enabled memory region (type IB_MR_TYPE_INTEGRITY).
Also add meta_length attribute to ib_sig_attrs structure for saving the
mapped metadata length (needed for the new API implementation).

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Introduce ib_map_mr_sg_pi to map data/protection sgl's
Max Gurtovoy [Tue, 11 Jun 2019 15:52:40 +0000 (18:52 +0300)]
RDMA/core: Introduce ib_map_mr_sg_pi to map data/protection sgl's

This function will map the previously dma mapped SG lists for PI
(protection information) and data to an appropriate memory region for
future registration.
The given MR must be allocated as IB_MR_TYPE_INTEGRITY.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Introduce IB_MR_TYPE_INTEGRITY and ib_alloc_mr_integrity API
Israel Rukshin [Tue, 11 Jun 2019 15:52:39 +0000 (18:52 +0300)]
RDMA/core: Introduce IB_MR_TYPE_INTEGRITY and ib_alloc_mr_integrity API

This is a preparation for signature verbs API re-design. In the new
design a single MR with IB_MR_TYPE_INTEGRITY type will be used to perform
the needed mapping for data integrity operations.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Save the MR type in the ib_mr structure
Max Gurtovoy [Tue, 11 Jun 2019 15:52:38 +0000 (18:52 +0300)]
RDMA/core: Save the MR type in the ib_mr structure

This is a preparation for the signature verbs API change. This change is
needed since the MR type will define, in the upcoming patches, the need
for allocating internal resources in LLD for signature handover related
operations. It will also help to make sure that signature related
functions are called with an appropriate MR type and fail otherwise.
Also introduce new mr types IB_MR_TYPE_USER, IB_MR_TYPE_DMA and
IB_MR_TYPE_DM for correctness.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoRDMA/core: Introduce new header file for signature operations
Max Gurtovoy [Tue, 11 Jun 2019 15:52:37 +0000 (18:52 +0300)]
RDMA/core: Introduce new header file for signature operations

Ease the exhausted ib_verbs.h file and make the code more readable.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
5 years agoLinux 5.2-rc6 v5.2-rc6
Linus Torvalds [Sat, 22 Jun 2019 23:01:36 +0000 (16:01 -0700)]
Linux 5.2-rc6

5 years agoMerge tag 'iommu-fix-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro...
Linus Torvalds [Sat, 22 Jun 2019 21:08:47 +0000 (14:08 -0700)]
Merge tag 'iommu-fix-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu fix from Joerg Roedel:
 "Revert a commit from the previous pile of fixes which causes new
  lockdep splats. It is better to revert it for now and work on a better
  and more well tested fix"

* tag 'iommu-fix-v5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"