Switchdev

Switchdev on the Mellanox Platform

Article

Switchdev on the Mellanox platform

Switchdev is a linux project to directly support networking ASICs in a standard linux environment.
It provides in kernel support for the ASIC, so existing tools like ifconfig, ethtool, ip link just work, and are used to configure the forwarding hardware.
Mellanox provides a switchdev driver that they include into a fedora remix iso that can easily be installed.
We will use this platform to take a look what can be done in this new model.
As for several other examples we will une ONIE to install the switch OS.
It is important to mention here that besides the OS there also exists a switch firmware in the forwarding ASIC that must be the correct / current version.
If this is not the case Mellanox provides images and tools to update the firmware. As this is not a naive Linux functionality the tools ofr this must be provided by the hardware vendor.
The ONIE install is a bit more complex for this system, some additional steps are necessary:

1. We need to extract the ISO image onto the webroot of our installation server
2. The kickstart file needs to be in a directory named ks and point to the installation server

After preparing this, and starting the installation it ran through without any problem, and we were able to nog into the device.

=======================================
[ OK ] Reached target Network.
Starting OpenSSH server daemon...
[ OK ] Reached target Network is Online.
Starting Notify NFS peers of a restart...
Starting Permit User Sessions...
[ OK ] Started Notify NFS peers of a restart.
[ OK ] Started Permit User Sessions.
[ OK ] Started Command Scheduler.
[ OK ] Started Job spooling tools.
Starting Hold until boot process finishes up...
Starting Terminate Plymouth Boot Screen...
[ OK ] Started OpenSSH server daemon.

Generic release 25 (Generic)
Kernel 4.8.15-300.fc25.x86_64 on an x86_64 (ttyS0)


localhost login: root
Password:
Welcome to the Mellanox Development System.

Please refer to https://github.com/Mellanox/mlxsw/wiki for user manual.
This is a Fedora Remix. Unmodified Fedora is available at getfedora.org

[root@localhost ~]#
===========================

Running ifconfig only shows us the management ports at the moment. We need to update the kernel to have the correct version to work with our firmware. Firmware in this case refers to the microcode on the linecard / ASIC itself. The management ports are "standard" ethernet ports running with the standard linux drivers from the OS perspective. The dataplane ports are special ethernet interfaces that need the "advanced" switchdev driver to work. The switchdev driver needs to talk directly to the hardware accelerated ports using an API exposed by the linecard firmware code. This API is called by the kernel, which means kernel version and firmware need to be in sync for the dataplane ports to work. Here is the state after installing, we see only the management interfaces:

============================================
[root@localhost ~]# ifconfig
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.77.10.118 netmask 255.255.255.0 broadcast 10.77.10.255
inet6 fe80::7efe:90ff:fe28:cd6e prefixlen 64 scopeid 0x20<link>
ether 7c:fe:90:28:cd:6e txqueuelen 1000 (Ethernet)
RX packets 17 bytes 2780 (2.7 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 32 bytes 3915 (3.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 20 memory 0xf7c00000-f7c20000

ens6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::7efe:90ff:fe28:cd6f prefixlen 64 scopeid 0x20<link>
ether 7c:fe:90:28:cd:6f txqueuelen 1000 (Ethernet)
RX packets 2404 bytes 242124 (236.4 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 13 bytes 2410 (2.3 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 18 memory 0xf5e00000-f5e20000

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1 (Local Loopback)
RX packets 120 bytes 10768 (10.5 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 120 bytes 10768 (10.5 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

[root@localhost ~]#
=========================================

To get the updates, we need to configure the networking accordingly to reach the fedora rpm servers.
If not done automatically through dhcp, you need to set ip, gateway and DNS servers manually as you do on the Linux distribution used. The distribution provided for the Mellanox is based on Fedora.
To update the kernel (remember, switchdev is a kernel driver), we use dnf. Dnf is a newer replacement for yum which works much the same way:

================================
[root@localhost ~]# dnf update kernel
Fedora 25 - x86_64 - Updates 17 MB/s | 19 MB 00:01
Fedora 25 - x86_64 16 MB/s | 50 MB 00:03
Last metadata expiration check: 0:00:43 ago on Wed Mar 1 20:32:10 2017.
Dependencies resolved.
================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
kernel x86_64 4.9.12-200.fc25 updates 96 k
kernel-core x86_64 4.9.12-200.fc25 updates 20 M
kernel-modules x86_64 4.9.12-200.fc25 updates 22 M

Transaction Summary
================================================================================
Install 3 Packages

Total download size: 42 M
Installed size: 74 M
Is this ok [y/N]: y
Is this ok [y/N]: y
Downloading Packages:
[MIRROR] kernel-core-4.9.12-200.fc25.x86_64.rpm: Status code: 404 for http://archive.linux.duke.edu/pub/fedora/linux/updates/25/x86_64/k/kernel-core-4.9.12-200.fc25.x86_64.rpm
[MIRROR] kernel-modules-4.9.12-200.fc25.x86_64.rpm: Status code: 404 for http://archive.linux.duke.edu/pub/fedora/linux/updates/25/x86_64/k/kernel-modules-4.9.12-200.fc25.x86_64.rpm
[MIRROR] kernel-4.9.12-200.fc25.x86_64.rpm: Status code: 404 for http://archive.linux.duke.edu/pub/fedora/linux/updates/25/x86_64/k/kernel-4.9.12-200.fc25.x86_64.rpm
(1/3): kernel-4.9.12-200.fc25.x86_64.rpm 212 kB/s | 96 kB 00:00
(2/3): kernel-core-4.9.12-200.fc25.x86_64.rpm 7.7 MB/s | 20 MB 00:02
(3/3): kernel-modules-4.9.12-200.fc25.x86_64.rp 7.7 MB/s | 22 MB 00:02
--------------------------------------------------------------------------------
Total 13 MB/s | 42 MB 00:03
warning: /var/cache/dnf/updates-87ad44ec2dc11249/packages/kernel-4.9.12-200.fc25.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID fdb19c98: NOKEY
Importing GPG key 0xFDB19C98:
Userid : "Fedora 25 Primary (25) <fedora-25-primary@fedoraproject.org>"
Fingerprint: C437 DCCD 558A 66A3 7D6F 4372 4089 D8F2 FDB1 9C98
From : /etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-25-x86_64
Is this ok [y/N]: y
Key imported successfully
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Installing : kernel-core-4.9.12-200.fc25.x86_64 1/3
Installing : kernel-modules-4.9.12-200.fc25.x86_64 2/3
Installing : kernel-4.9.12-200.fc25.x86_64 3/3
Verifying : kernel-4.9.12-200.fc25.x86_64 1/3
Verifying : kernel-core-4.9.12-200.fc25.x86_64 2/3
Verifying : kernel-modules-4.9.12-200.fc25.x86_64 3/3

Installed:
kernel.x86_64 4.9.12-200.fc25 kernel-core.x86_64 4.9.12-200.fc25
kernel-modules.x86_64 4.9.12-200.fc25

Complete!
[root@localhost ~]#

======================================


after a reboot the new driver is active, and we can see all the interfaces now. If this would not be the case we would need to update the linecard firmware too. As this is not a standard Linux functionallity, we would need to use a switch verndor specific tool, and download the most current firmware from the vendor site. The whole process is described at the Mellanox switchdev WIKI in detail.

Here we see the output of ifconfig after things have been updated, and the dataplane interfaces are active:

[root@localhost ~]# ifconfig -a
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.77.10.118 netmask 255.255.255.0 broadcast 10.77.10.255
inet6 fe80::7efe:90ff:fe28:cd6e prefixlen 64 scopeid 0x20<link>
ether 7c:fe:90:28:cd:6e txqueuelen 1000 (Ethernet)
RX packets 1 bytes 346 (346.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 14 bytes 1377 (1.3 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 20 memory 0xf7c00000-f7c20000

ens6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::7efe:90ff:fe28:cd6f prefixlen 64 scopeid 0x20<link>
ether 7c:fe:90:28:cd:6f txqueuelen 1000 (Ethernet)
RX packets 228 bytes 22108 (21.5 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 16 bytes 2904 (2.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 18 memory 0xf5e00000-f5e20000

eth0: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 7c:fe:90:ee:45:81 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth1: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 7c:fe:90:ee:45:83 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
.
.
.
.
eth30: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 7c:fe:90:ee:45:bd txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth31: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 7c:fe:90:ee:45:bf txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1 (Local Loopback)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

=================================================

We can now use standard Linux commands / tools to configure the 100G dataplane ports just as the standard management interfaces. First, let's look at an example how to get hardware related information like temperature or fan speeds:

Hardware monitoring:


[root@localhost hwmon1]# ls
device fan3_input fan6_input name subsystem temp1_reset_history
fan1_input fan4_input fan7_input power temp1_highest uevent
fan2_input fan5_input fan8_input pwm1 temp1_input

Find the RPM for a fan:
[root@localhost hwmon1]# cat fan1_input
12562

Find the current ASIC temperature:
[root@localhost hwmon1]# cat temp1_input
34000

Another way is using the lm_sensors package. As often in Linux, there are several different ways to access the same information:


[root@localhost hwmon1]# sensors
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +34.0°C (high = +87.0°C, crit = +105.0°C)
Core 0: +26.0°C (high = +87.0°C, crit = +105.0°C)
Core 1: +34.0°C (high = +87.0°C, crit = +105.0°C)

acpitz-virtual-0
Adapter: Virtual device
temp1: +27.8°C (crit = +106.0°C)
temp2: +29.8°C (crit = +106.0°C)

mlxsw-pci-0300
Adapter: PCI adapter
fan1: 12798 RPM
fan2: 10861 RPM
fan3: 12448 RPM
fan4: 10861 RPM
fan5: 12562 RPM
fan6: 10861 RPM
fan7: 12798 RPM
fan8: 10861 RPM
temp1: +34.0°C (highest = +35.0°C)

We can use ethtool to get port state information, again ethtool works the same way as we are used to:

[root@localhost hwmon1]# ethtool -i eth0
driver: mlxsw_spectrum
version: 1.0
firmware-version: 13.1220.130
expansion-rom-version:
bus-info: 0000:03:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
[root@localhost hwmon1]#


We can use ip link to change port states and get link information:


[root@localhost hwmon1]# ip link show dev eth0
4: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
link/ether 7c:fe:90:ee:45:81 brd ff:ff:ff:ff:ff:ff
[root@localhost hwmon1]#


Now we look at configuring some basic networking functionallity on the dataplane ports. In this example we use ip link to configure a L2 bridge:

[root@localhost hwmon1]# ip link add name br0 type bridge vlan_filtering 1
[root@localhost hwmon1]# ip link show br0
37: br0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 5a:54:a5:32:5d:4a brd ff:ff:ff:ff:ff:ff
[root@localhost hwmon1]# ip link set dev eth0 master br0
[ 2714.793068] br0: port 1(eth0) entered blocking state
[ 2714.798180] br0: port 1(eth0) entered disabled state
[ 2714.804300] device eth0 entered promiscuous mode
[root@localhost hwmon1]# ip link set dev eth1 master br0
[ 2722.929924] br0: port 2(eth1) entered blocking state
[ 2722.935035] br0: port 2(eth1) entered disabled state
[ 2722.941283] device eth1 entered promiscuous mode
[root@localhost hwmon1]# ip link show br0
37: br0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 7c:fe:90:ee:45:81 brd ff:ff:ff:ff:ff:ff
[root@localhost hwmon1]# bridge vlan show dev eth0
port vlan ids
eth0 1 PVID Egress Untagged

eth0 1 PVID Egress Untagged

These are just some examples to demonstrate the way we can now configure standard networking through linux tools. This as an approach that has a certain learning curve in how to utilize linux commands to configure networking behavior, but this is the same as learning a new CLI when switching to a new networking vendor. The nice thing is that the linux commands will work the same with different vendor devices supporting switchdev, so there is no other CLI to learn.

In addition to this, we are also able to install standard Linux software this way.
As an example, we will use this platform to install perfsonar tools directly on the device. You will find the information regarding the perfsonar installation in a later blog post.