MP-BGP EVPN VXLAN on Proxmox with Ingress Replication

I will be talking about how to implement MP-BGP EVPN VXLAN in ingress replication mode, and in another future post about multicast, in Proxmox and Nexus N9K. It doesn’t need to be a nexus switch. Your switch of choice just needs to support VXLAN, multiprotocol BGP, EVPN and some IGP like OSPF or IS-IS. I also have a custom switch made from an old PC with FRR running on it. This is a good substitute for a dedicated pricey enterprise switch, not to mention the power draw is much lower.

I’ve documented what I have learned and my findings. Note that everything may/is not fully correct as these are only my findings from what I could find on the internet, mainly Cisco pdf files and individual tutorials. Do not forget to analyze and go through the configurations by yourself.

Chapters

Network preparations
Some warnings
Nexus switch configuration
1. IGP
2. VLANs
3. EVPN
4. NVE
5. MP-BGP
FRRouting configuration
Testing
1. Nexus
2. FRRouting

Network preparations

First of all, you need to prepare network addresses to use, preferably a /24 network subnetted into /31 networks, and as many /32 addresses as you have hosts. For example, if I have three servers connected with two links to a switch I would need six /31 networks and four /32 IP addresses. Those /32 addresses will be configured on loopback interfaces so that ECMP (Equal Cost Multi-Path) can be utilized.

Some warnings

If you have installed Wazuh Agent on your Linux/Proxmox clients/servers you will need to uninstall the agent as it causes FRR service to freeze the entire system. Otherwise you will be kind of forced to do a warm reboot which ZFS doesn’t like very much. At least that happened in my case and I wanted to say it if someone else is using Wazuh.

Nexus switch configuration

IGP

Here I will talk about the IGP (OSPF) for MP-BGP EVPN VXLAN.

We need to have an IGP because MP-BGP needs to have a way to know where a standard BGP router or a Route Reflector is. It is possible to have BGP working in a mesh but that is not very scalable.

If you need/want to you can do multi area OSPF although for most homelabers and people wanting to experiment with it, a single stub area, area 0, is enough.

It is configured like any other OSPF router would be. By that I mean setting router-id, auto-cost reference-bandwidth, and passive-interface default.

The last option, no ip ospf passive-interface, is used to ensure that OSPF adjacencies are formed only on explicitly enabled interfaces, while only interfaces that are a part of OSPF are advertised via LSAs. This way if a switch has an IP on let’s say interface vlan20, that IP will not be redistributed via LSAs.

Here is how I have configured my OSPF router. It is of course up to you to configure your ospf as you see fit.

router ospf 10 router-id SomeRouterID auto-cost reference-bandwidth 1000 Mbps passive-interface default

Then on interfaces connected to servers, as well loopback interface, I have applied this command

interface lo0 ! Omitted configuration no ip ospf passive-interface

VLANs

When configuring VLANs you need to also add vn-segment to the VLAN configuration so that said VLAN get its VXLAN Network Identifier. Also VXLAN is spelled VXLAN, not VxLAN as stated by RFC7348 standard [1].

Your VLAN configuration would look similar to the following in which VLAN id and VXLAN Network Identifier 404 is used

vlan 404 name Example vn-segment 404

NVE

You define how VNIs are redistributed

interface nve1 no shutdown host-reachability protocol bgp source-interface loopback0 member vni 404 ingress-replication protocol bgp member vni 418 ingress-replication protocol bgp
! Then repeat for every VNI

EVPN

At this step you configure EVPN “for” MP-BGP. Ingress replication needs EVPN Type-3 routes to say to other routers about the available VTEP so that BUM (Broadcast, Unknown unicast, Multicast)traffic can be replicated.

Also there are three main EVPN route types; type-2, type-3 and type-5. Type-2 sends IPs and MAC addresses. Type-3 are used for BUM traffic. Lastly type-5 is used for IP prefixes.

evpn vni 1404 l2 rd auto route-target import auto route-target export auto vni 1418 l2 rd auto route-target import auto route-target export auto ! Also repeat for every VNI

MP-BGP

Configure /32 IP on loopback interface. It will be used as destination IP for BGP peering. Note that PERMIT_LO ACL permits loopback addresses to be redistributed via BGP.

router bgp 64572 router-id 2.1.3.7 bestpath as-path multipath-relax log-neighbor-changes address-family ipv4 unicast redistribute direct route-map PERMIT_LO address-family l2vpn evpn retain route-target all ! PURR-SW neighbor 10.255.254.254 remote-as 64572 update-source loopback0 address-family ipv4 unicast send-community send-community extended route-reflector-client address-family l2vpn evpn send-community send-community extended route-reflector-client ! Repeat for each neighbor

FRRouting configuration

Here a bit of “silver tape” will be used. What I mean by that is that first, we need to create a vxlan interface that takes the VXLAN packet and decapsulates them. On Proxmox in order to have VMs connected to VXLAN interface you will need to create a virtual bridge that will have the vxlan interface as bridge-slave. With that Proxmox will send the traffic through vxlan, which then encapsulates the IP packet into a VXLAN packet and then it is sent to a destination VTEP.

Important thing worth mentioning is how MTU needs to be handled. VXLAN adds an overhead of around 50 bytes in IPv4 and around 70 in IPv6. It means that if we were to set vxlan interface to MTU 4096 and virtual bridge to MTU 4096, then the vxlan interface would need to drop the packet. Also the MTU that vxlan interface has needs to match with MTU of the physical port. In simple terms, the underlay interface (physical link like eno1, enp24s0, etc) must have an MTU that is larger than or same as the overlay MTU (vni1404).

That is exactly why the vxlan interface (the one decapsulating VXLAN packets) needs to have MTU 4096. Virtual bridge connected to vxlan interface on the other hand needs MTU that is 50 byte lower than 4096 which results in 4046. I personally have set MTU on virtual bridges to 4030 to have a safe margin. Even with that MTU 4030 my network still managed to achieve 40Gbps with single digits retransmissions using parallel iperf3 test.

IGP

The IGP configuration is pretty much the same as on Cisco. The only difference being the syntax which is still very similar to Cisco.

router ospf ospf router-id SomeRouterID auto-cost reference-bandwidth 1000 passive-interface default exit

VXLAN interfaces

Here you will create the interface for decapsulating the VXLAN packets.

What I recommend is to use the /etc/network/interfaces.d directory to create vni file which will be used here only.

By default Proxmox has an include /etc/network/interfaces.d/ line in the /etc/network/interface file. In case that line is not there, add it at the beginning of the file.

Here is an example for a vxlan interface with VNI 1404. I use vniXXXX as the naming scheme as it directly shows that it is a VXLAN interface with VNI1404

auto vni1404 iface vni1404 inet manual vxlan-id 1404 vxlan-local-tunnelip YourLocalLOOPBACKInterfaceIP vxlan-dstport 4789 vxlan-external on mtu 4096

Virtual Bridges

Now comes the part where we create the bridges for VMs and LXCs to connect to. These configurations needs to be appended to /etc/network/interface file. If the virtual bridges were to be added under interfaces.d directory then they will not show up in Proxmox WebGUI. Here is an example for how to create a simple virtual switch without an IP address

Here is an example for how a virtual switch (vmbr in Proxmox WebGUI) is configured without an IP address. In these examples I use vnibrxxxx as the naming scheme. This naming scheme shows instantly that it is a bridge with VNI1404

auto vnibr1404 iface vnibr1404 inet manual bridge-ports vni1404 bridge-stp off bridge-fd 0 learning off flooding on mtu 4030

And here is another example for a virtual bridge with an IP address. You can comment out the gateway line if you only want L2 connection.

auto vnibr1404 iface vnibr1404 inet static address 192.168.1.2/24 gateway 192.168.1.1 bridge-ports vni1404 bridge-stp off bridge-fd 0 learning off flooding on mtu 4030

MP-BGP

Here comes an example for how to configure MP-BGP EVPN in FRR. It looks pretty similar to Nexus configuration information-wise. The only difference is that the syntax is different and that in this example the FRR is not a Route Reflector. That means that FRR in this case connects to Nexus and exchanges MAC addresses using EVPN NLRI. Also it doesn’t prevent advertising both IP routes and MAC addresses.

It is possible to create a BGP-mesh so that every peer would establish peering with every other router, FRR or any other, although it is not scalable when the number of routers grows or when there are many routers in the topology.

router bgp SOMEASN bgp router-id SomeBGPRouterID no bgp default ipv4-unicast bgp graceful-restart bgp bestpath as-path multipath-relax no bgp network import-check neighbor NeighborIP remote-as SOMEASN neighbor NeighborIP update-source lo ! Repeat the previous 2 lines for every neighbor ! address-family ipv4 unicast neighbor NeighborIP activate ! Repeat for every neighbor exit-address-family ! address-family l2vpn evpn neighbor NeighborIP activate ! Repeat for every neighbor advertise-all-vni exit-address-family exit

Sources

[1] M. Mahalingam et al., “Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks,” www.rfc-editor.org, Aug. 2014, doi: https://doi.org/10.17487/RFC7348.