I will be talking about how to implement MP-BGP EVPN VXLAN in ingress replication mode, and in another future post about multicast, in Proxmox and Nexus N9K. It doesn’t need to be a nexus switch. Your switch of choice just needs to support VXLAN, multiprotocol BGP, EVPN and some IGP like OSPF or IS-IS. I also have a custom switch made from an old PC with FRR running on it. This is a good substitute for a dedicated pricey enterprise switch, not to mention the power draw is much lower.
I’ve documented what I have learned and my findings. Note that everything may/is not fully correct as these are only my findings from what I could find on the internet, mainly Cisco pdf files and individual tutorials. Do not forget to analyze and go through the configurations by yourself.
Chapters
Network preparations
First of all, you need to prepare network addresses to use, preferably a /24 network subnetted into /31 networks, and as many /32 addresses as you have hosts. For example, if I have three servers connected with two links to a switch I would need six /31 networks and four /32 IP addresses. Those /32 addresses will be configured on loopback interfaces so that ECMP (Equal Cost Multi-Path) can be utilized.
Some warnings
If you have installed Wazuh Agent on your Linux/Proxmox clients/servers you will need to uninstall the agent as it causes FRR service to freeze the entire system. Otherwise you will be kind of forced to do a warm reboot which ZFS doesn’t like very much. At least that happened in my case and I wanted to say it if someone else is using Wazuh.
Nexus switch configuration
IGP
Here I will talk about the IGP (OSPF) for MP-BGP EVPN VXLAN.
We need to have an IGP because MP-BGP needs to have a way to know where a standard BGP router or a Route Reflector is. It is possible to have BGP working in a mesh but that is not very scalable.
If you need/want to you can do multi area OSPF although for most homelabers and people wanting to experiment with it, a single stub area, area 0, is enough.
It is configured like any other OSPF router would be. By that I mean setting router-id, auto-cost reference-bandwidth, and passive-interface default.
The last option, no ip ospf passive-interface, is used to ensure that OSPF adjacencies are formed only on explicitly enabled interfaces, while only interfaces that are a part of OSPF are advertised via LSAs. This way if a switch has an IP on let’s say interface vlan20, that IP will not be redistributed via LSAs.
Here is how I have configured my OSPF router. It is of course up to you to configure your ospf as you see fit.
router ospf 10
router-id SomeRouterID
auto-cost reference-bandwidth 1000 Mbps
passive-interface default
Then on interfaces connected to servers, as well loopback interface, I have applied this command
interface lo0
! Omitted configuration
no ip ospf passive-interface
VLANs
When configuring VLANs you need to also add vn-segment to the VLAN configuration so that said VLAN get its VXLAN Network Identifier. Also VXLAN is spelled VXLAN, not VxLAN as stated by RFC7348 standard [1].
Your VLAN configuration would look similar to the following in which VLAN id and VXLAN Network Identifier 404 is used
vlan 404
name Example
vn-segment 404
NVE
You define how VNIs are redistributed
interface nve1
no shutdown
host-reachability protocol bgp
source-interface loopback0
member vni 404
ingress-replication protocol bgp
member vni 418
ingress-replication protocol bgp ! Then repeat for every VNI
EVPN
At this step you configure EVPN “for” MP-BGP. Ingress replication needs EVPN Type-3 routes to say to other routers about the available VTEP so that BUM (Broadcast, Unknown unicast, Multicast)traffic can be replicated.
Also there are three main EVPN route types; type-2, type-3 and type-5. Type-2 sends IPs and MAC addresses. Type-3 are used for BUM traffic. Lastly type-5 is used for IP prefixes.
evpn
vni 1404 l2
rd auto
route-target import auto
route-target export auto
vni 1418 l2
rd auto
route-target import auto
route-target export auto
! Also repeat for every VNI
MP-BGP
Configure /32 IP on loopback interface. It will be used as destination IP for BGP peering. Note that PERMIT_LO ACL permits loopback addresses to be redistributed via BGP.
router bgp 64572
router-id 2.1.3.7
bestpath as-path multipath-relax
log-neighbor-changes
address-family ipv4 unicast
redistribute direct route-map PERMIT_LO
address-family l2vpn evpn
retain route-target all
! PURR-SW
neighbor 10.255.254.254
remote-as 64572
update-source loopback0
address-family ipv4 unicast
send-community
send-community extended
route-reflector-client
address-family l2vpn evpn
send-community
send-community extended
route-reflector-client
! Repeat for each neighbor
FRRouting configuration
Here a bit of “silver tape” will be used. What I mean by that is that first, we need to create a vxlan interface that takes the VXLAN packet and decapsulates them. On Proxmox in order to have VMs connected to VXLAN interface you will need to create a virtual bridge that will have the vxlan interface as bridge-slave. With that Proxmox will send the traffic through vxlan, which then encapsulates the IP packet into a VXLAN packet and then it is sent to a destination VTEP.
Important thing worth mentioning is how MTU needs to be handled. VXLAN adds an overhead of around 50 bytes in IPv4 and around 70 in IPv6. It means that if we were to set vxlan interface to MTU 4096 and virtual bridge to MTU 4096, then the vxlan interface would need to drop the packet. Also the MTU that vxlan interface has needs to match with MTU of the physical port. In simple terms, the underlay interface (physical link like eno1, enp24s0, etc) must have an MTU that is larger than or same as the overlay MTU (vni1404).
That is exactly why the vxlan interface (the one decapsulating VXLAN packets) needs to have MTU 4096. Virtual bridge connected to vxlan interface on the other hand needs MTU that is 50 byte lower than 4096 which results in 4046. I personally have set MTU on virtual bridges to 4030 to have a safe margin. Even with that MTU 4030 my network still managed to achieve 40Gbps with single digits retransmissions using parallel iperf3 test.
IGP
The IGP configuration is pretty much the same as on Cisco. The only difference being the syntax which is still very similar to Cisco.
router ospf
ospf router-id SomeRouterID
auto-cost reference-bandwidth 1000
passive-interface default
exit
VXLAN interfaces
Here you will create the interface for decapsulating the VXLAN packets.
What I recommend is to use the /etc/network/interfaces.d directory to create vni file which will be used here only.
By default Proxmox has an include /etc/network/interfaces.d/ line in the /etc/network/interface file. In case that line is not there, add it at the beginning of the file.
Here is an example for a vxlan interface with VNI 1404. I use vniXXXX as the naming scheme as it directly shows that it is a VXLAN interface with VNI1404
auto vni1404
iface vni1404 inet manual
vxlan-id 1404
vxlan-local-tunnelip YourLocalLOOPBACKInterfaceIP
vxlan-dstport 4789
vxlan-external on
mtu 4096
Virtual Bridges
Now comes the part where we create the bridges for VMs and LXCs to connect to. These configurations needs to be appended to /etc/network/interface file. If the virtual bridges were to be added under interfaces.d directory then they will not show up in Proxmox WebGUI. Here is an example for how to create a simple virtual switch without an IP address
Here is an example for how a virtual switch (vmbr in Proxmox WebGUI) is configured without an IP address. In these examples I use vnibrxxxx as the naming scheme. This naming scheme shows instantly that it is a bridge with VNI1404
auto vnibr1404
iface vnibr1404 inet manual
bridge-ports vni1404
bridge-stp off
bridge-fd 0
learning off
flooding on
mtu 4030
And here is another example for a virtual bridge with an IP address. You can comment out the gateway line if you only want L2 connection.
auto vnibr1404
iface vnibr1404 inet static
address 192.168.1.2/24
gateway 192.168.1.1
bridge-ports vni1404
bridge-stp off
bridge-fd 0
learning off
flooding on
mtu 4030
MP-BGP
Here comes an example for how to configure MP-BGP EVPN in FRR. It looks pretty similar to Nexus configuration information-wise. The only difference is that the syntax is different and that in this example the FRR is not a Route Reflector. That means that FRR in this case connects to Nexus and exchanges MAC addresses using EVPN NLRI. Also it doesn’t prevent advertising both IP routes and MAC addresses.
It is possible to create a BGP-mesh so that every peer would establish peering with every other router, FRR or any other, although it is not scalable when the number of routers grows or when there are many routers in the topology.
router bgp SOMEASN
bgp router-id SomeBGPRouterID
no bgp default ipv4-unicast
bgp graceful-restart
bgp bestpath as-path multipath-relax
no bgp network import-check
neighbor NeighborIP remote-as SOMEASN
neighbor NeighborIP update-source lo
! Repeat the previous 2 lines for every neighbor
!
address-family ipv4 unicast
neighbor NeighborIP activate
! Repeat for every neighbor
exit-address-family
!
address-family l2vpn evpn
neighbor NeighborIP activate
! Repeat for every neighbor
advertise-all-vni
exit-address-family
exit
Sources
[1] M. Mahalingam et al., “Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks,” www.rfc-editor.org, Aug. 2014, doi: https://doi.org/10.17487/RFC7348.