LACP LAG between USG1100 and XGS3700 not working properly

imaohw
imaohw Posts: 124  Ally Member
First Comment First Answer Friend Collector Sixth Anniversary
edited April 2021 in Security

I am trying to set up a LACP LAG between a USG1100 and an XGS3700 switch (three switches in a stack). I want to increase available bandwidth between the switch and the USG for traffic on one particular vlan.

After doing the configuration on both devices the LAG shows as operational on the two ports of the XGS3700 (1/8 and 1/9) and is active on the USG1100 however network traffic is not passing properly. This is the first time I am configuring a LAG between a USG and a switch so I may have a configuration issue, but I cannot find it.

My testing (detailed below) leads me to believe that the network traffic issue is with the USG1100 side of the LAG and not the XGS3700 side. But I could be wrong. Sorry the description is so long but I wanted to provide all of the information I had.

Below is information on my setup and what I have done to try and diagnose the issue. I have a config file and a diaginfo file for the USG1100 which I can attach to a PM.

  • I am setting up a LACP LAG between a USG1100 (ports 4 and 5) and an XGS3700 (ports 1/8 and 1/9).
  • Port 1/8 of the XGS3700 is physically connected to port 4 of the USG1100 and port 1/9 is physically connected to port 5 with patch cables.
  • The USG1100 is running V4.35(AAPK.0)ITS-WK46-2019-11-27-191100445D firmware
  • The XGS3700 is running V4.30(AAGC.2)_20200115 | 01/15/2020 firmware
  • The LACP LAG is being used only for traffic on vlan 2.
  • Vlan 2 is “assigned” (base port) to lag0 (ports 4 and 5) on the USG1100.
  • Vlan 2 is tagged on ports 1/8 and 1/9 on the XGS3700.
  • With both ports physically connected and active on both machines the LACP LAG appears to be set up properly but is not passing traffic properly. Some devices in vlan 2 work fine, others pass traffic intermittently (access internet and other vlans sometimes), and others cannot pass traffic at all.
  • On the Port Statistics page of the USG1100 port 5 always shows less the 256 for Rx B/s. Many times (5 second refresh) it shows 0 Rx B/s. Port 4 shows a much higher number (usually in the thousands).
  • All configuration changes made to the XGS3700 and USG1100 for testing were done using a PC on vlan 1 (the management vlan - XGS3700 and USG1100 are in vlan 1).
  • For testing I started a continuous ping from a PC in vlan 2 to the XGS3700 (its IP in vlan 1) and a continuous ping from a PC in vlan 2 to the USG1100 (using its IP in vlan 1). With both ports up the pings were working but some other PCs and mobile devices in vlan 2 could not reach either the XGS3700, the USG1100, other vlans, or the internet.
  • I "downed" (unchecked Active in port setup screen) port 1/9 (part of LAG) on the XGS3700 and pressed Apply. The pings continued to work and all other devices started to function properly (had internet and inter-vlan access). At this point only port 1/8 (connected to USG1100 port 4) was active. 
  • I then activated port 1/9 again and some of the devices in vlan 2 stopped having internet and inter-vlan access.
  • I next downed port 1/8 leaving only port 1/9 as part of the LACP LAG. As soon as I pressed Apply the continuous pings started timing out. None of the devices on vlan 2 had internet or inter-vlan access.

A summary so far:

  • With both ports in the LACP LAG active some devices on the vlan assigned to the lag work, some devices work intermittently, and some devices do not work at all.
  • With only switch port 1/8 active (port 1/9 down) all devices on the vlan assigned to the lag work properly.
  • With only switch port 1/9 active (port 1/8 down) NO devices on the vlan assigned to the lag work properly.

Next I decided to check the physical patch cables between the USG1100 and the XGS3700.

  • I checked the Port Status on ports 1/8 and 1/9. Neither port showed errors. 
  • I ran a cable test on port 1/8 and 1/9. Both cables passed.
  • To be extra cautious I replaced the cable between port 1/9 on the XGS3700 and port 5 on the USG1100 with a known working cable. I still saw the same issues. 
  • I physically disconnected the cable from switch port 1/9. All devices on vlan 2 could pass traffic. 
  • I physically moved the cable from switch port 1/8 to port 1/9. A cable now connected port 4 on the USG1100 to port 1/9 on the XGS3700. All devices on vlan 2 could pass traffic.
  • I disconnected the cable from port 4 of the USG1100 and tried with a cable from the USG1100 port 5 first to the XGS3700 port 1/8 and then 1/9. Neither setup worked. No devices on vlan 2 could pass traffic.

It appeared that the issue was with port 5 of the USG1100. I double checked the configuration but did not see any issues.

Next I decided to use the Packet Capture capability of the USG1100 to gather data for reporting the issue.

  • I started with only port 1/9 of the LACP LAG on the XGS3700 active (port 1/8 down) and its patch cable physically connected to port 5 of the USG1100.
  • The continuous pings from a PC on vlan 2 to the XGS3700 and the USG1100 were timing out.
  • On the USG1100 I went to the Maintenance tab, Diagnostics, Packet Capture, Capture and selected interface 5. I then pressed "Capture".
  • As soon as I started the Packet Capture the continuous pings from the PC in vlan 2 to the XGS3700 and the USG1100 started working.
  • As soon as I stopped the Packet Capture the continuous pings from the PC in vlan 2 to the XGS3700 and the USG1100 started timing out again.
  • This behavior of the pings working only when the packet capture was running is repeatable.

I'm not sure where to go from here. It appears the issue is with the USG1100. I may have a configuration issue, but I don't see it. Please let me know what additional information you would like me to capture and send to you.

For now, I have the LACP LAG up but only port 1/8 on the XGS3700 connected to port 4 on the USG1100. This is the only way I can reliably have traffic flowing in vlan 2.

Thanks for your help.

«1

All Replies

  • Zyxel_Vic
    Zyxel_Vic Posts: 282  Zyxel Employee
    25 Answers First Comment Friend Collector Seventh Anniversary

    Hi @imaohw

    Would you share your settings with us in private message? (USG & XGS settings). Some further checking in our lab may be required. Per your description, when enabling packet capturing, all the ping can pass through ==> This could be a key and we will further check and focus in this part firstly.

    For complete checking, we may need your help to provide us a topology information and configuration files to rebuild in lab directly.

  • Zyxel_Vic
    Zyxel_Vic Posts: 282  Zyxel Employee
    25 Answers First Comment Friend Collector Seventh Anniversary

    By the way, the firmware on USG1100 you're using is not belong to formal release. This version is mainly for specific analyzing purpose base on the issue you feedback to us. Can you change the firmware to the official version (V4.35(AAPK.2)C0)?


    You can download the firmware from the Myzyxel.com portal directly


  • imaohw
    imaohw Posts: 124  Ally Member
    First Comment First Answer Friend Collector Sixth Anniversary

    @Zyxel_Vic - private message sent.

  • Zyxel_Vic
    Zyxel_Vic Posts: 282  Zyxel Employee
    25 Answers First Comment Friend Collector Seventh Anniversary

    Hi @imaohw

    Thanks for your cooperation in this case with us. We're now co-working with our USA colleagues and finding out what possibilities it could be.

  • bzzt
    bzzt Posts: 6  Freshman Member
    First Comment Friend Collector Third Anniversary
    I have a very similar problem with LACP LAG between XGS3700 stack and Cisco 3750E. 
    In the same fashion, when i down one of the LAG ports the connection persists, but if restore it and down the second LAG port i lose all connectivity between the switches.
    I tried doing LACP lag between two Cisco switches and it works as one would expect - providing redundancy and everything.
    This makes me think the problem lies within XGS3700 and the way LACP operates on it. I am running latest firmware which is 4.30(AAGC.2)C0 at the moment of this post.
    I will make a separate thread with our topology and configs if necessary.
  • Zyxel_Emily
    Zyxel_Emily Posts: 1,396  Zyxel Employee
    Zyxel Certified Network Administrator - Security Zyxel Certified Sales Associate 100 Answers 1000 Comments
    Symptom:
    USG1100 ge4-------------1/8 XGS3700------client
                     ge5-------------1/9
    When both links are connected, sometimes the traffic from the client is not working.
    If link of ge4 is disconnected, all traffic is not working. 
    If link of ge5 is disconnected, all traffic is working.

    The root cause of the issue is that old Device HA mode does not support LAG completely.
    Only Device HA Pro mode is working with LAG interface.


    Hence, you can select one of the following methods to make it work.
    Method 1. 
    Use old Device HA mode.
    Inactive interface P5-Home2 and reboot device again to solve issue. 

    Method 2. 
    Use Device HA Pro mode.
    Keep interface P5-Home2 activated and reboot the device.
  • imaohw
    imaohw Posts: 124  Ally Member
    First Comment First Answer Friend Collector Sixth Anniversary
    @Zyxel_Emily - I want to verify that what you are saying is that having Device HA turned on and monitoring an interface that is not part of the LAG is causing the issue with the traffic on the LAG interfaces?  

    In other words, you are saying Device HA can not be used if the USG/Zywall has any LAGs of any type (active-backup, 802.3ad, or balance-alb) configured.

    This is not how the knowledge base article describes Device HA (see - https://kb.zyxel.com/KB/searchArticle!gwsViewDetail.action?articleOid=015983&lang=EN 

    The knowledge base article says Device HA cannot monitor a LAG.  Not that you cannot have a LAG.

    This seems to me like a defect that should be fixed.  The USG/Zywall is not functioning as described in Zyxel’s own documentation.

    At a minimum, if you cannot have any LAGs with Device HA then the USG/Zywall UI (and CLI) should not allow Device HA to be turned on if there any LAGs configured and vice versa, not allow LAGs to be created if Device HA is on.

    Basically you are telling me I have two options.  “Don’t use a LAG” which is unacceptable or “Use Device HA-Pro” which I do not currently use for The following two reasons:
    1. it requires a dedicated port on the USG which I do not currently have available
    2. when the “Main” USG is offline for any reason the “Backup” USG becomes the “Main” and does not fall back when the original “Main “ USG comes back up. This is a problem for my installation because the internet connections on my “Backup” USG are much more expensive to use.
  • Zyxel_Emily
    Zyxel_Emily Posts: 1,396  Zyxel Employee
    Zyxel Certified Network Administrator - Security Zyxel Certified Sales Associate 100 Answers 1000 Comments

    USG1100 ge4-------------1/8 XGS3700------client
                     ge5-------------1/9

    If you don't want to use Device HA Pro, try to use the workaround with the current Device HA mode and LAG setting.
    Inactivate interface P5-Home2 and reboot USG1100.
    No matter ge4 or ge5 is disconnected, all traffic is still working.

  • imaohw
    imaohw Posts: 124  Ally Member
    First Comment First Answer Friend Collector Sixth Anniversary
    edited September 2020
    @Zyxel_Emily - if I Inactivate P5-Home2 (ge5) doesn’t that mean that the LAG is only using P5-Home (ge4)?

    I would use Device HA Pro if it had the option to go back to the “master” device when the “master” came back up. Like Device HA works.
  • Zyxel_Emily
    Zyxel_Emily Posts: 1,396  Zyxel Employee
    Zyxel Certified Network Administrator - Security Zyxel Certified Sales Associate 100 Answers 1000 Comments
    When you inactivate interface P5-Home2, only eth4 interface IP setting is disabled. 
    It can avoid device-ha running “ip link ethx down/up” which affects LAG.
    LAG interface can still work with both interfaces.

Security Highlight