Adding LAGs kills AP Management

Options
dkyeager
dkyeager Posts: 69  Ally Member
First Anniversary 10 Comments Friend Collector
edited August 2022 in Switch
Have a XS3800-28 acting as the campus core switch at a single location connecting to individual XS1930-12HPs which have connections to a USG Flex 500 and multiple WAX650S, WAX610D, and one WAX510D (8 total).  Two pairs of OM4 fiber connect at 10GB each to each XS1930 to the XS3800. 

If we have just one pair between these switches the AP Management on the USG Flex 500 works fine. 

Adding Link Aggregation (without LACP) using src-dst-mac with both pairs combined with Rapid Spanning Tree typically kills the AP Management within 24 hours (the APs are still usable by client devices, but CAPWAP errors for "AP Disconnect." "Reason: Idle in RUN state" appear in the USG Flex 500 log).  Disabling one of the two links per XS1930-12HP also kills AP management for the APs connecting through the LAG.  APs operating without a LAG link between them and the USG Flex 500 will continue to work fine. Had same errors on LAG using copper link. 

(Some Zyxel switches have been left out for simplification.  All equipment on the latest public firmware. Not using Nebula).

Suggestions to fix this? Questions?  Thanks

Best Answers

  • dkyeager
    dkyeager Posts: 69  Ally Member
    First Anniversary 10 Comments Friend Collector
    Answer ✓
    Options
    All 8 APs are properly shown in the USG Flex 500 AP Management List (most are doing two LAG hops).  Thank you very much Zyxel_Chris for your help and the patched firmware.  Please close any tickets you have on this.
  • Zyxel_Chris
    Zyxel_Chris Posts: 660  Zyxel Employee
    First Anniversary 10 Comments Friend Collector First Answer
    Answer ✓
    Options
    Update:
    The root cause is on XS1930 with LAG interface (traffic send to the wrong interface cause the traffic lost), this case has been resolved by the datecode, the official release will be on December (f/w: 4.70).
    Chris
«1

All Replies

  • mMontana
    mMontana Posts: 1,300  Guru Member
    First Anniversary 10 Comments Friend Collector First Answer
    Options
    Is mandatory use Rapid Spanning Tree?
  • dkyeager
    dkyeager Posts: 69  Ally Member
    First Anniversary 10 Comments Friend Collector
    Options
    The low density of the XS1930 switches requires more of them.  The risk of creating and accidental link loop is increased.  We do have about 10 Vlans, thus using MSTP might be possible.  Going dynamic also possible with LACP on the LAGS rather than static, although that was discouraged in the manuals for performance reasons. There is also have loop guard.  For other vendors using these in conjunction with broadcast storm control creates issues.  Does it also create issues on Zyxel switches?

    Just found out that there were updates during the time when we did this testing process.  Thus the APs were upgraded to 6.25 a few hours ago.  So far so good, although previously it took 24 hours for APs to fall out of AP management with the LAGs with Rapid Spanning Tree.

    The update to 5.10 for the USG Flex 500 was just seen.  Our test unit was updated to this today.  We are planning to leave it on just the test Flex until next week to see if we have any issues given the multitude of changes.

    Either of these updates could solve the above changes.  Will post the AP 6.25 results tomorrow night.
  • dkyeager
    dkyeager Posts: 69  Ally Member
    First Anniversary 10 Comments Friend Collector
    Options
    6.25 update for the APs did not solve the issue.  3 out of 4  using LAG have fallen off AP Management.   The 5.10 update for the USG Flex 500 will be put on the production USG Flex 500 on Sunday, assuming test unit continues to perform well.
  • Zyxel_Chris
    Zyxel_Chris Posts: 660  Zyxel Employee
    First Anniversary 10 Comments Friend Collector First Answer
    Options
    @dkyeager
    Is your topology looks like following screenshot?
    Also if the issue still persist, please collect the tech support (collect when issue occurred) on XS3800 and one of the XS1930, since I'd like to check the MAC table therefore please provide AP MAC address to me as well. 


    Chris
  • dkyeager
    dkyeager Posts: 69  Ally Member
    First Anniversary 10 Comments Friend Collector
    Options
    We currently do not have a LAG on the USG Flex 500 (but plan to do so once this current issue is resolved).  Otherwise the diagram is correct for all of our XS1930-12HPs (some have more than one AP). plus a few other Zyxel switches.  I will message the other information to you.  Thanks for your help.
  • dkyeager
    dkyeager Posts: 69  Ally Member
    First Anniversary 10 Comments Friend Collector
    Options
    Update:  5.10 update on the USG Flex 500 did not solve issue either.  4 APs on LAGs dropped from AP Management withing a few hours.  4 APs without LAG off various XS1930-12HP switches doing fine.  Tech data has yield no likely suspects at this time.  Zyxel requests have moved to packet capture at key points which I will be supplying soon.
  • dkyeager
    dkyeager Posts: 69  Ally Member
    First Anniversary 10 Comments Friend Collector
    Options
    Update: Zyxel able to replicate issue. Current theory is this is a XS3800-28 issue, but until analysis is complete this is subject to change.
  • dkyeager
    dkyeager Posts: 69  Ally Member
    First Anniversary 10 Comments Friend Collector
    Options
    Zyxel has sent a XS1930 patch which appears to be working after 24 hours with a WAX650s.  Now testing with a WAX510D and a WAX610D, then will test with 2 LAGs in series and remaining APs and XS1930s.
  • dkyeager
    dkyeager Posts: 69  Ally Member
    First Anniversary 10 Comments Friend Collector
    Answer ✓
    Options
    All 8 APs are properly shown in the USG Flex 500 AP Management List (most are doing two LAG hops).  Thank you very much Zyxel_Chris for your help and the patched firmware.  Please close any tickets you have on this.
  • Zyxel_Chris
    Zyxel_Chris Posts: 660  Zyxel Employee
    First Anniversary 10 Comments Friend Collector First Answer
    Answer ✓
    Options
    Update:
    The root cause is on XS1930 with LAG interface (traffic send to the wrong interface cause the traffic lost), this case has been resolved by the datecode, the official release will be on December (f/w: 4.70).
    Chris