XGS3700 switch stack issues after one switch in stack restarts

imaohw
imaohw Posts: 115  Ally Member
edited April 14 in Switch
I have four XGS3700 switches in a stack using ring topology.  All switches are running the latest firmware.

Yesterday the UPS for the switch in slot 2 failed (each switch in the stack is on a separate UPS).  This caused the switch in slot 2 to lose power.  The other three switches in the stack continued to operate and pass traffic properly.  After approximately 30 minutes power was restored to the XGS3700 in slot 2. 

When the power was restored to the XGS3700 in slot 2 it rebooted. When the XGS3700 in slot 2 joined the stack in slot 2 it caused all of the switches in the stack to stop passing traffic properly.  The switch UI could not be reached by any device connected to any port on any of the switches in the stack.  The switch UI could not be reached thru a device connected to the management port of any of the four XGS3700s.  Devices needing an IP address could not get it from the DHCP server. Some traffic did get thru the switch but most did not. 

To eliminate the possibility that it was an inter-vlan routing problem I attempted to connect to the UI of our USG1100 (connected to the switch) to see if it was reporting any issues.  I found a PC connected to the switch which could reach the login screen for the UI for the USG1100. However, when I entered the credentials they were rejected as invalid.  I tried this several times (until the PC was locked out from the USG1100).  The correct username and password were definitely entered.  It was as if the data was not getting to the USG1100 correctly.

At this point I pulled the plug on the XGS3700 in slot 2 waited 30 seconds and plugged it back in to see if rebooting the switch would help.  It did not.  After the switch in slot 2 rebooted, traffic would still not flow properly.

Ultimately, to try and get back to a fully functioning state, I pulled the plug on all four XGS3700s in the stack.  I then plugged them in one by one, in slot order, waiting for each switch to fully boot before plugging in the next switch.  This worked and the switch stack was back to normal passing traffic.

This cascading failure when one switch in the stack comes back online is totally unacceptable when in theory an XGS3700 stack should provide a high level of fault tolerance.

Unfortunately, I could not get any diagnostic files from the switch stack before I had to pull the power as I could not reach the switch UI and I needed to get the switch stack operating ASAP.


All Replies

  • Zyxel_Jonas
    Zyxel_Jonas Posts: 233  Zyxel Employee
    Hi @imaohw,

    I made a local test using the same equipment and situation, but there is no anomaly in my side.
    You've mentioned "slot2", just would like to clarify that when the issue happens, did you glance or notice if all slot number in the front panel of 4 XGS3700 are normal?
    Moreover, if it is convenient with you, please PM me your configuration settings, so I could try to duplicate in my LAB.

    Thanks.
    Jonas,
  • imaohw
    imaohw Posts: 115  Ally Member
    @Nebula_Jonas - any update?
  • Zyxel_Jonas
    Zyxel_Jonas Posts: 233  Zyxel Employee
    Hi @imaohw,

    Yes, unfortunately, after doing some LAB using your configuration. All working fine in my LAB.
    Regarding this issue, you've mentioned that when slot 2 came back, the master indicator light was off which is odd, and we need to collect the output log from the console to verify if there is any error logs.

    Just in case the issue happens again, I'll might need your help to collect the information below for investigating:
    1. Connect a console cable (R232) to Master and slot that's power down (example: slot 2).
    2. Open Teraterm or Putty to collect/save output logs.
    3. Power on slot 2 switch.
    4. Master and slot 2 switch will print logs.
    5. After the switch is fully on, please help to collect both switches tech-support via CLI [show tech-support].

    Sorry for the inconvenience.
    Jonas,
    Jonas,
  • imaohw
    imaohw Posts: 115  Ally Member
    @Nebula_Jonas - when we have some down time I will try and reproduce and collect logs.
  • Zyxel_Jonas
    Zyxel_Jonas Posts: 233  Zyxel Employee
    Hi @imaohw,

    Got it.
    Thanks for the support. 

    Jonas,
    Jonas,