Contents
1.1.1 Finding the “error disable” ports across all my edge switches in the network.
1.1.2 When was the last time the spanning tree topology changed in my switched network?
1.1.3 Are all my switches pointing to the correct root bridge?
1.1.4 Let’s check that all my devices are running the same Rel of software.
1.1.5 What Rel is going to be booted from the next time I do a reboot?
1.1.6 Any Core dump files present on any of my network devices?
1.1.7 Are my devices getting their time sync from the one source?
1.1.8 Way too many MAC addresses showing on an edge switch. What's going on?
1.1.9 What Errors are showing up in the logs?
1.1.10 Do my Multicast routers agree on a BSR and RP?
1.1.11 Lets check that the under lying Interfaces that keep the network up and running are all configured in a like manner
1.1.12 Have any of my devices in the network rebooted disgracefully? Ie Have they rebooted due to a crash of some kind or unexpected power failure.
1.1.13 Useful commands to check the health of your SBx8100 units.
1.1.14 Going to mess around with config files and want to first make a snapshot of your running config on all your devices?
1.1.15 "I have a bunch of PoE devices in my network. I want to quickly check the status"
1.1.1 Finding the “error disable” ports across all my edge switches in the network.
Cmd:
atmf working-set group all
sh int bri | grep err
Output:
=====================================================================
STAR-core, Virtulization-Sw-Val, stsw523, stswS1, stswcsg, stswe1, stswe2,
stswr1, stswr2, stswr3, stswr4, stswr7:
=====================================================================
=======
stswr5:
=======
port1.0.6 err-disabled down
=======
stswr6:
=======
port1.0.23 err-disabled down
Analysis:
Use to find any error disable ports that you may have. I use this across all my edge switches, as the devices on the end of the ports sometimes create loops in the network and the ports go error disabled. I’ve set the time out to 24 hours to flush out the offending devices/ports/users.
1.1.2 When was the last time the spanning tree topology changed in my switched network?
Cmd:
atmf working-set group all
show spanning-tree | grep last
Output:
==========
STAR-core:
==========
% Default: 42 topology change(s) - last topology change Tue Jan 26 09:47:34 2016
=====================
Virtulization-Sw-Val:
=====================
% Default: 35 topology change(s) - last topology change Fri Feb 12 14:51:34 2016
========
stsw523:
========
% Default: 7 topology change(s) - last topology change Tue Jan 26 09:47:33 2016
=======
stswS1:
=======
% Default: 27 topology change(s) - last topology change Tue Jan 26 09:47:33 2016
========
stswcsg:
========
% Default: 49 topology change(s) - last topology change Tue Jan 26 09:47:33 2016
=======
stswe1:
=======
% Default: 42 topology change(s) - last topology change Tue Jan 26 09:47:33 2016
=======
stswe2:
=======
% Default: 3 topology change(s) - last topology change Tue Jan 26 09:47:34 2016:::
=======
stswr1:
=======
% Default: 0 topology change(s) - last topology change Never
=======
stswr7:
=======
% Default: 36 topology change(s) - last topology change Tue Jan 26 09:47:33 2016
Analysis:
Use to highlight spanning tree issues. Too many topology changes point to issues in the network. This will highlight the switch(s) that are changing too often. In the above output most switches have a date of last topology change of Tue Jan 26 at 09:47:33. This is correct, as a brief power outage affected the network and spanning tree reconverged on that date. stswr1 and stswr2 have different times because they are EPSR nodes and not partaking in spanning tree. Virtulization-Sw-Val on the other hand should be the same as the other switches but because it’s not setup to point to the correct root bridge, it has been changing topology due to other outside influences. Must correct as it has the potential to destabilise the network.
1.1.3 Are all my switches pointing to the correct root bridge?
Cmd:
atmf working-set group all
show spanning-tree bri | inc “Root Id”
Output:
=====================================================================
STAR-core, stsw523, stswS1, stswcsg, stswe1, stswe2, stswr3, stswr4, stswr5,
stswr6, stswr7:
=====================================================================
Default: Root Id 1000:0000cd3704c1
=====================
Virtulization-Sw-Val
=====================
Default: Root Id 8000:eccd6dc11a8d
=======
stswr1:
=======
Default: Root Id 8000:eccd6d37c319
=======
stswr2:
=======
Default: Root Id 0000:eccd6d20c011
Analysis:
Most of my switches are correctly pointing to the root bridge of cd3704c1. So far so good, but stswr1 and stswr2 are pointing to something different. stswr1 and stswr2 are nodes in an EPSR ring so they won’t point to cd3704c1. Virtulization-Sw-Val should point to cd3704c1, so there is work to do. Get onto the Virtulization-Sw-Val switch and adjust the spanning tree attributes and get the switch to point to cd3704c1. This will help in the stability of the network.
1.1.4 Let’s check that all my devices are running the same Rel of software.
Cmd:
atmf working-set group all
show sys | grep Software
Output:
=====================================================================
STAR-core, Virtulization-Sw-Val, stswcsg, stswe1, stswe2, stswr1, stswr2,
stswr3, stswr4, stswr5, stswr6, stswr7:
=====================================================================
Software version : 5.4.5-2.1
========
stsw523:
========
Software version : 5.4.5-2.1
B2 Software Lab
=======
stswS1:
=======
Software version : 5.4.5-2.1
B1 L2 Software
Analysis:
Yep, all the switches are running the same release of software, that’s what I want. It would be instantly obvious if a switch was on a different release.
1.1.5 What Rel is going to be booted from the next time I do a reboot?
Cmd:
atmf working-set group all
show boot | inc “Current boot image”
Output:
==========
STAR-core:
==========
Current boot image : flash:/x930-5.4.5-2.1.rel (file exists)
=====================================
Virtulization-Sw-Val, stswr3, stswr5:
=====================================
Current boot image : flash:/x310-5.4.5-2.1.rel (file exists)
========================================
stsw523, stswcsg, stswr4, stswr6, stswr7:
=========================================
Current boot image : flash:/x510-5.4.5-2.1.rel (file exists)
=======
stswS1:
=======
Current boot image : flash:/x210-5.4.5-2.1.rel (file exists)
=======
stswe1:
=======
Current boot image : flash:/x230-5.4.5-2.1.rel (file exists)
=======================
stswe2, stswr1, stswr2:
=======================
Current boot image : flash:/x610-5.4.5-2.1.rel (file exists)
Analysis:
Yep, next time all or some or one of the switches reboots, it will be off the release of software that I want it to reboot from. Helps for keeping a neat and tidy network.
1.1.6 Any Core dump files present on any of my network devices?
Cmd:
atmf working-set group all
dir *.tgz
Output:
% No such file or directory
Analysis:
This cmd checks across all the devices for any core dump files in the directory structure. No files is a good sign, shows the devices are happy in their work. If there are some, then get onto the device
and start checking for last reboot times and other attributes of a device not functioning correctly. Note: 'sh exception log' will do the same thing for you.
1.1.7 Are my devices getting their time sync from the one source?
Cmd:
atmf working-set group all
show run | grep ntp
Output:
==========
STAR-core:
==========
ntp peer 10.32.16.210
====================================================================
Virtulization-Sw-Val, stsw523, stswS1, stswcsg, stswe1, stswe2, stswr1,
stswr2, stswr3, stswr4, stswr5, stswr6, stswr7:
=====================================================================
ntp peer 10.36.250.13
Analysis:
The Star-core switch gets its time sync from 10.32.16.210 and all the other switches get their time sync from10.36.250.13. This is correct, it is as it was setup. All good.
1.1.8 Way too many MAC addresses showing on an edge switch. What's going on?
One of my edge switches that I monitor with an snmp monitoring tool is showing that it has way too many Mac address entries in the FDB. It really sticks out compared to the other edge switches. Why, what’s going on? See chart series below, look for stswr6.
stswr6 needs attention
Cmd:
atmf working-set group all
sh run | grep "mac address-table"
Output:
========================================================================
STAR-core, Virtulization-Sw-Val, stsw523, stswS1, stswcsg, stswe1, stswe2,
stswr1, stswr2, stswr3, stswr4, stswr5, stswr7:
========================================================================
=======
stswr6:
=======
mac address-table ageing-time none
Analysis:
We discussed what would cause one edge switch to have a much greater number of mac address entries than the other edge switches that were performing a similar function. Someone suggested that it looked like the mac addresses weren’t aging out, which sparked us to look for mac address aging commands in the running configs of all the switches, and bingo, stswr6 has a command to not age out any mac address entries. The command had been put onto the device about a year prior to this to try and trap another issue that stswr6 had been seeing, and the command hadn’t been removed.
Note: To have done the trouble shooting via the conventional method of opening up several terminal sessions and comparing running configs would have been a nightmare, not to mention time consuming. With the above command, it was all confirmed and highlighted in a minute.
In the graphs below, all the edge switches now are running at around 100 mac address entries, as I would expect. The only exception is the STswcore switch which is correctly running at about 330 mac address entries.
stswr6 now performing as other edge switches
1.1.9 What Errors are showing up in the logs?
Cmd:
atmf working-set group all
show log | grep err
Output:
=============
01-X81CFC960:
=============
2016 Mar 3 06:34:12 local6.err awplus-2.5 chassis[832]: Neighbor discovery has timed out on link 2.5.3
2016 Mar 3 06:34:13 local6.err awplus-1.6 chassis[791]: Neighbor discovery has timed out on link 1.6.3
2016 Mar 3 06:34:15 local6.err awplus-1.5 chassis[840]: Neighbor discovery has timed out on link 1.5.4
2016 Mar 3 06:50:36 user.notice 01-X81CFC960 IMISH[21577]: [manager@ttyS0]show log | grep err
========
03-X930:
========
2016 Mar 3 06:50:36 user.notice 03-X930 IMISH[1964]: [manager@01-X81CFC960.atmf]show log | grep err
==========
05-AR4050:
==========
2016 Mar 2 21:24:50 kern.notice awplus kernel: Kernel command line: console=ttyS0,115200 root=/dev/ram0 :::releasefile=AR4050S-tb150.rel bootversion=5.0.6
loglevel=1 extraflash=00000000 mtdoops.mtddev=errlog securitylevel=1 reladdr=0x8000000020010000,28c706f
2016 Mar 2 21:50:36 user.notice 05-AR4050 IMISH[2401]: [manager@01-X81CFC960.atmf]show log | grep err
==========
06-AR4050:
==========
2016 Mar 3 06:24:40 kern.notice awplus kernel: Kernel command line: console=ttyS0,115200 root=/dev/ram0 :::releasefile=AR4050S-:::tb150.rel bootversion=4.0.5
-ARC loglevel=1 extraflash=00000000 mtdoops.mtddev=errlog securitylevel=1 reladdr=0x8000000030000000,28c706f
2016 Mar 3 06:50:36 user.notice 06-AR4050 IMISH[2093]: [manager@01-X81CFC960.atmf]show log | grep err
==========
08-AR3050:
==========
2016 Mar 2 21:24:59 kern.err awplus kernel: mtdoops: mtd device (mtddev=name/number) must be supplied
2016 Mar 2 21:25:13 user.err 08-AR3050 pbrd: PBR: Failed to add route table 1
2016 Mar 2 21:25:14 user.err 08-AR3050 pbrd: PBR: Failed to add policy route 10
2016 Mar 2 21:25:15 user.err 08-AR3050 pbrd: PBR: Failed to add policy route 20
2016 Mar 2 21:25:15 user.err 08-AR3050 pbrd: PBR: Failed to add policy route 30
2016 Mar 2 21:25:15 user.err 08-AR3050 pbrd: PBR: Failed to add policy route 40
2016 Mar 2 21:25:15 user.err 08-AR3050 pbrd: PBR: Failed to add policy route 50
2016 Mar 2 21:50:36 user.notice 08-AR3050 IMISH[1937]: [manager@01-X81CFC960.atmf]show log | grep err
========
21-X510:
========
2016 Mar 2 21:50:36 user.notice 21-X510 IMISH[2164]: [manager@01-X81CFC960.atmf]show log | grep err
Analysis:
Use this command to look for Errors in your log files, across all or some of your devices. Note how in this instance some devices show lots of errors in the logs, of which an administrator should follow up on, and some devices just show the command line entered, which means the logs are free from errors. That's a good state to be in, have yourself a coffee and donut!
1.1.10 Do my Multicast routers agree on a BSR and RP?
Cmd:
atmf working-set group all
show ip pim sparse-mode bsr-router
Output:
========
A1-Edge:
========
PIMv2 Bootstrap information
BSR address: 172.16.0.2
Uptime: 22:06:54, BSR Priority: 64, Hash mask length: 10
Expires: 00:01:41
Role: Non-candidate BSR
State: Accept Preferred
========
A2-Core:
========
PIMv2 Bootstrap information
This system is the Bootstrap Router (BSR)
BSR address: 172.16.0.2
Uptime: 03d01h58m, BSR Priority: 64, Hash mask length: 10
Next bootstrap message in 00:00:30
Role: Candidate BSR
State: Elected BSR
Candidate RP: 172.16.0.2(lo)
Advertisement interval 60 seconds
Next C-RP advertisement in 00:00:21
================================
A3-Access, B5-Access, B7-Access:
================================
==========
A4-Access:
==========
PIMv2 Bootstrap information
BSR address: 172.16.0.2
Uptime: 02d23h57m, BSR Priority: 64, Hash mask length: 10
Expires: 00:01:40
Role: Non-candidate BSR
State: Accept Preferred
========
B1-Edge:
========
PIMv2 Bootstrap information
BSR address: 172.16.0.2
Uptime: 22:05:31, BSR Priority: 64, Hash mask length: 10
Expires: 00:01:41
Role: Non-candidate BSR
State: Accept Preferred
========
B2-Core:
========
PIMv2 Bootstrap information
BSR address: 172.16.0.2
Uptime: 00:25:20, BSR Priority: 64, Hash mask length: 10
Expires: 00:01:41
Role: Non-candidate BSR
State: Accept Preferred
=================================
B3-Distribution, B4-Distribution:
=================================
PIMv2 Bootstrap information
BSR address: 172.16.0.2
Uptime: 06:16:30, BSR Priority: 64, Hash mask length: 10
Expires: 00:01:40
Role: Non-candidate BSR
State: Accept Preferred
==========
B6-Access:
==========
PIMv2 Bootstrap information
BSR address: 172.16.0.2
Uptime: 06:16:31, BSR Priority: 64, Hash mask length: 10
Expires: 00:01:40
Role: Non-candidate BSR
State: Accept Preferred
Analysis:
All PIM routers in a network must agree on a BSR. This command shows that the routers all have a BSR as 172.16.0.2. That’s good, shows your PIM is nice and stable. If they don’t all report back with the same IP address, start troubleshooting.
1.1.11 Lets check that the under lying Interfaces that keep the network up and running are all configured in a like manner
Cmd:
atmf working-set group all
show run int | inc inter|description|trunk|atmf
Output:
==================
box10, box8, box9:
==================
==================
interface port1.0.1-1.0.4
interface port1.0.5
description <<< Connection to: box7 >>>
switchport atmf-link
switchport mode trunk
switchport trunk allowed vlan add 1004,4000
switchport trunk native vlan none
interface port1.0.6-1.0.24
interface vlan1004
description <<< EPSR Data vlan >>>
interface vlan4000
description <<< Connection to: testbox >>>
======
box11:
======
interface port1.0.1
description <<< Connection to testbox through AMF_L2_dist >>>
interface port1.0.2-1.0.4
interface port1.0.5
description <<< Connection to: box41 >>>
switchport atmf-link
switchport mode trunk
switchport trunk allowed vlan add 1004,1011
switchport trunk native vlan none
interface port1.0.6
interface port1.0.7-1.0.10
description <<< Connection to downstream ATMF nodes >>>
switchport atmf-link
switchport mode trunk
switchport trunk allowed vlan add 1004,4000
switchport trunk native vlan none
interface port1.0.11-1.0.12
description <<< Part of SA1 connection to: box2 >>>
switchport mode trunk
switchport trunk allowed vlan add 1004,1011
switchport trunk native vlan none
interface port1.0.13-1.0.24
interface sa1
description <<< Connection to: box2 >>>
switchport atmf-link
switchport mode trunk
switchport trunk allowed vlan add 1004,1011
switchport trunk native vlan none
interface vlan1004
description <<< EPSR Data vlan >>>
interface vlan4000
description <<< Connection to: testbox >>>
Analysis:
This command quickly allows you to see across the whole ATMF Area. Which interfaces (ports|LAGs) are atmf links. Which vlans are trunked over them. If the native untagged vlan support has been removed, which is 'best practice' for ATMF. Nodes are grouped together if they have the same data.
On The ATMF Validation scenario. This command quickly highlighted links that indeed had not had the native vlan removed. Which was then quickly resolved within a couple of hours. Also by grouping the output it showed that the issues could be fixed through working-sets, instead of having to log into each device and change sequentially. This command is powerful and gives you the ability to understand your network topology in a concise format. It's worth a coffee and two donuts.
1.1.12 Have any of my devices in the network rebooted disgracefully? Ie Have they rebooted due to a crash of some kind or unexpected power failure.
Cmd:
atmf working-set group all
show reboot history | grep Unexpected
Output:
==========
STAR-core:
==========
2015-08-24 10:50:05 Unexpected System reboot
2015-07-02 21:21:31 Unexpected Rebooting due to VCS duplicate member-ID
2015-07-02 21:20:57 Unexpected System reboot
=====================
Virtulization-Sw-Val:
=====================
2016-02-02 11:35:02 Unexpected System reboot
========
stsw523:
========
2015-07-07 11:34:28 Unexpected System reboot
2015-07-01 15:41:25 Unexpected System reboot
2015-05-04 21:55:37 Unexpected System reboot
2015-04-01 11:37:57 Unexpected System reboot
2015-02-17 01:07:15 Unexpected System reboot
2015-02-16 21:42:37 Unexpected System reboot
=======
stswr7:
=======
2015-11-05 09:46:34 Unexpected System reboot
2015-11-04 18:48:52 Unexpected System reboot
Analysis:
Network administrators take note, this is a powerful command. I’ve just asked all the network devices to tell me about any ‘unexpected’ reboots. Any Administrator will pride themselves on 'Uptime,' the longer the better. In the output above just about all of these unexpected reboots can be put down to someone pulling the power cord to the device. That’s just a consequence of a very busy lab and the switches being generally available. There are however a couple of instances that I cant put down to power interruption that I should follow up on. You should run this command daily, as weekly or monthly can be too long in a busy network environment. I’m going to put that command in my daily .scp file now. You can also run the command without the grep to see an interesting history of reboots and why, for the devices.
1.1.13 Useful commands to check the health of your SBx8100 units.
Cmd:
atmf working-set group all
show stack | include Init|Provisioned
Output:
===============
SIT-Core5-X930:
===============
2 - eccd.6dd1.64e4 128 Init Backup Member
===============
SIT-Core7-X930:
===============
3 - eccd.6dd1.6342 128 Init Backup Member
Analysis:
It would be much better if the output showed nothing, that way you would know they were free of the issues / errors you were searching for. In this case there is a ‘member’ in the initialising phase. Check back in five minutes to see that they have moved on from the ‘Init’ phase. If not, start trouble shooting.
Cmd:
atmf working-set group all
show card | include Unsupported|Incompatible|Disabled|Booting|Initializing
Output:
================
SIT-Core1-x8100:
================
1.1 AT-SBx81GT24 Booting
2.1 AT-SBx81GT24 Unsupported
================
SIT-Core6-x8100:
================
1.7 AT-SBx81GT24 Booting
1.8 AT-SBx81GT24 Initializing
Analysis:
The cards highlighted above are in one of the conditions searched for, and is therefore not good. It could well be that these ‘states’ are just tempory and will change and then not show up at all because they are in a satisfactory condition. Check again in 5 minutes to make sure the states have gone to a ‘good’ condition, otherwise it's into trouble shooting.
1.1.14 Going to mess around with config files and want to first make a snapshot of your running config on all your devices?
This will put a copy of the running config on all devices:
Cmd:
atmf working-set group all
show running-config > run_conf_20150310-1.cfg
Analysis:
As a responsible network administrator you decide that before entering some unknown commands into your running config to take a quick copy of the current running config/s before the fun starts, just in case. This is the way to do it. Sure you have backups of the configs, but they are usually centrally located, this way the config is held on the device/s. From a working-set, you can use a 'delete force' to get rid of them:
del force run_conf_20150310-1.cfg
1.1.15 "I have a bunch of PoE devices in my network. I want to quickly check the status"
This can be done through the ATMF automatic working-set groups, of which PoE is one:
Cmd:
atmf working-set group poe
show power-inline | grep "Powered\|Fault"
Output:
===========================================================================
SIT-Backup, SIT-Core1-x8100, SIT-Core2-IX5, SIT-Core3-X930, SIT-Edge5-X610:
===========================================================================
===============
SIT-Edge7-X230:
===============
port1.0.1 Enabled Low Powered 2230 n/a 1 4000 [C]
port1.0.2 Enabled Low Powered 2230 n/a 1 4000 [C]
SIT[6]#
=========================
Analysis:
This is a useful and easy way to find out what your POE ports are up to.
Merci @ Björn