This particular incident was actually the CPE’s fault, but let me tell you a history of the problem we’ve experienced before. We have several circuits from different companies. Unfortunately, I am not going to name any of them just because I am not sure if I have the right to disclose such information.
To give you a background in our world, we have mixed WAN connectivity, VSAT (Very Small Aperture Terminal), MPLS (Multi Protocol Label Switching), Frame Relay, and fixed wireless. MPLS and fixed wireless are pretty new to us – about 2 years old. Our VSAT network has been in placed since 1981 but has been upgraded to different RF heads a.k.a. ODUs (Out-door unit) and different IDUs (In-door unit) over the years. Recently, we upgraded our VSAT network to cut down the latency in half, from ~1200ms to ~600ms. Yup, you guessed right, it is not viable solution for VoIP or IPT (IP Telephony) because the maximum acceptable latency is 120ms.
Company A is the very first one that we used MPLS and we’ve had several issues with it during the first few months of the conversion from Frame Relay to MPLS. One of the issues that we had was the default route not being propagated to our router. All the BGP routes were there but we were pinging our router at ~600ms, which normally means that we are communicating using our VSAT. However, upon doing a trace route from my desktop, we’ve found out that it actually went through the telco’s cloud, but the last route (which was our serial interface address) was at ~600ms. The guys that saw this actually sent the ticket to Company A for investigation. They were unable to find the problem and were blaming the CPE. Our configurations were correct so we knew there was something wrong with telco’s cloud or their configuration. Upon investigating, I found out that the default route that was in the routing table was our VSAT, which is a floating static route. I immediately informed our SP that we were not receiving the default route from the BGP but several techs or Engineers didn’t know how to fix it and was still blaming the CPE. Finally, there was another Engineer that looked at the problem and found a missing statement in their PE (Provider Edge) router, which was neighbor x.x.x.x as-override. Below is the sample output of what the routing table looked liked:
Router>sh ip route
S* 0.0.0.0/0 [250/0] via 172.24.0.1
After the Engineer entered the command on their PE router, the routing table changed and the router started pinging at ~20ms, which means we’re really communicating via telco’s cloud. Sample output below:
Router>sh ip route
B* 0.0.0.0/0 [20/0] via 172.25.177.62, 1m
You may be asking why we have bunch of BGP routes if this router is a stub router. Yes, this is a stub router and it shouldn’t have those BGP routes and besides, it has the default route in the table so having bunch of BGP routes is pretty much useless. I honestly do not know why. One explanation that I heard was there are some application that needs all the routes in the routing table in order for it to work – I highly doubt it. I think that our Network Architecture team doesn’t even want to bother with that anymore. They should though, because I think you should be saving RAM for something else rather than the routing table, especially for a stub router.
Now that you know the history, let’s talk about what happened the other day. My colleague was trying to troubleshoot a problem and our 1st shift lead went over to my colleague’s desk, which is right next to me so I overheard some of their conversation. I got curious and started looking over my colleague’s shoulder and listening to what they were saying. When I finally got pretty much the juice of the problem, I had to dig around. Before I do anything, my routine is to ping the router, I noticed that the router was pinging at ~600ms (as mentioned earlier, it usually is communicating via VSAT). Did a tracert in command prompt and found that it was going through the telco’s cloud. With that information, I knew it was the same problem that we had with Company A, but this was for Company B’s implementation of MPLS. Upon checking the router’s routing table, I saw the output below:
Router>sh ip route
If you are not paying attention to the result of the command, you will surely miss that the AD’s value is 1, which means it is a static route – not a floating static route, which we use for our backup – VSAT. If you were exposed to a similar or exact problem (which we did – as described earlier) you will be blaming telco for their screw up in their configuration. Unfortunately, we actually didn’t pay attention to the output and went ahead and called the telco (we always blame the SP – like I mentioned earlier, 99.9% of the time it is always their fault). After working with the technician in making sure that everything was configured exactly the same way as the working router, I issued the command again and really paid attention to the output. That’s when I saw that the problem was the ip route statement in our router. Just to be sure that I was right, I went ahead and issued show run | i ip route and sure enough, the administrative distance in the end of the command was missing.
What the router was set to:
ip route 0.0.0.0 0.0.0.0 192.168.0.100
It should have been set something like:
ip route 0.0.0.0 0.0.0.0 192.168.0.100 250 name VSAT
What’s the lesson learned with this incident? Don’t always go with the flow of the blaming game. Try to pay attention to all the details before blaming someone else. But, like I said, 99.9% of the time in our world points to SP so we always blame them! c”,) Seriously though, make sure everything is good in your end before blaming them.
Will post some more stuff about the blaming game. Stay tuned!
Written By: Andr01d
"I know nothing except the fact of my ignorance" - Socrates
"I know nothing except the fact of my ignorance" - Socrates
0 comments:
Post a Comment