DaemonForums  

Go Back   DaemonForums > OpenBSD > OpenBSD General

OpenBSD General Other questions regarding OpenBSD which do not fit in any of the categories below.

Reply
 
Thread Tools Display Modes
  #1   (View Single Post)  
Old 11th March 2013
kbeaucha kbeaucha is offline
Port Guard
 
Join Date: May 2008
Posts: 36
Default Ethernet port becomes unresponsive - troubleshooting suggestions

Hello:

I have a remote site where I'm having a problem with the OpenBSD network gateway I'm using there. This site is one of five that are all configured basically the same, and this site has been in service for many years. What we thought was a minor change has apparently caused a new problem.

The remote site's gateway forwards packets between its upstream port and its local network port. Most traffic comes in on enc0, because the gateway is one end of a point-to-point VPN tunnel set up using ipsec, but the upstream port is pingable and permits ssh logins.

For the longest time a Soekris 4801 ran the tunnel flawlessly.

A recent change put a new embedded controller behind this gateway. From the local network, you can log into the controller by telneting to port 1400, and the same port is used to push data back to a Macintosh on our main campus through the tunnel.

No changes were made to our remote ruleset to accommodate this move.

After we added this controller and Mac connection, we began to experience times when the upstream port at the remote site would become unresponsive. Data wasn't traversing the tunnel for anything behind the Soekris; I believe the tunnel was being dropped. The upstream port would not allow ssh logins and would not respond to pings.

Power-cycling the Soekris would bring everything back.

To eliminate the possibility that the Soekris was the cause, we replaced it with a (faster) PC Engines Alix unit. The problems seemed to go away for over a year, until last week, when the tunnel dropped again.

Due to some other problems I wasn't able to log into the Alix's serial port, but the upstream (and local network) ports still had link, and the admin for the switch that the upstream port was plugged into said he could see link and get the MAC address of the gateway. I am open to suggestions on what to look for if this should occur again to help resolve the problem.


tia
kmb

Last edited by kbeaucha; 11th March 2013 at 08:09 PM. Reason: Add some details on state of upstream port from other admin
Reply With Quote
  #2   (View Single Post)  
Old 11th March 2013
ocicat ocicat is offline
Administrator
 
Join Date: Apr 2008
Posts: 3,318
Default

..and the version of OpenBSD used is what?
Reply With Quote
  #3   (View Single Post)  
Old 12th March 2013
jggimi's Avatar
jggimi jggimi is offline
More noise than signal
 
Join Date: May 2008
Location: USA
Posts: 7,977
Default

Quote:
Originally Posted by kbeaucha View Post
...After we added this controller and Mac connection, we began to experience times when the upstream port at the remote site would become unresponsive. Data wasn't traversing the tunnel for anything behind the Soekris; I believe the tunnel was being dropped. The upstream port would not allow ssh logins and would not respond to pings.
Your problem reflects something more than an IPSec tunnel being dropped. Your non-VPN communications -- ping and ssh -- were non-functional.
Quote:
Power-cycling the Soekris would bring everything back.
Was there ever an admin monitoring the console at this time? For example, the OS may have been functional but the NIC was not, or the OS may have panicked and dropped into ddb(4), or the OS may have been hung. Without a console (and an admin, local or remote) you would not be able to determine which of these three possibilities was occurring.
Quote:
To eliminate the possibility that the Soekris was the cause, we replaced it with a (faster) PC Engines Alix unit. The problems seemed to go away for over a year, until last week, when the tunnel dropped again.

Due to some other problems I wasn't able to log into the Alix's serial port, but the upstream (and local network) ports still had link, and the admin for the switch that the upstream port was plugged into said he could see link and get the MAC address of the gateway.
Do you also lose ping response on the Alix? It's not completely clear if that's the case.

I am not sure what you mean by "link" -- if you are describing status lights on Ethernet hardware (switches / hubs) these have various meanings depending on NIC manufacturer but are related to physical connectivity and not to data transfer. In the event of a software failure (OS hang/ OS panic / NIC bug) electrical connections would not necessarily be severed.

I know enough about Ethernet to use it and administer it. I'm not a NIC hardware expert, nor a NIC driver writer. With that disclaimer out of the way, I think it is perfectly reasonable for Ethernet NICs to manage traffic independent of the OS, to ignore (or pass on, depending on the type of Ethernet) non-broadcast Ethernet frames destined for MAC addresses other than its own. In like manner, I assume a NIC could respond appropriately to Ethernet frames that query for its MAC address. This is different than responding to ARP requests for IP address / MAC address resolution.
Quote:
I am open to suggestions on what to look for if this should occur again to help resolve the problem.
  1. Plug in a console, for use by you or your admin for the switch. Use it when this occurs to determine if the OS is still operating, the OS has crashed, or the OS has panicked.
  2. Monitor ongoing operation, while things are going well. Pay special attention to free mbufs -- if you run out of message buffers, your network stack will stop moving data. You can see current mbuf usage with the -m option to netstat. Script something that notifies you if you start running out of mbuf capacity.
  3. Review system logs from at the time of the problem -- in the event of a hang/crash, these probably will not aid you. In the event you were logging to a remote syslog server, these will probably not aid you. But if you are/were logging to /var/log locally, inspect /var/log/messages* files for any messages at the time the errors occurred. In the event mbuf shortages were the cause, look for "mcplimit limit reached" messages.
Quote:
Originally Posted by ocicat View Post
..and the version of OpenBSD used is what?
That too, is another good question. Last you mentioned it, in May of 2012, your systems were running 5.0.

Last edited by jggimi; 12th March 2013 at 12:32 AM. Reason: typos
Reply With Quote
  #4   (View Single Post)  
Old 15th March 2013
kbeaucha kbeaucha is offline
Port Guard
 
Join Date: May 2008
Posts: 36
Default

I tried to log into the Alix, but was unable to due to problems unrelated to the Alix itself (another story). I opted for the power cycle because this was an after-hours call and I wanted to restore service as quickly as possible.

I do lose ping response on the Alix. The admin for the upstream switch logged into his Cisco and checked the status on the port our Alix plugs into. It was the Cisco that reported a "link"ed device on the appropriate port and its (the Alix's) MAC address (although I'm uncertain if the Cisco just had that information cached).

As you suspect, the version running is 5.0

I'm in the process of resolving the console availability issue.

When monitoring mbufs (netstat -m), is there something specific I should look for, or just usage over time?

Thanks for the suggestions.

kmb
Reply With Quote
  #5   (View Single Post)  
Old 15th March 2013
jggimi's Avatar
jggimi jggimi is offline
More noise than signal
 
Join Date: May 2008
Location: USA
Posts: 7,977
Default

Quote:
When monitoring mbufs (netstat -m), is there something specific I should look for, or just usage over time?
Take a look at the netstat output. You will see significant details of mbuf use - including types of usage and a list of mbufs consumed by size, and counts of requests cancelled or deferred.

But you will also see an output line showing a percentage in use. I recommend a cron job that parses that line for the percentage, and notifies you if it exceeds some threshold you set -- for example, choose a threshold of 50% or 75%.

While writing that cron job, you might also want to look at consumption of PF states in that remote router-- the default is 10,000, and while a small office firewall should not exceed that, perhaps there is a problem causing excess state table consumption. See the pfctl(8) man page, and option -s info.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Upgraded all ports, now slow and unresponsive X caravel FreeBSD Ports and Packages 5 12th July 2012 07:45 PM
need help with troubleshooting pf.conf tinhead OpenBSD Security 11 25th March 2011 09:34 PM
need troubleshooting tip for vpn connections badguy OpenBSD Security 19 10th November 2010 02:53 PM
Need suggestions on what to name this project TerryP Off-Topic 10 6th November 2010 03:13 PM
CD/DVD burner becomes unresponsive after burncd dewarrn1 FreeBSD General 2 23rd October 2008 01:45 AM


All times are GMT. The time now is 04:54 PM.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Content copyright © 2007-2010, the authors
Daemon image copyright ©1988, Marshall Kirk McKusick