Probably not the best title for my thread but here it goes...
I've gone and built a ZFS web server to stream videos using lighttpd, and I seem to be hitting some performance bottlenecks but I'm not quite sure where to look. I can't seem to get more than 300mbits out of the box, and I think it should be able to handle quite a bit more. There are 14 SATA drives + 2 flash drives (for cache) setup as a single pool, mirrored in pairs.
Here's what the box is up to right now:
Code:
vmstat -l output:
tx: 32330.48 KiB/s 12000 p/s
Code:
zpool iostat -v output:
operations bandwidth
read write read write
------ ------ ------ -------
2.16K 14 269M 63.6K
cache
1.2K 2 127M 512K
Code:
netstat -na | grep ESTABLISHED | wc -l = 22236
kern.openfiles = 48790
Now, obviously I'm servicing an assload of connections. Lighttpd is configured to use 500 processes with 100 connections each, which is a ridiculous amount I know but the box stays responsive in processing new connections. I would imagine the drive system is random reading like a muthafucker (excuse my language
), but I don't know how to get stats on that. top output indicates most lighttpd processes are in "zfs" or "zio" state. Obviously since Im streaming video, each connection will be long running and always in a read state. I'm thinking my bottleneck here is read io/s per second, not thruput. Caching on the flash drives doesn't seem to be used as heavy as I thought it would. Load averages on this box are running 35 35 35 or higher.
This box is obviously heavily loaded, but I can't quite pinpoint what is being hit the hardest. Me thinks I've maxed out the io system - the number of connections served must be taxing the crap out of it from random read requests. Am I correct and what is the best way to determine this?
Thanks! (hey phoenix, this is a vancouver project - you available for a consult?)