mirror device detached on large file copy

lil_elvis2000 · #1 **(View Single Post)** 2nd June 2008

Hello running FreeBDs 7.0 on a Celeron 800Mhz and I copied a 1.8G file across from my Windows box to the FreeBSD box. It finished the copy..or during the copy I got the following:

kernel : ad6 : FAILURE - device detached
kernel : subdisk6: detached
kernel : ad6 : detached
kernel : GEOM_MIRROR : device dat : provider ad6 disconnected.
kernel : g_vfs_done():mirror/dats1d[READ(offset=196937613312, length=16384)]error=6

I can only think that perhaps its a time out?

the configuration is roughly thus
- ad4 - 320G SATA
ad6 - 320G SATA

I put a newfs on ad4 but not ad6. Thought the gmirror would take care of that. I have put other files of several hundred megabytes without a problem. But when I tried this 1.8G file this error happened.

This is first problem I have had with the mirror.

Any tips?

lil_elvis2000 · #2 **(View Single Post)** 6th June 2008

Well I've had further problems. I was extracting files from the large gzip and the mirror crashed again. This time it was ad4.
I did a "gmirror forget" and then tried to re-insert ad4. but could not..ad4 did not appear in the /dev folder either.
So I did a reboot. and to my horror..gmirror remounted ad4 , but not ad6, in a DEGRADED state.
and when I did a gmirror status..the screen then filled and poured a stream of g_vfs_done ERROR messages. I had to hit the power switch.

I managed to get the system back up and unmounted the mirror and destroyed it...a bit tricky as gmirror kept remounting it after a few seconds. I had to quickly do a "gmirror unload".

So after making a backup. I then relabelled
gmirror label -b split -s 2048 dat ad4 ad6
and newfs'd the mirror.
newfs /dev/mirror.dat -U

mounted and tried to copy the data back. The mirror broke again. I rebuilt the mirror and this time I left off soft-updates as I suspect there is a problem there. Remounted and copied back all the data and the mirror held together.

Don't know if that is a coincedence or not. Is there an issue with soft-updates and gmirror?

radcapricorn · #3 **(View Single Post)** 6th June 2008

Are you using the whole drive for the root (/) partition? In this case, there indeed may be some troubles with soft-updates, because using soft-updates for the root partition is not recommended (sysinstall even disables soft-updates for this partition by default when you create it).

lil_elvis2000 · #4 **(View Single Post)** 6th June 2008

I am using the entire drive, but it is mounted at /home.
my other drive, a small 11GB is / and swap.

gkontos · #5 **(View Single Post)** 6th June 2008

If the hard drive did not appeared in the /dev directory then it might be a hardware problem with the drive. Run check disk utilities from the manufacturer of the HD.

George

lil_elvis2000 · #6 **(View Single Post)** 9th June 2008

When the drive "failed" in the mirror, it is reasonable that BSD removes it from the /dev folder. But on the reboot the drive reppeared. and I have rebuilt the mirror and it is working perfectly.

I suspect a problem with Soft-updates and gmirror and perhaps certain configurations.

Haven't had chance yet to look through the log file. Wonder how I could submit this as a problem to the BSD developers.

halber_mensch · #7 **(View Single Post)** 10th June 2008

I too lean towards a power problem here. SATA seems to be very fickle where power is concerned... it seems to me that if I soft reset my machine (AMD64 3000+, 1G ram, 2xSATA 160G, 1xMemorex CD/DVD burner on 450W supply), I get READ_DMA timeouts from the disks - it's especially a pain if I've had a power failure because my gmirror rebuild and file system fscks spit out tons of READ_DMA timeout errors, and most likely the mirror rebuild will fail with a READ_DMA failure. However, it seems that if I completely power the machine down and cold boot, the error go away. I can't explain it. So now I've got my box on a UPS running apcupsd so it never goes down ungracefully.

gkontos · #8 **(View Single Post)** 10th June 2008

Strange, I never had this problem on my home fileserver (3 years now). It is an HP ML 110G3 very low entry level model. Of course I use a cheap ups for temp power failures but I had my experiences with hard reboots.

It could be a MB problem with the SATA drives. Especially if it is old it might not be able to handle SATA differently than IDE thus soft updates do not report correct. What happens if you remove soft-updates ? Do you have the same issues?

George

Weaseal · #9 **(View Single Post)** 10th June 2008

Please, please be sure to send-pr this (send problem report). This sounds like a big issue, especially since multiple people are experiencing it. Sounds high-priority to me.

lil_elvis2000 · #10 **(View Single Post)** 11th June 2008

Quote:

Originally Posted by gkontos

Strange, I never had this problem on my home fileserver (3 years now). It is an HP ML 110G3 very low entry level model. Of course I use a cheap ups for temp power failures but I had my experiences with hard reboots.

It could be a MB problem with the SATA drives. Especially if it is old it might not be able to handle SATA differently than IDE thus soft updates do not report correct. What happens if you remove soft-updates ? Do you have the same issues?

George

Yes same issues. whether soft-updates are enabled or not. The SATA drives are on a PCI SATA card. Can't imagine that being an issue as prior to MB integration..millions of PCs had to have IDE cards to run their hard disks. I have NOT configured the RAID 1 in the SATA card's BIOS.

However it could be a power issue, I have consulted with a couple of other people and they both think 235W isn't enough. I wonder if this also affects my USB ports and USB KVM on this machine.

lil_elvis2000 · #11 **(View Single Post)** 17th June 2008

Well installed the new power supply and....Nope it didn't help.

Here's what I got when I extracted a file from one Samba share to another Samba share on the same mirror:

Jun 17 10:22:48 ChamRAID01 kernel: xl0: transmission error: 90
Jun 17 10:22:48 ChamRAID01 kernel: xl0: tx underrun, increasing tx start threshold to 120 bytes
Jun 17 10:41:33 ChamRAID01 kernel: xl0: transmission error: 90
Jun 17 10:41:33 ChamRAID01 kernel: xl0: tx underrun, increasing tx start threshold to 180 bytes
Jun 17 10:54:49 ChamRAID01 kernel: xl0: transmission error: 90
Jun 17 10:54:49 ChamRAID01 kernel: xl0: tx underrun, increasing tx start threshold to 240 bytes
Jun 17 10:55:12 ChamRAID01 kernel: xl0: transmission error: 90
Jun 17 10:55:12 ChamRAID01 kernel: xl0: tx underrun, increasing tx start threshold to 300 bytes
Jun 17 10:56:22 ChamRAID01 kernel: ad4: FAILURE - device detached
Jun 17 10:56:22 ChamRAID01 kernel: subdisk4: detached
Jun 17 10:56:22 ChamRAID01 kernel: ad4: detached
Jun 17 10:56:22 ChamRAID01 kernel: GEOM_MIRROR: Device dat: provider ad4 disconnected.
Jun 17 10:56:22 ChamRAID01 kernel: g_vfs_done():mirror/dat[READ(offset=267860606976, length=131072)]error = 6
Jun 17 10:56:41 ChamRAID01 kernel: ad6: FAILURE - device detached
Jun 17 10:56:41 ChamRAID01 kernel: subdisk6: detached
Jun 17 10:56:41 ChamRAID01 kernel: ad6: detached
Jun 17 10:56:41 ChamRAID01 kernel: GEOM_MIRROR: Device dat: provider ad6 disconnected.
Jun 17 10:56:41 ChamRAID01 kernel: GEOM_MIRROR: Device dat: provider mirror/dat destroyed.
Jun 17 10:56:41 ChamRAID01 kernel: GEOM_MIRROR: Device dat destroyed.
Jun 17 10:56:41 ChamRAID01 kernel: g_vfs_done():mirror/dat[WRITE(offset=268466061312, length=16384)]error = 6
Jun 17 10:56:41 ChamRAID01 kernel: g_vfs_done():mirror/dat[WRITE(offset=268467388416, length=131072)]error = 6
Jun 17 10:56:41 ChamRAID01 kernel: g_vfs_done():mirror/dat[WRITE(offset=268467519488, length=131072)]error = 6
.... A LOT of these ....

The machine then crashed. Not sure where that's logged.
It rebooted itself and then complained about file blocks on the mirror. So I did a fsck on it. The fsck completed and then I attempted a reboot..but the machine crashed. output a core and rebooted.

time to give up?

lil_elvis2000 · #12 **(View Single Post)** 11th June 2008

Quote:

Originally Posted by halber_mensch

I too lean towards a power problem here. SATA seems to be very fickle where power is concerned... it seems to me that if I soft reset my machine (AMD64 3000+, 1G ram, 2xSATA 160G, 1xMemorex CD/DVD burner on 450W supply), I get READ_DMA timeouts from the disks - it's especially a pain if I've had a power failure because my gmirror rebuild and file system fscks spit out tons of READ_DMA timeout errors, and most likely the mirror rebuild will fail with a READ_DMA failure. However, it seems that if I completely power the machine down and cold boot, the error go away. I can't explain it. So now I've got my box on a UPS running apcupsd so it never goes down ungracefully.

This sounds like a bug to me. I would pass this on.

lil_elvis2000 · #13 **(View Single Post)** 10th June 2008

Well, I've ordered a power supply and lets see what happens there first. Its a 380W unit so should be able to handle the loads. It seems interesting that the rebuild is okay..which is heavily hitting both disks...but large writes and reads (especially reads) fail and break gmirror.

This will be okay until I decide to replace the MB...I'll probably need to upgrade the PSU again.

this was supposed to be a cheap project.....

halber_mensch · #14 **(View Single Post)** 17th June 2008

Are those xl0 transmission underruns possibly related? Try dding a large file to your mirror and see if it causes trouble.. I'm suspicious about the proximity of those underruns to the mirror failure.

lil_elvis2000 · #15 **(View Single Post)** 18th June 2008

Quote:

Originally Posted by halber_mensch

Are those xl0 transmission underruns possibly related? Try dding a large file to your mirror and see if it causes trouble.. I'm suspicious about the proximity of those underruns to the mirror failure.

I tried a large copy from the console..from one location to another on the mirror. Same effect - just faster! Data underrun/overrun. PIC card incompatibility?

I've turned off the mirror and have installed a CRON job to copy the files over at the end of the day. I don't have a debugging kernel either so can't debug it. But I have a core..a couple of them..incase someone can have a look at them.

I'm now considering whether openSUSE or Xubuntu would be a switch to make. But I've no time to do that for a couple of months. Just have to keep it in "crippled" mode for now.

halber_mensch · #16 **(View Single Post)** 18th June 2008

Wait.. you got xl0 underruns copying within the mirror? Your NIC should not be in the picture at that point.

lil_elvis2000 · #17 **(View Single Post)** 18th June 2008

Sorry to confuse. I got the underruns from one Samba share to another on the same mirror...from my Windows box. Then I did it again. the same copy file from the console - to see if it was the NIC or Samba. gmirror broke again.

I can't keep having these problems so I just got rid of gmirror now and have installed a CRON copy job. Ugly I know, but at least the system is stable now.

I'm now either; swap the sil 3512 card for something else (a ICH5 based card maybe?) and try again. or change to Linux. At any rate I'm very busy with a Oracle project for the next month and a half so got no cycles to spare.

Most I will do is check and clean all the drive and card connectors.

Weaseal · #18 **(View Single Post)** 18th June 2008

Did you send-pr yet? There's a chance that with an issue this urgent they'd hurry a patch into -CURRENT that you could roll in.

lil_elvis2000 · #19 **(View Single Post)** 19th June 2008

yes I have sent PR. but I don't have a core that I can debug and supply a backtrace. (someone asked). I have two cores which I believe were caused by GEOM. But don't know what to do with them.

IMHO I think that there is an issue with g_vfs_done with my specific configuration. Or maybe the configuration of the drives - I got that "cannot use BIOS cyl/head/track calculating my own" message from BSD. I just went with what BSD suggested.

Its too bad because BSD performs very very well.

lil_elvis2000 · #20 **(View Single Post)** 25th June 2008

As an update I have managed to build a debug kernel and look at the cores. Nothing spectacular it looks pretty random to me. Today I got a core dump as well....and I'm not running gmirror! So the problem is more fundamental. I have done some tests and seems fairly random...as if there is a random memory or meta-data problem.

I was running the disks in dedicated mode? Where my mount is something like /dev/ad4 instead of /dev/ad4s1d. the BSDLabel looked pretty funny to me.
two partitions, one large one at a small offset and then a large one at 0. which looked suspicious-probably a left over from gmirror label?

So I've now gone into sysinstall and fdisk and label options and redone the disks. Will have to see how that runs. I've still got to finish it off tomorrow...

Thread Tools
Show Printable Version Email this Page
Display Modes
Switch to Linear Mode Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Rebuilding RAIDframe mirror after crash/power failure	sherekhan	OpenBSD Installation and Upgrading	7	25th September 2009 10:06 PM
Have problem transfer large file bigger 1GB	bsdme2	FreeBSD General	9	14th January 2009 05:49 AM
Large MFS filesystems	jggimi	Guides	2	26th October 2008 05:17 PM
identifying device associated with USB device?	spiderpig	OpenBSD General	2	7th July 2008 05:18 AM
FreeBSD 7.0 Writing large amount to USB Disc cause kernel panic	pvree	FreeBSD General	1	13th June 2008 02:50 AM