PDA

View Full Version : YASU + nVRAID == array corruption ?



Nodens
29.03.2007, 05:43
Not sure if this is actually YASU related but I need to check:)

The system is running XP 64bit version, SP2. Latest version of DT and YASU build 7033. I just downloaded 7035. The chipset is 680i with the latest chipset drivers.

I didn't have to use YASU at all up to now. I have been using it for about 6 days now.

I have a software RAID array on the nvRAID south bridge controller on this board. Type is 0+1 with 4 WD RAptors 74G.
Three days ago nvraidservice reported the disk on channel 1.1 failed just after I had quit Heroes of Might and Magic 5 which I was running with YASU. I quit the game since I was noticing visual artifacts after a bit of alt-tabbing. Thinking the drive was toast I shutdown to check it.

After a bit of troubleshooting with cables etc (I will save you the useless details here) I figured out the drive was fine. I just rebuilt the array and it worked fine. And I was just thinking...odd.

30 minutes ago I quit the same game again and this time nvraid service reported a disk failing again. First thought was that the drive was dying slowly. But when I actualy checked this time it was a different drive. It was channel 2.0.

I tried rebuilding from the crappy nVIDIA windows tool. It said it couldn't find the drive. Although the drive was visible in the tool. Reboot. BIOS RAID tool and now I see 2 disks corrupted.
Channels 2.0 and 2.1. Also funny thing is that the BIOS tool sees those 2 drives as separate corrupted arrays (same it did with the previous occurance).

Right now, now the system is live and I'm rebuilding the array and synchronizing those same disks. And everything is working fine other than me boggling to figure out what causes this.

If I am not wrong YASU hides the scsi drives by blocking access to specific registry keys, no? (I've noticed sptd does that to it's key)

Is it possible that it somehow blocks access to nvraid related keys? Thing is this started happening after I used YASU and with my not so modest knowledge I think it's the only possible culprit.

I did check the filesystem for corruption and even did strain tests on the array with various monitoring tools. There was no corruption whatsoever. If it's not YASU related the only other culprit could be the server 2003 SP2.

I will not use YASU until I hear for you, both as a test to see if this will happen again with yasu not running and as a precaution as well. I'd be in deep trouble if the corrupted channels were not the 2 mirror ones...

Thanks in advance,
Nodens

Nodens
02.04.2007, 08:35
Just wanted to report that no RAID array issues have been noted up to now since I haven't been running YASU...

Elandril
02.04.2007, 12:41
I could imagine that the nvraid background service regularily queries the scsi devices for changes and that a failure to do so might prompt it to issue an unfounded command to the raid driver indicating a problem.

Nodens
02.04.2007, 14:41
Aye, this is my assumption as well. It is quite possible that the array was never corrupted but the driver just thinks it is because the service provider can not access the channels and chooses to remove the degraded/problematic channel from the array.

Which comes to the question, does YASU block access to all kinds of SCSI devices or said device is not intentionally affected? And in case this first scenario stands, is it possible to exclude devices like RAID miniports or not, due to implementation?

I hope sYk0 can shed some light on this:)

Chiefnuts
02.04.2007, 15:23
I have a Nforce4 2 drive array (raid 0) for my disc images, and files. My games and os are on the same non-raided drive. I haven't had any problems with my array while using YASU, any version. I'm on x32 bit xp pro. I wonder if it's releated to 64bit xp?

Nodens
03.04.2007, 01:57
I have a Nforce4 2 drive array (raid 0) for my disc images, and files. My games and os are on the same non-raided drive. I haven't had any problems with my array while using YASU, any version. I'm on x32 bit xp pro. I wonder if it's releated to 64bit xp?

It is quite possible that this issue has not been manifested for you because your raid array is not being paged by the nvraidservice provider while YASU is running. Your operating system files, pagefile, game files, and running applications run of your non RAID drive. In theory your array gets paged only while a copy protection scheme checks for the original disk at the initialization process of a game since your images reside there. That period may not be enough for the issue to manifest itself (there may be several timers involved).

Another option to consider is that your chipset is nForce4 based. This means that the whole service provider and driver approach could be entirely different from the n680i chipset (though I'm fairly sure the software codec for the RAID controller is compatible between those chipsets-I've been able to mount arrays created with nf4 on 680i).

The third option as you suggest is the AMD64/EM64T extension difference. On the 64bit platform a driver inherently works quite differently. Certain coding practices are banned from this platform (eg kernel level system call hooks) so it is possible that this issue is 64bit specific.

Lastly it good be related to the installation of SP2 and not YASU and just a coincidence that this happened twice while running YASU, so far. For example Server 2003 SP2 broke nhancer for me, and others, and the developer can not reproduce it. I'm currently waiting for the developer to send me a debug build on that one so I can provide him with debugger info.
It's just that the very nature of what YASU does and the fact that twice now it happened while it was running, that makes it the number 1 suspect.

Chiefnuts
03.04.2007, 14:59
I agree with you completely. I thought it could very well be the x64 piece, to do different driver access levels & calls, and because YASU is usually loaded all the time and cloaking my drives. But all OS/Pagefile/App loading occurs on non-array drives. Hopefully, you can get to the bottom of this, becuase I was going to migrate the array to a i650/Core2 setup I just bought, and was planning on switching to x64, So it's good to know..

arfett
15.04.2007, 01:15
I use x64 and nForce 570 chipset using raid-0 with dual 16mb cache 74gb raptors. I do not use the software tool in windows and I haven't had any issues. I also disable the nvidia raid service with start-run-services.msc. Try setting up and using your raid in BIOS only and you should be fine. Possibly try disabling the nvidia raid service in windows as I do as it is not necessary.

Nodens
15.04.2007, 14:21
I use x64 and nForce 570 chipset using raid-0 with dual 16mb cache 74gb raptors. I do not use the software tool in windows and I haven't had any issues. I also disable the nvidia raid service with start-run-services.msc. Try setting up and using your raid in BIOS only and you should be fine. Possibly try disabling the nvidia raid service in windows as I do as it is not necessary.

I never said anything about any software tool.I always setup the arrays via the BIOS tool. What I meant is that the whole nVRAID solution is a software codec, BIOS assisted. It's not a hardware RAID solution.

Also if you disable the nvraidservice you won't be able to see degraded arrays (in your case it's not needed since if your stripe set gets degraded your array is toast anyway). Also the nvraidservice is THE ONLY way of rebuilding an array when your mirrorred array or RAID 5 gets degraded. It doesn't matter if you select to rebuild from the control panel or BIOS tool, the nvraidservice polls the array, reads the flag and performs the operation. So the service is necessary unless you are on a simple RAID-0 like you are. Every other RAID mode, supported by the codec (1,0+1,5), needs that service.

But let's get back on topic. I am beginning to believe YASU is entirely unrelated to the degradation issue. Yesterday it happened again without YASU running. My latest theory is that 3 out of the 4 Raptors were running on the same PSU line and that's a common no-no as they can't get enough amperage. The common rule is 2 HDs per line... I will stress test with YASU on now that I've backed up the array.

Aerowinder
16.04.2007, 23:53
I posted the following message three days ago on a more private forum:


I'm having another weird issue; I seem to attract them. It's somewhat complicated, and seems to happen most under certain conditions. Though I have money to replace what needs to be replaced - if replacing is needed, I cannot afford trial and error with NewEgg.

It first started a long time ago when I was playing Rainbow Six: Vegas. One day I updated it. The next day the OS was complaining about some registry corruption. Could do nothing but reinstall. All is well until recently when I got C&C 3. Since then I've had registry corruption resulting in reinstalls twice, and some HD corruption issues. Both of these games are SecuROM games, by the way.

I am also running Alcohol 120% with updated SPTD drivers, along with YASU. At first, I thought these were to blame. The system was crashing without YASU - I now have crashing issues while not using it after a clean install, so that's not the issue. Never had a problem with Alcohol.

I can play WoW for hours at a time and have no issue - play C&C for hours - no issue. Then I boot it up again, and it crashes within 5 minutes, system reboots, raid not detected - or registry corruption. I should mention the only time I have raid detection issues is when the system reboots itself after a crash, and that a power clear (unplug PSU, hit power, plug back in) always fixes it.

MemTest86/Prime95 stable for 6 hours straight. This should rule out memory/CPU/PSU. One would think that since the raid is not detecting after an automated reboot (which only a "power clear" will fix), this is an HD/controller issue. I find it odd however, that this has until today, only been happening while I was playing SecuROM games. WoW crashed a few times today - it never does that. I tried to delete a file in the WoW folder, some character settings, the system complained that the files were corrupted and could not be deleted. I ran Chkdsk then was able to delete them.

So I'm leaning more towards HD/controller. I'm running 2xSATA 70gb Raptors in RAID0 on a DFi LP nF4 board. I've run extended tests with WD Diagnostic tools/SpinRite lvl4, both turn out fine. I know this is not the end all be all solution to HD diagnostics, but it's something.

My issues first started while running R6:V in combination with YASU; it appears you are not alone here. My full system specs are as follows:

Case: Coolermaster Praetorian PAC-T01-EK Black
PSU: Antec TruePowerII 550W
Mobo: DFi nF4 Ultra-D w/BIOS 3-10-05
Proc: AMD64 4000+ (San Diego)
---------ThermalTake AquariusIII External Liquid Cooling System (AS5)
Video: Sapphire Radeon X1800XT 512MB GDDR3 PCI-E x16
Memory: Corsair XMS 2Gb (2x1GB) 2-3-3-6@2.8v Twinx2048-3200c2
Storage: 2x Western Digital Raptor 74GB 10,000RPM SATA150
Lite-On 165x DVSD Burner
NEC 1.44MB 3.5" Floppy Drive
OS: Windows XP SP2


I took the liberty of relocating my drives from SATA channels 1/2 to 3/4. As much as this looks like a software issue... I cannot get over the fact that the raid will not detect upon automated reboot (no blue screen).

Nodens
17.04.2007, 13:31
Hi mate, you seem to make an invalid assumption. Running Memtest86 and Prime95 for stress tests, does not rule out your PSU. It does rule out your memory and CPU and certainly the memory controller (North bridge or CPU again as AMD chips carry it on-die).

Thing is that hard disks draw power from the +5V rail. If they can not draw enough amperage from your PSU, it may manifest in a number of ways. Corruption of the file system, you may hear clicking sounds (parking/unparking hard disk heads) and a few other things. You should really check your event log for errors that mention "timed out" IO operations or retries.

The way you describe your issue really sounds like a PSU issue to me.
http://www.antec.com/specs/TPII550_spe.html

Are the specs of your PSU were it states 40A for the 5V rail but the little asterisk notes:* +5V, +3.3V, +12V1, 12V2 maximum output 530 Watts max.

It could be that some other rail is overloading your PSU, the usual culrpit on such cases are the +12V rails since they feed the, nowaways power hungry, GFX cards.

I also find quite interesting that your Corsair modules run at 2.8V. That is extremely high for memory modules...does your liquid cooling system have memory blocks? I'd suspect any memory running at 2.8V to eventually fry...I'm also wondering why you need that high voltage, your memory timings are not that aggressive...



I posted the following message three days ago on a more private forum:
My issues first started while running R6:V in combination with YASU; it appears you are not alone here. My full system specs are as follows:
Case: Coolermaster Praetorian PAC-T01-EK Black
PSU: Antec TruePowerII 550W
Mobo: DFi nF4 Ultra-D w/BIOS 3-10-05
Proc: AMD64 4000+ (San Diego)
---------ThermalTake AquariusIII External Liquid Cooling System (AS5)
Video: Sapphire Radeon X1800XT 512MB GDDR3 PCI-E x16
Memory: Corsair XMS 2Gb (2x1GB) 2-3-3-6@2.8v Twinx2048-3200c2
Storage: 2x Western Digital Raptor 74GB 10,000RPM SATA150
Lite-On 165x DVSD Burner
NEC 1.44MB 3.5" Floppy Drive
OS: Windows XP SP2
I took the liberty of relocating my drives from SATA channels 1/2 to 3/4. As much as this looks like a software issue... I cannot get over the fact that the raid will not detect upon automated reboot (no blue screen).

Jito463
17.04.2007, 14:24
I'd have to agree about the memory voltage. I'm running Corsair XMS Xtra-Low Latency sticks (2-2-2-5), and I only bump mine by .1v for stability. Corsair does need more power than generic stuff, but 2.8v sounds high, since the standard should be 2.5v. That means you have a .3v bump above stock settings. Might be worth dropping that down a point or two.

Aerowinder
17.04.2007, 22:02
On the Corsair website, for this particular RAM, they are calling for 2.75v, confirmed on their forums. My board rounds to the nearest tenth and 2.7v wasn't stable, hence the 2.8v.

I hear no unusual sounds from my drives, the system is quiet as well as the hard drives. There are no, and never have been, any errors in the event log pertaining to time outs or I/O errors. The only time I get an error is typically when I get a crash, I get NTFS error 55.

Is there a benchmarking program I can look into that will stress everything in the system at once, to see if I can reliably reproduce the problem (3DMark?)? Or is there a better way to test voltage stability? I still find it odd that my problems are not consistent, is that typical with power supply issues? I've never run into a bad or underpowered one before.

Why after I get auto-rebooted will the raid not detect (system sees only one drive)? I thought the power reset (or whatever) when the system reset?

But anyhow, thank you both for the info, certainly something to investigate.

Jito463
17.04.2007, 23:07
You can download Burn-In-Test from www.passmark.com, but the trial version will only run for 15 minutes at a time. To run for longer periods, you have to purchase a license (though it's not too bad, only $24 for standard or $49 for professional). Which reminds me, I need to upgrade to v5. I'm still running v4.

LocutusofBorg
18.04.2007, 06:19
I still find it odd that my problems are not consistent, is that typical with power supply issues?


Yes! It is, and that's why there is no "easy" 100% remote "diagnosis" for us. But what you describe could be typical PSU-issue.


As your prob occurs also without yasu, I doubt that its a soft-
ware related prob and my best bet is your PSU

Jito463
18.04.2007, 14:12
As a computer tech who's been doing this for well over a decade now (wow, has it been that long?), I just wanted to confirm what LoB said. PSU issues affect everything in the system, so without a PSU tester (which aren't always perfect, but usually suffice), or a new PSU, it's hard to track down the PSU as the cause of the problem. It's definitely the least expensive option to swap out and test, though.

It's even more likely if you have a no-name brand PSU in your system. In this case you have a decent brand in your system, but even good ones can go bad. I had an Enermax one once that fried one mobo and nearly fried a second one before I discovered it was bad. At the very least, it's worth testing. Worse case scenario, if your PSU doesn't turn out to be the problem, at least you've got a backup PSU for testing in the future.

Aerowinder
18.04.2007, 21:36
Thanks for the input. I haven't had any problems since I posted my original message (changed some things around), but if they continue, I will pick up a new PSU.


Does this one (http://www.newegg.com/Product/Product.aspx?Item=N82E16817341002) look good to you? I've heard a lot of good things about OCZ. It's probably got a little more power than I will need, though. I think something like a minimum of 480w was recommended for this board.

Jito463
19.04.2007, 01:17
Well, while you wouldn't want for power with that thing in your system, you don't need anywhere near that much power for your computer. A 400WT would suffice even. Here's a 500WT ThermalTake that would work nicely as a replacement if you felt the need.

http://www.newegg.com/Product/Product.aspx?Item=N82E16817153028

Nodens
23.04.2007, 21:05
Well that PSU is powerful enough, maybe a bit overkill for your system but I always say that it's better to plan ahead. I'd suggest a 500W PSU with at least 32A on the +12V (preferably 2 or 3 rails since it's safer that way--) and another 32A on the +5V.
But like I said, I like to plan ahead so thinking future upgrades and how power requirements have jumped on an exponential curve lately I'd go for the one you linked.

Back on topic. YASU had nothing to do with my RAID issues.

Well the idea came up after setting up an identical system with Vista x64 for code compatibility testing and I witnessed that exactly the same thing happened on that machine too. At first I thought it was a chipset issue or even a BIOS one. I decided to get a bit dirty and use a bus sniffer.

The issue is this, apparently the nvidia chipset drivers do something entirely stupid. They spin down the hard disks when you choose to shut down. In Win XP x64 that started to happen after the installation of SP2 while on Vista x64 it happens every now and then with the OOB drivers and 100% with v15 drivers. The drives spin down and apparently they sometimes spin down before the write cache gets flushed resulting in data corruption. Vista x64 detects that and disables write cache on the array.

The problem is far more apparent on Vista x64 since a warm boot with the v15 drivers ALWAYS results in the the RAID BIOS extension timing out on "Detecting array" or reports a degraded array.

Talking with nvidia, they said they are aware of the issue and they're working on a fix. They already have a beta driver that fixes it but it's only for Vista x86.

Workaround for anyone experiencing the same issue:

1. If vista has not disabled write caching on your array, disable it manually via device manager>array>properties. That will reduce the chances of actual corruption and you'll only have to deal with the Detecting array" issue.

2. To avoid the "Detecting array" issue don't choose to restart. Simply shutdown instead. Wait for your system to shutdown entirely and then start up normally. This gives your drives enough time to spin up normally.

3. I'd suggest keeping a backup of whatever sensitive data you have on your RAID array until we get the replacement driver.