Dell PowerVault 220 are Junk, and I’m Tired!

What a couple of weeks I’ve had. We have a pair of Dell PowerVault 220s at the office. Each of these hold 1.2 TBs of data and provides about half the data storage for the entire company. When the boxes work, they work just fine, but watch out if they start to fail. What follows are my tired, frustrated, ramblings.

Two weeks ago, we had one just go totally offline from the server to which it was attached (for no reason). That resulted in an entire day of downtime as we had to restore from tape (it wouldn’t come back on-line cleanly). Then this past Saturday, we had a drive failure in the same one. That resulted in 6 hours of work on Saturday to get it back on-line and an update to Windows Server 2003 SP1 (Dell told us to do this), before they would admit that it was a hardware problem and ship us a replacement drive.

Jump forward three days. On Tuesday morning another drive in array died, while we’re still in the process of doing the full backup. Of course the RAID-5, array that is supposed to fail-over gracefully when a drive dies, didn’t and the server crashed, hard. We were able to bring the crippled server back on-line and complete the backup. After that was done, a Dell technician replaced every part in the unit except for the “good” drives and the case. So we started the process of restoring data. Down all day Tuesday.

Today, we’re having problems with the restore, therefore part of the server is still offline (as of 7:00 PM). We have to babysit the restore because it stops after 1 hour and 45 minutes (Veritas says we need to do a reinstall with the latest patch, but not until we’re completed the restores we have going right now). So we’ve broken the job up into parts that take about an hour for each task. It is going to be another really late night.

AAGGGHHH!!!!

Needless to say, I’m frustrated, frazzled, and tired. The firm’s partners are fairly understanding, but they’re not happy to say the least. At the height of downtime, it cost us a bunch in lost billings per hour. So on Friday, I will have to do a de-brief and lessons learned presentation. That will be followed of a couple months of intensive research into what technologies we can implement to better protect ourselves and minimize downtime. (Until now they weren’t ready to spend the kind of money that this will cost).

The silver lining in all of this is that we’ll get a lot of new technology to implement. Finally something “fun” to sink my teeth into again.

Explore posts in the same categories: Techie, Work

35 Comments on “Dell PowerVault 220 are Junk, and I’m Tired!”

  1. Master Foley Says:

    there is always a bright side to things

  2. 51 Hours in 3 Days at Nerhood Weblog Says:

    [...] We had a major server crash. The crash was similar to one we experienced back in September. It involved our second Dell PowerVault 220. We are not sure why the array initially went offline (Dell tech support want to blame a cable, but that is really just a bunch of hogwash), but we subsequently find three drives with media errors so they ship us three new ones. (After the fact we determine that we are one revision behind in firmware on the array, guess what that update fixes. It corrects problems with the array timing out under heavy loads, just what we are experiencing.) It tried to fail over to the hot spare, but the server crashed completely. Oh great, we’ve been here before. We try to bring it up and do a check disk. That fails (just as we expected). So our only recourse it to rebuild the array and restore from backup. [...]

  3. Frodo Says:

    hi,
    we had some similar problems with this 220s from dell.

    Once a day one disc has failed in the storage and was replaced.
    A week later another disc has failed in the storage and was also replaced.

    So far so good.
    But then the storagebox was off after the weekend, without any reason.
    The result was corrupt data on the system and a lot of night work.

    The box had this error three times on sunday at more or less the same time. The answer from dell was: “could you please first update all firmware and all drivers and bios? But be sure that you have a full backup before you start”
    What a bad joke!!
    After that we made for three weeks a monitoring of our powerlines and ups but we didn’t found something. So it’s really weared and makes no sense, because we did not updated all that stuff and it stops itself for the moment.
    For me it’s clear that we need other storage hardware as soon as possible.

  4. kbn Says:

    As I just posted, we had another outage. 4 times in 6 months across two different powevaults. The funny (in an ironic sense) thing, is that next week I have a storage consultant coming in to help us analyze our system and plan a strategy for replacing.

    I’ll be posting about how that does.

  5. Eddy Says:

    We are having the same situation with the Powervault 220s. After 2 hard drive failure last month, we got 4 today at the same time. I really don’t understand what is happening. Just wanted to share this with you.

  6. david hajek Says:

    We have same issues. Faulty drives which brings PowerVault on his knees. We have 5 downtime in last 6 months. This is not acceptable. I’m working with Dell to return that box.

  7. kbn Says:

    Eddy and David,

    Sorry to hear you are having problems too. The more research I do and talking to experts in the data world, I’m coming to believe that the PERC controllers are real problem. The PowerVaults are just a “dumb” disk array. It is the PERC cards that where the supposed “smarts” are, and they are really bad.

    Becuase of our ongoing problems, Dell shipped us an entirely new powervault, with new drives and a new PERC card. All supposed been “burned-in”. When we hooked it all up the first 4 drives could not be seen. Turns out it was the PERC controller.

    In any case, we are actively looking to move away from the Dell product line for storage.

  8. Linda Says:

    If you are still looking for a replacement, call me. (603)964-7840. The best thing to do is to get away from a host based RAID solution; too difficult to troubleshoot and too much of an impact on your server. We have a stand alone RAID array at the same price as the Powervault 220 or 221S. Excellent reliability, simple plug and play, high performance, at a Powervault price!

  9. Goodbye Dell, Hello NetApp (almost) at Nerhood Weblog Says:

    [...] I’ve been writing about how Dell’s PowerVault 220s are junk for quite awhile now. We experienced our 3rd major crash for the year this week. We had everything restored and back in operation 23 hours later. [...]

  10. Gam Says:

    Ohhh I just googled PV220S for fun since I just installed 2 today on 4 PE26xx/1850 machines to Perc4 hosts - I guess I get to have “job security” at least - and I can put in my 2 weeks when it starts to look too bad. Thanks for the head’s up. :)
    “Oh but that SAN is $20k!” says the boss.

    Where should I go for vacation after I put my 2 weeks in?

  11. kbn Says:

    Gam,

    With each addtional crash, I’m convinced it is the PERC controllers and not really the PV220s. We actually had a drive crash on us again this past week, but this time we had actually forgotten to configure the hot spare.

    We system ran just fine in degraded mode, we then got everyone off the server and replaced the drive. It started the rebuild and then we let everyone back on. It worked like it was supposed to.

    I think that the PERC 3 (and now PERC 4) controllers just can’t handle failing over or move to the hot spare when they are under any kind of load.

    My recommendation is not to configure any hot spares, but watch the boxes like a hawk. I recommend using Nagios and I’ve even written some scripts specific to the Dell PERC controllers, you can find them in Monitoring Dell Hardware with Nagios post.

  12. Steve Says:

    I have a bit of a different take and a question. I relocated offices with several 220’s and 2650’s I have had one loose its raid configuration. Any good suggestions, Yes they are the !@#$%^&*() perc3 cards. Im at the point of just blowing away the 2 bad logical volumes and rebuilding them. At leasts its only data not OS. But then one of my site techs notifies me that at one of his sites they lost power and now it is doing the same thing to him on one of his 220’s. all the lights are amber. Any good suggestions. Other than the trash can.
    I am also looking at making the move to alternative storage solutions. Have spoken to EMC and a company called left hand sloutions, http://www.lefthandnetworks.com/press/bcbr_reprint_121304_sm.pdf

    I like there solution it saves me from buying Fibra switches. it uses ISCSI and NIC cards. I have worked with NETWORK Applicance and I dont like then other problems as well as costly, any optiojn you want you pay for. and yearly renewals.

    Thanks for anything.

    Steve

  13. kbn Says:

    Steve,

    When were looking at vendors, one that made our short list for iSCSI solutions was EqualLogic. This made a really nice product that seemed like it was scale very nicely. You might want to take a look them. They were competitively priced too. I know some people who went the Left Hand route, and seemed to be happy with them.

    As far as your problems with dead and/or missing arrays. I would first suggest that you call Dell. Hopefully you are still under maintenance. If not then, the only advice is blow away and restore from tape. We got very successful at doing that. Dell always wanted to spend days troubling shooting our problems, and I just couldn’t do that with production machines. For every minute they were down we were losing money.

    Good luck.

    –ken

  14. JAyBOD Says:

    I too am experiencing “issues” restoring data to a PV220S hung off a PERC 4/DC installed in a PE2650 running W2k3 Server (Latest BIOS, Drivers, FW, and patches all around.)

    Data from tape is coming from BackupExec 10.0 with LTO2 drives and going to the system above with bandwidth to spare.

    The desired restore job contains a fairly broad directory tree (10,000 across) but not necessarily very deep.

    I’m trying to restore a mere 38GB of data to a new 550GB RAID 5 volume in the 220S. A little less than halfway through the restore and the data throughput from tape starts to crawl and then stops altogether. I’m left with no logging other than Backupexec telling me “The job failed with the following error: A communications failure has occurred.” (App Event 34113).

    Any help on this would be greatly appreciated! Thanks.

  15. Dan Draper Says:

    I am having the same issue and it is frequent. I was in doing maintenance one night and I heard the unit squeeling. I looked at it and a drive had failed. I went to the KVM to log onto the server and investigate when the alarm became louder. I went back and looked and the array now had 5 bad drives. This is a production server that this is attached to and it houses our SQL server for one of our websites. Fortunately, we use replistore so I failed over to a hot spare ad I was back up. Days later the PV on the hotspare failed and I had to fail over again. I have had this happen on these 2 boxes about 3 times each in a few months.

    Dell has replaced all components in both PV’s but it still happens. I am having both replaced. I am also thinking of purchasing adaptec Raid controllers for the arrays. Has anyone tried changing from the PERC POS to another controller????

    Dan Draper

  16. Chris Says:

    Two junk 1850 x64 poweredge servers. Perc 4/dc and 220s arrays.

    I bought these with the exact same config. I have updated them all the way to the latest versions. When they transfer a lot of data.. Like a 20 gig file. Boom. They go down.

    Anyone else with W2K R2 x64 issues?

    I can’t believe two of them have the same issue! My fibre arrays are great though.

  17. Rusty Says:

    Well, I’m not convinced the problem is with the PowerVault chassis or the PERC controller… at least, not in our case. We have approximately 240 Dell servers and around 25 PowerVault 220’s. Over the last year and a half, we’ve had a large number of drive failures - somewhere in the neighborhood of 35 drives, I think. Some were due to age, but others involved brand new equipment. They seemed to occur in batches, too… 10 drives in a few weeks, then nothing for several months, then another six, etc. I found the problem occurred with both PowerVaults and with servers. I also found the problem occurred not only with RAIDed drives on PERC controllers, but also with non-RAIDed drives attached to Adaptec SCSI cards (ouch!). I was so concerned at one point, I actually had the power in our server room checked for purity.

    After a while I realized something. All of the failed drives were Seagates. Through the years we’ve gotten Dell equipment with both Seagate and Fujitsu drives. The Fujitsu drives, however, haven’t had a failure (yet). I don’t know if this is coincidental, especially since the number of Seagate drives outnumber the Fujitsu drives in our shop 4 to 1, but given the number of drive failures we have experienced, the higher failure rate with the Seagate drives has grabbed my attention.

    By the way, I should note one of our drive “failures” was actually a faulty backplane in the 2650, not the drive itself.

  18. kbn Says:

    Rusty,

    Interesting theory about the drives failures. I honestly don’t remember which kind we had, and I’m not about to go pulling drives out of our remaining system.

    In our case I’m still 85% certain that it is the PERC cards that are the problems (at least with our crashes). We had what I felt were an exceeding large number of drive failures with these PV220s units. More than with all of my other Dell servers combined.

    What our final solution (even though it sucked) to not have any hot spares. When a drive failed, we kicked everyone of the server, and did a manual failover. This worked every time. Yes it meant down time (of about 15 minutes) but it was clean. But when we would try and do it with a global hot spare, every time it would crash the server hard and corrupt the disks such that it would require a restore from tape. At the end we were able to do that in just over 18 hours.

    Yes I think something is wrong with the PowerVault 220s particularly given the large number of drive failures, but there are JBODs (Just a Bunch of Disks) boxes, with the “smarts” being on the PERC card. And the PERC cards just couldn’t handle these units under load.

    But hey, in my case it doesn’t matter any more because I’ve replaced all of my production units with a NetApp Storage Appliance that is working BEAUTIFULLY.

  19. Are Dell PowerVault 220s Junk? « What KNot Says:

    [...] Do you have any Dell PowerVault 220s? Are you having problems with disk corruption and crashing servers? Do you not trust your PERC cards? There is a lively discussion about the problems users are having with their PowerVault 220 systems going on at a post called Dell PowerVault 220 are Junk, and I’m Tired!. I recommend that you take a couple minutes and check out the site. [...]

  20. Gareth Chambers Says:

    There is an urgent firmware upgrade for the PERC4/DC card which I suspect is the cause of this. I’m just about to put this on after yet another failure with our 220S.

  21. Anthony Says:

    Hi,

    Just thought I’d poke around to see if any Powervault 220’s were going cheap… and I bumped into this. Its interesting to here of the problems so I thought I must add in mine. I had an adaptec dual channel scsi controller with brand new fuji drives that fell off the raid volume when I hit the db really hard. I swapped all sorts and got really hacked off, thinking that raid never really does save your bacon when the drives go offline reporting no errors whatsoever anywhere which makes it hard to fix. Eventually I plugged brand new adaptec U320 round cables from the controller to the drives on one line instead of through the enternal scsi connector and I have not had a single failure since, period. I guess the connectors changing from 68pin to vhdi was too cheap and nasty to tollerate under load. Ho hum, I nearly threw the drives back at fuji as a last resort.

    Now I thought I’d start to spread the SQL DB across controllers and spindles to get a speed increase. I was looking at an ‘upmarket’ enclosure but now I’m worried about the 220 :)…. still from what you guys have said todate, get firmware uptodate on the perc4 and stay away from a global hotspare. (of course just to test this out I could setup a 220 and thrash the living daylights out of it and pull a drive to test of under load that the hotspare kicks in!)…. and practice recovery if it dont!

    Anthony

  22. Wayne Fischer Says:

    I’ve run a PowerVault 220S without any issues for over four years now and until I switched to newer PowerEdge 1850 servers with a PERC 4DC card in it I didn’t have a single issue.

    Now I find that the PowerVault frequently loses communications, reports that the hard drives are bad, claims that the entire enclosure has failed etc. I’m entirely convinced that there is an issue with the PERC 4/DC cards and the PowerVault 220S after reading this thread. I’ve spent over $3000 replacing the power supplies, EMMs, and backplane as well as bad drives etc. In short, I’ve replaced every component in the PowerVault 220S Enclosure

    I’ve replaced every component (including new hard drives) in the PV 220 to no avail. Thus, it has to be the host, or in this case the PERC 4/DC cards I was assured would work with the PV 220S. I’m very disappointed, and I feel all of your pain and frustration.

    Has anyone come up with a working solution? I’m going to try swapping out the PERC 4/DC cards with a PERC 3/DC (which worked fine previously) and see if it works (even though the Dell Sales Technician informed me that there were “heating” issues with the PERC 3/DC in PowerEdge 1850 Servers. I’m not sure I trust their technical knowledge on this matter at this point.

    Dell you really let me down on this one.

  23. kbn Says:

    Wayne,

    I feel your pain and I wish you luck. As you can see from this discussion that now has a “working” solution. My particular problems were with both the PERC 3 and PERC 4 controllers in PowerEdge 2650.

    Let me know if going back to the PERC 3 helps and good luck, and yes I agree that Dell does not know what they are talking about when it comes to this product set.

  24. SteveB Says:

    I have a 220S, and its connected to an adaptec scsi controller.(160) I am doing backup to disk, and periodically the disk pack goes offline. A reboot brings it back online, but the next night during backups, it goes again. Diagnostics show a bad disk, but it seems fine after a reboot.

    I too am very disappointed with my purchase.

    I do not have a perc controller in my Dell 2850.

  25. Don Says:

    It is with somewhat perverse satisfaction that I read these posts. I cannot say how amazingly underwhelmed I am with the entire Dell product line, the PowerVaults leading the way. The complete and utter failure of our production PV220S and Dell’s complete oblivious approach to “Customer Support” enabled me to ban Dell purchases and move up to an Enterprise-Class vendor (Dell can dream on this front, but an EC vendor does not sell $500 laptops out the backdoor). Our new Sun Servers, Storagetek Arrays, and Apple MacPros (yes, running XP and/or Vista) and MacBooks have made us all vow never to look back to Dell again.

    Oh, yeah — part way through the process of upgrading, Dell followed up on our complaints with a proposal to sell us EMC product. The very fact that Dell was backing them up caused us to strike EMC from our potential vendors for storage as a sort of “knee-jerk” reaction (anybody at EMC paying attention?). Perhaps if they had offered some credits to make up for the four days of lost business (we host thousands of high-traffic/e-commerce sites) we might have let them buy us lunch, but as it is — they will never see another penny from any organization that listens to me.

  26. Emerson Says:

    Hi folks,
    Now we understand the reason of the join venture with dell and EMC for corporate accounts. But in therms of price, entry level machines, HP and IBM are well positioned.. I had a good experience with an IBM DS3000 storage that I just customized to my customer. Easy, friendly and exceptional performance. I recomend.

  27. Dycie Says:

    Hi All,

    We have had a Powervault 220s for about 2 years now, and its been a lemon since day one. In the first two weeks it crashed 5 times taking the server down in the middle of the working day.
    After much heartache we got a flash dump utility from Dell which dumped the plain text log from the controller. The PERC 4/D card was detecting a failed disk in the array, and so marking it off-line. But would later would mark it as ready and try to rebuild onto it. This would happen until the log in the controller filled, hung the controller and crashed the server. Throughout this experience the Powervault showed no warnings, or visible alarms of any kind.
    Its been OK since, but the posts here have only confirmed my suspicions that a disk failure is more than likely to kill the system again.

    Thanks to the wonder of this experience and continual assurances that the hardware was perfectly fine we moved to IBM for all of our new servers and have not looked back. I have been reviewing whether to replace this heap of crap out of cycle and clearly with everyone else’s horror stories I need to.

  28. SWalker Says:

    You can learn something new every day in this business. We use PV220s and 3DC/4DC PERCs. I upgrade the cache to 256MB on the controllers, blow the latest firmware on the PERCs and ZEMMs, used matched drives with matching firmware in every array (including hotspare - no global) and do a read/write burn-in for 10 days. We have RAID 5 and RAID 10 arrays. I did receive a 4DC from Dell with a bad connector, once. Seagate DX10 firmware eliminates disconect issues. I have Quantum, Fuji and Seagate U160 & U320 drives and they all come from the same lots. Not one failure. We use Diskeeper and we love it. If we could afford a EC solution, I would use it as I come from Compaq ISSG Competitive analisys and you can induce all kinds of failures in these boxes as well as other manufactures JBODs. Drive rebuilds are particularly precarious and Dell support does not documant any step-by-step procedures to do this. Without going into great lengths - try this.

    Use a Dell box (I found it can be any) to check consistency and low-level the drive. If you have rebooted the system with the failing drive wait until amber light comes back on as the PERC is recognizing the failure again and setting up to prep failover. I insure the replacement drive is same as the failed one (firmware also). Unplug and replace. It begins rebuilding and after it completes, reboot after hours. Had to do this in 2550, 2650 & 2850 internal drives but have had no failures in PV220s (knockonwood). I have bought new product and used and wound up having to always test, burn-in and blow firmware in any event. In several burn-in sessions, I have lost a couple of drives, mostly Seagates. A little hint - If you are using Seagate OEM drives with firmware 0007 or lower on PERC controllers, you are asking for trouble.

  29. Dave Grannas Says:

    We have been having issues with the PowerVault 220 for a couple years now. After physically moving a server, and the attached array, 3 times and each time having an issue with the PV recognizing the physical drives, we think we finally came across a fix.

    First of all, verify you are connecting the array to the right SCSI card after the move. In one scenario, we had two scsi cards and had “failed” physical and “offline” logical drives when the array was attached to each card. This added a couple hours of troubleshooting.

    TIP: Watch the bootup sequence and notice which SCSI BIOS recognizes the attached drives. An Adaptec controller will list the physical drives attached to its interface during the bootup sequence.

    Finally after reading these (and many other) forums, we upgraded the PERC4/DC firmware to 350O and the drives were finally recognized (when connected to the right controller). However, we had to find the firmware upgrade on Dell’s ftp site (not the website) at http://ftp.us.dell.com/scsi-raid/.

    And, thanks to Ben Gordy at Dell, we got a direct link to the latest firmware for the PERC4/DC: http://ftp.us.dell.com/scsi-raid/RAID_FRMW_LX_R130906.BIN

    As a side note, your OS, particularly Linux, may not recognize the drives after bootup. For Linux, run “fdisk -l” to list all the physical drives attached to the system. If you see your array drive(s), try to mount them based on the device.

    Example: mount /dev/sdb1 /mount_dir

  30. Robin Kearney Says:

    Just thought I’d say something nice about the 220’s! I’ve had 10 or so of them attached to Perc 4’s for between 3 and 4 years now, never missed a beat. Until today when a blower went. Still this proves they can be reliable sometimes! :)

  31. kbn Says:

    Robin,

    Glad to hear that yours have worked great. I’ll keep my fingers crossed for you that you never run into all the problems others have faced.

  32. BW Says:

    I have 2 Powervault 220’s and the only issue I have had in 3 years was (2)Hardrives went bad and DELL replaced them next day. I have nothing but nice things to say about DELL and the Powervault 220’s.

  33. Craig Says:

    Don’t really see these as enterprise systems. But still they should at least work. Ours were retired to a test lab, and it is OK if they fail there.

    Move them to test, and forget about it.

    Go with entry level Hitachi SAN when you need over 1TB of fast disk. Also SAN allows sharing that disk if you cluster.

    You are looking at 20K for sure to get started, but nothing else really ensures uptime.

    Craig

  34. soiledhalo Says:

    Hiya, just googled the 220s and came across this site. So sad to see you, as well as others experienced so much hell. Needless to say, we own a 220 and for the past 5 years it’s been very good, only had one problem when a drive in our raid 5 array died. We were able to quickly restore to the hotspare and all is well. Guess my mileage varied.

    Regards,
    Richard.

  35. kbn Says:

    Richard,

    I’m glad you have had good success with them. I really think the problems that we and others faced was due to the amount of sustained data access as well potential firmware problems. Ours have been gone for over 2 years now, replaced with NetApp equipment (a good investment for us).

    I wish you continued success (and hopefully you didn’t; jinx yourself by saying you’ve had almost no problems).

    –Ken

Comment: