Posted by: esa | January 6, 2011

Aggregate Snap Reserve NetApp

Discussion from NetApp administrator mailing list :

question : How many of you Netapp admins set your Aggregate Snap Reserve to zero as soon as you create it? How many leave at the 5% default, or even grow it? Assuming no Syncmirror of course. What do you believe are the pros and cons?

Eugene Kashpureff asking :

There’s two issues to consider here – aggr snap reserve and aggr snap schedule. Most admins ignore both, and leave them at default. Some zero out the snap schedule and zero out the snap reserve, but I don’t think this is a good idea.

First, we should ask – Why use aggregate snapshots ? As far as I can tell aggr snapshots are used in three places by Data ONTAP:
1. Snap Restore. You can snap restore an aggregate, which will revert the state of all volumes in the aggregate to the state they were in at the time of the aggregate snapshot.
It’s not very likely you’d want to do this. Snap Restore would be done at the volume level.
2. SyncMirror. SyncMirror uses snapshots to resynchronize the plexes when recovering from a broken mirror. These snapshots are independent of the default scheduled snapshots on the aggregate. The snapshot frequency for SyncMirror aggregates is set with ‘aggr options aggrname resyncsnaptime’ variable, which defaults to 60 minutes.
3. WAFL_check or wafliron. Although there’s all kinds of mechanisms built into Data ONTAP to keep data safe and available it is possible to corrupt a WAFL filesystem. This is usually the result of doing something stupid, like turning off disk shelves while the filer is running, adding a non-compatible shelf to a a filer while running, etc.
wafliron will try to find a recent aggregate snapshot to leverage and can run much faster if one exists.

So.. you probably do not need to keep the default aggregate snapshot schedule:

sim1> snap sched -A
Aggregate aggr0: 0 1 4@9,14,19

You may want to consider setting the schedule for a non SyncMirror aggregate to the same as used by SyncMirror:

sim1> snap sched -A aggr0 0 0 2
sim1> snap sched -A aggr0
Aggregate aggr0: 0 0 2

This schedule will take an aggr snapshot every hour, and retain 2. If you ever do get in a corrupted aggregate situation this can save you from a painful wafliron experience. The difference can be doing the recovery in 3 hours instead of 3 days.

Second, we should now consider the aggregate snapshot reserve. Although the reserve may not be needed to account for snapshot space there are several other issues we may wish to consider with regard to using and reserving space in the aggregate. There are three other reasons I might want to reserve extra space in aggregates from being used by flexible volumes:
1. Synchronous Snapmirror: Snapmirror in synchronous mode in Data ONTAP prior to 7.2.2 logged all NVRAM writes on the destination to the root volume in: /etc/sync_snapmirror_nvlog/<dstfsid>.log[0|1] In Data ONTAP => 7.2.2 these NVLOG files are stored in the free space of the aggregate that contains the snapmirror destination volume. The Data ONTAP 7.3.4 Online Backup and recovery guide has a page on ‘Estimating aggregate size for synchronous SnapMirror destination volumes’. It says you should have 20 X NVRAM size of free space in the aggregate containing the destination volume. I’m not sure where this comes from, but I have a hard time believing this much space is needed ! You may want to use the aggregate snap reserve to guarantee this space is available on synchronous snapmirror destination volume aggregates.
2. ASIS (Dedupe): In Data ONTAP 7.3 the dedupe fingerprint database was moved from the sis volumes to the free space of the containing aggregate. You should ensure that the aggregate has free space that is at least 3 percent of the total data usage for all volumes.
3. Write Performance (Fragmentation): All file systems may be subject to fragmentation over time. There are two factors that lead to file system fragmentation – how dynamic they are, and how much free space is available for new writes. It’s best to keep your file systems at less than 90% full to avoid ongoing fragmentation. With flexible volumes on an aggregate there is no pre-allocation of space to volumes. All volumes are allocated space from the containing aggregate. Therefore, it is at the aggregate level that we should be concerned with utilization of the file system, rather than at the volume level. You may want to use the aggregate snap reserve to guarantee this extra space is always there on the aggregate. See also -‘reallocate’ command.

To summarize, I don’t zero out the aggregate reserve, I increase it to 10%. I also change the aggregate snapshot schedule to take hourly snapshots.

____________________________________________________________________________________________

Sam Wozniak ask : Eugene brought up all the points I was aware of in addition to some others. I usually leave it at the default 5% with the default schedule. Also remember that if the aggregate was created with System Manager the reserve will be set to 0 by default. I always go back and reset it to 5%. I do like the idea of increasing it to 10% in consideration of the performance impact of running the file system that utilized.

One other benefit of keeping the aggregate snap reserve and schedule is in the event you or a customer accidentally deletes a volume when they were intending to delete, say, a snapshot. Just something I heard about happening.

Most engagements I work on are pretty cookie cutter, so my rule of thumb is to stick with the recommended NetApp defaults, and in most situations, best practices. Just what I’ve been taught…

____________________________________________________________________________________________

David Nixon • I set it to 0% and always keep in mind that I need to leave the aggr with 20% free space. We have two many systems to ever to a restore of an aggr if a volume was destroied, we would just snapmirror data back from the DR site. But based on the above comments, I could see where it could be helpful for certain issues.

____________________________________________________________________________________________

Chris A. • Good points but for those that don’t already know an aggregate snapshot can only be used to revert an entire aggregate, rather than specific flexvols within it. I would go with David’s suggestion and use SnapMirror to restore individual volumes and keep a small number of aggregate snapshots in case of dire emergency.

____________________________________________________________________________________________

Lian Fan • If you use the dedup, and your DOT is 7.3 or above, strongly recommend to reserve 5% at lease. Because in 7.3 the metadata is stored in aggregate.

Earl Bryant (VCP NCDA) : One issue that bears mentioning for the responsible Storage Architect/Engineer is monitoring. Utilizing NOM (Operations Manager) or another applicable tool will give you the confidence to drop your aggregate reserve to a more sensible number, as you will have a better handle on volume utilization in the aggregate. Also, if there is no DR storage implemented in your scenarios (duplicate NetApp storage with SnapMirror/SnapVault implemented), this may dictate the dependence on an aggregate reserve for wafliron processes if needed to recover from catastrophic corruption. I however have only needed to do this once in 10 years-something must have really “screwed the pooch” in order to have to utilize it.

Another point is recovery. As has been mentioned here, unless you are prepared to recover every volume in the aggr (no aggregate level “FlexClone” yet!), or are utilizing SyncMirror, maintaining aggr snap reserve isn’t buying you a lot.

This leaves us with two options for aggr snap reserve:
1. No-brainer insurance policy against filling up the aggregate (but we ARE monitoring our consumption already, right?)
2. To assist in advanced recovery with wafliron (but then again, since a single, initial snapshot takes up nearly no space, are we really worried about consumption, as we’d probably configure aggr snap sched to retain only ONE latest snapshot?)

As I mention to my DOTF students, if their main concern is clawing back (a now “measly”) 400-600GB of space from their fully-scaled aggr, then it’s time to consider reorganization of their volumes, ASIS, compression, and some better storage utilization monitoring.

____________________________________________________________________________________________

OK..maybe this help.. 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: