Oracle Extreme Performance Systems
During the Oracle 12c Beta Test program new engineering possibilities emerged, so how fast can we now go? The Answer, Faster than you ever imagined.
Over the last decade consolidation of computing resources has changed the cost model for system provisioning and architectural patterns have emerged for high performance database systems that disrupt all previous thinking. In 2003 I spoke at length with senior executives from Intel and asked a simple question, how long would we have to wait to see the traditional spinning disk be replaced by solid state replacements? On all occassions these questions went unanswered with rye smiles.
The birth of the current generations of high performance database systems started when Oracle and HP introduced ExaData to the world in 2008, at Oracle Open World. It began with an internal project called SAGE, an acronym for Storage Appliance for Grid Environments. This was to be an open hardware stack solution, i.e. with no proprietary hardware. The competition was Teradata and Netezza (now owned by IBM), who relied much more on defined or proprietary hardware solutions, so Oracle’s aim was to compete through a) openness, and b) Moore’s law. The latter meant that as commodity hardware improved with technological advances, the SAGE software running on it would benefit from these advances whilst the competition struggled to test and release newer appliance-based solutions. The idea of openness is also important, because it could be argued that much of Oracle’s success in becoming the dominant relational database vendor over the past few decades came from the policy of offering a wide variety of ports and platform choices.
The headline from the Open World release read as, “Oracle Introduces The HP Oracle Database Machine: Delivering 10x Faster Performance Than Current Oracle Data Warehouses”.
In September 2009, Oracle introduces Exadata V2, this time on top of Sun – Intel based hardware utilizing its 11g R2 database platform and a significantly upgraded Cluster and ASM extension that allowed individual ASM disks to go off line and return without system outages and long disk synchronization recovery periods. This new feature is not restricted to ASM, its operation is only available from 11.2.0.2 of the database, with this database compatibility setting needed in the ASM instance for instantiation. So if you have ever wondered why Oracle does not allow Exadata’s to run pre 11g R2 databases, this is the reason, a reason that will become more evident as we cover high speed PCIe based flash cards later in this article.
Exadata V2 was a turning point. It was the first Oracle Database appliance architected as no compromise engineered system to be released to the market. Its architecture set new standards and ignored the classic wisdom promoted by vendors like IBM, EMC, HDS and Cisco. Technologies that were beyond consideration, just too new, found their way into this platform. The creation of Storage Cells that ran the database within the storage device have previously never been seen and today form a major part of Oracle’s strategy.
The challenge of today’s high performance architecture
As a result of Oracle’s alignment with open standards and the view that inclusion is paramount to extended market share we are provided with options in achieving high performance database systems with performance characteristics that rewrite the rules book on system performance and bottlenecks in tier performance. In the early part of the last decade tablespaces were tuned with data files spread across multiple disks or jbods using combinations of RAID and individual disks to resolve IO bottlenecks with disks attached by SCSI interfaces, the fore runners to SAN technologies and then Fibre Channel.
If we spend a minute looking at how this world has changed we see Infiniband 40 Gigabit network connections used to access both high speed SAN disks, commonly comprising SSDs and background block transition to SAS and finally SATA based on storage over time. These arrays are capable of achieving sustained rates of IO well above 30,000 IO/Per Second. However, if we are to match Exadata then we need another quantum leap in performance, not an incremental increase. One answer to this problem comes in the form of PCIe Flash Cards that reside within the server and provide 1 to 1.5 Million IOP per second and ASM Failure Groups. The Infiniband networks compares to Fibre Channel and its 8 Gigabit capacity with 5 times this capacity, and although it was esoteric when Oracle first used it with Exadata V2, today it is readily in reach of any enterprise architecture.
Any Exadata DBA will now arch up and tell me there is a great deal more to Exadata than networks and servers, and this is absolutely true, but for this article I will focus on the foundations of achieving system performance and leave the management and virtualization until the next article, and I have no illusions, Oracle are well advanced on the next leap, but until then we can explore how we optimise our existing systems to meat the changing performance expectation our business and customers impose on us all.
The example system using this technology
To take the architectural theory into a practice, I have used an AUSOUG partner’s technology, HP, and assembled an Oracle certified system configuration using DL380 servers, Infiniband fabric and switching, Virident or as they are now known, HGST, Flash Storage and built a three node 12.1.0.2 RAC environment on Oracle Enterprise Linux 6.3 with the UEK Kernel to see how thing go and yes, No SAN. I have done this as the combination establishes a pattern for blocks of compute and storage that can be used to build scalable RAC environments without the need for SAN storage. So we can now build modular high performance compute systems by racking and stacking servers alone with no compromise in performance, redundancy or scalability. For AUSOUG members there is also a chance for you to test these systems and see just how they work through a series of Webinars and workshops, including an option to carry our a POC of your own to validate the systems and their performance.
The test system comprises,
3 x DL380 servers with 64 Gigbytes of memory and 4 x 8 core Intel CPUs.
3 x HSGT (Virident) 1 terabyte PCIe cards
3 x Dual InfiniBand network cards
2 x InfiniBand network switches
Oracle Enterprise Linux 6.3
Oracle 12.1.0.2 Enterprise Edition Database and Clusterware
Diagram 1 illustrates our system including its network topology.
Diagram 1
By using InfiniBand we are able to assemble mirrored storage across all three nodes with lower latency and improved performance than we can achieve from fibre channel connectivity on a shared storage array like a SAN. We also have the option to use Oracle’s new network virtualization, previously Xsigo, to establish load aggregation across multiple cards and traffic separation with quality of service standards for key traffic. The layout employed uses ASM failure groups, configured when establishing the initial Disk Groups by setting the redundancy to “High”, providing 3 way mirroring of data across the three nodes. The PCIe Flash Cards include RPMs for Linux to enable these block devices to be exposed to all nodes in the cluster across the network. They therefore appear as shared disks across all nodes and are treated like LUNs with the capacity to name them with meaningful aliases that identify the node, card and logical unit. These devices are then enabled using udev rules and grouped within ASM disk groups. Diagram 2 illustrates this at a logical level, with data striped across all three nodes in our cluster.
Diagram 2, Logical presentation of LUNs across all nodes as shared disk storage devices
To obtain maximum performance, dynamic CPU frequency scaling should be disabled. This is especially important when measuring I/O latencies. To disable dynamic CPU frequency scaling run the commands.
# service cpuspeed stop
# chkconfig cpuspeed off
One of the reasons I selected OEL was that Linux UEKR3 kernel has OFED driver 2.0 embedded and does not require additional OFED driver installation. However, installation of InfiniBand software is still required.
Use the following commands to install and enable InfiniBand software on each node:
To install and enable InfiniBand software on each node:
1. Run the following commands to install InfiniBand software:
# yum install infiniband-diags.x86_64
# yum install libibverbs-utils
# yum install rdma
# yum install opensm
2. Add the following line to the /etc/rdma/rdma.conf file:
MLX4_LOAD=yes
3. Enable and start rdma service:
# chkconfig rdma on
# service rdma start
Configuring the Subnet Manager for Two InfiniBand Links
Each InfiniBand link must be a part of a subnet. It is necessary to configure, enable and start the opensmd (InfiniBand subnet manager) service on each server if:
A back-to-back (no switch) InfiniBand connection is used
Unmanaged (no embedded subnet manager)InfiniBand switches are used – not recommended for production deployments with two or more InfiniBand ports per server.
If each server has only one InfiniBand link then it is not necessary to customize the opensmd configuration.
If each server has two InfiniBand links then each server must be configured to manage one subnet assigned to one of the InfiniBand ports. This can be done by adding the IB Port GUID of the corresponding port to the opensmd configuration file. for example i n a 2-node cluster with two InfiniBand ports (assuming that port 1 on node A is connected to port 1 on node B):
On node A add the line “guid<local port 1 GUID>” to /etc/opensm/opensm.conf file
On node B add the line “guid<local port 2GUID>” to /etc/opensm/opensm.conf file
# cat /etc/opensm/opensm.conf
guid 0x0002c90300e7cd18
IB Port GUIDs are available in the ibstat –p command ouput as shown below.
# ibstat -p
0x0002c90300e7cd10
0x0002c90300e7cd18
Enabling and Starting the opensmd Service
To enable and start the opensmd service
Run the following commands:
# chkconfig opensmd on
# service opensmd start
Starting IB Subnet Manager. [ OK ]
This service should be started and enabled on all servers connected either back-to-back or to an unmanaged switch.
Validating the InfiniBand Connectivity
To validate the InfiniBand connectivity
On each server use the ibstat command to verify that the InfiniBand driver is loaded and the InfiniBand links are established.
# ibstat
CA 'mlx5_0'
CA type: MT4113
Number of ports: 2
Firmware version: 10.10.1000
Hardware version: 0
Node GUID: 0x0002c90300e7cd10
System image GUID: 0x0002c90300e7cd10
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 4
LMC: 0
SM lid: 1
Capability mask: 0x06514848
Port GUID: 0x0002c90300e7cd10
Link layer: InfiniBand
Port 2:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 2
LMC: 0
SM lid: 2
Capability mask: 0x0651484a
Port GUID: 0x0002c90300e7cd18
Link layer: InfiniBand
If Physical state of any link is not LinkUp then check whether the InfiniBand cables are connected correctly. If State of any link is not Active this indicates that the link does not have a subnet configured. If you are using InfiniBand switches that have an embedded Subnet Manager then refer to the administration manual for the switch used to configure subnets. If you are using back-to-back (no switch) connections or InfiniBand switches that do not have an embedded Subnet Manager, see the next section.
Do not stop InfiniBand drivers or openibd service while FlashMAX Connect drivers are loaded (vgcd service is running). This can cause kernel panic.
Validating Accessibility of All Cluster Nodes via InfiniBand
To verify that all cluster nodes are accessible via InfiniBand
Run the ibhosts command as shown below.
# ibhosts
Ca : 0x0002c9030036da30 ports 1 "server02 HCA-1"
Ca : 0x0002c9030036d970 ports 1 "server01 HCA-1"
Blacklisting Devices in a Multipath Driver Configuration
If multipath driver is installed it must be configured to have all Virident devices in its blacklist. Otherwise the multipath driver will lock the Virident devices and will prevent normal operation. The multipath driver is typically used when FibreChannel cards are used, but may be present in other cases too.
To add devices to multipath driver blacklist
Add the following to the “/etc/multipath.conf” file (assuming there are no other devices already in the blacklist):
blacklist {
devnode "^vgc*"
devnode "^vshare*"
}
The above format assumes that vShare devices will be configured with the recommended “vshare” prefix in their names.
Verifying the Installation of FlashMAX II Devices
Installation of the FlashMAX II devices must be verified when the system is powered on for the first time after installation.
To verify the device installation
Run the lspci command as shown below. This displays a listing for every Virident device installed. It verifies that the device is detected and initialized properly by the PCI subsystem.
# lspci -d 1a78: 05:00.0 FLASH memory: Virident Systems Inc. Device0040 (rev 01)
Installing FlashMAX Connect Drivers and Utilities
To install the drivers and utilities
1. Download the RPMs listed below from the Virident website. It is necessary to select the RPMs corresponding to your distribution and kernel version.
Base/vStore and vCache driver RPM:
kmod-vgc-drivers-<distribution or kernel version>-1.2.FC.<driver version>-<driver build>.x86_64.rpm
Utilities RPM: vgc-utils-<distribution name and version>-1.2.FC.<driver version>-<driver build>.x86_64.rpm
RDMA transport driver RPM (for vShare and vHA only): vgc_rdma-<distribution or kernel version>-1.2.FC.<driver version>-<driver build>.x86_64.rpm
RDMA Utilities RPM (for vShare and vHA only):
vgc-rdma-utils-<distribution name and version>-1.2.FC.<driver version>-<driver build>.x86_64.rpm
2. Confirm the running kernel version with the uname–a command.
# uname -a
Linux hostname 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux
3. Install the RPMs using the standard rpm commandas shown below.
# rpm -ivh kmod-vgc-redhat6.1+-1.2.FC-65244.V5A.x86_64.rpm
Preparing... ######################################### [100%]
1:kmod-vgc-redhat6.1+ ######################################### [100%]
# rpm -ivh vgc_rdma-2.6.32-220.el6.x86_64-1.2.FC-65244.V5A.x86_64.rpm
Preparing... ######################################### [100%]
1:vgc_rdma-2.6.32-220.el6######################################### [100%]
# rpm -ivh vgc-utils-1.2.FC-65244.V5A.x86_64.rpm
Preparing... ######################################### [100%]
1:vgc-utils ######################################### [100%]
# rpm –ivh vgc-rdma-utils-1.2.FC-65244.V5A.x86_64.rpm
Preparing... ######################################### [100%]
1:vgc-rdma-utils ######################################### [100%]
Starting the Driver
After installing the drivers and utilities RPMs, the driver will load automatically on every system boot.
To start the driver without rebooting the system
1. Run the service vgcd command as shown below.
# service vgcd start
Loading kernel modules... [ OK ]
Rescanning SW RAID volumes... [ OK ]
Rescanning LVM volumes... [ OK ]
Enabling swap devices... [ OK ]
Rescanning mount points... [ OK ]
2. After the driver starts successfully, run the vgc-monitor command to confirm that the status is Good.
# vgc-monitor
vgc-monitor: FlashMAX Connect Software Suite 1.2(65244.V5A)
Driver Uptime: 13:57
Card_Name Num_Partitions Card_Type Status
/dev/vgca 1 VIR-M2-LP-550-1A Good
Partition Usable_Capacity RAID vCache vHA vShare
/dev/vgca0 555 GB enabled disabled disabled disabled
Formatting and Enabling vShare
Before configuring vShare target,it is necessary to enable them on allFlashMAX II devices. This requires formatting the physical partition on the FlashMAX II device using the vgc-config utility.
The formatting mode can be changed while formatting the physical partition. The modes available are:
Max Capacity: This is the default mode. It makes the full advertised capacity of the card available for user data.
Max Performance: This mode is useful for write-intensive applications. It provides twice the sustained random write performance of “Max Capacity” while reducing available user capacity of the device by 15%. Only workloads that generate significant amount of random write I/O will benefit from using the “Max Performance” mode.
Read performance and sequential write performance are the same in both modes.
Oracle ASM version 11.2 has 2TB disk size limit. When using FlashMAX II cards of 2.2TB capacity there are three possible ways to configure them:
Use Max Performance mode, which reduces user capacity of the card to 1.85TB.
Configure 1999 GB or smaller size in vShare auto-configuration file. See the following chapter.
Split each card into two 1.1 TB physical partitions. Contact Virident support for help if this is your preferred option. Do not use fdisk or parted utilities.
This limitation was removed in Oracle ASM 12c.
Configuring a vShare Cluster Using a Configuration File
vShare devices can be configured automatically using the vgc-vshare-auto-config utility. This utility offers a series of commands that can be used to configure, monitor, and maintain vShare target vSpaces, and initiators in a cluster. This utility reduces the manual effort of configuring servers/hosts individually and, makes the vShare configuration easier and faster.
Automatic configuration requires password-protected, ssh-based remote root access to the hosts in the cluster. The utility can be used with:
ssh-setups where the passwords are all the same and setups where different passwords are used on different servers.
ssh-setups that require shared, password-protected keys for authentication.
The passwords, however, are neither stored persistently nor logged during the configuration.
Configuring vShare automatically can be done by creating a configuration file and then using it as an input to configure the target vSpaces and initiators. The configuration file contains information like cluster name, vShare name, hosts, backing device path, size of the vShare vSpaces, and IB port GUIDs.
To configure vShare automatically with a configuration file
1. Run the ibstat–p command on each server node to get the IB port GUIDs. This step is optional and required only if you want to enable multipathing with two InfiniBand ports in each server.
# ibstat -p
0x0002c90300e7cd10
0x0002c90300e7cd18
2. Create a configuration file with any name (e.g. example.conf) on any node of the cluster.
[cluster]
name: myrac
[vShare:vshare-rac1-a]
host: rac1
backing-dev: /dev/vgca0
size: 300
initiators: rac2 rac3
[vShare:vshare-rac1-b]
host: rac1
backing-dev: /dev/vgcb0
size: 300
initiators: rac2 rac3
[vShare:vshare-rac2-a]
host: rac2
backing-dev: /dev/vgca0
size: 300
initiators: rac1 rac3
[vShare:vshare-rac2-b]
host: rac2
backing-dev: /dev/vgcb0
size: 300
initiators: rac1 rac3
[vShare:vshare-rac3-a]
host: rac3
backing-dev: /dev/vgca0
size: 300
initiators: rac1 rac2
[vShare:vshare-rac3-b]
host: rac3
backing-dev: /dev/vgcb0
size: 300
initiators: rac1 rac2
[ib:path1]
rac1: 0x002590ffff4813fd
rac2: 0x002590ffff481401
rac3: 0x002590ffff481267
[ib:path2]
rac1: 0x002590ffff4823eb
rac2: 0x002590ffff482413
rac3: 0x002590ffff4822d4
These configuration files create one vShare target on each of the FlashMAX II cards on all nodes and configure initiators for each target on all other nodes. The targets and initiators are configured so that they will use the specified IB Ports to communicate with each other.
The first section is [cluster]with the cluster name. It can be any name but the first character in the name must be alphanumeric ([a-zA-Z0-9]) or ‘_’ and can contain alphanumeric characters or any of the characters: ‘+’, ‘=’, ‘-‘, or ‘_’.
The [vShare:vshare-rac1-a] section specifies parameters for a vShare target named ‘vshare-rac1-a’.It is recommended that the name consists of the following:
A fixed word in the beginning (e.g. “vshare”) to make it easy to filter vShare devices by their names
Name of the server node to make it easy to identify the location of the device
A letter corresponding to the FlashMAX II device, for example, ‘a’ for /dev/vgca0
This name is currently limited to 24 characters. The first character in the name must be alphanumeric ([a-zA-Z0-9]) or ‘_’ and can contain alphanumeric characters or any of the characters: ‘+’, ‘=’, ‘-‘or ‘_’. Under this section:
“host” is the name of the target host/server.
“backing-dev” is the name of the existing FlashMAX II backing device, /dev/vgc[a-z][0-1] on the target server.
“size” is the size of the target vSpace in GB.
“initiators” is the list of one or more initiator host(s)/server(s) for this target, separated by a space. Typically the initiators list includes all nodes, except for the target node itself.
The[ib:path1] and [ib:path2]sectionsare optional and used for configuring InfiniBand multipathing. Each of these sectionsprovides a list of IB port GUIDs that are part of the same subnet (connected to the same IB switch or, in case of back-to-back connection, connected directly to each other). Each of the two sections must list host names for all nodes with corresponding IB port GUIDs.
3. Run the vgc-vshare-auto-config–configure command, as shown below. This creates the target vSpace, grants access, and creates the initiator, with multipathing enabled. The pre-requisites, IB connectivity and configuration are verified. The cluster configuration is stored on all cluster hosts as “/var/lib/vshare/*_cluster.conf”. The steps involved in the configuration are displayed one by one until completion. The process ends with disconnecting the SSH sessions.
The cluster configuration is persisted on the cluster hosts for use by the other management commands.
If the command is run on a host that is not a member of the cluster, the cluster configuration file is not persisted locally. Future management commands must be run from one of the configured cluster hosts.
# vgc-vshare-auto-config --configure example.conf
Is the same root password used for all hosts [y/N]: y
Enter root password:
Step: Verify Passwords
[rac1, rac2]: Opening SSH sessions
Succeeded.
...
Step: Completion
[rac1, rac2]: Disconnecting SSH sessions
Succeeded.
Verifying the vShare Devices
After the vShare cluster is successfully configured verify that all vShare devices are visible on all cluster nodes with the ls–l command, as shown below.
# ls -l /dev/vshare*
brw-rw----. 1 root disk 250, 32 Dec 25 22:16 /dev/vshare-rac1-a
brw-rw----. 1 root disk 250, 48 Dec 25 22:16 /dev/vshare-rac2-a
Enabling Synchronized Boot Service
In an Oracle RAC environment vgc-vshare-wait service must be enabled immediately after configuring vShare and before disk groups are created in Oracle ASM. During system boot the vgc-vshare-wait service pauses the system boot process untilall vShare initiators are in a connected state. The service then exits and the Oracle ohasd service starts. Without this service enabled when bringing up a vShare/RAC cluster ASM may find some of its disks missing and will fail to start. This might lead to unexpected startup failures if any of the cluster nodes boots with a significant delay compared to other nodes. The vgc-vshare-wait service is only for environments running Oracle RAC.If Oracle is not installed on the node and the ohasd service is not present then the vgc-vshare-wait service exits without waiting.
To enable the service
1. On each server node that has vShareinitiators configured, run the chkconfig vgc-vshare-wait on command. This enables the service.
2. Verify that the service has been enabled, with the command chkconfig --list | grep vshare.
[@initiatorhost]# chkconfig vgc-vshare-wait on
[@initiatorhost]# chkconfig --list | grep vshare
vgc-vshare-wait 0:off 1:off 2:on 3:on 4:on 5:on 6:off
The vgc-vshare-wait utility has configuration options that the administrator can customize. They are found in the “vgc-vshare-wait.conf” file.
To customize the configuration options
1. On the target server, run the command vim /etc/sysconfig/vgc-vshare-wait.conf. This opens the configuration file in the vim editor.You can use any available text editor. The options TIMEOUT, SLEEP_INTERVAL, MSG_THROTTLE, and IGNORE_OHASD_AUTOSTART are displayed with the default/recommended values along with comments explaining about the accepted values.
# Copyright (C) 2013 Virident Systems, Inc.
# TIMEOUT < 0 Wait until all devices are connected.
# TIMEOUT = 0 Do nothing (no check) and exit. disables the script.
# TIMEOUT > 0 Check for vShare devices for this number of seconds.
# TIMEOUT > 0 always checks at least once.
TIMEOUT=-1
# SLEEP_INTERVAL the number of seconds to wait between checking vShare status.
# SLEEP_INTERVAL will be adjusted to be at least 0 and at least TIMEOUT seconds.
SLEEP_INTERVAL=5
# MSG_THROTTLE controls how often this script prints runtime details of the disconnected vShares.
# Note MSG_THROTTLE < SLEEP_INTERVAL implies MSG_THROTTLE = SLEEP_INTERVAL.
# Status messages will always be printed when TIMEOUT expires or all vShares are connected.
MSG_THROTTLE=5
# IGNORE_OHASD_AUTOSTART controls whether this script will change the current ohasd autostart value to disable if vShares aren't connected at timeout.
# IGNORE_OHASD_AUTOSTART = 0 Allow changing.
# IGNORE_OHASD_AUTOSTART = 1 Don't change anything.
IGNORE_OHASD_AUTOSTART=0
The TIMEOUT value specifies the amount of time in seconds until which the vgc-vshare-wait service will wait for vShare initiators to be connected. By default the vgc-vshare-wait service will wait indefinitely until all vShare initiators are in a connected state. This value can be set as follows:
A TIMEOUT value of ‘-1’ causes the service to wait indefinitely until all devices are connected. This is the default value.
A TIMEOUT value of‘0’ disables the script and no checking of vShare status is performed.
A TIMEOUT value greater then ‘0’ will check the connection of the vShare devices for that amount of time.
The SLEEP_INTERVAL valueis the number of seconds that the service waits between checking the vShare connection status. By default the vgc-vshare-wait service will check thevShare connection status,once in every 5 seconds.
While waiting for vShare devices to connect, the vgc-vshare-wait service prints periodic status messages. The MSG_THROTTLE value controls how often the service prints runtime details of the disconnected vShares.
The IGNORE_OHASD_AUTOSTART value specifies whether or not vgc-vshare-wait should attempt to change the automatic startup of Oracle Clusterware High Availability Services. If the TIMEOUT value is set and expires before all vShare devices are in a connected state, the vgc-vshare-wait will change the current Oracle Clusterware High Availability Services autostart value in order to disable its startup. To disable changing the ohasd autostart value, this value should be set to 1.
2. Change the values as required.
3. Save and close the file.
To disable the service
1. On the target or initiator server, run the command chkconfig vgc-vshare-wait off. This disables the service.
2. Verify that the service has been disabled with the command chkconfig --list | grep vshare.
[@initiatorhost]# chkconfig vgc-vshare-wait off
[@initiatorhost]# chkconfig --list | grep vshare
vgc-vshare-wait 0:off 1:off 2:off 3:off 4:off 5:off 6:off
All devices used by Oracle ASM must have correct permissions set. If the permissions are not set correctly ASM will not allow using those devices. The permissions can be set using a UDEV rule or using ASMLib. Virident recommends using a UDEV rule as described below.
1. Create a file named /etc/udev/rules.d/99-vgc-oracle.rules
2. Add the following line to the file and save it. Change the owner and group names as needed.
KERNEL=="vshare*", OWNER="grid", GROUP="dba", MODE=660
3. Run the following command to apply the rule.
# udevadm trigger
4. Verify that the permissions are correct.
# ls -l /dev/vshare*
brw-rw---- 1 grid dba 252, 0 Dec 4 15:03 /dev/vshareXXXX
brw-rw---- 1 grid dba 252, 16 Dec 4 14:59 /dev/vshareXXXX
Enabling Direct I/O and Asynchronous I/O
It is important for most of the storage I/O to go directly to vShare devices bypassing the Linux page buffer. Also, using asynchronous I/O provides better I/O concurrency and performance.
To enable both direct I/O and asynchronous I/O
Set the Oracle initialization parameter“FILESYSTEMIO_OPTIONS=SETALL” in one of the following ways:
By setting it in Database Configuration Assistant, in Initialization Parameters (step 9) -> All Initialization Parameters… -> Show Advanced Parameters
By adding it to initORACLE_SID.ora file (or by changing it, if the parameter already exists)
By setting it in SPFILE, as follows:
SQL> ALTER SYSTEM SET FILESYSTEMIO_OPTIONS=SETALL SCOPE=SPFILE;
SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP
Measuring Storage Performance
The easiest way to assess performance of an Oracle storage subsystem is using Oracle’s standard PL/SQL procedure called CALIBRATE_IO. This procedure uses real Oracle database processes accessing the actual blocks in the database files. It can be used with any storage configuration including Oracle ASM and including various redundancy modes. Its use is dependent on async_io being set to true, the default with ASM.
The following query will display the current state of your environment and the enabling of async_io.
SELECT d.name, i.asynch_io
FROM v$datafile d, v$iostat_file i
WHERE d.file# = i.file_no
AND i.filetype_name = 'Data File'
/
/u01/oradata/CDB1/CDB1/EAF64EB7BEE6102EE045000000000001/datafile/o1_mf_martin_98 3n9kxz_.dbf ASYNC_ON
If the tablespaces and their matching data files are not displayed with ASYNC_ON then run the following command and restart the database.
ALTER SYSTEM SET filesystemio_options=setall SCOPE=SPFILE;
Syntax: DBMS_RESOURCE_MANAGER.CALIBRATE_IO (<DISKS>, <MAX_LATENCY>, iops, mbps, lat);
The CALIBRATE_IO procedure has two input parameters:
DISKS: This parameter affects the increment and the maximum number of outstanding I/Os used during the test. In case of HDDs, user has to set this parameter equal to the number of physical spindles. However, due to much higher concurrency of FlashMAX II, we recommend setting it to 4 per card. For example, if you have 6 cards in the cluster, set the parameter to 24.
MAX_LATENCY: This parameter sets the limit for acceptable latency in milliseconds. The procedure will stop increasing the amount of outstanding I/Os when latencies become higher than this value. We recommend setting this parameter to the minimum allowed value of 10 milliseconds.
The CALIBRATE_IO procedure has three output values:
iops: This value returns the best measured number of I/Os per second achieved with latencies lower than MAX_LATENCY.
mbps: This value returns the best measured bandwidth in MB/s using large blocks. Note that the mpbs value is not the iops value multiplied by block size. A separate test using larger block sizes is used for measuring the mbps value.
lat: This value returns the actual latencies in milliseconds measured during the iops test. On vShare, the iops value is expected to be zero in most cases, as the latencies are <1ms.
To execute the CALIBRATE_IO procedure
Run the following PL/SQL script in SQL*Plus. It takes 10 minutes per node (30 minutes on a 3-node cluster) to complete. Make sure to modify the DISKS parameter corresponding to the number of FlashMAX cards being used (4 per card).
SET SERVEROUTPUT ON
DECLARE
lat INTEGER;
iops INTEGER;
mbps INTEGER;
BEGIN
DBMS_RESOURCE_MANAGER.CALIBRATE_IO (24, 10, iops, mbps, lat);
DBMS_OUTPUT.PUT_LINE ('max_iops = ' || iops);
DBMS_OUTPUT.PUT_LINE ('latency = ' || lat);
DBMS_OUTPUT.PUT_LINE ('max_mbps = ' || mbps);
end;
/
Summary and Findings.
The system we assembled delivered a sustained 1.43 Million IOPs with immeasurable latency. We can compare this to systems of three years ago that typically achieved less than 20,000 IOPS based on enterprise grade SAN technology and fibre channel networks. Its assembly comprised off the shelf components from recognised and trusted suppliers like HP and Virident. The combination is an Oracle certified combination, illustrating just what can be done to engineer cost competitive high performance solutions. The combination is also certified to run on Oracle Virtual Machine, opening the door to run Exchange and SQL-Server on the same platform with startling results.
As I said earlier in the article we are lucky enough to have this equipment for about six months. I will be running face to face and on line sessions on this system with comprehensive notes to allow you to reassemble the system at your leisure. I look forward to seeing you at one of our conferences both in Australia and Asia.
References.
Exadata history details,
http://flashdba.com/history-of-exadata/
Test Systems setup guidelines,
Oracle RAC 11.2
with FlashMAX Connect™ vShare ver. 1.2
Deployment Guide rev1.01
About the Author.
Martin Power is an Oracle ACE, member of Oracle’s Beta Test Teams for Database, Virtualization and Linux operating systems. He is the current president of the Australia Oracle User Group and a regular contributor to the Oracle community’s publications and as a speaker. He works with Parish Crest, an Australian Oracle Gold Partner and professional services company in the Infrastructure, database administration, architecture and middleware. He can be contacted via email at president@ausoug.org.au, martin.power@parishcrest.com.au or +61 (0)411 248 486
This post was published in The Australain Oracle User Group's Foresight Magazine in 2014.