Our Experiences with Virtualized SuSe Linux on our Mainframe SLES11 SP2 under z/VM

overall we are very satisfied with the virtualization capabilities of z/VM on our two Z196 and Z9. great performance, reliability, functionality, stability all come together.

it is not ‘bed of roses’ always. there are times when we run into issues but most of the time it just works out well. i am listing some of the challenges we faced recently to help others planning to run Penguin on the mainframe.

File Scan issue

Why file scan occur ?  How are we addressing it ?  How is the grouping being scheduled or force to stagger the

filescan ?  How can we permanently stop the read only if we need to ?

 

We have our zlinux app data on fileserver (NFS and samba), with large filesystems of multi terabytes. when we upgraded them from sles9 to sles11, all filesystems were recreated again almost at the same time. so the problem was that in a boot       once the 180days default fsck was over, all filesystems would scan and one boot of the server accounted for 6+ hrs, a definate outage. ( they say fsck is not required for ext3 -journaled, but we still want to be safe with our data and didn’t want to completely disable it.)

options were:

1) Disable fsck and do manually when we wanted( tedious and prone to mistakes)

2) Stagger the filesystems into group of similar sizes and use tune2fs to tune the parameter that trigger the scan so that all groups dont scan together. this would need scheduled reboot of servers so that we can control and all filesystems dont scan together.-current solution

3) We are looking forward to ways of finding out if we can disable scan at boot time and do the fsck using a script so that we can run parallel scans /or speed up DASD access using parallel HyperPAV etc .

4) If we can’t do the above option, we might have to breakup filesystems into smaller business unit dedicated fileservers to  resolve the issue permanently until it outgrows.- least recommended approach as we are not maxing out on anything(Memory/bandwidth,cpu) on fileserver

 

IBM GA Driver 93 firmware update on the Mainframe –

What did Drver93 actually cause?  Why did it totally disable us ?  How did we circumvent it ?

How did we manage to recover the SMT – then patch it so we can patch other guest that were affected ?

If hipersocket network is disabled do we failover to use internal network ?

 

Driver 93 update required that the Linux kernel version should be greater than version version 2.6.32.29-0.3.1. most of our images were patched 3 month ago and had kernel 2.6.32.27-0.2-default and thus did not meet the requirement ( except our production environment luckily -kernel 2.6.32.59-0.3-default). because of the new feature called QIOASSIST(Queued I/O Assist) on Hipersocket in driver 93, the  Linux kernels using hipersocket panic’ed because IBM decided to keep it enabled by default. Many of our guests including our infrastructure servers like Our SMT ( subscription management Server) server started kernel panic abruptly.

The solution was to repatch the systems to the latest, those which are failing but to be able to patch SMT was required and SMT itself was failing. Fortunately there was a ‘CP QIOASSIST OFF’  option at the guest level by using ‘NOQIOASSIST’ in z/VM guest level which we did on the SMT so it was available for other guests to use for patching. And we also had to do this on every guest we wanted to patch because we did not want the guest to panic while it was being patched.

Kernel panic was happening at the time of hyper socket initialization phase, so the guest that were not using hyper socket were safe but we use it on almost every guest.

Hyper socket was a different network segment and there was no fail-over option thus.

SAMBA –

what development need or business need initiated the requirement to implement Samba ?

Files had to be shared between Windows and Linux. Using NFS client on Windows was not efficient because of the limited ACL and permission related issues and the client software license was to be purchased and did not support multiple NFS shares on the same drive letter. Thus Samba turned out to be a better option because of its ease of use and simplicity and no coding changes required, and being free.

How was it procedural wise to implement ?

Implementation is very easy – install some packages, configure shares, set up user ID mapping. Setup passwords.

Window file system to Linux 

How would you proceed to develop a strategy to migrate window file systems to Linux ?

To be able to successfully migrate Windows file servers to zLinux, Samba seems to be the way -active directory integration is required, the file system and shares should support ACLs. HA is important. OSA performance and throughput is important because for VDI implementation we put even PST files on network, backup should complete timely, lot of small files so an efficient backup system will be required.

Steps necessary ?  What are the implications is using Netapp  OR just San attach fcp ?

Would z/fcp on z/linux be more complicated that redhat Intel migration ?

Steps would be:

  • choose disk devices DASD or zFCP.
  • Choose the cluster file systems because HA is important. Novell SLES has cluster suit option and is free for zLinux.
  • Configure Samba in HA
  • Configure shares and permissions
  • configure active directory integration of Samba
  • size OSAs – possibility of link aggregation?
  • Design reliable and efficient backup infrastructure.

One option is to use a NetApp appliance but we currently don’t have the necessary skill to manage it and are currently reluctant on it. It can be good if designed properly.

Netscaler / Native linux LB –

Why were we not able to upgrade tcp listerner guest to sles11 outright ?

Because of the MAC forwarding method we were forced to use in the new versions of IBM load balancer is the option for TCP listener application to use TCP load balancer was gone for one of our critical application.

what problems did we see ?

Can’t load balance TCP. So different load balancer had to be used, preferably inside the system Z.

Why did Websphere edge technology not suffice ? What is being used now in sles9 for the LB ?

The NAT forwarding method used in IBM load balancer version 6 worked fine for a TCP load balancing application but the new version only supported MAC forwarding.

How was netsclaer implemented ? be as detailed as possible ?  How would we accomplish HA design with netscaler ?

Netscaler has built an HA and two Netscalers can work together to provide HA within load balancer and load balance the TCP application and is a very powerful but pricey load balancing solution for the enterprise. Netscaler has checks for health check of real servers and easy web interface to manage supports VLANs and many advanced features for routing traffic.

How did you develop native linux LB ?  be as detailed as possible ?  How would we implement HA design?

We tested a self-built load balancing solution from the open source community which was a combination of PEN, pound, UCARP ( equivalent of VRRP in open world), and HTTP check TCP check perl scripts from the Nagios world, which worked well in the in VMware  image for our TCP listener service hosted in zLinux, and are currently working to compile the same in system Z as all of these package sources are available for free.

UCARP is the open source world IP failover protocol like VRRP which Cisco claims the ownership for.

Why do these solutions work so well with sles11 upgrade requirements ?

Keeping SLES11 system updated and running latest kernel has really given us a lot of features and enhancements.

DR recovery of z/linux guest 

Describe process you to successful get a successful/recoverable flashcopy backup ?

For zLinux initially our Flash copy backups were failing to the two main reasons

  • Cached file system in the LINUX (data in memory not flushed to disk at the time of Flash copy run)
  • Different timestamp of Flash copy for all volumes of zLinux including members of an LVM.

Solution was simple:

  • We wrote a scheduled sync script  which issued sync command in all zLinux guests just before Flash copy triggered
  • We implemented consistency groups in Flash copy process to make sure all volumes of each guest and get flashed same time.

Could we run emc networker Server and /or client on z/linux ?

EMC networker server is not supported on zLINUX but client is very well supported. If we go zFCP, the only way to backup in our environment would be to use EMC networker client.

Websphere Application server v8 cutover (WAS8)-

Describe the type of problem you and Martin had and what scipts you wrote to reduce the implementation time ?

Since we were doing the entire zLinux cell upgrade together to WAS8 using parallel approach, there was a need to upgrade 35 servers together most of them websphere app servers. Parallel environment was built with WAS8 binaries and all necessary upgraded software stack of agents. The challenge was to switch the identity of these servers -like , VM username ,IP address hostname, hyper socket IP, sync data in the limited 1 1/2 hours window available.

Solution was to:

  • Implement REX scripts to rename VM user at mass.
  • Implement scripts to change identity and all parameters of listed 35 servers AUTOMATICALLY.
  • Implement  rsync scripts to sync data consistency purposes, and run them that time.

All was done perfect in time with zero issues.

About

View all posts by

Leave a Reply

Your email address will not be published. Required fields are marked *