QMAN$MASTER.DAT on shadow set in cluster causes crashes on x86-64

OpenVMS x86 Field Test questions, reports, and feedback.

Topic author
racingmars
Member
Posts: 9
Joined: Mon Apr 17, 2023 6:03 pm
Reputation: 0
Status: Offline

QMAN$MASTER.DAT on shadow set in cluster causes crashes on x86-64

Post by racingmars » Mon Apr 17, 2023 6:25 pm

(Not sure if this belongs in the Clustering forum, X86-64 forum, or here...)

I have a three-system cluster with OpenVMS E9.2-1 for x86-64 on each node. Each node has its own system disk, and each system has a second disk which is a member of a shadow set across all three nodes.

Code: Select all

$ show dev shdt1

Device                  Device           Error   Volume          Free  Trans Mnt
 Name                   Status           Count    Label         Blocks Count Cnt
DSA0:                   Mounted              0 SHDT1          50182695     5   3
$1$DKA100:    (VMSX01)  ShadowSetMember      0 (member of DSA0:)
$2$DKA100:    (VMSX02)  ShadowSetMember      0 (member of DSA0:)
$3$DKA100:    (VMSX03)  ShadowSetMember      0 (member of DSA0:)
All of the documentation for clusters says the queue manager QMAN$MASTER.DAT file needs to be accessible at the same location to all cluster members. I create a directory for the queue manager file on the shadowset volume and set the QMAN$MASTER logical cluster-wide:

Code: Select all

$ CREATE/DIR SHDT1:[CLUSTER$CONFIG.Q]
$ DEFINE/SYSTEM/EXECUTIVE/CLUSTER QMAN$MASTER  SHDT1:[CLUSTER$CONFIG.Q]
Now, from one cluster member (VMSX01), I start the queue manager:

Code: Select all

$ START/QUEUE/MANAGER/NEW_VERSION
As expected, this created QMAN$MASTER.DAT in SHDT1:[CLUSTER$CONFIG.Q]:

Code: Select all

$ dir shdt1:[cluster$config.q]

Directory SHDT1:[CLUSTER$CONFIG.Q]

QMAN$MASTER.DAT;1   

Total of 1 file.
All three cluster members agree that the queue manager is running on node VMSX01. All three produce this exact same output:

Code: Select all

$ show queue/manager
Queue manager SYS$QUEUE_MANAGER, running, on VMSX01::
I run "enable /autostart /queues" on all three cluster members.

Now I create the sys$batch queue from node VMSX01:

Code: Select all

$ init /queue /start /autostart_on=(vmsx01::,vmsx02::,vmsx03::) /batch sys$batch
$ show queue sys$batch
Batch queue SYS$BATCH, idle, on VMSX01::
The other two cluster members also show "Batch queue SYS$BATCH, idle, on VMSX01::" when I run "show queue sys$batch" on them.

So at first glance, everything appears to be working as expected. However, if I try to stop the queue manager (this also happens if I try to shut down the cluster), VMSX01 crashes.

Code: Select all

$ stop/queue/manager/cluster
$ show queue/manager
Queue manager SYS$QUEUE_MANAGER, stopping, on VMSX01::
After a few seconds, the console for VMSX01 shows:

Code: Select all

VSI Dump Kernel SYSBOOT Jan 23 2023 14:03:45

**** OpenVMS x86_64 Operating System E9.2-1   - BUGCHECK ****

** Bugcheck code = 0000019C: INCONSTATE, Inconsistent I/O data base
** Crash Time:            17-APR-2023 20:54:37.33
** Crash CPU: 00000000    Primary CPU: 00000000    Node Name: VMSX01
** Highest CPU number:    00000001
** Active CPUs:           00000000.00000003
** Current Process:       "QUEUE_MANAGER"
** Current PSB ID:        00000001
** Image Name:            $1$DKA0:[SYS0.SYSCOMMON.][SYSEXE]QMAN$QUEUE_MANAGER.EXE;1
   
** Dumping error logs to the system disk ($1$DKA0:)
** Error logs dumped to $1$DKA0:[SYS0.SYSEXE]SYS$ERRLOG.DMP
** (used 52 out of 64 available blocks)
** Dumping memory to the system disk ($1$DKA0:)
Am I doing something wrong by putting the QMAN$MASTER.DAT file on a shadowset? If so, how are you supposed to share the QMAN$MASTER.DAT file in a cluster without shared disks?

Or is this a bug?

Thanks,
Matthew


bhall
Member
Posts: 5
Joined: Tue Jun 25, 2019 5:12 pm
Reputation: 0
Status: Offline

Re: QMAN$MASTER.DAT on shadow set in cluster causes crashes on x86-64

Post by bhall » Mon Apr 17, 2023 10:36 pm

You moved QMAN$MASTER.DAT to a common location, SHDT1:[CLUSTER$CONFIG.Q]. Your “START/QUEUE/MANAGER/NEW_VERSION” command created the queue definition and journal files in the default location on node VMSX01. The default location is SYS$COMMON:[SYSEXE]. To create all three files in your cluster common directory:

Code: Select all

$ START/QUEUE/MANAGER/NEW_VERSION SHDT:[CLUSTER$CONFIG.Q]
I would also recommend that you reconsider your use of a failover execution queue with a name of SYS$BATCH. It is the default batch queue and most users expect jobs submitted to SYS$BATCH to execute on the same node that they were submitted on. I suggest creating node specific batch queues and then define the logical name SYS$BATCH to point to the node specific execution queue on each node. For example:

Code: Select all

$ init/queue /start /autostart_on=(vmsx01::) /batch vmsx01_sys$batch
$ init/queue /start /autostart_on=(vmsx02::) /batch vmsx02_sys$batch
$ init/queue /start /autostart_on=(vmsx03::) /batch vmsx03_sys$batch
$ RUN SYS$SYSTEM:SYSMAN
set environment/cluster
do define/system sys$batch ‘f$getsyi(“nodename”)’_sys$batch
do show queue/full sys$batch
exit
Be sure to add the node specific definition of SYS$BATCH to the SYLOGICALS.COM on each node’s system disk so that it is available during system startup.
Last edited by bhall on Mon Apr 17, 2023 10:42 pm, edited 1 time in total.


Topic author
racingmars
Member
Posts: 9
Joined: Mon Apr 17, 2023 6:03 pm
Reputation: 0
Status: Offline

Re: QMAN$MASTER.DAT on shadow set in cluster causes crashes on x86-64

Post by racingmars » Mon Apr 17, 2023 11:34 pm

bhall wrote:
Mon Apr 17, 2023 10:36 pm
You moved QMAN$MASTER.DAT to a common location, SHDT1:[CLUSTER$CONFIG.Q]. Your “START/QUEUE/MANAGER/NEW_VERSION” command created the queue definition and journal files in the default location on node VMSX01. The default location is SYS$COMMON:[SYSEXE]. To create all three files in your cluster common directory:

Code: Select all

$ START/QUEUE/MANAGER/NEW_VERSION SHDT:[CLUSTER$CONFIG.Q]
I've tried it both ways -- creating just the QMAN$MASTER.DAT in the shared directory by leaving the directory off of the START/QUEUE/MANAGER/NEW_VERSION command and including it. Both ways ends up producing the same behavior whether or not the journal files are in the shared directory: system crashed with the "Inconsistent I/O data base" bugcheck later.
bhall wrote:
Mon Apr 17, 2023 10:36 pm
I would also recommend that you reconsider your use of a failover execution queue with a name of SYS$BATCH. It is the default batch queue and most users expect jobs submitted to SYS$BATCH to execute on the same node that they were submitted on. I suggest creating node specific batch queues and then define the logical name SYS$BATCH to point to the node specific execution queue on each node. For example: ...
Thanks for the tip -- once things are stable, yeah, I'll try to set things up a bit more correctly!

User avatar

volkerhalle
Master
Posts: 196
Joined: Fri Aug 14, 2020 11:31 am
Reputation: 0
Status: Offline

Re: QMAN$MASTER.DAT on shadow set in cluster causes crashes on x86-64

Post by volkerhalle » Tue Apr 18, 2023 12:49 am

Matthew,

I tried a simple test with a non-clustered E9.2-1 system and the queue manager database on a shadowed non-system disk and could not reproduce your problem.

NOTE: a system crash is always a bug ! Even it would be an 'operator error', that should NEVER crash the system !

The next step would be to try to have a look at the CLUE footprint of the crash. Please do the following on VMSX01:

$ TYPE CLUE$HISTORY

This should show the crash history for this node, one line per crash. Unfortunately this may not work, as there is a bug in SDA, when processing the CLUE HISTORY command on a newly written crashdump.

Please issue the following commands:

$ ANAL/CRASH SYS$SYSTEM
SDA> CLUE HISTORY/OVERRIDE
SDA> EXIT

This should create the CLUE$HISTORY.DAT file, so the above $ TYPE CLUE$HISTORY command should work now. And it should also create a CLUE$COLLECT:CLUE$VMSX01_ddmmyy_hhmm*.LIS CLUE file. This file contains the most important information about the crash footprint - readable ASCII text.

Unfortunately, one cannot attach .LIS files in this forum and the CLUE file is a little bit too big to post it in a CODE box. You could send the CLUE to me in a mail (as an attachment), if you can figure out my mail address (hint: I still work at invenate.de). I have been maintaining the CANASTA Mail server (automatic crashdump analysis) at Digital/Compaq/HP long ago, but this server still exists and has been enhanced to also process CLUE files from VSI OpenVMS x86-64. It will not have a solution for this crash, but I'll be able to tell you more...

If you can raise a support call to VSI, you should definitely do this. It would help, if you obtain the VSI$SUPPORT.COM procedure and run it on VMSX01 before logging the call. This procedure collects information about the system and also includes CLUE files found. Attach the output of the procedure to your call.

Volker.
Last edited by volkerhalle on Tue Apr 18, 2023 3:21 am, edited 2 times in total.


Topic author
racingmars
Member
Posts: 9
Joined: Mon Apr 17, 2023 6:03 pm
Reputation: 0
Status: Offline

Re: QMAN$MASTER.DAT on shadow set in cluster causes crashes on x86-64

Post by racingmars » Tue Apr 18, 2023 3:56 am

volkerhalle wrote:
Tue Apr 18, 2023 12:49 am
I tried a simple test with a non-clustered E9.2-1 system and the queue manager database on a shadowed non-system disk and could not reproduce your problem.
Yeah... I suspect it has to do with multiple cluster members trying to touch the file. And something with the locking going wrong either on its own or coupled with the fact that it's a shadow set across multiple cluster members. Dunno...
volkerhalle wrote:
Tue Apr 18, 2023 12:49 am
NOTE: a system crash is always a bug ! Even it would be an 'operator error', that should NEVER crash the system !
Agreed! Just wanted to leave open the door that I was doing something Very Bad :-). I did, however, just set up a three-member VMS 7.3 on VAX cluster under simh that is configured the same way, and the bug doesn't reproduce there.
volkerhalle wrote:
Tue Apr 18, 2023 12:49 am
The next step would be to try to have a look at the CLUE footprint of the crash. Please do the following on VMSX01
Here's a link to the CLUE file generated per your instructions:

https://mattwilson.org/filedrop/CLUE$VM ... 41.LIS.txt
volkerhalle wrote:
Tue Apr 18, 2023 12:49 am
If you can raise a support call to VSI, you should definitely do this. It would help, if you obtain the VSI$SUPPORT.COM procedure and run it on VMSX01 before logging the call. This procedure collects information about the system and also includes CLUE files found. Attach the output of the procedure to your call.
I'm just a hobbyist so we're only supposed to post to the forum and not open support tickets... this does seem to be a legit system-crashing bug of some sort, though, so hopefully the folks at VSI monitoring the forums can pass the word along internally.

Thanks!

-Matthew

User avatar

volkerhalle
Master
Posts: 196
Joined: Fri Aug 14, 2020 11:31 am
Reputation: 0
Status: Offline

Re: QMAN$MASTER.DAT on shadow set in cluster causes crashes on x86-64

Post by volkerhalle » Tue Apr 18, 2023 4:11 am

Matthew,

this is an inconsistency detected in SYS$IPC_SERVICES. The major crash footprint from the CLUE file is this:

Code: Select all

Crash Time:        18-APR-2023 07:41:00.31
Bugcheck Type:     INCONSTATE, Inconsistent I/O data base
Node:              VMSX01  (Cluster)
CPU Type:          VMware, Inc. VMware7,1
VMS Version:       E9.2-1  
Current Process:   QUEUE_MANAGER
Current Image:     $1$DKA0:[SYS0.SYSCOMMON.][SYSEXE]QMAN$QUEUE_MANAGER.EXE;1
Failing PC:        FFFF8300.07706432    SYS$IPC_SERVICES+8007E032   
Failing PS:        00000000.00000000
Module:            SYS$IPC_SERVICES    (Link Date/Time: 23-JAN-2023 14:14:55.29)
Offset:            8007E032
This should NOT happen. IPC is clusterwide interprocess communication

Does $ SHOW QUEUE/MANAGER/FULL show the SAME information on all 3 OpenVMS x86-64 nodes ? And there are no other nodes in this cluster than these 3 x86-64 systems ?

Volker.


Topic author
racingmars
Member
Posts: 9
Joined: Mon Apr 17, 2023 6:03 pm
Reputation: 0
Status: Offline

Re: QMAN$MASTER.DAT on shadow set in cluster causes crashes on x86-64

Post by racingmars » Tue Apr 18, 2023 4:56 am

volkerhalle wrote:
Tue Apr 18, 2023 4:11 am
Does $ SHOW QUEUE/MANAGER/FULL show the SAME information on all 3 OpenVMS x86-64 nodes ?
Yep, if I recreate the scenario, starting with the directory pointed to by QMAN$MASTER being empty, and then running START/QUEUE/MANAGER/NEW_VERSION QMAN$MASTER, all three members show the exact same output for SHOW QUEUE/MANAGER/FULL. They all agree that the queue manager is running on member VMSX01 and that the master file and database location are the same:

Code: Select all

$ SHOW QUEUE/MANAGER/FULL
Master file:  SHDT1:[CLUSTER$CONFIG.Q]QMAN$MASTER.DAT;

Queue manager SYS$QUEUE_MANAGER, running, on VMSX01::
  /ON=(*)
  Database location:  SHDT1:[CLUSTER$CONFIG.Q]
At that point, running STOP/QUEUE/MANAGER/CLUSTER causes the crash after about 5 seconds 100% of the time.
volkerhalle wrote:
Tue Apr 18, 2023 4:11 am
And there are no other nodes in this cluster than these 3 x86-64 systems ?
That's correct, only these three systems in the cluster.

Thanks,
Matthew

User avatar

volkerhalle
Master
Posts: 196
Joined: Fri Aug 14, 2020 11:31 am
Reputation: 0
Status: Offline

Re: QMAN$MASTER.DAT on shadow set in cluster causes crashes on x86-64

Post by volkerhalle » Fri Apr 21, 2023 10:17 am

Matthew,

thanks for reporting this problem.

Answer from VSI: Problem will be fixed in V9.2-1.

Volker.


Topic author
racingmars
Member
Posts: 9
Joined: Mon Apr 17, 2023 6:03 pm
Reputation: 0
Status: Offline

Re: QMAN$MASTER.DAT on shadow set in cluster causes crashes on x86-64

Post by racingmars » Fri Apr 21, 2023 3:38 pm

volkerhalle wrote:
Fri Apr 21, 2023 10:17 am
Answer from VSI: Problem will be fixed in V9.2-1.
Or, hopefully, V9.2-2, since V9.2-1 is the version that it's happening in. (Or do they mean an update patch for 9.2-1?)

Good to hear they've identified it and will have a fix, though.

Thanks!


tim.stegner
VSI Expert
Valued Contributor
Posts: 55
Joined: Wed Jul 21, 2021 9:14 am
Reputation: 0
Status: Offline

Re: QMAN$MASTER.DAT on shadow set in cluster causes crashes on x86-64

Post by tim.stegner » Fri Apr 21, 2023 3:39 pm

correction - it's happening in "E"9.2-1. "V"9.2-1 will be coming out later.

Post Reply