Understanding SYSLOST

issinoho · Post by **issinoho** » Thu Jan 23, 2025 12:00 pm

Hardware type: HP rx2660 (1.59GHz/6.0MB)
Software type: OpenVMS IA64 V8.4

We have 2x servers clustered with a fibre HPE MSA P2000 G3 Modular Smart Array controller which provides shared external storage for the cluster.
The disk array is presented to the servers as 2 disk devices, i.e.

Code: Select all

Disk $1$DGA1501: (NODE1), device type HP P2000 G3 FC, is online, mounted, file-
    oriented device, shareable, device has multiple I/O paths, served to cluster
    via MSCP Server, error logging is enabled.
    
Disk $1$DGA1502: (CTCUP2), device type HP P2000 G3 FC, is online, mounted, file-
    oriented device, shareable, available to cluster, error logging is enabled.

DGA1501 is a data disk; DGA1502 acts primarily as a quorum disk.

The system has been operational for coming up to 20 years and recently suffered a SAN controller failure which presented initially as mass data corruption on DGA1501.

An ANA/DISK/REPAIR was run which didn't do any good but has resulted in about 3GB of data appearing in DGA1501:[SYSLOST...]

The controller has since been replaced and the firmware restored all the data without any loss or ongoing corruption.

My question is what do I do with all the files in SYSLOST? They look like copies of files that exist in their original locations, post-incident.

I have one reservation which is manifesting in batch queue entries. Let me try to explain...

I have a folder, say PROJECT$ROOT:[JOBS] which contains command files for batch jobs, most of which are self re-submitting. I would run the command file and the job would execute and submit itself back to run again in x minutes.
On the surface this is working fine, however if I examine a batch entry in full it shows the following, e.g.

Code: Select all

  Entry  Jobname         Username     Blocks  Status
  -----  -------         --------     ------  ------
    753  TEST1_CODE      MYUSER               Holding until 23-JAN-2025 16:51:25
         On idle batch queue SERVER_BATCH
         Submitted 23-JAN-2025 16:50:25.79 /KEEP /NOLOG /NOPRINT /PRIORITY=100
         File: _$1$DGA1501:[SYSLOST.PROJECT.JOBS]TEST1_CODE.COM;7

Note that the file location is the SYSLOST version of the file, and not the actual file, almost as if there is a symlink between the two ??!?

Based on the above I am keen to understand what is going on and also hesitant to delete anything from SYSLOST.
Can anyone shed any light here?

abrsvc · Post by **abrsvc** » Thu Jan 23, 2025 1:11 pm

My understanding is that the display will query the file location based upon the "open" channel for that file. This process was active when the ANAL/DISK/REPAIR was done and file header was not in a directory file so it was moved to [SYSLOST]. Future submissions should use the original location as defined by the batch job.

The directory [SYSLOST] is used for any file that does not have an associated directory when found by the scan. In older versions of VMS, you could delete directories just like files and they were not protected as they are today. A delete today will fail if there are files within it, no so years ago. So... If a directory was deleted by mistake, the ANAL/DISK/REPAIR would find the files and place them in the [SYSLOST] directory where you could either move them back to the proper place or "rename" the [syslost] directory to the name of the deleted one.

-Dan

issinoho · Post by **issinoho** » Thu Jan 23, 2025 1:31 pm

Thanks, Dan.

Unfortunately, submitting a fresh job from the original folder still ends up being reported as being in SYSLOST from the SHOW ENTRY command. This is my conundrum.

amuir · Post by **amuir** » Thu Jan 23, 2025 1:56 pm

Providing the output from the following may prove helpful in understanding what's happening:

Code: Select all

DIR/FILE $1$DGA1501:[SYSLOST.PROJECT.JOBS]TEST1_CODE.COM
DIR/FILE PROJECT$ROOT:[JOBS]TEST1_CODE.COM
SUBMIT/HOLD PROJECT$ROOT:[JOBS]TEST1_CODE.COM /USER=MYUSER /QUEUE=SERVER_BATCH
SHOW ENTRY/FULL '$ENTRY'

The /HOLD qualifier ensures this test job won't execute. You can simply use the following command to cancel the test job once you've collected the information:

Code: Select all

DELETE/ENTRY='$ENTRY'

issinoho · Post by **issinoho** » Thu Jan 23, 2025 2:18 pm

Code: Select all

$ dir/file _$1$DGA1501:[SYSLOST.PROJECT.JOBS]TEST1_CODE.COM;6

Directory _$1$DGA1501:[SYSLOST.PROJECT.JOBS]

TEST1_CODE.COM;6   (210,2469,0)

Total of 1 file.

Code: Select all

$ dir/file PROJECT$ROOT:[JOBS]TEST1_CODE.COM;6

Directory PROJECT$ROOT:[000000.JOBS]

TEST1_CODE.COM;6   (210,2469,0)

Total of 1 file.

Code: Select all

$ submit/hold PROJECT$ROOT:[JOBS]TEST1_CODE.COM /user=MYUSER /queue=SERVER_BATCH
Job TEST1_CODE (queue SERVER_BATCH, entry 942) holding

Code: Select all

$ sh entry/full 942
  Entry  Jobname         Username     Blocks  Status
  -----  -------         --------     ------  ------
    942  TEST1_CODE      CYGNET               Holding
         On idle batch queue SERVER_BATCH
         Submitted 23-JAN-2025 19:13:44.16 /PRIORITY=100
         File: _$1$DGA1501:[SYSLOST.PROJECT.JOBS]TEST1_CODE.COM;6

sms · Post by **sms** » Thu Jan 23, 2025 4:19 pm

Code: Select all

> $ dir/file PROJECT$ROOT:[JOBS]TEST1_CODE.COM;6

   Logical names can hide important details.  What's "PROJECT$ROOT"? 

   I know nothing, but I can imagine that some directory or other got
put into [SYSLOST], and it keeps dragging you back there.  (PROJECT.DIR?
JOBS.DIR?)

   Another ANAL /DISK (without /REPAIR?) might reveal something.

issinoho · Post by **issinoho** » Thu Jan 23, 2025 5:52 pm

Code: Select all

$ sh log PROJECT$ROOT
   "PROJECT$ROOT" [exec] = "SAN$ROOT:[PROJECT.]" [concealed] (LNM$SYSTEM_TABLE)
$ sh log san$root
   "SAN$ROOT" [exec] = "$1$DGA1501:" [concealed] (LNM$SYSTEM_TABLE)

ANA /DISK on $1$DGA1501: is giving me two types of errors as follows.

Code: Select all

%ANALDISK-W-DIRNAME, directory file [SYSLOST]TEMPREGION.DIR;2 is not named '.DIR;1'

and,

Code: Select all

%ANALDISK-W-BACKLINK, directory (45,1,0) [SYSLOST]SERVER1.DIR;1
        incorrect back link for entry TEMPREGION.DIR;1

I can't see any reference to the specific file or folder of the batch command file example above.

Is there any significance that the DIR /FILE of the original and SYSLOST versions of the file have the same FILE_ID ?

jon.pinkley · Post by **jon.pinkley** » Thu Jan 23, 2025 9:19 pm

TL;DR don't delete anything out of [SYSLOST] until you are sure there are not alias entries in other directories.

If you don't have DFU or FIND, finding the duplicates is not straight forward. DFU will allow you to find all instances of alias directory entries with the command:

$ dfu directory /alias $1$dga1501:

On a system disks, aliases are expected. That's how sys$common works.

For data disks, they are not common, they won't normally exist unless someone added entries (or hard links on ODS-5).

So you should investigate any aliases that DFU finds.

See help set file/enter and set file/remove

It's too late now, but if you ever have a disk corruption, you should do a backup/physical of it while it is mounted /foreign /nowrite before trying to repair anything, because when you repair, you lose information about what was corrupted.

And before running analyze/disk, turn on bypass priv, otherwise it may not be able to access everything it needs to. For example, some users may set protections to deny System access, although doing so just makes it more likely that someone will see the files. And the first analyze should be an analyze/disk/norepair/lock just to see the extent of the damage. If there was a known corruption, and you don't have a physical backup, you should only mount the disk /nowrite.

Ok, dire warnings out of the way, let's proceed to what you should do to fix the problem.

First some background about how the queuing database stored files. It stores the name and the file id. When it displays the entries, it calls lib$fid_to_name to display the path (or something that behaves like lib$fid_to_name).

If you do a show entry /ful the file name specified will be in unconcealed lib$fid_to_file format, e.g. for the entry if the back link was correct, the file would show up as

File: _$1$DGA1501:[PROJECT.JOBS]TEST1_CODE.COM;6

instead of

File: _$1$DGA1501:[SYSLOST.PROJECT.JOBS]TEST1_CODE.COM;6

as it does now, but it would never show up as

File: ROOT$PROJECT:[JOBS]TEST1_CODE.COM;6

It seems unlikely the the original definition of ROOT$PROJECT was:
"PROJECT$ROOT" [exec] = "SAN$ROOT:[PROJECT.]" [concealed] (LNM$SYSTEM_TABLE).

I think it is more likely that it was:
"PROJECT$ROOT" [exec] = "SAN$ROOT:[TEMPREGION.PROJECT.]" [concealed] (LNM$SYSTEM_TABLE).

My guess is that the [000000]TEMPREGION.DIR got clobbered, and when you did the analyze/disk/repair the backlink of the PROJECT.DIR file got "fixed" to point to [SYSLOST]

Is there a [000000]TEMPREGION.DIR;1 file?

What dates are displayed for the [000000]*.dir files?
$ directory/size=all/date/width=(file:38,size:8)/date=(cre,mod,bac) $1$dga1501:[000000]*.dir

What is displayed if you enter the commands:

$ pipe dump/header/block=(s:1,c:0) project$root:[jobs]test1_code.com;6 | search/nowin sys$pipe "back link","file name"
$ pipe dump/header/block=(s:1,c:0) $1$dga1501:[project]job.dir | search/nowin sys$pipe "back link","file name"
$ pipe dump/header/block=(s:1,c:0) $1$dga1501:[syslost.project]job.dir | search/nowin sys$pipe "back link","file name"
$ pipe dump/header/block=(s:1,c:0) $1$dga1501:[syslost]project.dir | search/nowin sys$pipe "back link","file name"
$ pipe dump/header/block=(s:1,c:0) $1$dga1501:[000000]syslost.dir | search/nowin sys$pipe "back link","file name"

The best documentation about ODS-2 is VMS File System Internals, by Kirby McCoy, but it is long out of print and not easy to find. Here's a link I just found with a text file written by Andy Goldstein in 1985 "Files-11 On-Disk Structure Specification"
https://web-docs.gsi.de/~kraemer/COLLEC ... S/ods2.txt

Every directory file has a "back link" to a single "primary" directory that it is cataloged in. If you delete a directory (i.e. set it to /nodirectory, then set protection to delete, then delete it), all the files that were in that directory will now be lost, i.e. there is no path from the root [000000] master file directory to the files. Analyze/disk/repair scans the index file (which contains all the file headers) and verifies that directory contained in the back link is valid, if not it creates a directory entry in the [syslost] directory and resets the backlink in the file header to the syslost directory.

You can probably "fix" this by turning on bypass priv, and renaming [syslost]project.dir [000000]project.dir if the project directory really was in the [000000] directory to begin with.

But before doing anything more, it would be best to make a physical backup of the disk (while is is mounted foreign, you don't want changes being made to the disk while it mounted read/write). Then at least you have a "bit for bit" copy of the disk as it is now.

roberbrooks · Post by **roberbrooks** » Thu Jan 23, 2025 10:40 pm

Code: Select all

$ dir/file _$1$DGA1501:[SYSLOST.PROJECT.JOBS]TEST1_CODE.COM;6

Directory _$1$DGA1501:[SYSLOST.PROJECT.JOBS]

TEST1_CODE.COM;6   (210,2469,0)

$ dir/file PROJECT$ROOT:[JOBS]TEST1_CODE.COM;6

Directory PROJECT$ROOT:[000000.JOBS]

TEST1_CODE.COM;6   (210,2469,0)

You'll note that the FID is the same.

I have no idea how that happened.

I think your disk is a bit more messed up than you realize.

I have also had corruption problems with P2000 G3's; I'll never use one again.

You realize that even the newer MSA 2040's are past end-of-life, right?

I'd get a newer array if possible.

m_detommaso · Post by **m_detommaso** » Fri Jan 24, 2025 3:38 am

roberbrooks wrote: ↑
Thu Jan 23, 2025 10:40 pm

You realize that even the newer MSA 2040's are past end-of-life, right?
I'd get a newer array if possible.

Unfortunately all msa (Modula Storage Array) class storage certified by VSI have been clarified by the vendor "End-of_life Sales and Service".

https://vmssoftware.com/products/supported-platforms/

Code: Select all

Supported MSA storage arrays:

    MSA 1000 Active/Active (rx1600, rx2600, rx2620, rx2660, rx3600, rx6600, rx7640, and rx8640 only)
    MSA 1500 Active/Active (rx1600, rx2600, rx2620, rx2660, rx3600, rx6600, rx7640, and rx8640 only)
    MSA 2300fc
    P2000 G3 FC
    P2000 G3 FC/iSCSI (FC Connect)
    MSA 2040 FC
    MSA 2050/2052 FC (excl. rx1600, rx2600, and rx2620)

The evolution of this storage class is the msa2060/msa2062 family which however do not appear to have been certified by VSI. And this brings us back to the thread https://forum.vmssoftware.com/viewtopic.php?f=31&t=9362 (the few storage devices still certified appear to be very expensive compared to the msa solution - entry level class storage - and this forces customers to remain on current but dated architectures).

Now that standard support for VMS I64 V8.4-2L3 has been extended to 12/31/2035 and many customers are still unable to migrate to x86, it would be desirable for more modern storage models to be certified for the Integrity platform and x86 (VMDirectPath I/O functionality and future support of ISCSI).

VSI OpenVMS Forum

Understanding SYSLOST

Understanding SYSLOST

Re: Understanding SYSLOST

Re: Understanding SYSLOST

Re: Understanding SYSLOST

Re: Understanding SYSLOST

Re: Understanding SYSLOST

Re: Understanding SYSLOST

Re: Understanding SYSLOST

Re: Understanding SYSLOST

Re: Understanding SYSLOST