getting system health state

Post Reply

Topic author
willemgrooters
Valued Contributor
Posts: 91
Joined: Fri Jul 12, 2019 1:59 pm
Reputation: 0
Location: Netherlands
Status: Offline
Contact:

getting system health state

Post by willemgrooters » Tue Jan 04, 2022 10:54 am

Is there a way to get the "physical health state" of an Alpha (PWS, DS10) and Itanium (RX2620): CPU and memory temperature, fan state, power state etc, by a program, to monitor and signalling of "extreme" situations BEFORE they occur?
DS10 has RMC in snoop mode, which allows me to display a status - when it does react, which is not always the case - but that can only be done via console.


jonesd
Master
Posts: 109
Joined: Mon Aug 09, 2021 7:59 pm
Reputation: 0
Status: Offline

Re: getting system health state

Post by jonesd » Tue Jan 04, 2022 2:29 pm

For my DS10L, I can get CPU temperature examining f$getsyi("TEMPERATURE_VECTOR"). The last 2 digits are a hex encoding of the temperature in Celsius. I haven't found any other CPUs that this works. Since the 10L is a 1U box, keeping the thing cool is a problem.


dirk.bogaerts
Member
Posts: 6
Joined: Thu Feb 18, 2021 9:50 am
Reputation: 0
Status: Offline

Re: getting system health state

Post by dirk.bogaerts » Wed Jan 05, 2022 4:38 am

This script used to work on my DS20E Alpha's, but not anymore on my current Integrity servers. Currently using CockpitMgr which does all the HW monitoring.

I added an extra check to the "env_check" script:

Code: Select all

---
$ activecpu_cnt = f$getsyi("ACTIVECPU_CNT")    
$ availcpu_cnt  = f$getsyi("AVAILCPU_CNT")     
----
$ gosub cpu_check    
----
$cpu_check:                                                                     
$ if availcpu_cnt  .gt. 1                                                      
$ then                                                                         
$   if activecpu_cnt .lt. availcpu_cnt                                         
$   then write sys$output                                                     -
            "CPU is BAD : avail ''availcpu_cnt' / active ''activecpu_cnt'"     
$   else write sys$output "CPUs are Good"                                      
$   endif                                                                      
$ else                                                                         
$   write sys$output "CPU is Good"                                             
$ endif                                                                        
$ return                                                                        
----


Topic author
willemgrooters
Valued Contributor
Posts: 91
Joined: Fri Jul 12, 2019 1:59 pm
Reputation: 0
Location: Netherlands
Status: Offline
Contact:

Re: getting system health state

Post by willemgrooters » Wed Jan 05, 2022 11:39 am

Thanks - great script, gives the information I needed. It would be nice if Cockpit manager was available for community members in some form
Last edited by willemgrooters on Wed Jan 05, 2022 11:41 am, edited 1 time in total.

User avatar

imiller
Master
Posts: 173
Joined: Fri Jun 28, 2019 8:45 am
Reputation: 0
Location: South Tyneside, UK
Status: Offline
Contact:

Re: getting system health state

Post by imiller » Mon Feb 27, 2023 10:59 am

What I'm doing using DCL and Kermit scripting to connect to the iLo and do a PS command to get the power supply status and temp then checking the output. Works for Alpha and I64.
Ian Miller
[ personal opinion only. usual disclaimers apply. Do not taunt happy fun ball ].


marxch
Member
Posts: 7
Joined: Wed Nov 17, 2021 1:12 am
Reputation: 0
Status: Offline

Re: getting system health state

Post by marxch » Wed Aug 16, 2023 7:20 am

Hello Ian,

is there an example how this can be done?
That would be very helpful.

The script from Neil is not working on my Integrity. Some "Termal_vector" is not loaded.


Best Regards,

Christoph

User avatar

imiller
Master
Posts: 173
Joined: Fri Jun 28, 2019 8:45 am
Reputation: 0
Location: South Tyneside, UK
Status: Offline
Contact:

Re: getting system health state

Post by imiller » Wed Aug 16, 2023 7:53 am

the DCL is like this

Code: Select all

$!
$! CHECK_TEMP - record temp  and report if too high
$!
$!'F$VERIFY(0)
$START:
$ SET RESTART="START"
$ ON ERROR THEN GOTO END
$!
$ warming_temp = 29     ! it's a bit warm in here
$ warning_temp = 34     ! one less than the warning temp set on the iLo
$ critical_temp = 36    ! two less than the critical temp set on the iLo
$!
$ nodename = F$GETSYI("NODENAME")
$ script = F$SEARCH("CMANAGER:GET-''nodename'-TEMP.KERMIT")
$ IF script .EQS. ""
$ THEN
$     WRITE SYS$OUTPUT "**** UNKNOWN NODE ''node name'"
$     EXIT 4
$ ENDIF
$ temprecord = "CLOGS:''nodename'_TEMPRATURE.CSV" ! record of temp sensor values - datetime,value
$!
$ SUBMIT/QUEUE='nodename'_SYS_BATCH/USER=SYSTEM/LOG=CLOGS:CHECK_'nodename'_TEMP.LOG/AFTER="+01:00"/RESTART CMANAGER:CHECK_TEMP.COM
$ @UTILS:PRUNE CLOGS:CHECK_'nodename'_TEMP.LOG 10
$ CKERMIT :== $UTILS:CKV300-I64-VMS831H1-UCX56.EXE ! Kermit - a blast from the past :-)
$ DEFINE/USER SYS$OUTPUT T.TMP
$ CKERMIT "-B" -y 'script'
$! pick out the line "Ambient temperature : 25 C"
$ SEARCH/OUTPUT=T2.TMP T.TMP "Ambient"
$ IF $SEVERITY .NE. 1 THEN GOTO TMPERR
$ OPEN/READ tf T2.TMP
$ READ tf line
$ CLOSE tf
$ temp = F$ELEMENT(1,":",F$EDIT(line,"COLLAPSE")) - "C"
$ now = F$ELEMENT(0,".",F$TIME()) ! get current datetime without fractional seconds
$ IF F$SEARCH(temprecord) .EQS. "" THEN COPY NL: 'temprecord'
$ OPEN/APPEND tr 'temprecord'
$ WRITE tr "''now',''temp'"
$ CLOSE tr
$!
$ DEFINE/NOLOG TCPIP$SMTP_FROM "Node-Checks@xxx.COM"
$!
$ IF temp .GE. critical_temp
$ THEN
$     MAIL NL: "@CMANAGER:SYSMAN.DIS" /SUBJECT="''nodename' has reached critical temperature and may shutdown"
$     GOTO END1
$ ENDIF
$ IF temp .GE. warning_temp
$ THEN
$     MAIL NL: "@CMANAGER:SYSMAN.DIS" /SUBJECT="''nodename' has reached warning temperature"
$     GOTO END1
$ ENDIF
$ IF temp .GE. warming_temp
$ THEN
$     MAIL NL: "@CMANAGER:SYSMAN.DIS" /SUBJECT="''nodename' is getting warm"
$     GOTO END1
$ ENDIF
$END1:
$!
$! Check fan status
$! [ search result of PS command for Fan status lines then those lines for status not Normal ]
$ SEARCH T.TMP "Fan Unit"/OUTPUT=T1.TMP
$ OPEN/READ T1 T1.TMP
$L1:
$ READ/END=T1END T1 LINE
$ LINE = F$EDIT(LINE,"COMPRESS")
$ unit = F$ELEMENT(1," ", LINE)
$ IF unit .EQS. "Unit"  ! Check line is one we want
$ THEN
$     unitnum = F$ELEMENT(2," ", LINE)
$     fanstatus = F$ELEMENT(3," ", LINE)
$     IF fanstatus .NES. "Normal"
$     THEN
$         MAIL NL: "@CMANAGER:SYSMAN.DIS" /SUBJECT="''nodename' has problem with fan ''unitnum' status is ''fanstatus'"
$     ENDIF
$ ENDIF
$ GOTO L1
$T1END:
$ CLOSE T1
$!
$END:
$ DELETE/NOCONFIRM/NOLOG T.TMP;*
$ DELETE/NOCONFIRM/NOLOG T1.TMP;*
$ DELETE/NOCONFIRM/NOLOG T2.TMP;*
$ EXIT
$TMPERR:
$ MAIL NL: "@CMANANGER:SYSMAN.DIS" /SUBJECT="error getting temperature of ''nodename'"
$ IF F$SEARCH("T.TMP") .NES. "" THEN TYPE T.TMP
$ GOTO END
[code]


Kermit script is like this
[code]
set host host.company.com
input 3 MP login:
lineout Oper
input 3 MP password:
lineout somepassword
input 3 hpiLO->
lineout CM
set input echo on
input 3 CM:hpiLO->
lineout PS
input 3 CM:hpiLO->
output \x02
input 3 hpiLO->
lineout X
exit 0
[code]
Ian Miller
[ personal opinion only. usual disclaimers apply. Do not taunt happy fun ball ].

Post Reply