Friday, August 19, 2011

SCOM 2007 R2: Disconnected Agents (Event Id 21034)

One day I noticed my servers (agents) were greyed out in Computers node in Monitoring section. You know what’s strange;  those greyed out computers were not even showing up in Agent Managed node in Administration section.

All of these greyed out servers, I see this event id 21034.

Event Type:        Warning
Event Source:        OpsMgr Connector
Event Category:        None
Event ID:        21034
Date:                8/12/2011
Time:                10:03:10 AM
User:                N/A
Computer:        SERVER-NAME
Description:
The Management Group Watch-Men has no configured parents and most monitoring tasks cannot be performed. This can happen if a management group in Active Directory does not have any server SCPs or if the agent does not have access to any server SCPs.

I usually try two things when a SCOM agent is not talking to SCOM management server.

1. Restart System Center Management (HealthService) on the affected server. Watch the event log in Operations Manager node.

2. If step 1 fails, I do

a. Stop the System Center Management (HealthService) service.

b. Open Explorer window and go to “C:\Program Files\System Center Operations Manager 2007”. Rename (or Delete) the folder named “Health Service State”.

c. Start the System Center Management Service (HealthService).

Well..well..well. That didn’t work. I have called Microsoft support and got help on how to reconnect the disconnected agents.

On the SQL server that hosts Operations Manager database, open SQL Server Management Studio. Browse to the OperationsManager database. Open new query window for the database. If you need help get from the DBA.

SCRIPT A: Execute the following query to list all disconnected agents in the database.

declare @DiscoverySourceId uniqueidentifier;
set @DiscoverySourceId = dbo.fn_DiscoverySourceId_User();
SELECT TME.[TypedManagedEntityid], HS.PrincipalName
FROM MTV_HealthService HS
INNER JOIN dbo.[BaseManagedEntity] BHS with(nolock)
ON BHS.[BaseManagedEntityId] = HS.[BaseManagedEntityId]
-- get host managed computer instances
INNER JOIN dbo.[TypedManagedEntity] TME with(nolock)
ON TME.[BaseManagedEntityId] = BHS.[TopLevelHostEntityId]
AND TME.[IsDeleted] = 0
INNER JOIN dbo.[DerivedManagedTypes] DMT with(nolock)
ON DMT.[DerivedTypeId] = TME.[ManagedTypeId]
INNER JOIN dbo.[ManagedType] BT with(nolock)
ON DMT.[BaseTypeId] = BT.[ManagedTypeId]
AND BT.[TypeName] = N'Microsoft.Windows.Computer'
-- only with missing primary
LEFT OUTER JOIN dbo.Relationship HSC with(nolock)
ON HSC.[SourceEntityId] = HS.[BaseManagedEntityId]
AND HSC.[RelationshipTypeId] = dbo.fn_RelationshipTypeId_HealthServiceCommunication()

AND HSC.[IsDeleted] = 0
INNER JOIN DiscoverySourceToTypedManagedEntity DSTME with(nolock)
ON DSTME.[TypedManagedEntityId] = TME.[TypedManagedEntityId]AND DSTME.[DiscoverySourceId] = @DiscoverySourceId WHERE HS.[IsAgent] = 1 AND HSC.[RelationshipId] IS NULL

If you see any results, Note down (copy/paste) the results in a note pad of all disconnected agents.

Now we need to delete all disconnected agents. Make database backup of Operations Manager database.

Execute this script to delete all disconnected agents. Note: you are on your own. I am NOT responsible for your actions.

declare @TypedManagedEntityId uniqueidentifier;
declare @DiscoverySourceId uniqueidentifier;
declare @LastErr int;
declare @TimeGenerated datetime;

set @TimeGenerated = GETUTCDATE();
set @DiscoverySourceId = dbo.fn_DiscoverySourceId_User();

DECLARE EntitiesToBeRemovedCursor CURSOR LOCAL FORWARD_ONLY READ_ONLY FOR SELECT TME.[TypedManagedEntityid] FROM MTV_HealthService HS INNER JOIN dbo.[BaseManagedEntity] BHS ON BHS.[BaseManagedEntityId] = HS.[BaseManagedEntityId]

-- get host managed computer instances

INNER JOIN dbo.[TypedManagedEntity] TME ON TME. BaseManagedEntityId] = BHS.[TopLevelHostEntityId] AND TME.[IsDeleted] = 0 INNER JOIN dbo.[DerivedManagedTypes] DMT ON DMT.[DerivedTypeId] = TME.[ManagedTypeId] INNER JOIN dbo.[ManagedType] BT ON DMT.[BaseTypeId] = BT.[ManagedTypeId] AND BT.[TypeName] = N'Microsoft.Windows.Computer'

-- only with missing primary

LEFT OUTER JOIN dbo.Relationship HSC

ON HSC.[SourceEntityId] = HS.[BaseManagedEntityId] AND HSC.[RelationshipTypeId] = dbo.fn_RelationshipTypeId_HealthServiceCommunication() AND HSC.[IsDeleted] = 0 INNER JOIN DiscoverySourceToTypedManagedEntity DSTME ON DSTME.[TypedManagedEntityId] = TME.TypedManagedEntityId] AND DSTME.[DiscoverySourceId] = @DiscoverySourceId WHERE HS.[IsAgent] = 1 AND HSC.[RelationshipId] IS NULL;

OPEN EntitiesToBeRemovedCursor

FETCH NEXT FROM EntitiesToBeRemovedCursor  INTO @TypedManagedEntityId

WHILE @@FETCH_STATUS = 0
BEGIN
BEGIN TRAN

-- Delete entity

EXEC @LastErr = [p_RemoveEntityFromDiscoverySourceScope] @TypedManagedEntityId, @DiscoverySourceId, @TimeGenerated;

IF @LastErr <> 0 GOTO Err

COMMIT TRAN

-- Get the next typedmanagedentity to delete.

FETCH NEXT FROM EntitiesToBeRemovedCursor

INTO @TypedManagedEntityId

END

CLOSE EntitiesToBeRemovedCursor

DEALLOCATE EntitiesToBeRemovedCursor

GOTO Done

Err:

ROLLBACK TRAN

GOTO Done

Done:

 

Execute SCRIPT A again to see any disconnected agents listed. Hopefully not. If yes, you need to execute the following script. See the highlighted value for EntityId. Replace it with the ID from above script results. Run the script against all disconnected servers with their corresponding EntityIds.

DECLARE @EntityId uniqueidentifier;

DECLARE @TimeGenerated datetime;

-- change "GUID" to the ID of the invalid entity

SET @EntityId = '3B2F8221-9F7B-5FFD-B80D-DEEAFFB6E342';

SET @TimeGenerated = getutcdate();

BEGIN TRANSACTION

EXEC dbo.p_TypedManagedEntityDelete @EntityId, @TimeGenerated;

COMMIT TRANSACTION

Execute SCRIPT A again to check the server is not listed as disconnected.

Check SCOM console to see these servers disappeared in Computers Node in Monitoring section. 

Now you have to do the following all original disconnected servers after fixing it in the database. On every disconnected server,

a. Stop the System Center Management (HealthService) service.

b. Open Explorer window and go to “C:\Program Files\System Center Operations Manager 2007”. Rename (or Delete) the folder named “Health Service State”.

c. Start the System Center Management Service (HealthService).

I made a little VBScript to do the above task on all list of servers. Copy/Paste the following script in notepad an save it as "FixSCOMAgent.vbs”. Create a new text file called Servers.txt on the same folder you saved the VBScript.  Type the disconnected server names in Servers.txt file. List each server name on it’s own line. e.g,

servername1
servername2
servername3

 

' #######              #####   #####  ####### #     #
' #       # #    #    #     # #     # #     # ##   ##
' #       #  #  #     #       #       #     # # # # #
' #####   #   ##       #####  #       #     # #  #  #
' #       #   ##            # #       #     # #     #
' #       #  #  #     #     # #     # #     # #     #
' #       # #    #     #####   #####  ####### #     #
'                                                    
'
'    #                              
'   # #    ####  ###### #    # #####
'  #   #  #    # #      ##   #   #  
' #     # #      #####  # #  #   #  
' ####### #  ### #      #  # #   #  
' #     # #    # #      #   ##   #  
' #     #  ####  ###### #    #   #  
'
' Script Name: FixSCOMAgent.vbs                                   
' Description: This script will stop the SCOM agent service, Delete
' Health Service State folder and start the SCOM agent service on
' all servers listed in servers.txt.
'
' Written by Anand Venkatachalpathy
'

On Error Resume Next

'how much time we want to wait after initating the stop service
intSleep = 18000

'agent health service state folder to be deleted
HSFolder = "\c$\Program Files\System Center Operations Manager 2007\Health Service State"

' creating a File system object
Set objFSO = CreateObject("Scripting.FileSystemObject")

' Open Servers.txt file, should be located same folder as this script
Set f=objFSO.OpenTextFile("servers.txt",1)

' Read the file one line at a time
Do While f.AtEndOfStream <> True
  strComputer = f.ReadLine
  WScript.Echo "Fixing " & strComputer & " ..."
 
  'call the sub routine to fix the server
  ReStartHealthService strComputer
Loop

'-*-*-*-*-* End of Script -*-*-*-*-*-*

 


'Sub Routine: RestartHealthService
'Parameter: server name
'Description: This sub routine stops the SCOM agent service,
'Delete the Health Service Status folder and start the
'agent service.

Sub RestartHealthService(strComputer)

  'Service Name
  strService = " 'HealthService' "
 
  'Get WMI object on the given server
  Set objWMIService = GetObject("winmgmts:" _
  & "{impersonationLevel=impersonate}!\\" _
  & strComputer & "\root\cimv2")
 
  'Get the services WMI object
  Set colListOfServices = objWMIService.ExecQuery _
  ("Select * from Win32_Service Where Name ="_
  & strService & " ")
 
  'Folder to delete
  strSource = "\\" & strComputer & HSFolder
  'Get the folder object to delete
  Set fTarget = objFSO.GetFolder(strSource) 
 
 
 
  For Each objService in colListOfServices
    WScript.Echo vbTab & "Stopping SCOM Agent Service"
    objService.StopService()
    WScript.Sleep intSleep
   
    WScript.Echo  vbTab & "Deleting the folder: " & strSource
    fTarget.Delete
   
    WScript.Echo vbTab & "Starting SCOM Agent Service"
    objService.StartService()
  Next
End Sub
'-*-*-*- End of Sub Routine –*-*-*-*-*-

 

Now open the command prompt as administrator, go to the location where you saved the script and run it. (CScript FixSCOMAgent.vbs).

After you run the script, Watch the SCOM console. Servers will start showing up correctly. It may take about 15 minutes some times. Just in case check fixed server’s event log for any errors.

Whew! Hope this blog helped you.

No comments:

Post a Comment

Followers

hit counter