 anduin ревизій цього gist 7 months ago. До ревизії
                
                anduin ревизій цього gist 7 months ago. До ревизії
                
                    2 files changed, 16 insertions, 1 deletion
FixBatch.md
| @@ -42,7 +42,6 @@ $machines | Where-Object { $_.DesiredMachineDefinition -eq 'AD' } | Group-Object | |||
| 42 | 42 | ||
| 43 | 43 | 如果没有 DMS,可以使用 CADW 数据库: [CADW](https://dataexplorer.azure.com/clusters/cadwprod.westus2/databases/Exchange) | |
| 44 | 44 | ||
| 45 | - | ||
| 46 | 45 | ```kusto | |
| 47 | 46 | SubstrateMachine | |
| 48 | 47 | | where DeployRing == "SDFV2" | |
FixStruggler.md(файл створено)
| @@ -0,0 +1,16 @@ | |||
| 1 | + | ## 第三章 - 诊断剩余机器不部署的问题 | |
| 2 | + | ||
| 3 | + | 1. 运行下面的查询来查看剩余机器信息 | |
| 4 | + | ||
| 5 | + | 使用 CADW 数据库: [CADW](https://dataexplorer.azure.com/clusters/cadwprod.westus2/databases/Exchange) | |
| 6 | + | ||
| 7 | + | ```kusto | |
| 8 | + | SubstrateMachine | |
| 9 | + | | where ActivityState == "DotBuildUpgrade" and DesiredMachineDefinition == "BE" | |
| 10 | + | | where ActualExchangeVersion contains "15.20.8534" | |
| 11 | + | | where DeployRing in ('SIP', 'WW') | |
| 12 | + | | extend unpatched = strcmp(ActualExchangeVersion, "15.20.8534.031") < 0 | |
| 13 | + | | summarize TotalCount=count(), unpatchedCount = countif(unpatched) by Forest | |
| 14 | + | | extend UnPatchedPercentage = round(100.0 * unpatchedCount / TotalCount, 2) | |
| 15 | + | | order by UnPatchedPercentage desc | |
| 16 | + | ``` | |
                
                
                
                     anduin ревизій цього gist 7 months ago. До ревизії
                
                anduin ревизій цього gist 7 months ago. До ревизії
                
                    1 file changed, 2 insertions
FixMachine.md
| @@ -96,6 +96,8 @@ CentralAdminWorkflows_Global | |||
| 96 | 96 | | sort by CreateTimeUtc asc | |
| 97 | 97 | ``` | |
| 98 | 98 | ||
| 99 | + | 对于 Itar,则使用 [Jarvis](https://portal.microsoftgeneva.com/logs/dgrep?be=DGrep&ep=CA%20Fairfax&ns=O365PassiveITAR&en=CentralAdminWorkflows&time=2025-03-05T07:23:00.000Z&UTC=true&offset=-3&offsetUnit=Days&conditions=[[%22ClassName%22,%22%3D%3D%22,%22PatchPersistenceInspector%22]]&kqlClientQuery=source%0A|%20extend%20WorkflowId%20%3D%20strcat(%22\\\\%22,%20ManagementUnit,%20%22\\%22,%20Id)%0A|%20project%20ClassName,%20Result,%20CreateTimeUtc,%20EndTimeUtc,%20WorkflowId,%20Exception,%20LastGoodKnownState,%20UserContext,%20TenantVersion%0A|%20sort%20by%20CreateTimeUtc%20desc&aggregates=[%22Count%20by%20env_cloud_roleInstance%22]&chartEditorVisible=true&chartType=line&chartLayers=[[%22New%20Layer%22,%22%22],[%22Count%20by%20env_cloud_roleInstance%22,%22groupby%20env_time.roundDown(\%22PT1M\%22)%20as%20X,%20env_cloud_roleInstance\nwhere%20env_cloud_roleInstance%20%3D%3D%20\%22DM3MGT04CS0029\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22PH1MGT0401CS001\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22BN8MGT0401CS001\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22SN5MGT0401CS009\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22PH1MGT0401CS013\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22DM3MGT04CS0031\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22DM3MGT04CS0037\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22SN1MGT04CS103\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22BN8MGT0401CS019\%22%20||%20env_cloud_roleInstance%20%3D%3D%20\%22CY1MGT04CS110\%22\nlet%20Count%20%3D%20Count()%22]]%20). | |
| 100 | + | ||
| 99 | 101 | 一般到这里,我们已经可以知道机器为什么部署失败了。如果还不清楚,可以继续下面的步骤。 | |
| 100 | 102 | ||
| 101 | 103 | 6. 将部署的错误按原因分类: | |
                
                
                
                     anduin ревизій цього gist 8 months ago. До ревизії
                
                anduin ревизій цього gist 8 months ago. До ревизії
                
                    1 file changed, 3 insertions, 2 deletions
FixBatch.md
| @@ -48,10 +48,11 @@ SubstrateMachine | |||
| 48 | 48 | | where DeployRing == "SDFV2" | |
| 49 | 49 | | where DesiredMachineDefinition == "BE" | |
| 50 | 50 | | where DesiredVersion contains "15.20.8495" | |
| 51 | - | | count | |
| 51 | + | | where ProvisioningState != "Provisioned" | |
| 52 | + | | project Name, ActualVersion, DesiredVersion, Dag, Forest, DesiredMachineDefinition, ProvisioningState, ActivityState | |
| 53 | + | | sort by Dag | |
| 52 | 54 | ``` | |
| 53 | 55 | ||
| 54 | - | ||
| 55 | 56 | * 在这一步:确定不能部署的机器的Role | |
| 56 | 57 | ||
| 57 | 58 | 5. 检查期待性:在DMS里将机器按DesiredVersion Group,检查是否有机器试图部署这个版本。 | |
                
                
                
                     anduin ревизій цього gist 8 months ago. До ревизії
                
                anduin ревизій цього gist 8 months ago. До ревизії
                
                    1 file changed, 12 insertions
FixBatch.md
| @@ -40,6 +40,18 @@ $machines | Where-Object { $_.DesiredMachineDefinition -eq 'FE' } | Group-Object | |||
| 40 | 40 | $machines | Where-Object { $_.DesiredMachineDefinition -eq 'AD' } | Group-Object ActualVersion | Sort-Object { $_.Name } | |
| 41 | 41 | ``` | |
| 42 | 42 | ||
| 43 | + | 如果没有 DMS,可以使用 CADW 数据库: [CADW](https://dataexplorer.azure.com/clusters/cadwprod.westus2/databases/Exchange) | |
| 44 | + | ||
| 45 | + | ||
| 46 | + | ```kusto | |
| 47 | + | SubstrateMachine | |
| 48 | + | | where DeployRing == "SDFV2" | |
| 49 | + | | where DesiredMachineDefinition == "BE" | |
| 50 | + | | where DesiredVersion contains "15.20.8495" | |
| 51 | + | | count | |
| 52 | + | ``` | |
| 53 | + | ||
| 54 | + | ||
| 43 | 55 | * 在这一步:确定不能部署的机器的Role | |
| 44 | 56 | ||
| 45 | 57 | 5. 检查期待性:在DMS里将机器按DesiredVersion Group,检查是否有机器试图部署这个版本。 | |
                
                
                
                     anduin ревизій цього gist 8 months ago. До ревизії
                
                anduin ревизій цього gist 8 months ago. До ревизії
                
                    1 file changed, 10 insertions
FixMachine.md
| @@ -86,6 +86,16 @@ Enable-SeeAnything | |||
| 86 | 86 | See-Workflow $workflowId | |
| 87 | 87 | ``` | |
| 88 | 88 | ||
| 89 | + | 如果没有 DMS,则考虑使用下面的 Kusto: | |
| 90 | + | ||
| 91 | + | ```kusto | |
| 92 | + | CentralAdminWorkflows_Global | |
| 93 | + | | where RootWorkflowId == '$guid' | |
| 94 | + | | extend WorkflowId = strcat("\\\\", ManagementUnit, "\\", Id) | |
| 95 | + | | project ClassName, Result, CreateTimeUtc, EndTimeUtc, WorkflowId, Exception, LastGoodKnownState, UserContext, TenantVersion,RootWorkflowId | |
| 96 | + | | sort by CreateTimeUtc asc | |
| 97 | + | ``` | |
| 98 | + | ||
| 89 | 99 | 一般到这里,我们已经可以知道机器为什么部署失败了。如果还不清楚,可以继续下面的步骤。 | |
| 90 | 100 | ||
| 91 | 101 | 6. 将部署的错误按原因分类: | |
                
                
                
                     anduin ревизій цього gist 8 months ago. До ревизії
                
                anduin ревизій цього gist 8 months ago. До ревизії
                
                    1 file changed, 11 insertions
FixBatch.md
| @@ -78,6 +78,17 @@ APSFailedWorkitemEvent_Global | |||
| 78 | 78 | | order by targetIntention asc, Count desc | |
| 79 | 79 | ``` | |
| 80 | 80 | ||
| 81 | + | 如果输出了大量 DownloadComponent 的错误,可以使用这个 Query 查询它的分布: | |
| 82 | + | ||
| 83 | + | ```kusto | |
| 84 | + | ComponentReplicationCogsEvent_Global() | |
| 85 | + | | where deployRing == "TDF" and env_time > ago(100h) | |
| 86 | + | | summarize | |
| 87 | + | Failed = countif(result == 'Failed'), | |
| 88 | + | Succeeded = countif(result == 'Succeeded') by bin(env_time, 30min) | |
| 89 | + | | render timechart | |
| 90 | + | ``` | |
| 91 | + | ||
| 81 | 92 | 上面的查询会输出一些机器示例。请参考第二章以进一步诊断这些机器。 | |
| 82 | 93 | ||
| 83 | 94 | 8. 找到错误的信息,检查日志,找到正确的责任人。 | |
                
                
                
                     anduin ревизій цього gist 9 months ago. До ревизії
                
                anduin ревизій цього gist 9 months ago. До ревизії
                
                    1 file changed, 8 insertions
FixBatch.md
| @@ -131,6 +131,14 @@ Get-DeploymentConfigApprovedVersion -ApprovedVersion 15.20.74 | |||
| 131 | 131 | Get-DeploymentConfigPrerequisiteVersion -EntityName BE -ApprovedVersion 15.20.7472.030 | ft -a | |
| 132 | 132 | ``` | |
| 133 | 133 | ||
| 134 | + | 在没有 DMS 时,使用下面的 Kusto 应急: | |
| 135 | + | ||
| 136 | + | ``` | |
| 137 | + | SubstrateConfigWorkItem | |
| 138 | + | | where DeployRing contains "TDF" and ApprovedVersion contains "8374" and ServerRole contains "BE" | |
| 139 | + | | project HandlerType, HandlerStatus, WhenChanged | |
| 140 | + | ``` | |
| 141 | + | ||
| 134 | 142 | 是否完整 | |
| 135 | 143 | ||
| 136 | 144 | 12. 检查其前一个 Ring 有没有 config version 创建出来 | |
                
                
                
                     anduin ревизій цього gist 10 months ago. До ревизії
                
                anduin ревизій цього gist 10 months ago. До ревизії
                
                    1 file changed, 2 insertions
FixBatch.md
| @@ -1,3 +1,5 @@ | |||
| 1 | + | 这部分内容是通用的用于诊断 Substrate 数据中心机器部署失败的方法。可以从宏观上找到核心问题。 | |
| 2 | + | ||
| 1 | 3 | 1. 准备工作区:立刻打开两个DMS,两个OSP和一个Kusto Explorer。 | |
| 2 | 4 | ||
| 3 | 5 | 2. 识别:识别有故障的范围,是版本还是Ring。在OSP检查此Ring趋势图。检查Substrate版本历史,确认其版本类型(Dogfood、Daily)。 | |
                
                
                
                     anduin ревизій цього gist 10 months ago. До ревизії
                
                anduin ревизій цього gist 10 months ago. До ревизії
                
                    1 file changed, 1 insertion
FixMachine.md
| @@ -82,6 +82,7 @@ ApsPrioritizerTraceEvent_Global | |||
| 82 | 82 | 对于第二步的输出,我们可以看到 WorkflowId。我们可以使用这个 WorkflowId 来查看机器的部署错误。 | |
| 83 | 83 | ||
| 84 | 84 | ```powershell | |
| 85 | + | Enable-SeeAnything | |
| 85 | 86 | See-Workflow $workflowId | |
| 86 | 87 | ``` | |
| 87 | 88 | ||
                
                
                
                     anduin ревизій цього gist 10 months ago. До ревизії
                
                anduin ревизій цього gist 10 months ago. До ревизії
                
                    1 file changed, 1 insertion, 1 deletion
FixBatch.md
| @@ -10,7 +10,7 @@ | |||
| 10 | 10 | ||
| 11 | 11 | **不要**跳过这一步!很多问题都是由于Override引起的。或许你完全可以发现已经有人在Override这个问题了。 | |
| 12 | 12 | ||
| 13 | - | 在OSP Overrides 页面搜索: | |
| 13 | + | 在 [OSP Overrides](https://m365pulse.microsoft.com/DeploymentCore/DeploymentMonitorApp/control%20panel/override) 页面搜索: | |
| 14 | 14 | ||
| 15 | 15 | * 这个版本本身的信息 | |
| 16 | 16 | * 包含 999 的 override | |