技巧Get：如何使用Hadoop API實現集群資訊統計

更多深度文章，請關注雲計算頻道：https://yq.aliyun.com/cloud

適用於hadoop 2.7及以上版本）

涉及到RESTful API

ResourceManager REST API’s：

https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html

WebHDFS REST API：

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

MapReduce History Server REST API’s：

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html

Spark Monitoring and Instrumentation

http://spark.apache.org/docs/latest/monitoring.html

1. 統計HDFS檔案系統即時使用情況

URL

http://emr-header-1:50070/webhdfs/v1/?user.name=hadoop&op=GETCONTENTSUMMARY

返回結果：

{"ContentSummary":

關於返回結果的說明：

{"name" : "ContentSummary","properties":

注意length與spaceConsumed的關係，跟hdfs副本數有關。

如果要統計各個組工作目錄的使用情況，使用如下請求：

http://emr-header-1:50070/webhdfs/v1/user/feed_aliyun?user.name=hadoop&op=GETCONTENTSUMMARY

2. 查看集群的即時資訊和狀態

URL

http://emr-header-1:8088/ws/v1/cluster

返回結果

{ "clusterInfo": { "id": 1495123166259,

3. 查看資源佇列的即時資訊，包括佇列的配額資訊、資源使用即時情況

URL

http://emr-header-1:8088/ws/v1/cluster/scheduler

返回結果

{ "scheduler": { "schedulerInfo": { "type": "capacityScheduler",

具體參數說明參考： https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Queue_API

4. 查看即時的作業清單，清單資訊中也包含了作業運行的詳情資訊，包括作業名稱、id、運行狀態、起止時間，資源使用情況。

URL

http://emr-header-1:8088/ws/v1/cluster/apps

返回結果

{ "apps":

如果要統計固定時間段的，

可以加上"?finishedTimeBegin={時間戳記}&finishedTimeEnd={時間戳記}"參數，例如 http://emr-header-1:8088/ws/v1/cluster/apps?finishedTimeBegin=1496742124000&finishedTimeEnd=1496742134000

5. 統計作業掃描的資料量情況

job掃描的資料量，需要通過History Server的RESTful API查詢， MapReduce的和Spark的又有一些差異。

5.1 Mapreduce job掃描資料量

URL

http://emr-header-1:19888/ws/v1/history/mapreduce/jobs/job_1495123166259_0962/counters

返回結果

{ "jobCounters" : { "id" : "job_1326381300833_2_2", "counterGroup" : [

其中org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter裡面的BYTES_READ為job掃描的資料量

具體參數：https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html#Job_Counters_API

5.2 Mapreduce job掃描資料量

URL

http://emr-header-1:18080/api/v1/applications/application_1495123166259_1050/executors

每個executor的totalInputBytes總和為整個job的資料掃描量。

更多參考：http://spark.apache.org/docs/latest/monitoring.html