Karpenter是一个为Kubernetes构建的开源自动扩缩容项目。它提高了Kubernetes应用程序的可用性,而无需手动或过度配置计算资源。 Karpenter旨在通过观察不可调度的Pod的聚合资源请求并做出启动和终止节点的决策,以最大限度地减少调度延迟,从而在几秒钟内(而不是几分钟)提供合适的计算资源来满足您的应用程序的需求。
karpenter可以替代传统的cluster-autoscaler,我们可以参考karpenter进行安装,选择Migrating from Cluster Autoscaler安装方式,为现有的EKS集群安装karpenter
本次安装Cluster Autoscaler的参考链接
环境
| Software | version | install location |
|---|---|---|
| AWS EKS | 1.34 | AWS(俄勒冈region) |
| eksctl | 0.215.0 | 本地电脑 |
| kubectl | 1.34 | 本地电脑 |
| aws cli | 2.21.0 | 本地电脑 |
IAM OIDC provider设置
在EKS里面找到OpenID Connect provider URL,复制它
然后在AWS IAM控制台,进行添加
填入复制的OpenID Connect provider URL并在Audience里面填充sts.amazonaws.com
安装karpenter
根据Migrating from Cluster Autoscaler
设置环境变量
1KARPENTER_NAMESPACE=kube-system 2# 集群名称 3CLUSTER_NAME=<your cluster name> 4
建议在bash环境执行(windows可以安装mingw)
1# 海外区域填aws,国内填aws-cn 2AWS_PARTITION="aws" 3# 获取AWS region 4AWS_REGION="$(aws configure list | grep region | tr -s " " | cut -d" " -f3)" 5# 获取OIDC ENDPOINT 6OIDC_ENDPOINT="$(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.identity.oidc.issuer" --output text)" 7# 获取aws account id 8AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text) 9K8S_VERSION=$(aws eks describe-cluster --name "${CLUSTER_NAME}" --query "cluster.version" --output text) 10ALIAS_VERSION="$(aws ssm get-parameter --name "/aws/service/eks/optimized-ami/${K8S_VERSION}/amazon-linux-2023/x86_64/standard/recommended/image_id" --query Parameter.Value | xargs aws ec2 describe-images --query 'Images[0].Name' --image-ids | sed -r 's/^.*(v[[:digit:]]+).*$/\1/')" 11
建议在bash环境执行(windows可以安装mingw)
创建KarpenterNodeRole IAM role
为karpenter将要管理的节点创建KarpenterNodeRole-xxx role
1echo '{ 2 "Version": "2012-10-17", 3 "Statement": [ 4 { 5 "Effect": "Allow", 6 "Principal": { 7 "Service": "ec2.amazonaws.com" 8 }, 9 "Action": "sts:AssumeRole" 10 } 11 ] 12}' > node-trust-policy.json 13 14 15aws iam create-role --role-name "KarpenterNodeRole-${CLUSTER_NAME}" \ 16 --assume-role-policy-document file://node-trust-policy.json 17
效果如下:
给创建的role附加policy:
1aws iam attach-role-policy --role-name "KarpenterNodeRole-${CLUSTER_NAME}" --policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonEKSWorkerNodePolicy 2aws iam attach-role-policy --role-name "KarpenterNodeRole-${CLUSTER_NAME}" --policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonEKS_CNI_Policy 3aws iam attach-role-policy --role-name "KarpenterNodeRole-${CLUSTER_NAME}" --policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly 4aws iam attach-role-policy --role-name "KarpenterNodeRole-${CLUSTER_NAME}" --policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonSSMManagedInstanceCore 5
效果如下:
创建KarpenterControllerRole IAM Role
KarpenterControllerRole IAM Role 主要是给Karpenter 控制器自己用
1cat << EOF > controller-trust-policy.json 2{ 3 "Version": "2012-10-17", 4 "Statement": [ 5 { 6 "Effect": "Allow", 7 "Principal": { 8 "Federated": "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ENDPOINT#*//}" 9 }, 10 "Action": "sts:AssumeRoleWithWebIdentity", 11 "Condition": { 12 "StringEquals": { 13 "${OIDC_ENDPOINT#*//}:aud": "sts.amazonaws.com", 14 "${OIDC_ENDPOINT#*//}:sub": "system:serviceaccount:${KARPENTER_NAMESPACE}:karpenter" 15 } 16 } 17 } 18 ] 19} 20EOF 21 22aws iam create-role --role-name "KarpenterControllerRole-${CLUSTER_NAME}" \ 23 --assume-role-policy-document file://controller-trust-policy.json 24
为KarpenterControllerRole附加policy
1cat << EOF > controller-policy.json 2{ 3 "Statement": [ 4 { 5 "Action": [ 6 "ssm:GetParameter", 7 "ec2:DescribeImages", 8 "ec2:RunInstances", 9 "ec2:DescribeSubnets", 10 "ec2:DescribeSecurityGroups", 11 "ec2:DescribeLaunchTemplates", 12 "ec2:DescribeInstances", 13 "ec2:DescribeInstanceTypes", 14 "ec2:DescribeInstanceTypeOfferings", 15 "ec2:DeleteLaunchTemplate", 16 "ec2:CreateTags", 17 "ec2:CreateLaunchTemplate", 18 "ec2:CreateFleet", 19 "ec2:DescribeSpotPriceHistory", 20 "pricing:GetProducts" 21 ], 22 "Effect": "Allow", 23 "Resource": "*", 24 "Sid": "Karpenter" 25 }, 26 { 27 "Action": "ec2:TerminateInstances", 28 "Condition": { 29 "StringLike": { 30 "ec2:ResourceTag/karpenter.sh/nodepool": "*" 31 } 32 }, 33 "Effect": "Allow", 34 "Resource": "*", 35 "Sid": "ConditionalEC2Termination" 36 }, 37 { 38 "Effect": "Allow", 39 "Action": "iam:PassRole", 40 "Resource": "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}", 41 "Sid": "PassNodeIAMRole" 42 }, 43 { 44 "Effect": "Allow", 45 "Action": "eks:DescribeCluster", 46 "Resource": "arn:${AWS_PARTITION}:eks:${AWS_REGION}:${AWS_ACCOUNT_ID}:cluster/${CLUSTER_NAME}", 47 "Sid": "EKSClusterEndpointLookup" 48 }, 49 { 50 "Sid": "AllowScopedInstanceProfileCreationActions", 51 "Effect": "Allow", 52 "Resource": "*", 53 "Action": [ 54 "iam:CreateInstanceProfile" 55 ], 56 "Condition": { 57 "StringEquals": { 58 "aws:RequestTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned", 59 "aws:RequestTag/topology.kubernetes.io/region": "${AWS_REGION}" 60 }, 61 "StringLike": { 62 "aws:RequestTag/karpenter.k8s.aws/ec2nodeclass": "*" 63 } 64 } 65 }, 66 { 67 "Sid": "AllowScopedInstanceProfileTagActions", 68 "Effect": "Allow", 69 "Resource": "*", 70 "Action": [ 71 "iam:TagInstanceProfile" 72 ], 73 "Condition": { 74 "StringEquals": { 75 "aws:ResourceTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned", 76 "aws:ResourceTag/topology.kubernetes.io/region": "${AWS_REGION}", 77 "aws:RequestTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned", 78 "aws:RequestTag/topology.kubernetes.io/region": "${AWS_REGION}" 79 }, 80 "StringLike": { 81 "aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass": "*", 82 "aws:RequestTag/karpenter.k8s.aws/ec2nodeclass": "*" 83 } 84 } 85 }, 86 { 87 "Sid": "AllowScopedInstanceProfileActions", 88 "Effect": "Allow", 89 "Resource": "*", 90 "Action": [ 91 "iam:AddRoleToInstanceProfile", 92 "iam:RemoveRoleFromInstanceProfile", 93 "iam:DeleteInstanceProfile" 94 ], 95 "Condition": { 96 "StringEquals": { 97 "aws:ResourceTag/kubernetes.io/cluster/${CLUSTER_NAME}": "owned", 98 "aws:ResourceTag/topology.kubernetes.io/region": "${AWS_REGION}" 99 }, 100 "StringLike": { 101 "aws:ResourceTag/karpenter.k8s.aws/ec2nodeclass": "*" 102 } 103 } 104 }, 105 { 106 "Sid": "AllowInstanceProfileReadActions", 107 "Effect": "Allow", 108 "Resource": "*", 109 "Action": "iam:GetInstanceProfile" 110 }, 111 { 112 "Sid": "AllowUnscopedInstanceProfileListAction", 113 "Effect": "Allow", 114 "Resource": "*", 115 "Action": "iam:ListInstanceProfiles" 116 } 117 ], 118 "Version": "2012-10-17" 119} 120EOF 121 122aws iam put-role-policy --role-name "KarpenterControllerRole-${CLUSTER_NAME}" \ 123 --policy-name "KarpenterControllerPolicy-${CLUSTER_NAME}" \ 124 --policy-document file://controller-policy.json 125
效果如下:
为karpenter管理的node所在的子网和安全组添加tag
为现有的nodegroup所在的子网打标签
1# 为现有的nodegroup所在的子网打标签 2for NODEGROUP in $(aws eks list-nodegroups --cluster-name "${CLUSTER_NAME}" --query 'nodegroups' --output text); do 3 aws ec2 create-tags \ 4 --tags "Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}" \ 5 --resources $(aws eks describe-nodegroup --cluster-name "${CLUSTER_NAME}" \ 6 --nodegroup-name "${NODEGROUP}" --query 'nodegroup.subnets' --output text ) 7done 8
为现有的node group所在的安全组打标签
1# 查询nodegroup 2NODEGROUP=$(aws eks list-nodegroups --cluster-name "${CLUSTER_NAME}" \ 3 --query 'nodegroups[0]' --output text) 4# 查询aws ec2 launch template 5LAUNCH_TEMPLATE=$(aws eks describe-nodegroup --cluster-name "${CLUSTER_NAME}" \ 6 --nodegroup-name "${NODEGROUP}" --query 'nodegroup.launchTemplate.{id:id,version:version}' \ 7 --output text | tr -s "\t" ",") 8 9# If your EKS setup is configured to use only Cluster security group, then please execute - 10SECURITY_GROUPS=$(aws eks describe-cluster \ 11 --name "${CLUSTER_NAME}" --query "cluster.resourcesVpcConfig.clusterSecurityGroupId" --output text) 12 13# If your setup uses the security groups in the Launch template of a managed node group, then : 14SECURITY_GROUPS="$(aws ec2 describe-launch-template-versions \ 15 --launch-template-id "${LAUNCH_TEMPLATE%,*}" --versions "${LAUNCH_TEMPLATE#*,}" \ 16 --query 'LaunchTemplateVersions[0].LaunchTemplateData.[NetworkInterfaces[0].Groups||SecurityGroupIds]' \ 17 --output text)" 18# 为安全组打tag 19aws ec2 create-tags \ 20 --tags "Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}" \ 21 --resources "${SECURITY_GROUPS}" 22
效果如下:
编辑aws-auth
先设置kube凭证,让本地电脑可以控制EKS集群,用你的实际的值替换<region-code>和our-eks-cluster-name>
1aws eks update-kubeconfig --region <region-code> --name <your-eks-cluster-name> 2
开始编辑
1kubectl edit configmap aws-auth -n kube-system 2
安装下列模板补充进去,其中AWS_PARTITION、AWS_ACCOUNT_ID、CLUSTER_NAME是变量,填你自己的实际的值
1- groups: 2 - system:bootstrappers 3 - system:nodes 4 ## If you intend to run Windows workloads, the kube-proxy group should be specified. 5 # For more information, see https://github.com/aws/karpenter/issues/5099. 6 # - eks:kube-proxy-windows 7 # 这行的占位符需要替换为你自己的变量 8 rolearn: arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME} 9 # 这行不需要改!!! 10 username: system:node:{{EC2PrivateDNSName}} 11
这里注意要将AWS_PARTITION、AWS_ACCOUNT_ID、CLUSTER_NAME替换为你账户实际的值!!!,否则后续启动的node无权限加入EKS.
安装karpenter 1.8.1
设置karpenter版本为1.8.1,一般选择最新的就好.
1export KARPENTER_VERSION="1.8.1" 2
用helm创建Karpenter deployment yaml模板
1helm template karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" \ 2 --set "settings.clusterName=${CLUSTER_NAME}" \ 3 --set "settings.interruptionQueue=${CLUSTER_NAME}" \ 4 --set "serviceAccount.annotations.eks\.amazonaws\.com/role-arn=arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterControllerRole-${CLUSTER_NAME}" \ 5 --set controller.resources.requests.cpu=1 \ 6 --set controller.resources.requests.memory=1Gi \ 7 --set controller.resources.limits.cpu=1 \ 8 --set controller.resources.limits.memory=1Gi > karpenter.yaml 9
修改下载下来的karpenter.yaml,补充我红色方框的内容,并且注意将value改为你的EKS的nodegroup名称。
1 - matchExpressions: 2 - key: eks.amazonaws.com/nodegroup 3 operator: In 4 values: 5 # 输入你自己的node group列表 6 - ${NODEGROUP} 7
正式部署karpenter
1kubectl create namespace "${KARPENTER_NAMESPACE}" || true 2kubectl create -f \ 3 "https://raw.githubusercontent.com/aws/karpenter-provider-aws/v${KARPENTER_VERSION}/pkg/apis/crds/karpenter.sh_nodepools.yaml" 4kubectl create -f \ 5 "https://raw.githubusercontent.com/aws/karpenter-provider-aws/v${KARPENTER_VERSION}/pkg/apis/crds/karpenter.k8s.aws_ec2nodeclasses.yaml" 6kubectl create -f \ 7 "https://raw.githubusercontent.com/aws/karpenter-provider-aws/v${KARPENTER_VERSION}/pkg/apis/crds/karpenter.sh_nodeclaims.yaml" 8kubectl apply -f karpenter.yaml 9
创建一个默认的karpenter provisioner,注意要将${CLUSTER_NAME}替换为你自己的EKS集群名称
1cat <<EOF | envsubst | kubectl apply -f - 2apiVersion: karpenter.sh/v1 3kind: NodePool 4metadata: 5 name: default 6spec: 7 template: 8 spec: 9 requirements: 10 - key: kubernetes.io/arch 11 operator: In 12 values: ["amd64"] 13 - key: kubernetes.io/os 14 operator: In 15 values: ["linux"] 16 - key: karpenter.sh/capacity-type 17 operator: In 18 values: ["spot"] 19 - key: karpenter.k8s.aws/instance-category 20 operator: In 21 values: ["c", "m", "r"] 22 - key: karpenter.k8s.aws/instance-generation 23 operator: Gt 24 values: ["2"] 25 nodeClassRef: 26 group: karpenter.k8s.aws 27 kind: EC2NodeClass 28 name: default 29 expireAfter: 720h # 30 * 24h = 720h 30 limits: 31 cpu: 1000 32 disruption: 33 consolidationPolicy: WhenEmptyOrUnderutilized 34 consolidateAfter: 1m 35--- 36apiVersion: karpenter.k8s.aws/v1 37kind: EC2NodeClass 38metadata: 39 name: default 40spec: 41 role: "KarpenterNodeRole-${CLUSTER_NAME}" # replace with your cluster name 42 amiSelectorTerms: 43 - alias: "al2023@latest" 44 subnetSelectorTerms: 45 - tags: 46 karpenter.sh/discovery: "${CLUSTER_NAME}" # replace with your cluster name 47 securityGroupSelectorTerms: 48 - tags: 49 karpenter.sh/discovery: "${CLUSTER_NAME}" # replace with your cluster name 50EOF 51 52
到这里我们按照完成了,可以执行下面的命令查看日志,一般没有错误信息就行
1kubectl logs -f -n "${KARPENTER_NAMESPACE}" -l app.kubernetes.io/name=karpenter -c controller 2
如果发现类似The specified queue does not exist的错误,其实没大的关系,这个是因为我们还没设置SQS来配合karpenter实现EC2 spot机型的中断回收
也可以从查看karpenter的pod运行情况
1kubectl logs -f -n "${KARPENTER_NAMESPACE}" -l app.kubernetes.io/name=karpenter -c controller 2
测试karpenter
我们可以部署下面的nginx Deployment来触发karpenter对EC2节点的自动伸缩,我故意设置了副本数为100,来迫使EKS中的karpenter扩展更多的EC2
1apiVersion: apps/v1 2kind: Deployment 3metadata: 4 name: eks-sample-linux-deployment 5 namespace: default 6 labels: 7 app: eks-sample-linux-app 8spec: 9 replicas: 100 10 selector: 11 matchLabels: 12 app: eks-sample-linux-app 13 template: 14 metadata: 15 labels: 16 app: eks-sample-linux-app 17 spec: 18 affinity: 19 nodeAffinity: 20 requiredDuringSchedulingIgnoredDuringExecution: 21 nodeSelectorTerms: 22 - matchExpressions: 23 - key: kubernetes.io/arch 24 operator: In 25 values: 26 - amd64 27 - arm64 28 containers: 29 - name: nginx 30 image: public.ecr.aws/nginx/nginx:1.23 31 ports: 32 - name: http 33 containerPort: 80 34 imagePullPolicy: IfNotPresent 35 nodeSelector: 36 kubernetes.io/os: linux 37
效果如下,karpenter发现底层EC2资源不够,马上申请了了一个EC2满足大量的nginx的资源需求
当我们移出这个deployment后,karpenter会自动的在一端实际后销毁EC2. 这里我们还可以模拟了实现了EKS根据压力或者资源需求自动扩展/收缩EC2资源的功能,结合HPA可以做到了pod和EC2的弹性,关于EKS中的HPA的测试demo可以看Scale pod deployments with Horizontal Pod Autoscaler
