kserve-poc
This is a proof of concept for using Kserve as model serving infrastructure. Its ability to scale models individually and also scales to zero is helpful for performance and cost.
Installation
Pre-requisites
- A k8s cluster
- kubectl
- Helm
- A GCS bucket
Installing Kserve
0. Create a namespace for testing
kubectl create namespace kserve-test
Reference doc: https://kserve.github.io/website/master/admin/serverless/serverless/
1. Install Knative Serving
1.1 Install Knative Serving CRDs
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.13.1/serving-crds.yaml -n kserve-test
1.2 Install Knative Serving
Please note that the namespace here needs to be knative-serving, not the namespace that you created for testing.
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.13.1/serving-core.yaml -n knative-serving
2. Install Istio
2.1 Install Istio and CRD
kubectl apply -l knative.dev/crd-install=true -f https://github.com/knative/net-istio/releases/download/knative-v1.13.1/istio.yaml
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.13.1/istio.yaml
2.2 Install Istio controller
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.13.1/net-istio.yaml
2.3 Verify the installation
kubectl get pods -n knative-serving
You should see six pods running.
3. Install Certificate Manager
3.1 Install the cert-manager CRDs
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.crds.yaml
3.2 Create the namespace for cert-manager
helm repo add jetstack https://charts.jetstack.io & helm install cert-manager-kserve --namespace kserve-test --version v1.14.4 jetstack/cert-manager
4. Install Kserve
4.1 Install Kserve CRD
helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version v0.12.0 -n kserve-test
4.2 Install Kserve
helm install kserve oci://ghcr.io/kserve/charts/kserve --version v0.12.0 --values values.yaml -n kserve-test
5. Set up a k8s secret
This is for kserve to access the model stored in GCS. Create a key from a service account which has access to the GCS bucket. Note, the key file needs to be named gcloud-application-credentials.json
.
kubectl create secret generic storage-config --from-file=<path-to-key-file>/gcloud-application-credentials.json -n kserve-test
6. Create a GCS bucket for storing models
Make sure you are in the correct GCP project.
gcloud config set project <your-project-id>
gsutil mb gs://<your-bucket>
Deploy a model
1. Copy your model(s) to GCS
gsutil cp cluster_2_xgboost_forecast/model.ubj gs://<your-bucket>/<model-name>/
There is a sample model in the cluster_2_xgboost_forecast
folder.
Note that the model file’s name needs to be called model.xxx.
2. Apply the model config to k8s
There are a few examples in the model_configs
dir.
Modify the yaml files to match with your model’s path in GCS.
kubectl apply -f <path-to-your-model-config>.yaml -n kserve-test
3. Invoking the deployment with grpcurl (Optional)
Mount the volume:
kubectl create configmap proto-files --from-file=grpc_predict_v2.proto -n kserve-test
export MODEL_NAME=<your-model-name>
export CONTENT_LENGTH=<content-length>
./grpcurl/model_infer.sh $MODEL_NAME $CONTENT_LENGTH
You can check the pod calling-<your-model-name>
Running Performance Tests
To start a performance test, you need to run the following command:
export MODEL_NAME=<your-model-name>
export CONTENT_LENGTH=38
export CONCURRENCY=20
export RPS=20
The script I used to do performance test, make sure to download the grpc_predict_v2.proto from kserve website. Make sure to replace the model’s path on GCS.
#!/bin/bash
BASE_MODEL_NAME=$1
CONTENT_LENGTH=$2
CONCURRENCY=$3
RPS=$4
LOAD_SCHEDULE=$5
TIMEOUT=$6
MAX_DURATION=$7
SCALING_METRIC=$8
SCALING_TARGET=$9
LOAD_SCHEDULE=${LOAD_SCHEDULE:-"step"}
TIMEOUT=${TIMEOUT:-"30s"}
MAX_DURATION=${MAX_DURATION:-"180s"}
SCALING_METRIC=${SCALING_METRIC:-"rps"}
SCALING_TARGET=${SCALING_TARGET:-"10"}
MODEL_NAME=${BASE_MODEL_NAME}-${SCALING_METRIC}-${SCALING_TARGET}
echo "MODEL_NAME: $MODEL_NAME"
echo "CONTENT_LENGTH: $CONTENT_LENGTH"
echo "CONCURRENCY: $CONCURRENCY"
echo "RPS: $RPS"
echo "LOAD_SCHEDULE: $LOAD_SCHEDULE"
echo "TIMEOUT: $TIMEOUT"
echo "MAX_DURATION: $MAX_DURATION"
echo "SCALING_METRIC: $SCALING_METRIC"
echo "SCALING_TARGET: $SCALING_TARGET"
# Generate a list of numbers from 1 to CONTENT_LENGTH without a trailing comma
FP32_CONTENTS=$(seq -s , 1 $CONTENT_LENGTH | sed 's/,$//')
echo "FP32_CONTENTS: ${FP32_CONTENTS}"
CONFIG_JSON=$(cat <<EOF
{
"proto": "/protos/grpc_predict_v2.proto",
"call": "inference.GRPCInferenceService.ModelInfer",
"total": 200,
"concurrency": ${CONCURRENCY},
"rps": ${RPS},
"data": {
"model_name": "${MODEL_NAME}",
"inputs": [{
"name": "predict",
"shape": [1, $CONTENT_LENGTH],
"datatype": "FP32",
"contents": {
"fp32_contents": [$FP32_CONTENTS]
}
}]
},
"metadata": {
"foo": "bar",
"trace_id": "{{.RequestNumber}}",
"timestamp": "{{.TimestampUnix}}"
},
"load-schedule": "${LOAD_SCHEDULE}",
"load-start": 10,
"load-end": $RPS,
"load-step":5,
"load-step-duration":"5s",
"timeout": "${TIMEOUT}",
"import-paths": [
"/protos"
],
"max-duration": "${MAX_DURATION}",
"host": "${MODEL_NAME}.kserve-test.svc.cluster.local:80"
}
EOF
)
echo "CONFIG_JSON: ${CONFIG_JSON}"
kubectl create configmap ghz-config --from-literal=config.json="$CONFIG_JSON" -n kserve-test --dry-run=client -o yaml | kubectl apply -f - -n kserve-test
# Deploy the model for the performance test
cat <<EOF | kubectl apply -f -
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: ${MODEL_NAME}
namespace: kserve-test
annotations:
serving.kserve.io/secretName: storage-config
spec:
predictor:
minReplicas: 0
scaleTarget: ${SCALING_TARGET}
scaleMetric: "${SCALING_METRIC}"
model:
resources: # Add this section
requests:
cpu: "100m" # Request 100 milli-CPUs
memory: "200Mi" # Request 200 MiB of memory
limits:
cpu: "500m" # Limit to 500 milli-CPUs
memory: "500Mi" # Limit to 500 MiB of memory
modelFormat:
name: xgboost
protocolVersion: v2
# This should be the path to the model file, but NOT include the file's name
# The file should be named like model.xxx
storageUri: gs://<path-to-model>
ports:
- name: h2c # knative expects grpc port name to be 'h2c'
protocol: TCP
containerPort: 9000
readinessProbe:
httpGet:
path: /v2/models/${MODEL_NAME}/ready
port: 8080
EOF
sleep 180
# Deploy the performance test job
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: perf-${MODEL_NAME}-rps-${RPS}-cc-${CONCURRENCY}-ls-${LOAD_SCHEDULE}
namespace: kserve-test
spec:
template:
spec:
restartPolicy: Never
containers:
- name: ghz
image: obvionaoe/ghz
args:
- "--insecure"
- "--config"
- "/config/config.json"
volumeMounts:
- name: proto-files
mountPath: "/protos"
- name: config-volume
mountPath: "/config"
volumes:
- name: proto-files
configMap:
name: proto-files
- name: config-volume
configMap:
name: ghz-config
backoffLimit: 4
EOF
To run the script:
./performance_test/model_infer-ghz.sh $MODEL_NAME $CONTENT_LENGTH $CONCURRENCY $RPS $LOAD_SCHEDULE $TIMEOUT $MAX_DURATION $SCALING_METRIC $SCALING_TARGET
Example result:
│ Summary: │
│ Count: 1809 │
│ Total: 120.00 s │
│ Slowest: 2.90 s │
│ Fastest: 109.63 ms │
│ Average: 1.25 s │
│ Requests/sec: 15.07 │
│ │
│ Response time histogram: │
│ 109.631 [1] | │
│ 388.539 [6] |∎ │
│ 667.448 [15] |∎ │
│ 946.356 [11] |∎ │
│ 1225.264 [24] |∎∎ │
│ 1504.173 [148] |∎∎∎∎∎∎∎∎∎∎∎∎∎ │
│ 1783.081 [453] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ │
│ 2061.989 [268] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ │
│ 2340.898 [62] |∎∎∎∎∎ │
│ 2619.806 [25] |∎∎ │
│ 2898.714 [17] |∎∎ │
│ │
│ Latency distribution: │
│ 10 % in 1.40 s │
│ 25 % in 1.60 s │
│ 50 % in 1.70 s │
│ 75 % in 1.80 s │
│ 90 % in 2.09 s │
│ 95 % in 2.30 s │
│ 99 % in 2.70 s │
│ │
│ Status code distribution: │
│ [DeadlineExceeded] 19 responses │
│ [InvalidArgument] 739 responses │
│ [OK] 1030 responses │
│ [Unavailable] 21 responses │
│ │
│ Error distribution: │
│ [20] rpc error: code = Unavailable desc = error reading from server: read tcp 10.96.15.22:42192->10.100.9.175:80: use │
│ [1] rpc error: code = Unavailable desc = upstream request timeout │
│ [19] rpc error: code = DeadlineExceeded desc = context deadline exceeded │
│ [739] rpc error: code = InvalidArgument desc = Model my-model with version is not ready yet.