Inference methods
Amazon SageMaker offers a variety of inference options to suit different machine learning deployment needs[1][2]. These options include:
- Real-Time Inference: Provides predictions with immediate, low-latency responses, suitable for applications requiring instant feedback such as fraud detection, recommendation systems, and chatbots[5][6].
- Serverless Inference: Enables easy scaling to handle thousands of models per endpoint and millions of transactions per second, with sub-10 millisecond overhead latencies[1].
- Asynchronous Inference: Allows users to make requests for predictions and receive the results later, which is useful for applications that can afford to wait for the results[6]. It is also cost-effective due to the efficient utilization of computing resources[5].
- Batch Transform: An offline inferencing option for when you have very large datasets[2]. SageMaker Processing offers a managed compute environment to run a custom batch inference container with a custom script[7].
Amazon SageMaker also provides options for optimizing inference performance and cost, such as deploying multiple models on a single endpoint, using serial inference pipelines, and autoscaling compute resources[1]. You can also shadow test models to validate their performance and use intelligent routing to reduce latency[1].