

Amazon SageMaker Feature Store is a fully managed, purpose built repository that is used to store, share, and manage features for machine learning (ML) models. Features are the input variables into a machine learning model used during training and inference. Features are used repeatedly by multiple teams and feature quality is critical to ensure a highly accurate model. Also, when features used to train models offline in batch are made available for real-time inference, it’s hard to keep the two feature stores synchronized. SageMaker Feature Store provides a secured and unified store to process, standardize, and use features at scale across the ML lifecycle.
Ingestion - You can ingest data from various sources into the feature store. Some of the sources are:
Processing - Through feature processing, one can specify the data source and feature transformation functions (e.g. count of product views or time window aggregates) and the SageMaker feature store transforms the data at the time of ingestion into ML Features. With Amazon SageMaker DataWrangler, you can publish features directly into SageMaker Feature Store.
The Store - Feature store tags and indexes feature groups so that they are easily discoverable through the visual interface of SageMaker studio.
The catalog - allows teams to discover existing features they can confidently reuse and avoid duplication of pipelines. SageMaker Feature store uses the AWS Glue Data Catalog by default bu allows you to use a different catalog if desired. You can also query features using familiar SQL with Amazon Athena or another query tool of your choice.
Consistency - During training, models often use the complete data set and can take hours to complete, while inference needs to happen in milliseconds and usually uses a subset of the data. When used together, SageMaker Feature Store ensures that offline and online datasets remain in sync which is critical because if they diverge, it can negatively impact model accuracy.
Tracking - It is important to know how features were built and which models and endpoints are using them. SageMaker Feature Store allows data scientists to track their features in Amazon SageMaker Studio with SageMaker Lineage. SageMaker Lineage lets you track scheduled pipeline executions, visualize upstream lineage to trace features back to their data sources, and view feature processing code, all in one environment.
Time Travel - there may be need to train models with the exact set of feature values from a specific time in the past without the risk of including data from beyond that time (also referred to as feature leakage), such as patient medical data before a diagnosis. Point in time queries can be used to retrieve the state of each feature at the historical time of interest.
MLOps - Feature stores manage datasets and feature pipelines, speeding up data science tasks and eliminating the duplicate work of creating the same features multiple times.
Security & Compliance - To support security and compliance needs, you may need granular control over how shared ML features are accessed. These needs often go beyond table and column-level access control to individual row-level access control. For example, you may want to let account representatives see rows from a sales table for only their accounts and mask the prefix of sensitive data like credit card numbers. SageMaker Feature Store together with AWS Lake Formation can be used to implement fine-grained access controls to protect feature store data and grant access based on role.
Practical Implementations
Sources