How Can We Help?
Data Preprocessing
Cortex automatically connects every step of the Machine Learning process into end-to-end Machine Learning Pipelines that anyone in your business can run. In this guide, we’ll discuss the first step of a Cortex pipeline: data preprocessing.
What is data preprocessing?
Cortex’s data preprocessing step has two main goals:
- Loading raw data into Cortex from disparate sources
- Merging that data into a unified framework.
Cortex has the ability to ingest multiple data sets from multiple data sources. In other words, you can send User Events separately from User Attributes. And, even within the User Events dataset, you can send Mobile events in a separate feed than Web events. Regardless of its type or source, Cortex will automatically merge all your data into a single unified view.
Why does it matter?
Consolidating your business’s data sets the stage for it to flow through the rest of the ML Pipeline and be used to generate predictions. For example, users can purchase both online and in-store, but by combining those separate data sources we can get a fuller picture of the user than by analyzing each one independently. Also, your CRM may store user behavior separately from user attributes, and ensuring those are tied together in Cortex leads to more accurate predictions
Which data preprocessing techniques does Cortex use?
Loading Data
Cortex offers flexible connectors so that your business’s datasets (e.g. events, attributes) may be ingested from various sources (e.g. Amazon S3, Google BigQuery, Adobe Analytics, API, MRSS) and in various formats (e.g. JSON, CSV, TSV, PSV, Parquet). This data can either be streamed into Cortex in real-time, or uploaded in batched files on a recurring schedule.
Merging Data
Once loaded, Cortex merges each source into a comprehensive view of your business’s data. Some example preprocessing techniques used by Cortex include –
- Dataset matching: If a single dataset (e.g. user events) is fed by multiple sources, Cortex automatically joins the data into a single table. For example, user transactions may be sent into Cortex through a different feed than online browsing behavior, but both will ultimately be recognized by Cortex as user events.
- Object matching: Create a link between (for example) each user’s events and attributes to get a complete picture of the user.
- ID stitching: If an object is associated with multiple identifiers, Cortex will stitch together data from each of these IDs. For example, Cortex can be supplied a mapping table so that various cookie IDs are consolidated into a single representation of the user.
- Attribute timestamping: Cortex stores attributes with a timestamp indicating when each was loaded. This helps prevent data leakage later in the pipeline when an object’s attributes are matched up with its events. For example, by recording that Item ABC cost $10 on January 1st but changed to $15 on January 8th, your pipeline will know the exact price at the time of each purchase event.
- Numerical transformations: Transform categorical and other non-numerical data into a numerical form that can be processed and used for mathematical operations.
Related Links
- What is a Machine Learning Pipeline?
- Data Cleaning
- Feature Engineering
- Model Selection
- Prediction Generation
Still have questions? Reach out to support@mparticle.com for more info!