In the development of any moderately complex predictive algorithm, multiple factors and decisions need to be evaluated in a thorough and quantitative manner. Even fairly simplistic predictive algorithms will have 10 or more variables that have direct impact on the overall performance of the algorithm. Therefore, the ability to determine what algorithmic methods and associated settings result in highest overall predictive performance is a critical step in the development of high-performing predictive algorithms. Medici has developed a robust and effective system for this development scenario, which it uses to the benefit of its client base.
Rapid Assessment System and Design of Experiments
Medici’s Rapid Assessment System is a highly modular data-processing system composed of a large code base, a computing cluster, algorithm status messaging system, and results databases. The Rapid Assessment System can quickly create a large number of “processing recipes” that contain varying combinations of preprocessing techniques, feature selection methods, noise filtering, regression methods, and sub-modeling. These recipes are all evaluated in experiments that can involve hundreds to thousands of different processing recipes within a controlled framework.
The Rapid Assessment System consists of four framework pillars on which a systematic development of predictive models is based: Strict and Efficient Cross Validation, Modular Modeling Environment, Efficient Searching of Entire Modeling Space, Messaging system and Searchable Results Database.
Medici Algorithm Development Tools
Single-sample out CV is inefficient and may not adequately characterize model stability. RAS uses repeated N-fold CV and can use any combination of parameters to define rigid hold-out rules (e.g. subject-out). Additionally, variables thought to have spurious correlation with the target variable can be statistically balanced between training and validation sets to avoid model bias.
Modular Modeling Environment
The Rapid Assessment System uses a sophisticated application programing interface for rapidly defining a modeling “recipe” from the existing toolbox of methods and for plugging any analytical method into the system for evaluation on the dataset. Modeling recipes can be rapidly prototyped and evaluated, and performance results are directly compared. Training and test dataset separation is strictly enforced during the multiple N-fold cross validation repetitions.
Efficient Searching of Entire Modeling Space
Medici’s RAS includes tools for defining the modeling space to be explored, which can include different combinations or sequencing of methods as well as changes to method parameters, with up to dozens of degrees of freedom. The modeling recipe for any possible combination in the modeling space can be rapidly and automatically generated. A DOE approach enables rapid characterization of the entire modeling space, even for millions or billions of possible models.
Real-time updates on progress, model internals and interim results. Often large-scale algorithm development leaves the data scientists blind to intermediate results and trends. The RAS messaging system allows each and every stage of the algorithm development process to store diagnostic information as desired. As this information is immediately available, long-running recipes can be tracked in real-time allowing for better computer time management. Further, trends across recipes can be calculated and assessed. As the message system uses a document storage database, the algorithm designer is free to store almost anything. Tools for quickly searching messages and preparing dashboard or progress reports are incorporated in the system.
Searchable Results Database
The RAS database stores the specifics of all performance parameters calculated on every recipe for rapid analysis. A leaderboard of current results can be instantaneously generated at any time during the development process.
The Rapid Assessment System is an analytical toolbox composed of algorithms for data pretreatment, feature selection, regression modeling and classification model development. The system allows the user to effectively “plug-and-play” different processing approaches to determine the overall predictive capability of an integrated processing scheme. A given processing approach is referred to as a processing recipe. All recipes as well as their outputs are automatically saved and archived in a database.