Machine Learning Models are based on algorithms powered by machine learning.  When a new model is created it will need to be Trained. Supervised Training is a process where you provide examples of the duplicates and distinct records for DataGroomr to learn and identify patterns in data based on the fields you specified.


Tip: A model may be trained multiple times to improve accuracy.


When a user presses the TRAIN button, DataGroomr will analyze your existing data to identify duplicate (and non-duplicate) sets of records.  The amount of time required is based on the number of records in your Salesforce environment and number of fields selected.  You may exist this window and return at any time.



You will be shown sets of potentially duplicates records along with three options:

  1. YES - the records are duplicate
  2. NO - the records are not duplicates
  3. NOT SURE - if you cannot determine if the records are duplicates

We recommend identifying 5 sets of positive and 5 sets of negative duplicates for each field included in the model.  For example, if your model consists of 4 fields then you should review at least 40 sets.  When a sufficient number of duplicate sets is reviewed the FINISH button will become active.  Pressing this button will generate a confirmation window with additional information.


Press CONFIRM button to activate the model.


Retraining Existing Model


Occasionally an existing model needs to be retrained.  This sometimes happens when a model is not detecting all the duplicates, or a new important field has been added to Salesforce. 


Notice that an existing model has a version number.  To re-train this model, select it from the list and press OPEN.  



Then select the TRAIN button and CONFIRM to create a new version.



On the next screen, you are asked to evaluate if a pair of records is a duplicate. The process is the same as training a new model, except the training picks up where the previous version stopped.  


Press the Finish button to complete the re-training.  Note that a new version of the model will be created.


Good to know:  All the versions of a model are available for selection by a dataset