In order to take care of user privacy and ensure training quality, before training the model, BigCode annotated 400 samples, and established and continuously improved RegEx rules to remove such things as email addresses, keys, and IP addresses from the code of the dataset before training and other sensitive information.
In order to allow developers to use the code generated by SantaCoder with confidence, BigCode launched Dataset Search Search Tools. Through this tool, developers can find out the source of code so that users can comply with the corresponding licensing requirements if the code generated by SantaCoder belongs to a certain project.
In addition, BigCode also launched the “Am I in The Stack?” tool, developers can check whether the warehouse under their name is part of the training data set, and can delete their open source warehouse from the data set.
BigCode has currently provided a demo of SantaCoder on the Huggingface website for anyone to explore and try.
#BigCode #open #source #lightweight #language #model #supports #Python #Java #News Fast Delivery