machine learning - How to randomly split a dataset into training set, test set, and dev set in Python? -
i have large dataset , want randomly split dataset 70% train, 25% test, , 5% dev. how can in python scikit-learn?
i wonder if using sklearn.cross_validation.train_test_split(*arrays, **options) function example in following link?
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
you use:
from numpy.random import multinomial n_total_samples = 1000 # or whatever indices = np.arange(n_total_samples) inds_split = multinomial(n=1, pvals=[0.7, 0.25, 0.05], size=n_total_samples).argmax(axis=1) train_inds = indices[inds_split==0] test_inds = indices[inds_split==1] dev_inds = indices[inds_split==2] print len(train_inds) / float(n_total_samples) # => 0.713 print len(test_inds) / float(n_total_samples) # => 0.24 print len(dev_inds) / float(n_total_samples) # => 0.047
it's not pretty built-in function, believe need.
Comments
Post a Comment