xenonpy.datatools package
Submodules
xenonpy.datatools.dataset module
- class xenonpy.datatools.dataset.Dataset(*paths, backend='pandas', prefix=None)[source]
Bases:
object
- classmethod from_http(url, save_to, *, filename=None, chunk_size=262144, params=None, **kwargs)[source]
Get file object via a http request.
- Parameters:
url (str) – The resource url.
save_to (str) – The path of a dir to save the downloaded object into it.
filename (str, optional) – Specific the file name when saving. Set to
None
(default) to use a inferred name from http header.chunk_size (int, optional) – Chunk size.
params (any, optional) – Parameters will be passed to
requests.get
function. See Also: requestskwargs (dict, optional) – Pass to
requests.get
function as thekwargs
parameters.
- Returns:
File path contains file name.
- Return type:
- property csv
- property excel
- property pandas
- property pickle
xenonpy.datatools.preset module
- class xenonpy.datatools.preset.Preset(*args, **kwargs)[source]
Bases:
Dataset
Load data from embed dataset in XenonPy’s or user create data saved in
~/.xenonpy/cached
dir. Also can fetch data by http request.This is sample to demonstration how to use is. Also see parameters documents for details.
>>> from xenonpy.datatools import preset >>> elements = preset.elements >>> elements.info() <class 'pandas.core.frame.DataFrame'> Index: 118 entries, H to Og Data columns (total 74 columns): atomic_number 118 non-null int64 atomic_radius 88 non-null float64 atomic_radius_rahm 96 non-null float64 atomic_volume 91 non-null float64 atomic_weight 118 non-null float64 boiling_point 96 non-null float64 brinell_hardness 59 non-null float64 bulk_modulus 69 non-null float64 ...
- sync(data, to=None)[source]
load data.
Note
Try to load data from local at
~/.xenonpy/dataset
. If no data, try to fetch them from remote repository.- Parameters:
data (str) – name of data.
to (str) – The version of repository. See Also: https://github.com/yoshida-lab/dataset/releases
- Returns:
ret
- Return type:
DataFrame or Saver or local file path.
- property atom_init
The initialization vector for each element.
See Also: https://github.com/txie-93/cgcnn#usage
xenonpy.datatools.splitter module
- class xenonpy.datatools.splitter.Splitter(size, *, test_size=0.2, k_fold=None, random_state=None, shuffle=True)[source]
Bases:
BaseEstimator
Data splitter for train and test
- Parameters:
size (
int
) – Total sample size. All data must have same length of their first dim,test_size (
Union
[float
,int
]) – If float, should be between0.0
and1.0
and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. Can be0
if cv isNone
. In this case,cv()
will yield a tuple only containstraining
andvalidation
on each step. By default, the value is set to 0.2.k_fold (
Union
[int
,Iterable
,None
]) – Number of k-folds. Ifint
, Must be at least 2. IfIterable
, it should provide label for each element which will be used for group cv. In this case, the input ofcv()
must be apandas.DataFrame
object. Default value is None to specify no cv.random_state (
Optional
[int
]) – If int, random_state is the seed used by the random number generator; Default is None.shuffle (
bool
) – Whether or not to shuffle the data before splitting.
- cv(*arrays, less_for_train=False)[source]
Split data with cross-validation.
- Parameters:
*arrays (DataFrame, Series, ndarray, list) – Data for split. Must be a Sequence of indexables with same length / shape[0]. If None, return the split indices.
less_for_train (bool) – If true, use less data set for train. E.g.
[1, 2, 3, 4, 5, 6, 7, 8, 9, 0]
with 5 cv will be split into[1, 2]
and[3, 4, 5, 6, 7, 8, 9, 0]
. Usually,[1, 2]
(less one) will be used for val. Withless_for_train=True
,[1, 2]
will be used for train. Default isFalse
.
- Yields:
tuple – list containing split of inputs with cv. if inputs are None, only return the indices of split. if
test_size
is 0, test data/index will not return.
- split(*arrays)[source]
Split data.
- Parameters:
*arrays (
Union
[ndarray
,DataFrame
,Series
]) – Dataset for split. Size of dim 0 must be equal tosize()
. If None, return the split indices.- Returns:
List containing split of inputs. if inputs are None, only return the indices of splits. if
test_size
is 0, test data/index will not return.- Return type:
- property k_fold
- property random_state
- property shuffle
- property size
- property test_size
xenonpy.datatools.transform module
- class xenonpy.datatools.transform.PowerTransformer(*, method='yeo-johnson', standardize=False, lmd=None, tolerance=(-inf, inf), on_err=None)[source]
Bases:
BaseEstimator
,TransformerMixin
Box-cox transform. .. rubric:: References
G.E.P. Box and D.R. Cox, “An Analysis of Transformations”, Journal of the Royal Statistical Society B, 26, 211-252 (1964).
- Parameters:
method ('yeo-johnson' or 'box-cox') – ‘yeo-johnson’ works with positive and negative values ‘box-cox’ only works with strictly positive values
standardize (boolean) – Normalize to standard normal or not. Recommend using a sepearate standard function instead of using this option.
lmd (list or 1-dim ndarray) – You might assign each input xs with a specific lmd yourself. Leave None(default) to use a inferred value. See PowerTransformer for detials.
tolerance (tuple) – Tolerance of lmd. Set None to accept any. Default is (-np.inf, np.inf) but recommend (-2, 2) for Box-cox transform
on_err (None or str) – Error handle when try to inference lambda. Can be None or log, nan or raise by string. log will return the logarithmic transform of xs that have a min shift to 1. nan return
ndarray
with shape xs.shape filled with``np.nan``. raise raise a FloatingPointError. You can catch it yourself. Default(None) will return the input series without scale transform._PowerTransformer (..) – https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html#sklearn.preprocessing.PowerTransformer
- class xenonpy.datatools.transform.Scaler[source]
Bases:
BaseEstimator
,TransformerMixin
A value-matrix container for data transform.
- Parameters:
value (DataFrame) – Inner data.
- fit(x)[source]
Compute the minimum and maximum to be used for later scaling. :type x: :param x: The data used to compute the per-feature minimum and maximum
used for later scaling along the features axis.
- fit_transform(x, y=None, **fit_params)[source]
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- reset()[source]
Reset internal data-dependent state of the scaler, if necessary. __init__ parameters are not touched.