xenonpy.datatools package

Submodules

xenonpy.datatools.dataset module

class xenonpy.datatools.dataset.Dataset(*paths, backend='pandas', prefix=None)[source]

Bases: object

__call__(*args, **kwargs)[source]

Call self as a function.

classmethod from_http(url, save_to, *, filename=None, chunk_size=262144, params=None, **kwargs)[source]

Get file object via a http request.

Parameters:
  • url (str) – The resource url.

  • save_to (str) – The path of a dir to save the downloaded object into it.

  • filename (str, optional) – Specific the file name when saving. Set to None (default) to use a inferred name from http header.

  • chunk_size (int, optional) – Chunk size.

  • params (any, optional) – Parameters will be passed to requests.get function. See Also: requests

  • kwargs (dict, optional) – Pass to requests.get function as the kwargs parameters.

Returns:

File path contains file name.

Return type:

str

classmethod to(obj, path, *, force_pkl=False)[source]
property csv
property excel
property pandas
property pickle

xenonpy.datatools.preset module

class xenonpy.datatools.preset.Preset(*args, **kwargs)[source]

Bases: Dataset

Load data from embed dataset in XenonPy’s or user create data saved in ~/.xenonpy/cached dir. Also can fetch data by http request.

This is sample to demonstration how to use is. Also see parameters documents for details.

>>> from xenonpy.datatools import preset
>>> elements = preset.elements
>>> elements.info()
<class 'pandas.core.frame.DataFrame'>
Index: 118 entries, H to Og
Data columns (total 74 columns):
atomic_number                    118 non-null int64
atomic_radius                    88 non-null float64
atomic_radius_rahm               96 non-null float64
atomic_volume                    91 non-null float64
atomic_weight                    118 non-null float64
boiling_point                    96 non-null float64
brinell_hardness                 59 non-null float64
bulk_modulus                     69 non-null float64
...
build(*keys, save_to=None, **kwargs)[source]
sync(data, to=None)[source]

load data.

Note

Try to load data from local at ~/.xenonpy/dataset. If no data, try to fetch them from remote repository.

Parameters:
Returns:

ret

Return type:

DataFrame or Saver or local file path.

property atom_init

The initialization vector for each element.

See Also: https://github.com/txie-93/cgcnn#usage

property elements

Element properties from embed dataset. These properties are summarized from mendeleev, pymatgen, CRC Handbook and magpie.

See Also: Features

Returns:

element properties in pd.DataFrame

Return type:

DataFrame

property elements_completed

Completed element properties. [MICE] imputation used

See Also: Features

Return type:

imputed element properties in pd.DataFrame

xenonpy.datatools.splitter module

class xenonpy.datatools.splitter.Splitter(size, *, test_size=0.2, k_fold=None, random_state=None, shuffle=True)[source]

Bases: BaseEstimator

Data splitter for train and test

Parameters:
  • size (int) – Total sample size. All data must have same length of their first dim,

  • test_size (Union[float, int]) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. Can be 0 if cv is None. In this case, cv() will yield a tuple only contains training and validation on each step. By default, the value is set to 0.2.

  • k_fold (Union[int, Iterable, None]) – Number of k-folds. If int, Must be at least 2. If Iterable, it should provide label for each element which will be used for group cv. In this case, the input of cv() must be a pandas.DataFrame object. Default value is None to specify no cv.

  • random_state (Optional[int]) – If int, random_state is the seed used by the random number generator; Default is None.

  • shuffle (bool) – Whether or not to shuffle the data before splitting.

cv(*arrays, less_for_train=False)[source]

Split data with cross-validation.

Parameters:
  • *arrays (DataFrame, Series, ndarray, list) – Data for split. Must be a Sequence of indexables with same length / shape[0]. If None, return the split indices.

  • less_for_train (bool) – If true, use less data set for train. E.g. [1, 2, 3, 4, 5, 6, 7, 8, 9, 0] with 5 cv will be split into [1, 2] and [3, 4, 5, 6, 7, 8, 9, 0]. Usually, [1, 2] (less one) will be used for val. With less_for_train=True, [1, 2] will be used for train. Default is False.

Yields:

tuple – list containing split of inputs with cv. if inputs are None, only return the indices of split. if test_size is 0, test data/index will not return.

roll(random_state=None)[source]
split(*arrays)[source]

Split data.

Parameters:

*arrays (Union[ndarray, DataFrame, Series]) – Dataset for split. Size of dim 0 must be equal to size(). If None, return the split indices.

Returns:

List containing split of inputs. if inputs are None, only return the indices of splits. if test_size is 0, test data/index will not return.

Return type:

tuple

property k_fold
property random_state
property shuffle
property size
property test_size

xenonpy.datatools.transform module

class xenonpy.datatools.transform.PowerTransformer(*, method='yeo-johnson', standardize=False, lmd=None, tolerance=(-inf, inf), on_err=None)[source]

Bases: BaseEstimator, TransformerMixin

Box-cox transform. .. rubric:: References

G.E.P. Box and D.R. Cox, “An Analysis of Transformations”, Journal of the Royal Statistical Society B, 26, 211-252 (1964).

Parameters:
  • method ('yeo-johnson' or 'box-cox') – ‘yeo-johnson’ works with positive and negative values ‘box-cox’ only works with strictly positive values

  • standardize (boolean) – Normalize to standard normal or not. Recommend using a sepearate standard function instead of using this option.

  • lmd (list or 1-dim ndarray) – You might assign each input xs with a specific lmd yourself. Leave None(default) to use a inferred value. See PowerTransformer for detials.

  • tolerance (tuple) – Tolerance of lmd. Set None to accept any. Default is (-np.inf, np.inf) but recommend (-2, 2) for Box-cox transform

  • on_err (None or str) – Error handle when try to inference lambda. Can be None or log, nan or raise by string. log will return the logarithmic transform of xs that have a min shift to 1. nan return ndarray with shape xs.shape filled with``np.nan``. raise raise a FloatingPointError. You can catch it yourself. Default(None) will return the input series without scale transform.

  • _PowerTransformer (..) – https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html#sklearn.preprocessing.PowerTransformer

fit(x)[source]
Parameters:

X (array-like of shape (n_samples, n_features)) – The data used to compute the per-feature transformation

Returns:

self – Fitted scaler.

Return type:

object

inverse_transform(x)[source]
transform(x)[source]
class xenonpy.datatools.transform.Scaler[source]

Bases: BaseEstimator, TransformerMixin

A value-matrix container for data transform.

Parameters:

value (DataFrame) – Inner data.

box_cox(*args, **kwargs)[source]
fit(x)[source]

Compute the minimum and maximum to be used for later scaling. :type x: :param x: The data used to compute the per-feature minimum and maximum

used for later scaling along the features axis.

fit_transform(x, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

inverse_transform(x)[source]
log()[source]
min_max(*args, **kwargs)[source]
power_transformer(*args, **kwargs)[source]
reset()[source]

Reset internal data-dependent state of the scaler, if necessary. __init__ parameters are not touched.

standard(*args, **kwargs)[source]
transform(x)[source]

Scaling features of X according to feature_range. :type x: :param x: Input data that will be transformed. :type x: array-like, shape [n_samples, n_features]

yeo_johnson(*args, **kwargs)[source]

Module contents