Descriptor calculation
XenonPy comes with a general interface for descriptor calculation. By using this interface, users can implement their descriptor calculator with only a few lines of codes and run it smoothly.
We also use this system to provide built-in calculators. Currently, 15 featurizers in 4 types are available out-of-the-box. The following list shows a summary.
Featurizer |
Type |
Description |
|---|---|---|
Counting |
Composition |
Encoding number of compounds elements in to vector: \(f_{min, i} = min{f_{A,i}, f_{B,i}}\) |
WeightedAverage |
Composition |
Weighted average (abbr: ave): \(f_{ave, i} = w_{A}^* f_{A,i} + w_{B}^* f_{B,i}\) |
WeightedVariance |
Composition |
Weighted variance (abbr: var): \(f_{var, i} = w_{A}^* (f_{A,i} - f_{ave, i})^2 + w_{B}^* (f_{B,i} - f_{ave, i})^2\) |
WeightedSum |
Composition |
Weighted sum (abbr: sum): \(f_{sum, i} = w_{A} f_{A,i} + w_{B} f_{B,i}\) |
GeometricMean |
Composition |
Geometric mean (abbr: gmean): \(f_{gmean, i} = \sqrt[w_A + w_B]{f_{A,i}^{w_A} * f_{V,i}^{w_B}}\) |
HarmonicMean |
Composition |
Harmonic mean (abbr: hmean): \(f_{hmean, i} = \frac{w_A +w_B}{\frac{1}{f_{A,i}}*w_A + \frac{1}{f_{B,i}}*w_B}\) |
MaxPooling |
Composition |
Max-pooling (abbr: max): \(f_{max, i} = max{f_{A,i}, f_{B,i}}\) |
MinPooling |
Composition |
Min-pooling (abbr: min): \(f_{min, i} = min{f_{A,i}, f_{B,i}}\) |
RDKitFP |
Fingerprint |
RDKit fingerprint |
AtomPairFP |
Fingerprint |
Atom Pair fingerprints |
MACCS |
Fingerprint |
The MACCS keys for a molecule |
ECFP |
Fingerprint |
Morgan (Circular) fingerprints (ECFP) |
FCFP |
Fingerprint |
Morgan (Circular) fingerprints + feature-based (FCFP) |
TopologicalTorsionFP |
Fingerprint |
Topological Torsion fingerprints |
OrbitalFieldMatrix |
Structure |
Representation based on the valence shell electrons of neighboring atoms |
RadialDistributionFunction |
Structure |
Radial distribution in crystal |
FrozenFeaturizer |
NN |
Neural Network Extracted |
Compositional descriptors
XenonPy can calculate 290 compositional features for a given chemical composition.
This calculation uses the information of the 58 element-level property data recorded in elements_completed.
See Data access for details.
>>> from xenonpy.descriptor import Compositions
>>> cal = Compositions()
>>> cal
Compositions:
|- composition:
| |- Counting
| |- WeightedAverage
| |- WeightedSum
| |- WeightedVariance
| |- GeometricMean
| |- HarmonicMean
| |- MaxPooling
| |- MinPooling
The structure information of the calculator Cal is shown above.
This information tells us Cal has one featurizer group called composition with featurizers
WeightedAvgFeature, WeightedSumFeature, WeightedVarFeature, MaxFeature and MinFeature in it.
To use this calculator, users have to structure an iterable object that contains the information of compounds’ composition, then feed it to the method transform or fit_transform in cal.
These methods accept two types of input, the pymatgen.Structure objects, or dicts which have the structure like {‘H’: 2, ‘O’: 1}.
Using our sample data, users will obtain a pandas.DataFrame object that contains all the compositional descriptors.
>>> from xenonpy.datatools import preset
>>> samples = preset.mp_samples
>>> comps = samples['composition']
>>> descriptor = cal.transform(comps)
>>> descriptor
ave:atomic_number ... min:Polarizability
0 24.666667 ... 0.802000
1 33.000000 ... 1.100000
2 21.600000 ... 0.802000
... ... ... ...
928 44.500000 ... 5.500000
929 24.250000 ... 25.000000
930 26.750000 ... 4.800000
931 36.000000 ... 6.600000
932 16.500000 ... 0.802000
[933 rows x 290 columns]
where
>>> comps.__class__
pandas.core.series.Series
>>> comps[0].__class__
dict
If the input is a pandas.DataFrame object, the calculator will first try to read the data columns that have the same name as the featurizer groups.
For example, the name of the featurizer group in the example above is composition.
Therefore, the whole object entry can be fed into the calculator’s methods without explicitly extracting the composition column in the samples:
>>> descriptor = cal.transform(samples)
>>> descriptor
ave:atomic_number ... min:Polarizability
0 24.666667 ... 0.802000
1 33.000000 ... 1.100000
2 21.600000 ... 0.802000
... ... ... ...
928 44.500000 ... 5.500000
929 24.250000 ... 25.000000
930 26.750000 ... 4.800000
931 36.000000 ... 6.600000
932 16.500000 ... 0.802000
[933 rows x 290 columns]
This does the same work as the previous one.
Structural descriptors
Similar to the Compositions calculator, Structures accepts pymatgen.Structure objects as its input, and then return calculated results as a pandas.DataFrame.
>>> from xenonpy.descriptor import Structures
>>> cal = Structures()
>>> cal
Structures:
|- structure:
| |- RadialDistributionFunction
| |- OrbitalFieldMatrix
Structures contains one featurizer group called structure with RadialDistributionFunction and OrbitalFieldMatrix in it.
samples also has the structure information. We can use these to calculate structural descriptors.
>>> descriptor = cal.transform(samples)
This will take 3 ~ 5 min to run and finally, you will get:
>>> descriptor.head(5)
0.1 0.2 0.30000000000000004 ... f14_f12 f14_f13 f14_f14
mp-1008807 0.0 0.0 0.0 ... 0.0 0.0 0.0000
mp-1009640 0.0 0.0 0.0 ... 0.0 0.0 0.0000
mp-1016825 0.0 0.0 0.0 ... 0.0 0.0 0.0000
mp-1017582 0.0 0.0 0.0 ... 0.0 0.0 0.3851
mp-1021511 0.0 0.0 0.0 ... 0.0 0.0 0.0000
[5 rows x 1224 columns]
Advance
There are more details of the descriptor calculator system that are not yet included in this tutorial. Before we complete this document, you can check out https://github.com/yoshida-lab/XenonPy/blob/master/samples/custom_descriptor_calculator.ipynb for more information.