{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cours Science de données - IFRISSE 2020 - PART3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dans cette partie de notre étude, nous aborderons la notion de feature engineering et la construction de modèle de Machine Learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# I - Feature Engineering - Extraction de caractéristiques"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"L'extraction de caractéristiques est considérée comme étant les méthodes permettant de sélectionner et/ou combinent des variables caractéristiques dans le but **1)** de réduire efficacement la quantité de données à traiter, et **2)** tout en décrivant de manière précise et complète l'ensemble de données d'origine (sans perdre le contenu principal des données)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Qu'est-ce qu'une caractéristique et pourquoi avons-nous besoin d'une ingénierie ?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fondamentalement, tous les algorithmes d'apprentissage machine utilisent des données données d'entrées pour créer/proposer des sorties. \n",
"\n",
"Ces données d'entrées comportent des caractéristiques, qui généralement se présentent sous la forme de colonnes structurées. \n",
"\n",
"Pour fonctionner correctement, les algorithmes ont besoin de caractéristiques spécifiques (plus ou moins précises). C'est là que le besoin d'ingénierie des caractéristiques se fait sentir. Elles ont principalement deux objectifs :\n",
"\n",
"**1° Préparer le jeu de données d'entrée approprié, compatible avec les exigences du modèle ou de l'algorithme d'apprentissage.**\n",
"\n",
"**2° Améliorer les performances des modèles d'apprentissage.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bon à Savoir : src => Source: https://www.forbes.com/sites/gilpress\n",
"\n",
"![alt text](f_eng.jpg \"Source: https://www.forbes.com/sites/gilpress\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Quelques techniques utilisées pour l'extraction de caractéristiques "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1.Imputation => Imputation\n",
"\n",
"2.Handling Outliers => le traitement des valeurs aberrantes\n",
"\n",
"3.Binning => jumelage/échantilonnage\n",
"\n",
"4.Log Transform => transformation logarithmic\n",
"\n",
"5.One-Hot Encoding => codage à valeur binaire 0|1\n",
"\n",
"6.Grouping Operations => opérations de regroupement\n",
"\n",
"7.Feature Split => Découpage \n",
"\n",
"8.Scaling => mise à l'échelle\n",
"\n",
"#### NB: Certaines techniques pourraient être utilisées dans des cas spécifiques, tandis que d'autres pourraient être bénéfiques dans tous les cas."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"# Les librairies de bases les plus utilisées \n",
"import pandas as pd # analysise de données/transformation\n",
"import numpy as np # manipulation de valeurs numeriques et tableaux"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1 - Imputation => resoudre les problèmes de valeurs manquentes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Les valeurs manquantes sont l'un des problèmes les plus courants que vous pouvez rencontrer lorsque vous essayez de préparer vos données pour l'apprentissage. \n",
"\n",
"Les valeurs manquantes peuvent être dues à des erreurs humaines, des interruptions dans le flux (de collecte) de données, des problèmes de confidentialité, etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Solutions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### a° Supprimer les colonnes ou les lignes contenant des valeurs manquentes en fonction d'un seuil donné\n",
"\n",
"threshold = 0.7\n",
"\n",
"#Dropping columns with missing value rate higher than threshold !attention car cela peut affecter fortement la distribution des données d'origines\n",
"\n",
"data = data[data.columns[data.isnull().mean() < threshold]]\n",
"\n",
"#Dropping rows with missing value rate higher than threshold\n",
"\n",
"data = data.loc[data.isnull().mean(axis=1) < threshold]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### b° Remplacer par une valeur X lorsque les valeurs manquentes sont des valeurs numériques\n",
"\n",
"#Filling all missing values with 0\n",
"\n",
"data = data.fillna(0)\n",
"\n",
"#Filling missing values with medians of the columns\n",
"\n",
"data = data.fillna(data.median())\n",
"\n",
"#Filling missing values with std of the columns\n",
"\n",
"data = data.fillna(data.std())\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### c° Remplacer par la valeur la plus fréquente lorsque les valeurs manquentes sont des valeurs catégorielles\n",
"\n",
"#Max fill function for categorical columns\n",
"\n",
"data['column_name'].fillna(data['column_name'].value_counts().idxmax(), inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.Handling Outliers => le traitement des valeurs aberrantes\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Les valeurs aberantes peuvent être détecté de deux façons. Soit en partant de **1)** l'écart type, ou par **2)** la technique des percentiles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Solutions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### a° la technique de la deviation\n",
"\n",
"Si une valeur a une distance par rapport à la moyenne supérieure à **x * écart type**, elle peut être considérée comme une valeur aberrante. Alors que devrait être **x** ?\n",
"\n",
"Il n'y a pas de solution triviale pour **x**, mais généralement, une valeur comprise entre **2 et 4** semble pratique.\n",
"\n",
"#### obtenir et supprimer les valeurs aberantes\n",
"\n",
"x = 3\n",
"\n",
"upper_lim = data['column'].mean () + data['column'].std () * x\n",
"\n",
"lower_lim = data['column'].mean () - data['column'].std () * x\n",
"\n",
"data = data[(data['column'] < upper_lim) & (data['column'] > lower_lim)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.Binning\n",
"\n",
"Cette technique est utilisable que ce soit sur des variables numériques ou catégorielles. La principale motivation du binning est de rendre le modèle plus robuste et d'éviter les problèmes de sur-apprentissage.\n",
"\n",
"### Exemple avec de variables Numéric\n",
"Value Bin \n",
"0-30 -> Low \n",
"31-70 -> Mid \n",
"71-100 -> High\n",
"### Exemple avec de variables Categorielles\n",
"Value Bin \n",
"Spain -> Europe \n",
"Italy -> Europe \n",
"Chile -> South America\n",
"Brazil -> South America"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 45 | \n",
"
\n",
" \n",
" 2 | \n",
" 7 | \n",
"
\n",
" \n",
" 3 | \n",
" 85 | \n",
"
\n",
" \n",
" 4 | \n",
" 28 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value\n",
"0 2\n",
"1 45\n",
"2 7\n",
"3 85\n",
"4 28"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_n = pd.DataFrame([2,45,7,85,28], columns=['value'])\n",
"df_n"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" bin | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2 | \n",
" Low | \n",
"
\n",
" \n",
" 1 | \n",
" 45 | \n",
" Mid | \n",
"
\n",
" \n",
" 2 | \n",
" 7 | \n",
" Low | \n",
"
\n",
" \n",
" 3 | \n",
" 85 | \n",
" High | \n",
"
\n",
" \n",
" 4 | \n",
" 28 | \n",
" Low | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value bin\n",
"0 2 Low\n",
"1 45 Mid\n",
"2 7 Low\n",
"3 85 High\n",
"4 28 Low"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Exemple avec de variables Numéric\n",
"\n",
"df_n['bin'] = pd.cut(df_n['value'], bins=[0,30,70,100], labels=[\"Low\", \"Mid\", \"High\"])\n",
"df_n"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Country | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Spain | \n",
"
\n",
" \n",
" 1 | \n",
" Chile | \n",
"
\n",
" \n",
" 2 | \n",
" Australia | \n",
"
\n",
" \n",
" 3 | \n",
" Italy | \n",
"
\n",
" \n",
" 4 | \n",
" Brazil | \n",
"
\n",
" \n",
" 5 | \n",
" Burkina Faso | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Country\n",
"0 Spain\n",
"1 Chile\n",
"2 Australia\n",
"3 Italy\n",
"4 Brazil\n",
"5 Burkina Faso"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_c = pd.DataFrame(['Spain','Chile','Australia','Italy','Brazil', 'Burkina Faso'], columns=['Country'])\n",
"df_c"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Country | \n",
" Countryx | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Spain | \n",
" Europe | \n",
"
\n",
" \n",
" 1 | \n",
" Chile | \n",
" South America | \n",
"
\n",
" \n",
" 2 | \n",
" Australia | \n",
" Other | \n",
"
\n",
" \n",
" 3 | \n",
" Italy | \n",
" Europe | \n",
"
\n",
" \n",
" 4 | \n",
" Brazil | \n",
" South America | \n",
"
\n",
" \n",
" 5 | \n",
" Burkina Faso | \n",
" Africa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Country Countryx\n",
"0 Spain Europe\n",
"1 Chile South America\n",
"2 Australia Other\n",
"3 Italy Europe\n",
"4 Brazil South America\n",
"5 Burkina Faso Africa"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Exemple avec de variables Categorielles\n",
"conditions = [\n",
" df_c['Country'].str.contains('Spain'),\n",
" df_c['Country'].str.contains('Italy'),\n",
" df_c['Country'].str.contains('Chile'),\n",
" df_c['Country'].str.contains('Brazil'),\n",
" df_c['Country'].str.contains('Burkina Faso')]\n",
"\n",
"choices = ['Europe', 'Europe', 'South America', 'South America', 'Africa']\n",
"\n",
"df_c['Countryx'] = np.select(conditions, choices, default='Other')\n",
"df_c"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4.Log Transform => transformation logarithmic"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"C'est l'une des transformations mathématiques les plus couramment utilisées dans l'ingénierie des caractéristiques.\n",
"\n",
"Quels sont les avantages de la transformation logarithmique ?\n",
"\n",
"##### Elle permet de traiter des données biaisées et, après transformation, la distribution devient plus proche de la normale.\n",
"##### Il diminue également l'effet des valeurs aberrantes, grâce à la normalisation des différences d'amplitude et permet d'avoir un modèle devient plus robuste."
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" log+1 | \n",
" log | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2 | \n",
" 1.098612 | \n",
" 3.258097 | \n",
"
\n",
" \n",
" 1 | \n",
" 45 | \n",
" 3.828641 | \n",
" 4.234107 | \n",
"
\n",
" \n",
" 2 | \n",
" -23 | \n",
" NaN | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 3 | \n",
" 85 | \n",
" 4.454347 | \n",
" 4.691348 | \n",
"
\n",
" \n",
" 4 | \n",
" 28 | \n",
" 3.367296 | \n",
" 3.951244 | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" 1.098612 | \n",
" 3.258097 | \n",
"
\n",
" \n",
" 6 | \n",
" 35 | \n",
" 3.583519 | \n",
" 4.077537 | \n",
"
\n",
" \n",
" 7 | \n",
" -12 | \n",
" NaN | \n",
" 2.484907 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value log+1 log\n",
"0 2 1.098612 3.258097\n",
"1 45 3.828641 4.234107\n",
"2 -23 NaN 0.000000\n",
"3 85 4.454347 4.691348\n",
"4 28 3.367296 3.951244\n",
"5 2 1.098612 3.258097\n",
"6 35 3.583519 4.077537\n",
"7 -12 NaN 2.484907"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Log Transform Example\n",
"data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]})\n",
"data['log+1'] = (data['value']+1).transform(np.log)\n",
"\n",
"#Negative Values Handling\n",
"#Note that the values are different\n",
"data['log'] = (data['value']-data['value'].min()+1) .transform(np.log)\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. One-hot encoding "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cette méthode répartit les valeurs d'une colonne sur plusieurs colonnes et leur attribue 0 ou 1. Ces valeurs binaires expriment la relation entre la colonne groupée et la colonne codée."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![alt text](one_hot.png 'one_hot encod')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7.Feature Split.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cette techiq permet de rendre certaines variables plus compréhensible, favorisant la compréhension de l'algo au moment de l'apprentissage (moins d'ambiguité):\n",
"\n",
"* Nous permettons aux algorithmes d'apprentissage de les comprendre.\n",
"\n",
"* Rendre possible leur binning et leur regroupement.\n",
"\n",
"* Améliorer les performances des modèles en découvrant des informations potentielles."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Luther N. Gonzalez | \n",
"
\n",
" \n",
" 1 | \n",
" Charles M. Young | \n",
"
\n",
" \n",
" 2 | \n",
" Terry Lawson | \n",
"
\n",
" \n",
" 3 | \n",
" Kristen White | \n",
"
\n",
" \n",
" 4 | \n",
" Thomas Logsdon | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name\n",
"0 Luther N. Gonzalez\n",
"1 Charles M. Young\n",
"2 Terry Lawson\n",
"3 Kristen White\n",
"4 Thomas Logsdon"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_split = pd.DataFrame({'name':['Luther N. Gonzalez','Charles M. Young', 'Terry Lawson', 'Kristen White', 'Thomas Logsdon']})\n",
"df_split"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Luther\n",
"1 Charles\n",
"2 Terry\n",
"3 Kristen\n",
"4 Thomas\n",
"Name: name, dtype: object"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Extraire le prenom\n",
"df_split.name.str.split(\" \").map(lambda x: x[0])\n"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Gonzalez\n",
"1 Young\n",
"2 Lawson\n",
"3 White\n",
"4 Logsdon\n",
"Name: name, dtype: object"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Extraire le nom\n",
"df_split.name.str.split(\" \").map(lambda x: x[-1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Scaling => mise à l'échelle"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Normalisation\n",
"\n",
"Dans la plupart des cas, les variables n'ont pas une certaine harmonisation et elles diffèrent les unes des autres.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" normalized | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2 | \n",
" 0.231481 | \n",
"
\n",
" \n",
" 1 | \n",
" 45 | \n",
" 0.629630 | \n",
"
\n",
" \n",
" 2 | \n",
" -23 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 3 | \n",
" 85 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 4 | \n",
" 28 | \n",
" 0.472222 | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" 0.231481 | \n",
"
\n",
" \n",
" 6 | \n",
" 35 | \n",
" 0.537037 | \n",
"
\n",
" \n",
" 7 | \n",
" -12 | \n",
" 0.101852 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value normalized\n",
"0 2 0.231481\n",
"1 45 0.629630\n",
"2 -23 0.000000\n",
"3 85 1.000000\n",
"4 28 0.472222\n",
"5 2 0.231481\n",
"6 35 0.537037\n",
"7 -12 0.101852"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# exemple de normalisation =>\n",
"data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]})\n",
"\n",
"data['normalized'] = (data['value'] - data['value'].min()) / (data['value'].max() - data['value'].min())\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Standardization\n",
"\n",
"La standardisation (ou normalisation du z-score) met à l'échelle les valeurs tout en tenant compte de l'écart type. \n",
"\n",
"Si l'écart-type des caractéristiques est différent, leur plage sera également différente. Cela permet de réduire l'effet des valeurs aberrantes des caractéristiques.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" value | \n",
" standardized | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2 | \n",
" -0.518878 | \n",
"
\n",
" \n",
" 1 | \n",
" 45 | \n",
" 0.703684 | \n",
"
\n",
" \n",
" 2 | \n",
" -23 | \n",
" -1.229670 | \n",
"
\n",
" \n",
" 3 | \n",
" 85 | \n",
" 1.840952 | \n",
"
\n",
" \n",
" 4 | \n",
" 28 | \n",
" 0.220346 | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" -0.518878 | \n",
"
\n",
" \n",
" 6 | \n",
" 35 | \n",
" 0.419367 | \n",
"
\n",
" \n",
" 7 | \n",
" -12 | \n",
" -0.916922 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" value standardized\n",
"0 2 -0.518878\n",
"1 45 0.703684\n",
"2 -23 -1.229670\n",
"3 85 1.840952\n",
"4 28 0.220346\n",
"5 2 -0.518878\n",
"6 35 0.419367\n",
"7 -12 -0.916922"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# xemple de standardisation\n",
"data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]})\n",
"\n",
"data['standardized'] = (data['value'] - data['value'].mean()) / data['value'].std()\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# END OF FEATURE ENGINEERING"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}