{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cours Science de données - IFRISSE 2020 - PART3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Dans cette partie de notre étude, nous verons construire un modèle de Machine Learning et comment l'évaluer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# II - Construction de modèle de ML et Évaluation" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "ea25cdf7-bdbc-3cf1-0737-bc51675e3374", "_uuid": "fed5696c67bf55a553d6d04313a77e8c617cad99" }, "source": [ "# Titanic Data Science Solutions\n", "\n", "\n", "### This notebook is a companion to the book [Data Science Solutions](https://www.amazon.com/Data-Science-Solutions-Startup-Workflow/dp/1520545312). \n", "\n", "## Cheminement\n", "\n", "Le flux de solutions du concours passe par six étapes décrites dans le livre Data Science Solutions.\n", "\n", "1. Définition de la question ou du problème.\n", "2. Acquisition des données d'entrainement et de test.\n", "3. Préparer, nettoyer les données.\n", "4. Analyser, identifier des modèles et explorer les données.\n", "5. Modéliser, prévoir et résoudre le problème.\n", "6. Visualiser, rapporter et présenter les étapes de résolution du problème et la solution finale.\n", "\n", "\n", "Les différentes étapes indiquent la séquence générale de la façon dont chaque étape peut se succéder. \n", "\n", "- Il est possible de combiner plusieurs étapes du flux de travail. Nous pouvons analyser en visualisant les données.\n", "- Effectuer une étape plus tôt qu'indiqué. Nous pouvons analyser les données avant et après une disscussin.\n", "- Effectuer une étape plusieurs fois. La visualisation d'une étape peut être utilisée plusieurs fois.\n", "- Abandonner complètement une étape. \n", "\n", "\n", "\n", "## Questionnement et Problème\n", "\n", "Les sites de compétition comme Kaggle définissent le problème à résoudre ou les questions à poser tout en fournissant les ensembles de données pour construire votre modèle scientifique et tester les résultats du modèle par rapport à un ensemble de données test. La définition de la question ou du problème pour la compétition Titanic Survival est [décrite ici à Kaggle] (https://www.kaggle.com/c/titanic).\n", "\n", "> En se basant sur un ensemble d'échantillons d'entraînement répertoriant les passagers qui ont survécu ou non à la catastrophe du Titanic, notre modèle peut-il déterminer, sur la base d'un ensemble de données de test donné ne contenant pas les informations relatives à la survie, si ces passagers ont survécu ou non.\n", "\n", "Nous pouvons également vouloir développer une compréhension précoce du domaine de notre problème. Ceci est décrit sur la [page de description du concours Kaggle ici] (https://www.kaggle.com/c/titanic). Voici les points forts à noter.\n", "\n", "- Le 15 avril 1912, lors de son voyage inaugural, le Titanic a coulé après avoir heurté un iceberg, tuant 1502 des 2224 passagers et membres d'équipage. Traduit 32 % de taux de survie.\n", "- L'une des raisons pour lesquelles le naufrage a entraîné de telles pertes de vie est qu'il n'y avait pas assez de canots de sauvetage pour les passagers et l'équipage.\n", "- Bien qu'il y ait eu une part de chance pour survivre au naufrage, certains groupes de personnes avaient plus de chances de survivre que d'autres.\n", "\n", "\n", "\n", "## Le travail d'ingénierie pouvant être mise en place pour l'atteinte des objectifs en Science de données\n", "\n", "Les réponses généralement apportées par la science de données d'une étude peuvent être catégorisées en setp points.\n", "\n", "**Faire une Classification** : Nous pouvons vouloir classer ou catégoriser nos échantillons\n", "\n", "**Étudier une Correlation** : On peut aborder le problème en fonction des cratéristiques disponibles dans l'ensemble de données d'apprentissage. Quelles sont les caractéristiques de l'ensemble de données qui contribuent de manière significative à notre objectif de solution ? Statistiquement parlant, y a-t-il une [corrélation] (https://en.wikiversity.org/wiki/Correlation) entre une caractéristique et l'objectif de la solution ? Lorsque les valeurs des caractéristiques changent, l'état de la solution change-t-il également, et vice-versa ? Cela peut être testé à la fois pour les caractéristiques numériques et catégorielles dans l'ensemble de données donné. Nous pouvons également vouloir déterminer la corrélation entre les caractéristiques autres que la survie. La corrélation de certaines caractéristiques peut aider à créer, compléter ou corriger des caractéristiques.\n", "\n", "**Réaliser des conversion** : Pour l'étape de modélisation, il faut préparer les données. Selon l'algorithme de modélisation choisi, il peut être nécessaire de convertir toutes les caractéristiques en valeurs numériques équivalentes. Par exemple, la conversion des valeurs textuelles catégorielles en valeurs numériques.\n", "\n", "**De faire des ajouts/suppressions** : La préparation des données peut également nécessiter l'estimation de toute valeur manquante dans une caractéristique. Les algorithmes de modèle peuvent fonctionner au mieux lorsqu'il n'y a pas de valeurs manquantes.\n", "\n", "**De faire des corrections** : Nous pouvons également analyser l'ensemble de données d'apprentissage donné pour détecter les erreurs ou les valeurs éventuellement inexactes dans les caractéristiques et essayer de corriger ces valeurs ou d'exclure les échantillons contenant les erreurs. Une façon de le faire est de détecter toute valeur aberrante parmi nos échantillons ou caractéristiques. Nous pouvons également éliminer complètement une caractéristique si elle ne contredit pas l'analyse ou si elle peut fausser les résultats de manière significative.\n", "\n", "**Créer de nouvelles entrées** : Nous pouvons créer de nouvelles caractéristiques sur la base d'une caractéristique existante ou d'un ensemble de caractéristiques, de sorte que la nouvelle caractéristique respecte les objectifs de corrélation, de conversion et d'exhaustivité.\n", "\n", "**Faire de graphique** : Comment sélectionner les bons graphiques de visualisation en fonction de la nature des données et des objectifs de la solution." ] }, { "cell_type": "code", "execution_count": 213, "metadata": { "_cell_guid": "5767a33c-8f18-4034-e52d-bf7a8f7d8ab8", "_uuid": "847a9b3972a6be2d2f3346ff01fea976d92ecdb6" }, "outputs": [], "source": [ "# data analysis and wrangling\n", "import pandas as pd\n", "import numpy as np\n", "import random as rnd\n", "\n", "# visualization\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "6b5dc743-15b1-aac6-405e-081def6ecca1", "_uuid": "2d307b99ee3d19da3c1cddf509ed179c21dec94a" }, "source": [ "## Acquisition des données\n", "\n", "Les librairies Python Pandas nous aident à manipuler nos de données. Nous commençons par acquérir les ensembles de données d'entrainement et de test dans des DataFrames. Nous combinons également ces ensembles de données pour effectuer certaines opérations sur les deux ensembles de données ensemble. À noter que les données de test et d'apprentissage ont déjà été défini" ] }, { "cell_type": "code", "execution_count": 214, "metadata": { "_cell_guid": "e7319668-86fe-8adc-438d-0eef3fd0a982", "_uuid": "13f38775c12ad6f914254a08f0d1ef948a2bd453" }, "outputs": [], "source": [ "# chargement de données\n", "train_df = pd.read_csv('train.csv')\n", "test_df = pd.read_csv('test.csv')\n", "# concatener les données de train et de test pour ovoir la donnée initiale\n", "combine = [train_df, test_df]" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "3d6188f3-dc82-8ae6-dabd-83e28fcbf10d", "_uuid": "79282222056237a52bbbb1dbd831f057f1c23d69" }, "source": [ "## Analyser en décrivant les données\n", "\n", "Pandas aide également à décrire les ensembles de données répondant aux questions suivantes au début de notre projet.\n", "\n", "**Quelles sont les variables caractéristiques disponibles dans l'ensemble de données ?" ] }, { "cell_type": "code", "execution_count": 215, "metadata": { "_cell_guid": "ce473d29-8d19-76b8-24a4-48c217286e42", "_uuid": "ef106f38a00e162a80c523778af6dcc778ccc1c2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'\n", " 'Ticket' 'Fare' 'Cabin' 'Embarked']\n" ] }, { "data": { "text/plain": [ "(891, 12)" ] }, "execution_count": 215, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(train_df.columns.values)\n", "\n", "train_df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* survie -> Survie 0 = Non, 1 = Oui\n", "* pclass -> classe du Billet 1 = 1er, 2 = 2ème, 3 = 3ème\n", "* sexe -> Sexe\t\n", "* Âge -> Âge en années\t\n", "* sibsp -> de frères et soeurs / conjoints à bord du Titanic\t\n", "* parche -> de parents / enfants à bord du Titanic\t\n", "* ticket -> Numéro du billet\t\n", "* fare -> Tarifs passagers\t\n", "* cabine -> Numéro de cabine\t\n", "* embarked -> Lieu d'embarquement C = Cherbourg, Q = Queenstown, S = Southampton" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "cd19a6f6-347f-be19-607b-dca950590b37", "_uuid": "1d7acf42af29a63bc038f14eded24e8b8146f541" }, "source": [ "**Quelles variables caractéristiques sont catégorielles?**\n", "\n", "Ces valeurs permettent de classer les échantillons en ensembles d'échantillons similaires. \n", "\n", "- Catégorielle : Survécu, Sexe et Embarqué. Ordinal : Pclasse.\n", "\n", "**Quelles variables caractéristiques sont numériques ?\n", "\n", "Quelles sont les caractéristiques numériques ? Ces valeurs varient d'un échantillon à l'autre. Dans les caractéristiques numériques, les valeurs sont-elles discrètes, continues ou basées sur des séries temporelles ? \n", "\n", "- Continue : Âge, tarif. Discrètes : SibSp, Parch." ] }, { "cell_type": "code", "execution_count": 216, "metadata": { "_cell_guid": "8d7ac195-ac1a-30a4-3f3f-80b8cf2c1c0f", "_uuid": "e068cd3a0465b65a0930a100cb348b9146d5fd2f" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
121303Saundercock, Mr. William Henrymale20.000A/5. 21518.0500NaNS
131403Andersson, Mr. Anders Johanmale39.01534708231.2750NaNS
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14.0003504067.8542NaNS
151612Hewlett, Mrs. (Mary D Kingcome)female55.00024870616.0000NaNS
161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
171812Williams, Mr. Charles EugenemaleNaN0024437313.0000NaNS
181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female31.01034576318.0000NaNS
192013Masselmani, Mrs. FatimafemaleNaN0026497.2250NaNC
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "5 6 0 3 \n", "6 7 0 1 \n", "7 8 0 3 \n", "8 9 1 3 \n", "9 10 1 2 \n", "10 11 1 3 \n", "11 12 1 1 \n", "12 13 0 3 \n", "13 14 0 3 \n", "14 15 0 3 \n", "15 16 1 2 \n", "16 17 0 3 \n", "17 18 1 2 \n", "18 19 0 3 \n", "19 20 1 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "5 Moran, Mr. James male NaN 0 \n", "6 McCarthy, Mr. Timothy J male 54.0 0 \n", "7 Palsson, Master. Gosta Leonard male 2.0 3 \n", "8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 \n", "9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 \n", "10 Sandstrom, Miss. Marguerite Rut female 4.0 1 \n", "11 Bonnell, Miss. Elizabeth female 58.0 0 \n", "12 Saundercock, Mr. William Henry male 20.0 0 \n", "13 Andersson, Mr. Anders Johan male 39.0 1 \n", "14 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 \n", "15 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 \n", "16 Rice, Master. Eugene male 2.0 4 \n", "17 Williams, Mr. Charles Eugene male NaN 0 \n", "18 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 \n", "19 Masselmani, Mrs. Fatima female NaN 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S \n", "5 0 330877 8.4583 NaN Q \n", "6 0 17463 51.8625 E46 S \n", "7 1 349909 21.0750 NaN S \n", "8 2 347742 11.1333 NaN S \n", "9 0 237736 30.0708 NaN C \n", "10 1 PP 9549 16.7000 G6 S \n", "11 0 113783 26.5500 C103 S \n", "12 0 A/5. 2151 8.0500 NaN S \n", "13 5 347082 31.2750 NaN S \n", "14 0 350406 7.8542 NaN S \n", "15 0 248706 16.0000 NaN S \n", "16 1 382652 29.1250 NaN Q \n", "17 0 244373 13.0000 NaN S \n", "18 0 345763 18.0000 NaN S \n", "19 0 2649 7.2250 NaN C " ] }, "execution_count": 216, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# preview the data\n", "train_df[:20]" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "97f4e6f8-2fea-46c4-e4e8-b69062ee3d46", "_uuid": "c34fa51a38336d97d5f6a184908cca37daebd584" }, "source": [ "**Quelles sont les caractéristiques des types de données mixtes ?**\n", "\n", "Données numériques et alphanumériques au sein d'une même caractéristique. Ce sont des candidats pour la correction de l'objectif.\n", "\n", "- Le ticket est un mélange de types de données numériques et alphanumériques. La cabine est alphanumérique.\n", "\n", "**Quelles sont les caractéristiques qui peuvent contenir des erreurs ou des fautes de frappe ?\n", "\n", "Il est plus difficile d'examiner ces données pour un grand ensemble de données, mais la vérification de quelques échantillons d'un ensemble de données plus petit peut nous indiquer clairement quelles caractéristiques doivent être corrigées.\n", "\n", "- La caractéristique \"Nom\" peut contenir des erreurs ou des fautes de frappe car il existe plusieurs façons de décrire un nom, notamment les titres, les parenthèses rondes et les guillemets utilisés pour les noms alternatifs ou courts." ] }, { "cell_type": "code", "execution_count": 217, "metadata": { "_cell_guid": "f6e761c2-e2ff-d300-164c-af257083bb46", "_uuid": "3488e80f309d29f5b68bbcfaba8d78da84f4fb7d" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88688702Montvila, Rev. Juozasmale27.00021153613.00NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.00B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.45NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.00C148C
89089103Dooley, Mr. Patrickmale32.0003703767.75NaNQ
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "886 887 0 2 Montvila, Rev. Juozas \n", "887 888 1 1 Graham, Miss. Margaret Edith \n", "888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "886 male 27.0 0 0 211536 13.00 NaN S \n", "887 female 19.0 0 0 112053 30.00 B42 S \n", "888 female NaN 1 2 W./C. 6607 23.45 NaN S \n", "889 male 26.0 0 0 111369 30.00 C148 C \n", "890 male 32.0 0 0 370376 7.75 NaN Q " ] }, "execution_count": 217, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.tail()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "8bfe9610-689a-29b2-26ee-f67cd4719079", "_uuid": "699c52b7a8d076ccd5ea5bc5d606313c558a6e8e" }, "source": [ "**Quelles caractéristiques contiennent des valeurs manquantes, nulles ou vides ?\n", "\n", "Ces éléments devront être corrigés.\n", "\n", "- Cabine > Âge > Les caractéristiques embarquées contiennent un certain nombre de valeurs nulles dans cet ordre pour l'ensemble de données d'entrainement.\n", "\n", "- Cabine > Âge sont incomplètes dans le cas de l'ensemble de données de test.\n", "\n", "**Quels sont les types de données pour les différentes caractéristiques ?\n", "\n", "Aidez-nous lors de la conversion de l'objectif.\n", "\n", "- Sept caractéristiques sont des nombres entiers ou des valeurs flottantes. Six dans le cas d'un jeu de données de test.\n", "- Cinq caractéristiques sont des chaînes de caractères (objet)." ] }, { "cell_type": "code", "execution_count": 218, "metadata": { "_cell_guid": "9b805f69-665a-2b2e-f31d-50d87d52865d", "_uuid": "817e1cf0ca1cb96c7a28bb81192d92261a8bf427" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 891 entries, 0 to 890\n", "Data columns (total 12 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 PassengerId 891 non-null int64 \n", " 1 Survived 891 non-null int64 \n", " 2 Pclass 891 non-null int64 \n", " 3 Name 891 non-null object \n", " 4 Sex 891 non-null object \n", " 5 Age 714 non-null float64\n", " 6 SibSp 891 non-null int64 \n", " 7 Parch 891 non-null int64 \n", " 8 Ticket 891 non-null object \n", " 9 Fare 891 non-null float64\n", " 10 Cabin 204 non-null object \n", " 11 Embarked 889 non-null object \n", "dtypes: float64(2), int64(5), object(5)\n", "memory usage: 83.7+ KB\n", "########################################\n", "\n", "RangeIndex: 418 entries, 0 to 417\n", "Data columns (total 11 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 PassengerId 418 non-null int64 \n", " 1 Pclass 418 non-null int64 \n", " 2 Name 418 non-null object \n", " 3 Sex 418 non-null object \n", " 4 Age 332 non-null float64\n", " 5 SibSp 418 non-null int64 \n", " 6 Parch 418 non-null int64 \n", " 7 Ticket 418 non-null object \n", " 8 Fare 417 non-null float64\n", " 9 Cabin 91 non-null object \n", " 10 Embarked 418 non-null object \n", "dtypes: float64(2), int64(4), object(5)\n", "memory usage: 36.0+ KB\n" ] } ], "source": [ "train_df.info()\n", "print('#'*40)\n", "test_df.info()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "859102e1-10df-d451-2649-2d4571e5f082", "_uuid": "2b7c205bf25979e3242762bfebb0e3eb2fd63010" }, "source": [ "**Quelle est la répartition des valeurs numériques des caractéristiques entre les échantillons ?**\n", "\n", "Cela nous aide à déterminer, entre autres, dans quelle mesure l'ensemble des données d'entrainement est représentatif dans notre étude.\n", "\n", "- Le nombre total d'échantillons s'élève à 891, soit 40 % du nombre réel de passagers à bord du Titanic (2 224).\n", "- Le nombre de survivants est une caractéristique catégorielle avec des valeurs de 0 ou 1.\n", "- Environ 38% des échantillons ont survécu, ce qui est représentatif du taux de survie réel de 32%.\n", "- La plupart des passagers (> 75 %) n'ont pas voyagé avec des parents ou des enfants.\n", "- Près de 30 % des passagers avaient des frères et sœurs et/ou leur conjoint à bord.\n", "- Les tarifs variaient considérablement, peu de passagers (<1%) payant jusqu'à 512 $.\n", "- Peu de passagers âgés (<1%) dans la tranche d'âge 65-80 ans." ] }, { "cell_type": "code", "execution_count": 219, "metadata": { "_cell_guid": "58e387fe-86e4-e068-8307-70e37fe3f37b", "_uuid": "380251a1c1e0b89147d321968dc739b6cc0eecf2" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Age SibSp \\\n", "count 891.000000 891.000000 891.000000 714.000000 891.000000 \n", "mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n", "std 257.353842 0.486592 0.836071 14.526497 1.102743 \n", "min 1.000000 0.000000 1.000000 0.420000 0.000000 \n", "25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n", "50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n", "75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n", "max 891.000000 1.000000 3.000000 80.000000 8.000000 \n", "\n", " Parch Fare \n", "count 891.000000 891.000000 \n", "mean 0.381594 32.204208 \n", "std 0.806057 49.693429 \n", "min 0.000000 0.000000 \n", "25% 0.000000 7.910400 \n", "50% 0.000000 14.454200 \n", "75% 0.000000 31.000000 \n", "max 6.000000 512.329200 " ] }, "execution_count": 219, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.describe()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "5462bc60-258c-76bf-0a73-9adc00a2f493", "_uuid": "33bbd1709db622978c0c5879e7c5532d4734ade0" }, "source": [ "**Quelle est la répartition des caractéristiques catégorielles?**\n", "\n", "- Les noms sont uniques dans l'ensemble des données (nombre=unique=891)\n", "- Variable sexe comme deux valeurs possibles avec 65% d'hommes (top=hommes, freq=577/count=891).\n", "- Les valeurs de la cabine ont plusieurs doublons dans les échantillons. Par ailleurs, plusieurs passagers ont partagé une cabine.\n", "- Embarqué prend trois valeurs possibles. Port S utilisé par la plupart des passagers (top=S)\n", "- La caractéristique du billet présente un ratio élevé (22 %) de valeurs en double (unique=681)." ] }, { "cell_type": "code", "execution_count": 220, "metadata": { "_cell_guid": "8066b378-1964-92e8-1352-dcac934c6af3", "_uuid": "daa8663f577f9c1a478496cf14fe363570457191" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSexTicketCabinEmbarked
count891891891204889
unique89126811473
topDaly, Mr. Eugene Patrickmale347082G6S
freq157774644
\n", "
" ], "text/plain": [ " Name Sex Ticket Cabin Embarked\n", "count 891 891 891 204 889\n", "unique 891 2 681 147 3\n", "top Daly, Mr. Eugene Patrick male 347082 G6 S\n", "freq 1 577 7 4 644" ] }, "execution_count": 220, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.describe(include=['O'])" ] }, { "cell_type": "code", "execution_count": 221, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 221, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "# verifier les valeurs manquantes\n", "sns.heatmap(train_df.isna(),cmap='gnuplot')" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "2cb22b88-937d-6f14-8b06-ea3361357889", "_uuid": "c1d35ebd89a0cf7d7b409470bbb9ecaffd2a9680" }, "source": [ "### Hypothèses basées sur l'analyse des données\n", "\n", "Nous arrivons aux hypothèses suivantes sur la base de l'analyse des données effectuée jusqu'à présent. Nous pouvons valider ces hypothèses de manière plus approfondie avant de prendre les mesures appropriées.\n", "\n", "**Corrélation.**\n", "\n", "Nous voulons savoir dans quelle mesure chaque caractéristique est en corrélation avec la survie. Nous voulons le faire au début de notre projet et faire correspondre ces corrélations rapides avec les corrélations modélisées plus tard dans le projet.\n", "\n", "**Compléter.**\n", "\n", "1. Nous pourrions vouloir compléter la fonctionnalité Age car elle est définitivement corrélée à la survie.\n", "2. Nous pouvons vouloir compléter la fonction Embarqué car elle peut aussi être corrélée avec la survie ou une autre fonction importante.\n", "\n", "**Corriger.**\n", "\n", "1. La fonction de ticket peut être supprimée de notre analyse car elle contient un ratio élevé de doublons (22 %) et il peut ne pas y avoir de corrélation entre le ticket et la survie.\n", "2. La caractéristique de la cabine peut être supprimée car elle est très incomplète ou contient de nombreuses valeurs nulles tant dans l'ensemble des données de formation que dans celui des tests.\n", "3. Le PassengerId peut être supprimé de l'ensemble de données de formation car il ne contribue pas à la survie.\n", "4. La fonction de nom est relativement non standard, elle peut ne pas contribuer directement à la survie, donc elle peut être supprimée.\n", "\n", "**Créer.**\n", "\n", "1. Nous pourrions créer une nouvelle fonctionnalité appelée Famille basée sur Parch et SibSp pour obtenir le nombre total de membres de la famille à bord.\n", "2. Nous pouvons vouloir créer la fonction Nom pour extraire le Titre comme nouvelle fonction.\n", "3. Nous pourrions créer une nouvelle fonction pour les tranches d'âge. Cela transforme une caractéristique numérique continue en une caractéristique catégorielle ordinale.\n", "4. Nous pouvons également créer une fonction de fourchette tarifaire si cela facilite notre analyse.\n", "\n", "**Classification.**\n", "\n", "Nous pouvons également ajouter à nos hypothèses basées sur la description du problème mentionnée plus haut.\n", "\n", "1. Les femmes (Sexe=femme) ont plus de chances d'avoir survécu.\n", "2. Les enfants (Âge< ?) ont plus de chances d'avoir survécu. \n", "3. Les passagers de la classe supérieure (Pclass=1) ont plus de chances d'avoir survécu." ] }, { "cell_type": "code", "execution_count": 222, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Sex Survived\n", "female 1 233\n", " 0 81\n", "male 0 468\n", " 1 109\n", "Name: Survived, dtype: int64" ] }, "execution_count": 222, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_v = train_df.groupby('Sex')['Survived'].value_counts()\n", "df_v" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "6db63a30-1d86-266e-2799-dded03c45816", "_uuid": "946ee6ca01a3e4eecfa373ca00f88042b683e2ad" }, "source": [ "## Analyser en pivotant les caractéristiques\n", "\n", "Pour confirmer certaines de nos observations et hypothèses, nous pouvons rapidement analyser les corrélations entre nos caractéristiques en faisant pivoter les caractéristiques les unes par rapport aux autres. Nous ne pouvons le faire à ce stade que pour les caractéristiques qui n'ont pas de valeurs vides. Il est également logique de ne le faire que pour les caractéristiques de type catégorique (sexe), ordinal (classe P) ou discret (SibSp, parche).\n", "\n", "- Nous observons une corrélation significative (>0,5) entre Pclass=1 et Survived (classification #3). Nous décidons d'inclure cette caractéristique dans notre modèle.\n", "- **Sexe** Nous confirmons l'observation faite lors de la définition du problème, à savoir que Sexe=femelle avait un taux de survie très élevé à 74% (classification #1).\n", "- **SibSp et Parch** Ces caractéristiques ont une corrélation nulle pour certaines valeurs. Il peut être préférable de dériver une caractéristique ou un ensemble de caractéristiques à partir de ces caractéristiques individuelles (création de #1)." ] }, { "cell_type": "code", "execution_count": 223, "metadata": { "_cell_guid": "0964832a-a4be-2d6f-a89e-63526389cee9", "_uuid": "97a845528ce9f76e85055a4bb9e97c27091f6aa1" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PclassSurvived
010.629630
120.472826
230.242363
\n", "
" ], "text/plain": [ " Pclass Survived\n", "0 1 0.629630\n", "1 2 0.472826\n", "2 3 0.242363" ] }, "execution_count": 223, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)" ] }, { "cell_type": "code", "execution_count": 224, "metadata": { "_cell_guid": "68908ba6-bfe9-5b31-cfde-6987fc0fbe9a", "_uuid": "00a2f2bca094c5984e6a232c730c8b232e7e20bb" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SexSurvived
0female0.742038
1male0.188908
\n", "
" ], "text/plain": [ " Sex Survived\n", "0 female 0.742038\n", "1 male 0.188908" ] }, "execution_count": 224, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df[[\"Sex\", \"Survived\"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)" ] }, { "cell_type": "code", "execution_count": 225, "metadata": { "_cell_guid": "01c06927-c5a6-342a-5aa8-2e486ec3fd7c", "_uuid": "a8f7a16c54417dcd86fc48aeef0c4b240d47d71b" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SibSpSurvived
110.535885
220.464286
000.345395
330.250000
440.166667
550.000000
680.000000
\n", "
" ], "text/plain": [ " SibSp Survived\n", "1 1 0.535885\n", "2 2 0.464286\n", "0 0 0.345395\n", "3 3 0.250000\n", "4 4 0.166667\n", "5 5 0.000000\n", "6 8 0.000000" ] }, "execution_count": 225, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df[[\"SibSp\", \"Survived\"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)" ] }, { "cell_type": "code", "execution_count": 226, "metadata": { "_cell_guid": "e686f98b-a8c9-68f8-36a4-d4598638bbd5", "_uuid": "5d953a6779b00b7f3794757dec8744a03162c8fd" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ParchSurvived
330.600000
110.550847
220.500000
000.343658
550.200000
440.000000
660.000000
\n", "
" ], "text/plain": [ " Parch Survived\n", "3 3 0.600000\n", "1 1 0.550847\n", "2 2 0.500000\n", "0 0 0.343658\n", "5 5 0.200000\n", "4 4 0.000000\n", "6 6 0.000000" ] }, "execution_count": 226, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df[[\"Parch\", \"Survived\"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "0d43550e-9eff-3859-3568-8856570eff76", "_uuid": "5c6204d01f5a9040cf0bb7c678686ae48daa201f" }, "source": [ "## Analyser en visualisant les données\n", "\n", "Nous pouvons maintenant continuer à confirmer certaines de nos hypothèses en utilisant des visualisations pour analyser les données.\n", "\n", "### Corrélation des caractéristiques numériques\n", "\n", "Commençons par comprendre les corrélations entre les caractéristiques numériques et l'objectif que nous recherchons (Survived).\n", "\n", "Un histogramme est utile pour analyser des variables numériques continues comme l'âge, où des bandes ou des plages aideront à identifier des modèles utiles. \n", "\n", "Cela nous aide à répondre aux questions relatives à des intervalles spécifiques (les enfants ont-ils un meilleur taux de survie ?)\n", "\n", "Notez que l'axe des x dans les visualisations de l'histogramme représente le nombre d'échantillons ou de passagers.\n", "\n", "**Observations.**\n", "\n", "- Les enfants (âge <=4 ans) avaient un taux de survie élevé.\n", "- Les passagers les plus âgés (âge = 80 ans) ont survécu.\n", "- Un grand nombre de jeunes de 15 à 25 ans n'ont pas survécu.\n", "- La plupart des passagers sont dans la tranche d'âge 15-35 ans.\n", "\n", "**Décisions.**\n", "\n", "Cette simple analyse confirme nos hypothèses en tant que décisions pour les étapes suivantes du projet.\n", "\n", "- Nous devrions tenir compte de l'âge (notre hypothèse de classification #2) dans notre modèle d'entrainement.\n", "- Complétez la fonction Age pour les valeurs nulles (en complétant le n°1).\n", "- Nous devrions regrouper les groupes d'âge (en créant le #3)." ] }, { "cell_type": "code", "execution_count": 227, "metadata": { "_cell_guid": "50294eac-263a-af78-cb7e-3778eb9ad41f", "_uuid": "d3a1fa63e9dd4f8a810086530a6363c94b36d030" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 227, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAagAAADQCAYAAABStPXYAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/d3fzzAAAACXBIWXMAAAsTAAALEwEAmpwYAAAQuUlEQVR4nO3dfZBddX3H8fdHQKngA8ElEwEb2zIo0vK0Kki11YgTH2poBQsVJ87gpH9gi62ODfWP6jid4kzH0anFMaPW+FAFUUomdoQ0QKsdBwkKSEQN1RSikSSoKE5HDXz7xz2BHbJhb3bv3fvbve/XzJ1zz7lPnw375Xt/v3P2nFQVkiS15gmjDiBJ0nRsUJKkJtmgJElNskFJkppkg5IkNckGJUlqkg1qniR5Z5KtSe5IcluSFw7ofV+bZO2A3uvBAbzHk5JcmeTuJDcnWT6AaBoTY1QnL0ny9SR7k5w3iFyL0aGjDjAOkpwFvAY4vap+meQZwBMP4vWHVtXe6R6rqg3AhsEkHYiLgZ9U1e8kuQB4L/CnI86kBWDM6uQe4E3A20eco2mOoObHMmBPVf0SoKr2VNUPAZJs7wqRJJNJburuvyvJuiTXA5/oRiPP2/eGSW5KckaSNyX5YJKnde/1hO7xJye5N8lhSX47yZeS3Jrky0me0z3n2Um+muSWJO8Z0M+6Cljf3b8aWJEkA3pvLW5jUydVtb2q7gAeHsT7LVY2qPlxPXB8ku8muSLJH/T5ujOAVVX1Z8BngdcDJFkGPLOqbt33xKp6ALgd2PfefwRcV1W/BtYBf1FVZ9D7xnZF95wPAB+qqucDPzpQiK5Yb5vm9vJpnn4scG+XaS/wAHB0nz+vxts41Yn64BTfPKiqB5OcAbwYeClwZZK1VfXxGV66oar+r7t/FbAJ+Dt6Bfi5aZ5/Jb3ptBuBC4ArkhwJvAj43JSBzJO65dnA67r7n6Q3HTdd/hfPkHOq6UZLnk9LMxqzOlEfbFDzpKoeAm4CbkryTWA18HFgL4+OZA9/zMt+MeX1P0hyf5Lfo1dcfz7Nx2wA/iHJEnrfKm8AjgB+WlWnHijaTNmTfBl4yjQPvb2q/uMx23YAxwM7khwKPA348UyfIcFY1Yn64BTfPEhyYpITpmw6Ffjf7v52ekUCj35LO5DPAu8AnlZV33zsg1X1IPA1elMSG6vqoar6GfD9JOd3WZLklO4l/03vGyTAGw70oVX14qo6dZrbdEW3gd7/VADOA24oz0isPoxZnagPNqj5cSSwPsm3ktwBnAS8q3vs3cAHum9fD83wPlfTK5SrHuc5VwIXdct93gBcnOR2YCu9AxkALgUuSXILvZHOIHwUODrJ3cBfAwM5tFdjYWzqJMnzk+wAzgc+nGTrIN53sYlfbiVJLXIEJUlqkg1KktQkG5QkqUk2KElSk+a1Qa1cubLo/T2BN2/jcJsV68TbGN6mNa8Nas+ePfP5cdKCZJ1IPU7xSZKaZIOSJDXJBiVJapINSpLUJBuUJKlJNihJUpO8HtSALV/7xcd9fPvlr56nJJK0sDmCkiQ1yQYlSWqSDUqS1CQblCSpSR4kMc88iEKS+uMISpLUJBuUJKlJNihJUpNsUJKkJtmgJElNskFJkprU12HmSbYDPwceAvZW1WSSJcCVwHJgO/D6qvrJcGLOHw8Dl6Q2HMwI6qVVdWpVTXbra4HNVXUCsLlblyRpIOYyxbcKWN/dXw+cO+c0kiR1+m1QBVyf5NYka7ptS6tqJ0C3PGa6FyZZk2RLki27d++ee2JpEbJOpP3126DOrqrTgVcClyR5Sb8fUFXrqmqyqiYnJiZmFVJa7KwTaX99Naiq+mG33AVcA7wAuC/JMoBuuWtYISVJ42fGBpXkiCRP2XcfeAVwJ7ABWN09bTVw7bBCSpLGTz+HmS8Frkmy7/n/WlVfSnILcFWSi4F7gPOHF1OSNG5mbFBV9T3glGm23w+sGEaols30d1KSpMHwTBKSpCbZoCRJTbJBSZKaZIOSJDXJBiVJapINSpLUJBuUJKlJNihJUpNsUJKkJtmgJElNskFJkppkg5IkNckGJUlqkg1KktQkG5QkqUk2KElSk/puUEkOSfKNJBu79SVJNiXZ1i2PGl5MSdK4OZgR1KXAXVPW1wKbq+oEYHO3LknSQPTVoJIcB7wa+MiUzauA9d399cC5A00mSRpr/Y6g3g+8A3h4yralVbUToFseM90Lk6xJsiXJlt27d88lq7RoWSfS/mZsUEleA+yqqltn8wFVta6qJqtqcmJiYjZvIS161om0v0P7eM7ZwGuTvAo4HHhqkk8B9yVZVlU7kywDdg0zqCRpvMw4gqqqy6rquKpaDlwA3FBVFwEbgNXd01YD1w4tpSRp7Mzl76AuB85Jsg04p1uXJGkg+pnie0RV3QTc1N2/H1gx+EiSJHkmCUlSo2xQkqQm2aAkSU2yQUmSmnRQB0lI0sFavvaLj/v49stfPU9JtNA4gpIkNckGJUlqklN8kpo30zRhP5xKXHgcQUmSmuQIagFxZ7OkceIISpLUJBuUJKlJNihJUpNsUJKkJtmgJElNskFJkpo0Y4NKcniSryW5PcnWJO/uti9JsinJtm551PDjSpLGRT8jqF8CL6uqU4BTgZVJzgTWApur6gRgc7cuSdJAzNigqufBbvWw7lbAKmB9t309cO4wAkqSxlNf+6CSHJLkNmAXsKmqbgaWVtVOgG55zNBSSpLGTl+nOqqqh4BTkzwduCbJyf1+QJI1wBqAZz3rWbPJOFYGcVJMLTzjXCf+zutADuoovqr6KXATsBK4L8kygG656wCvWVdVk1U1OTExMbe00iJlnUj76+covolu5ESS3wBeDnwb2ACs7p62Grh2SBklSWOonym+ZcD6JIfQa2hXVdXGJF8FrkpyMXAPcP4Qc0qSxsyMDaqq7gBOm2b7/cCKYYSSJMnrQS0iXi9K0mLiqY4kSU1yBCUtQP0cmj0fI2YPEdcwOYKSJDXJBiVJapINSpLUJBuUJKlJNihJUpNsUJKkJtmgJElNskFJkppkg5IkNckzSegRnstPUkscQUmSmmSDkiQ1yQYlSWqSDUqS1KQZG1SS45PcmOSuJFuTXNptX5JkU5Jt3fKo4ceVJI2LfkZQe4G3VdVzgTOBS5KcBKwFNlfVCcDmbl2SpIGYsUFV1c6q+np3/+fAXcCxwCpgffe09cC5Q8ooSRpDB7UPKsly4DTgZmBpVe2EXhMDjjnAa9Yk2ZJky+7du+cYV1qcrBNpf303qCRHAp8H3lpVP+v3dVW1rqomq2pyYmJiNhmlRc86kfbXV4NKchi95vTpqvpCt/m+JMu6x5cBu4YTUZI0jvo5ii/AR4G7qup9Ux7aAKzu7q8Grh18PEnSuOrnXHxnA28Evpnktm7b3wKXA1cluRi4Bzh/KAklSWNpxgZVVV8BcoCHVww2jiRJPZ5JQpLUJBuUJKlJXg9qjMx0vSdpMevn999rnrXFEZQkqUk2KElSk2xQkqQm2aAkSU3yIAn1baadzO5gXnw8sEaj5AhKktQkR1CSNEDONAyOIyhJUpNsUJKkJjU5xecQWZLkCEqS1KQmR1CSNAoeVt8WR1CSpCb1c8n3jyXZleTOKduWJNmUZFu3PGq4MSVJ46afKb6PAx8EPjFl21pgc1VdnmRtt/43g4938DzAQpIWhxlHUFX1X8CPH7N5FbC+u78eOHewsSRJ4262+6CWVtVOgG55zIGemGRNki1JtuzevXuWHyctbtaJtL+hHyRRVeuqarKqJicmJob9cdKCZJ1I+5ttg7ovyTKAbrlrcJEkSZr930FtAFYDl3fLaweWSNJAeMCQFrp+DjP/DPBV4MQkO5JcTK8xnZNkG3BOty5J0sDMOIKqqgsP8NCKAWfRIua3eUkHyzNJSJKaZIOSJDXJk8VqYOZyok2nADUu+qkTf997HEFJkppkg5IkNckpPi0ITgFK48cRlCSpSQtyBDXMnfGSNGrOGPQ4gpIkNckGJUlq0oKc4pMOllMm+3O6e+Eal7+lcgQlSWqSDUqS1CQblCSpSTYoSVKTPEhCi4I7/KXFxxGUJKlJcxpBJVkJfAA4BPhIVXnpd0kaE8M+3H3WI6gkhwD/DLwSOAm4MMlJs04iSdIUc5niewFwd1V9r6p+BXwWWDWYWJKkcZeqmt0Lk/OAlVX15m79jcALq+otj3neGmBNt3oi8J3HedtnAHtmFWj+mXU4FlPWPVW1sp83sk6aYNbh6CfrtLUyl31QmWbbft2uqtYB6/p6w2RLVU3OIdO8MetwjGtW62T0zDocc8k6lym+HcDxU9aPA344h/eTJOkRc2lQtwAnJHl2kicCFwAbBhNLkjTuZj3FV1V7k7wFuI7eYeYfq6qtc8zT1xRHI8w6HGZt93Nnw6zDMRZZZ32QhCRJw+SZJCRJTbJBSZKa1ESDSrIyyXeS3J1k7ajzTJXk+CQ3JrkrydYkl3bblyTZlGRbtzxq1Fn3SXJIkm8k2ditN5k1ydOTXJ3k292/71kNZ/2r7r//nUk+k+TwUWRttVask+EZ5zoZeYNaAKdM2gu8raqeC5wJXNLlWwtsrqoTgM3deisuBe6ast5q1g8AX6qq5wCn0MvcXNYkxwJ/CUxW1cn0Dgq6gHnO2nitWCfDM751UlUjvQFnAddNWb8MuGzUuR4n77XAOfT+0n9Zt20Z8J1RZ+uyHNf9ErwM2Nhtay4r8FTg+3QH6kzZ3mLWY4F7gSX0jnzdCLxivrMupFqxTgaWc6zrZOQjKB79ofbZ0W1rTpLlwGnAzcDSqtoJ0C2PGWG0qd4PvAN4eMq2FrP+FrAb+JdumuUjSY6gwaxV9QPgH4F7gJ3AA1V1PfOfdUHUinUyUGNdJy00qL5OmTRqSY4EPg+8tap+Nuo800nyGmBXVd066ix9OBQ4HfhQVZ0G/IIGpimm082ZrwKeDTwTOCLJRaOIMs22pmrFOhm4sa6TFhpU86dMSnIYvaL7dFV9odt8X5Jl3ePLgF2jyjfF2cBrk2ynd3b5lyX5FG1m3QHsqKqbu/Wr6RVii1lfDny/qnZX1a+BLwAvYv6zNl0r1slQjHWdtNCgmj5lUpIAHwXuqqr3TXloA7C6u7+a3pz7SFXVZVV1XFUtp/fveENVXUSbWX8E3JvkxG7TCuBbNJiV3pTFmUme3P0+rKC3o3q+szZbK9bJcIx9nYx6x1q34+xVwHeB/wHeOeo8j8n2+/SmUe4AbuturwKOpreTdVu3XDLqrI/J/Yc8uvO3yazAqcCW7t/234CjGs76buDbwJ3AJ4EnjSJrq7VinQw149jWiac6kiQ1qYUpPkmS9mODkiQ1yQYlSWqSDUqS1CQblCSpSTaoRSDJHyepJM8ZdRapZdbKwmKDWhwuBL5C748OJR2YtbKA2KAWuO7cZ2cDF9MVXZInJLmiuy7LxiT/nuS87rEzkvxnkluTXLfvFCTSYmetLDw2qIXvXHrXivku8OMkpwN/AiwHfhd4M73LNOw7V9o/AedV1RnAx4C/H0FmaRTOxVpZUA4ddQDN2YX0Lh0AvRNfXggcBnyuqh4GfpTkxu7xE4GTgU29U2VxCL3T4kvjwFpZYGxQC1iSo+ldcO3kJEWviAq45kAvAbZW1VnzFFFqgrWyMDnFt7CdB3yiqn6zqpZX1fH0rr65B3hdN7++lN4JMaF3ZcuJJI9MYyR53iiCS/PMWlmAbFAL24Xs/w3w8/QuFraD3hmFP0zvyqYPVNWv6BXqe5PcTu+M0y+at7TS6FgrC5BnM1+kkhxZVQ92UxtfA86u3rVlJE1hrbTLfVCL18YkTweeCLzHgpMOyFpplCMoSVKT3AclSWqSDUqS1CQblCSpSTYoSVKTbFCSpCb9P81FgQhLzgCrAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "g = sns.FacetGrid(train_df, col='Survived')\n", "g.map(plt.hist, 'Age', bins=20)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "87096158-4017-9213-7225-a19aea67a800", "_uuid": "892259f68c2ecf64fd258965cff1ecfe77dd73a9" }, "source": [ "### Corrélation entre les caractéristiques numériques et ordinales\n", "\n", "Nous pouvons combiner plusieurs caractéristiques pour identifier des corrélations en utilisant une seule parcelle. Cela peut être fait avec des caractéristiques numériques et catégorielles qui ont des valeurs numériques.\n", "\n", "**Observations.**\n", "\n", "- La classe P=3 a accueilli la plupart des passagers, mais la plupart n'ont pas survécu. Confirme notre hypothèse de classification n°2.\n", "- Les passagers en bas âge en classe P=2 et P=3 ont pour la plupart survécu. Ce qui confirme notre hypothèse de classification n°2.\n", "- La plupart des passagers de la classe P=1 ont survécu. Confirme notre hypothèse de classement n° 3.\n", "- La classe P varie en fonction de la répartition des passagers par âge.\n", "\n", "**Décisions.\n", "\n", "- Considérer la classe P pour l'entrainement du modèles." ] }, { "cell_type": "code", "execution_count": 228, "metadata": { "_cell_guid": "916fdc6b-0190-9267-1ea9-907a3d87330d", "_uuid": "4f5bcfa97c8a72f8b413c786954f3a68e135e05a" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/rodrique/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:316: UserWarning: The `size` parameter has been renamed to `height`; please update your code.\n", " warnings.warn(msg, UserWarning)\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')\n", "grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2, aspect=1.6)\n", "grid.map(plt.hist, 'Age', alpha=.5, bins=20)\n", "grid.add_legend();" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "36f5a7c0-c55c-f76f-fdf8-945a32a68cb0", "_uuid": "892ab7ee88b1b1c5f1ac987884fa31e111bb0507" }, "source": [ "### Corrélation des caractéristiques catégorielles\n", "\n", "Nous pouvons maintenant corréler les caractéristiques catégorielles.\n", "\n", "**Observations.**\n", "\n", "- Les passagers féminins avaient un taux de survie bien plus élevé que les passagers masculins. Confirme la classification (#1).\n", "- Exception dans Embarqué=C où les hommes avaient un taux de survie plus élevé. Il pourrait s'agir d'une corrélation entre la classe P et Embarqué et, à son tour, la classe P et Survécu, mais pas nécessairement d'une corrélation directe entre Embarqué et Survécu.\n", "- Les hommes ont eu un meilleur taux de survie dans la classe P=3 par rapport à la classe P=2 pour les ports C et Q. Achèvement (#2).\n", "- Les ports d'embarquement ont des taux de survie variables pour la classe P=3 et parmi les passagers masculins. Corrélation (#1).\n", "\n", "**Décisions.\n", "\n", "- Ajouter la fonction Sexe à la formation des modèles.\n", "- Compléter et ajouter la fonction Embarqué à la formation du modèle." ] }, { "cell_type": "code", "execution_count": 229, "metadata": { "_cell_guid": "db57aabd-0e26-9ff9-9ebd-56d401cdf6e8", "_uuid": "c0e1f01b3f58e8f31b938b0e5eb1733132edc8ad" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/rodrique/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:316: UserWarning: The `size` parameter has been renamed to `height`; please update your code.\n", " warnings.warn(msg, UserWarning)\n", "/home/rodrique/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:645: UserWarning: Using the pointplot function without specifying `order` is likely to produce an incorrect plot.\n", " warnings.warn(warning)\n", "/home/rodrique/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:650: UserWarning: Using the pointplot function without specifying `hue_order` is likely to produce an incorrect plot.\n", " warnings.warn(warning)\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 229, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# grid = sns.FacetGrid(train_df, col='Embarked')\n", "grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)\n", "grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')\n", "grid.add_legend()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "6b3f73f4-4600-c1ce-34e0-bd7d9eeb074a", "_uuid": "fd824f937dcb80edd4117a2927cc0d7f99d934b8" }, "source": [ "### Corrélation entre les caractéristiques catégorielles et numériques\n", "\n", "Nous pouvons également vouloir corréler les caractéristiques catégorielles et les caractéristiques numériques. Nous pouvons envisager de corréler Embarqué (Catégorique non numérique), Sexe (Catégorique non numérique), Tarif (Numérique continu), avec Survécu (Catégorique numérique).\n", "\n", "**Observations.**\n", "\n", "- Les passagers payant un tarif plus élevé ont mieux survécu. Confirme notre hypothèse pour la création (#4) de fourchettes tarifaires.\n", "- Le port d'embarquement est en corrélation avec les taux de survie. Confirme la corrélation (n° 1) et en complétion (n° 2).\n", "\n", "**Décisions.**\n", "\n", "- Envisager l'ajout de la caractéristique tarifaire." ] }, { "cell_type": "code", "execution_count": 230, "metadata": { "_cell_guid": "a21f66ac-c30d-f429-cc64-1da5460d16a9", "_uuid": "c8fd535ac1bc90127369027c2101dbc939db118e" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/rodrique/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:316: UserWarning: The `size` parameter has been renamed to `height`; please update your code.\n", " warnings.warn(msg, UserWarning)\n", "/home/rodrique/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:645: UserWarning: Using the barplot function without specifying `order` is likely to produce an incorrect plot.\n", " warnings.warn(warning)\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 230, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})\n", "grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', size=2.2, aspect=1.6)\n", "grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)\n", "grid.add_legend()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "cfac6291-33cc-506e-e548-6cad9408623d", "_uuid": "73a9111a8dc2a6b8b6c78ef628b6cae2a63fc33f" }, "source": [ "## Données erronées\n", "\n", "Nous avons recueilli plusieurs hypothèses et décisions concernant nos ensembles de données et les sugegestions sur les variables pouvant contribuer à construire un bon modèle. Jusqu'à présent, nous n'avons pas eu à modifier une seule caractéristique ou valeur pour y parvenir.\n", "\n", "### Corriger en supprimant des caractéristiques\n", "\n", "C'est un bon objectif de départ à réaliser. En supprimant des caractéristiques, nous traitons moins de données. Cela accélère notre notebook et facilite l'analyse.\n", "\n", "Sur la base de nos hypothèses et décisions, nous voulons supprimer les fonctions Cabine (correction n° 2) et Billet (correction n° 1).\n", "\n", "Notez que, le cas échéant, nous effectuons les opérations sur les deux ensembles de données (entrainement et de test) ensemble pour rester cohérent." ] }, { "cell_type": "code", "execution_count": 231, "metadata": { "_cell_guid": "da057efe-88f0-bf49-917b-bb2fec418ed9", "_uuid": "e328d9882affedcfc4c167aa5bb1ac132547558c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before (891, 12) (418, 11) (891, 12) (418, 11)\n" ] }, { "data": { "text/plain": [ "('After', (891, 10), (418, 9), (891, 10), (418, 9))" ] }, "execution_count": 231, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"Before\", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)\n", "\n", "train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)\n", "test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)\n", "combine = [train_df, test_df]\n", "\n", "\"After\", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "6b3a1216-64b6-7fe2-50bc-e89cc964a41c", "_uuid": "21d5c47ee69f8fbef967f6f41d736b5d4eb6596f" }, "source": [ "### Création d'une nouvelle variables/caractéristiq à partir d'uune caractéristique existante\n", "\n", "Nous voulons vérifier si la fonction Nom peut être conçue pour extraire les titres et tester la corrélation entre les titres et la survie, avant d'abandonner les fonctions Nom et PassengerId.\n", "\n", "Dans le code suivant, nous extrayons la fonction Titre en utilisant des expressions régulières. Le modèle RegEx `(\\w+\\.)` correspond au premier mot qui se termine par un point dans Name feature. \n", "\n", "**Observations.\n", "\n", "Lorsque nous trouvons le titre, l'âge et les survivants, nous notons les observations suivantes.\n", "\n", "- La plupart des titres classent les groupes d'âge avec précision. Par exemple : Le titre principal a une moyenne d'âge de 5 ans.\n", "- La survie entre les tranches d'âge du titre varie légèrement.\n", "- Certains titres ont survécu (Mme, Lady, Sir) ou n'ont pas survécu (Don, Rev, Jonkheer).\n", "\n", "**Décision.\n", "\n", "- Nous décidons de conserver la nouvelle fonction Titre pour la construction du modèle." ] }, { "cell_type": "code", "execution_count": 232, "metadata": { "_cell_guid": "df7f0cd4-992c-4a79-fb19-bf6f0c024d4b", "_uuid": "c916644bd151f3dc8fca900f656d415b4c55e2bc" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sexfemalemale
Title
Capt01
Col02
Countess10
Don01
Dr16
Jonkheer01
Lady10
Major02
Master040
Miss1820
Mlle20
Mme10
Mr0517
Mrs1250
Ms10
Rev06
Sir01
\n", "
" ], "text/plain": [ "Sex female male\n", "Title \n", "Capt 0 1\n", "Col 0 2\n", "Countess 1 0\n", "Don 0 1\n", "Dr 1 6\n", "Jonkheer 0 1\n", "Lady 1 0\n", "Major 0 2\n", "Master 0 40\n", "Miss 182 0\n", "Mlle 2 0\n", "Mme 1 0\n", "Mr 0 517\n", "Mrs 125 0\n", "Ms 1 0\n", "Rev 0 6\n", "Sir 0 1" ] }, "execution_count": 232, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for dataset in combine:\n", " dataset['Title'] = dataset.Name.str.extract('([A-Za-z]+)\\.', expand=False)\n", "\n", "pd.crosstab(train_df['Title'], train_df['Sex'])" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "908c08a6-3395-19a5-0cd7-13341054012a", "_uuid": "f766d512ea5bfe60b5eb7a816f482f2ab688fd2f" }, "source": [ "We can replace many titles with a more common name or classify them as `Rare`." ] }, { "cell_type": "code", "execution_count": 233, "metadata": { "_cell_guid": "553f56d7-002a-ee63-21a4-c0efad10cfe9", "_uuid": "b8cd938fba61fb4e226c77521b012f4bb8aa01d0" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleSurvived
0Master0.575000
1Miss0.702703
2Mr0.156673
3Mrs0.793651
4Rare0.347826
\n", "
" ], "text/plain": [ " Title Survived\n", "0 Master 0.575000\n", "1 Miss 0.702703\n", "2 Mr 0.156673\n", "3 Mrs 0.793651\n", "4 Rare 0.347826" ] }, "execution_count": 233, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for dataset in combine:\n", " dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\\\n", " \t'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')\n", "\n", " dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')\n", " dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')\n", " dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')\n", " \n", "train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "6d46be9a-812a-f334-73b9-56ed912c9eca", "_uuid": "de245fe76474d46995a5acc31b905b8aaa5893f6" }, "source": [ "Convertissons les variables catégorielles" ] }, { "cell_type": "code", "execution_count": 234, "metadata": { "_cell_guid": "67444ebc-4d11-bac1-74a6-059133b6e2e8", "_uuid": "e805ad52f0514497b67c3726104ba46d361eb92c" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchFareEmbarkedTitle
0103Braund, Mr. Owen Harrismale22.0107.2500S1
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C3
2313Heikkinen, Miss. Lainafemale26.0007.9250S2
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000S3
4503Allen, Mr. William Henrymale35.0008.0500S1
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Fare Embarked Title \n", "0 0 7.2500 S 1 \n", "1 0 71.2833 C 3 \n", "2 0 7.9250 S 2 \n", "3 0 53.1000 S 3 \n", "4 0 8.0500 S 1 " ] }, "execution_count": 234, "metadata": {}, "output_type": "execute_result" } ], "source": [ "title_mapping = {\"Mr\": 1, \"Miss\": 2, \"Mrs\": 3, \"Master\": 4, \"Rare\": 5}\n", "for dataset in combine:\n", " dataset['Title'] = dataset['Title'].map(title_mapping)\n", " dataset['Title'] = dataset['Title'].fillna(0)\n", "\n", "train_df.head()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "f27bb974-a3d7-07a1-f7e4-876f6da87e62", "_uuid": "5fefaa1b37c537dda164c87a757fe705a99815d9" }, "source": [ "Nous pouvons maintenant supprimer sans risque la colonne \"Name\" des ensembles de données de train et de test. Nous n'avons pas non plus besoin de la fonction PassengerId dans l'ensemble de données d'entrainement." ] }, { "cell_type": "code", "execution_count": 235, "metadata": { "_cell_guid": "9d61dded-5ff0-5018-7580-aecb4ea17506", "_uuid": "1da299cf2ffd399fd5b37d74fb40665d16ba5347" }, "outputs": [ { "data": { "text/plain": [ "((891, 9), (418, 9))" ] }, "execution_count": 235, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df = train_df.drop(['Name', 'PassengerId'], axis=1)\n", "test_df = test_df.drop(['Name'], axis=1)\n", "combine = [train_df, test_df]\n", "train_df.shape, test_df.shape" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "2c8e84bb-196d-bd4a-4df9-f5213561b5d3", "_uuid": "a1ac66c79b279d94860e66996d3d8dba801a6d9a" }, "source": [ "### Conversion de variable 'objet' en numerique\n", "\n", "Nous pouvons maintenant convertir les caractéristiques qui contiennent des chaînes de caractères en valeurs numériques. Ceci est requis par la plupart des algorithmes de modélisation. Cela nous aidera également à atteindre l'objectif de compléter les caractéristiques.\n", "\n", "Commençons par convertir la caractéristique Sexe en une nouvelle caractéristique appelée Genre où femme=1 et homme=0." ] }, { "cell_type": "code", "execution_count": 236, "metadata": { "_cell_guid": "c20c1df2-157c-e5a0-3e24-15a828095c96", "_uuid": "840498eaee7baaca228499b0a5652da9d4edaf37" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassSexAgeSibSpParchFareEmbarkedTitle
003022.0107.2500S1
111138.01071.2833C3
213126.0007.9250S2
311135.01053.1000S3
403035.0008.0500S1
\n", "
" ], "text/plain": [ " Survived Pclass Sex Age SibSp Parch Fare Embarked Title\n", "0 0 3 0 22.0 1 0 7.2500 S 1\n", "1 1 1 1 38.0 1 0 71.2833 C 3\n", "2 1 3 1 26.0 0 0 7.9250 S 2\n", "3 1 1 1 35.0 1 0 53.1000 S 3\n", "4 0 3 0 35.0 0 0 8.0500 S 1" ] }, "execution_count": 236, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for dataset in combine:\n", " dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)\n", "\n", "train_df.head()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "d72cb29e-5034-1597-b459-83a9640d3d3a", "_uuid": "6da8bfe6c832f4bd2aa1312bdd6b8b4af48a012e" }, "source": [ "### Compléter un élément numérique continu\n", "\n", "Nous devrions maintenant commencer à estimer et à compléter les caractéristiques avec des valeurs manquantes ou nulles. Nous allons d'abord le faire pour la caractéristique \"âge\".\n", "\n", "Nous pouvons envisager trois méthodes pour compléter une caractéristique numérique continue.\n", "\n", "1. Une méthode simple consiste à générer des nombres aléatoires entre la moyenne et [l'écart-type] (https://en.wikipedia.org/wiki/Standard_deviation).\n", "\n", "2. Une façon plus précise de deviner les valeurs manquantes consiste à utiliser d'autres caractéristiques corrélées. Dans notre cas, nous constatons une corrélation entre l'âge, le sexe et la classe P. Devinez les valeurs de l'âge en utilisant les valeurs [médianes](https://en.wikipedia.org/wiki/Median) de l'âge dans des ensembles de combinaisons de caractéristiques de classe P et de sexe. Ainsi, l'âge médian pour Pclass=1 et Gender=0, Pclass=1 et Gender=1, et ainsi de suite...\n", "\n", "3. Combinez les méthodes 1 et 2. Ainsi, au lieu de deviner des valeurs d'âge basées sur la médiane, utilisez des nombres aléatoires entre la moyenne et l'écart-type, basés sur des ensembles de combinaisons de classes P et de genres.\n", "\n", "Les méthodes 1 et 3 introduiront un bruit aléatoire dans nos modèles. Les résultats de plusieurs exécutions peuvent varier. Nous préférerons la méthode 2." ] }, { "cell_type": "code", "execution_count": 237, "metadata": { "_cell_guid": "c311c43d-6554-3b52-8ef8-533ca08b2f68", "_uuid": "345038c8dd1bac9a9bc5e2cfee13fcc1f833eee0" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/rodrique/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:316: UserWarning: The `size` parameter has been renamed to `height`; please update your code.\n", " warnings.warn(msg, UserWarning)\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 237, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# grid = sns.FacetGrid(train_df, col='Pclass', hue='Gender')\n", "grid = sns.FacetGrid(train_df, row='Pclass', col='Sex', size=2.2, aspect=1.6)\n", "grid.map(plt.hist, 'Age', alpha=.5, bins=20)\n", "grid.add_legend()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "a4f166f9-f5f9-1819-66c3-d89dd5b0d8ff", "_uuid": "6b22ac53d95c7979d5f4580bd5fd29d27155c347" }, "source": [ "Commençons par préparer un tableau vide pour contenir des valeurs d'âge approximatives basées sur des combinaisons de classe P x sexe." ] }, { "cell_type": "code", "execution_count": 191, "metadata": { "_cell_guid": "9299523c-dcf1-fb00-e52f-e2fb860a3920", "_uuid": "24a0971daa4cbc3aa700bae42e68c17ce9f3a6e2" }, "outputs": [ { "data": { "text/plain": [ "array([[0., 0., 0.],\n", " [0., 0., 0.]])" ] }, "execution_count": 191, "metadata": {}, "output_type": "execute_result" } ], "source": [ "guess_ages = np.zeros((2,3))\n", "guess_ages" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "ec9fed37-16b1-5518-4fa8-0a7f579dbc82", "_uuid": "8acd90569767b544f055d573bbbb8f6012853385" }, "source": [ "Now we iterate over Sex (0 or 1) and Pclass (1, 2, 3) to calculate guessed values of Age for the six combinations." ] }, { "cell_type": "code", "execution_count": 192, "metadata": { "_cell_guid": "a4015dfa-a0ab-65bc-0cbe-efecf1eb2569", "_uuid": "31198f0ad0dbbb74290ebe135abffa994b8f58f3" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassSexAgeSibSpParchFareEmbarkedTitle
003022107.2500S1
1111381071.2833C3
213126007.9250S2
3111351053.1000S3
403035008.0500S1
\n", "
" ], "text/plain": [ " Survived Pclass Sex Age SibSp Parch Fare Embarked Title\n", "0 0 3 0 22 1 0 7.2500 S 1\n", "1 1 1 1 38 1 0 71.2833 C 3\n", "2 1 3 1 26 0 0 7.9250 S 2\n", "3 1 1 1 35 1 0 53.1000 S 3\n", "4 0 3 0 35 0 0 8.0500 S 1" ] }, "execution_count": 192, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for dataset in combine:\n", " for i in range(0, 2):\n", " for j in range(0, 3):\n", " guess_df = dataset[(dataset['Sex'] == i) & \\\n", " (dataset['Pclass'] == j+1)]['Age'].dropna()\n", "\n", " # age_mean = guess_df.mean()\n", " # age_std = guess_df.std()\n", " # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)\n", "\n", " age_guess = guess_df.median()\n", "\n", " # Convert random age float to nearest .5 age\n", " guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5\n", " \n", " for i in range(0, 2):\n", " for j in range(0, 3):\n", " dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\\\n", " 'Age'] = guess_ages[i,j]\n", "\n", " dataset['Age'] = dataset['Age'].astype(int)\n", "#guess_df\n", "train_df.head()\n", "\n", "# ex: si nous avons 10 femmes en 1ère classe, et que l'âge median de ces 10 femmes == 30. \n", "# Alors pour toute femme en 1ère \n", "# classe dont l'âge n'est pas d\"finit prendra la valeur 30." ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "dbe0a8bf-40bc-c581-e10e-76f07b3b71d4", "_uuid": "e7c52b44b703f28e4b6f4ddba67ab65f40274550" }, "source": [ "Créons des intervalles d'âge et déterminons les corrélations avec Survived." ] }, { "cell_type": "code", "execution_count": 193, "metadata": { "_cell_guid": "725d1c84-6323-9d70-5812-baf9994d3aa1", "_uuid": "5c8b4cbb302f439ef0d6278dcfbdafd952675353" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeBandSurvived
0(-0.08, 16.0]0.550000
1(16.0, 32.0]0.337374
2(32.0, 48.0]0.412037
3(48.0, 64.0]0.434783
4(64.0, 80.0]0.090909
\n", "
" ], "text/plain": [ " AgeBand Survived\n", "0 (-0.08, 16.0] 0.550000\n", "1 (16.0, 32.0] 0.337374\n", "2 (32.0, 48.0] 0.412037\n", "3 (48.0, 64.0] 0.434783\n", "4 (64.0, 80.0] 0.090909" ] }, "execution_count": 193, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df['AgeBand'] = pd.cut(train_df['Age'], 5)\n", "train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "ba4be3a0-e524-9c57-fbec-c8ecc5cde5c6", "_uuid": "856392dd415ac14ab74a885a37d068fc7a58f3a5" }, "source": [ "Remplaçons l'âge par des ordinaux basés sur ces intervalles. #Binning" ] }, { "cell_type": "code", "execution_count": 194, "metadata": { "_cell_guid": "797b986d-2c45-a9ee-e5b5-088de817c8b2", "_uuid": "ee13831345f389db407c178f66c19cc8331445b0" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassSexAgeSibSpParchFareEmbarkedTitleAgeBand
00301107.2500S1(16.0, 32.0]
111121071.2833C3(32.0, 48.0]
21311007.9250S2(16.0, 32.0]
311121053.1000S3(32.0, 48.0]
40302008.0500S1(32.0, 48.0]
\n", "
" ], "text/plain": [ " Survived Pclass Sex Age SibSp Parch Fare Embarked Title \\\n", "0 0 3 0 1 1 0 7.2500 S 1 \n", "1 1 1 1 2 1 0 71.2833 C 3 \n", "2 1 3 1 1 0 0 7.9250 S 2 \n", "3 1 1 1 2 1 0 53.1000 S 3 \n", "4 0 3 0 2 0 0 8.0500 S 1 \n", "\n", " AgeBand \n", "0 (16.0, 32.0] \n", "1 (32.0, 48.0] \n", "2 (16.0, 32.0] \n", "3 (32.0, 48.0] \n", "4 (32.0, 48.0] " ] }, "execution_count": 194, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for dataset in combine: \n", " dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0\n", " dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1\n", " dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2\n", " dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3\n", " dataset.loc[ dataset['Age'] > 64, 'Age']\n", "train_df.head()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "004568b6-dd9a-ff89-43d5-13d4e9370b1d", "_uuid": "8e3fbc95e0fd6600e28347567416d3f0d77a24cc" }, "source": [ "Supprimons la colonne AgeBand." ] }, { "cell_type": "code", "execution_count": 195, "metadata": { "_cell_guid": "875e55d4-51b0-5061-b72c-8a23946133a3", "_uuid": "1ea01ccc4a24e8951556d97c990aa0136da19721" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassSexAgeSibSpParchFareEmbarkedTitle
00301107.2500S1
111121071.2833C3
21311007.9250S2
311121053.1000S3
40302008.0500S1
\n", "
" ], "text/plain": [ " Survived Pclass Sex Age SibSp Parch Fare Embarked Title\n", "0 0 3 0 1 1 0 7.2500 S 1\n", "1 1 1 1 2 1 0 71.2833 C 3\n", "2 1 3 1 1 0 0 7.9250 S 2\n", "3 1 1 1 2 1 0 53.1000 S 3\n", "4 0 3 0 2 0 0 8.0500 S 1" ] }, "execution_count": 195, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df = train_df.drop(['AgeBand'], axis=1)\n", "combine = [train_df, test_df]\n", "train_df.head()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "1c237b76-d7ac-098f-0156-480a838a64a9", "_uuid": "e3d4a2040c053fbd0486c8cfc4fec3224bd3ebb3" }, "source": [ "### \n", "\n", "Nous pouvons créer une nouvelle variable pour FamilySize qui combine Parch et SibSp. Cela nous permettra de supprimer Parch et SibSp de nos ensembles de données." ] }, { "cell_type": "code", "execution_count": 238, "metadata": { "_cell_guid": "7e6c04ed-cfaa-3139-4378-574fd095d6ba", "_uuid": "33d1236ce4a8ab888b9fac2d5af1c78d174b32c7" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FamilySizeSurvived
340.724138
230.578431
120.552795
670.333333
010.303538
450.200000
560.136364
780.000000
8110.000000
\n", "
" ], "text/plain": [ " FamilySize Survived\n", "3 4 0.724138\n", "2 3 0.578431\n", "1 2 0.552795\n", "6 7 0.333333\n", "0 1 0.303538\n", "4 5 0.200000\n", "5 6 0.136364\n", "7 8 0.000000\n", "8 11 0.000000" ] }, "execution_count": 238, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for dataset in combine:\n", " dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1\n", "\n", "train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "842188e6-acf8-2476-ccec-9e3451e4fa86", "_uuid": "67f8e4474cd1ecf4261c153ce8b40ea23cf659e4" }, "source": [ "Nous pouvons créer une autre caractéristique IsAlone pour voir ceux qui ont suvecu en étant seul" ] }, { "cell_type": "code", "execution_count": 239, "metadata": { "_cell_guid": "5c778c69-a9ae-1b6b-44fe-a0898d07be7a", "_uuid": "3b8db81cc3513b088c6bcd9cd1938156fe77992f" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IsAloneSurvived
000.505650
110.303538
\n", "
" ], "text/plain": [ " IsAlone Survived\n", "0 0 0.505650\n", "1 1 0.303538" ] }, "execution_count": 239, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for dataset in combine:\n", " dataset['IsAlone'] = 0\n", " dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1\n", "\n", "train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "e6b87c09-e7b2-f098-5b04-4360080d26bc", "_uuid": "3da4204b2c78faa54a94bbad78a8aa85fbf90c87" }, "source": [ "Supprimons maintenant Parch, SibSp, and FamilySize en faveur de IsAlone." ] }, { "cell_type": "code", "execution_count": 198, "metadata": { "_cell_guid": "74ee56a6-7357-f3bc-b605-6c41f8aa6566", "_uuid": "1e3479690ef7cd8ee10538d4f39d7117246887f0" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassSexAgeFareEmbarkedTitleIsAlone
003017.2500S10
1111271.2833C30
213117.9250S21
3111253.1000S30
403028.0500S11
\n", "
" ], "text/plain": [ " Survived Pclass Sex Age Fare Embarked Title IsAlone\n", "0 0 3 0 1 7.2500 S 1 0\n", "1 1 1 1 2 71.2833 C 3 0\n", "2 1 3 1 1 7.9250 S 2 1\n", "3 1 1 1 2 53.1000 S 3 0\n", "4 0 3 0 2 8.0500 S 1 1" ] }, "execution_count": 198, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)\n", "test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)\n", "combine = [train_df, test_df]\n", "\n", "train_df.head()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "f890b730-b1fe-919e-fb07-352fbd7edd44", "_uuid": "71b800ed96407eba05220f76a1288366a22ec887" }, "source": [ "Nous pouvons créer ue variable artificielle comme Age*Class, qui regroupe l'age et la class" ] }, { "cell_type": "code", "execution_count": 199, "metadata": { "_cell_guid": "305402aa-1ea1-c245-c367-056eef8fe453", "_uuid": "aac2c5340c06210a8b0199e15461e9049fbf2cff" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Age*ClassAgePclass
0313
1221
2313
3221
4623
5313
6331
7003
8313
9002
\n", "
" ], "text/plain": [ " Age*Class Age Pclass\n", "0 3 1 3\n", "1 2 2 1\n", "2 3 1 3\n", "3 2 2 1\n", "4 6 2 3\n", "5 3 1 3\n", "6 3 3 1\n", "7 0 0 3\n", "8 3 1 3\n", "9 0 0 2" ] }, "execution_count": 199, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for dataset in combine:\n", " dataset['Age*Class'] = dataset.Age * dataset.Pclass\n", "\n", "train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "13292c1b-020d-d9aa-525c-941331bb996a", "_uuid": "8264cc5676db8cd3e0b3e3f078cbaa74fd585a3c" }, "source": [ "### Completing a categorical feature\n", "\n", "La caractéristique embarquée prend les valeurs S, Q, C en fonction de la porte d'embarquement. Deux valeurs manquent dans notre ensemble de données d'entraînement. Nous les remplissons simplement avec l'occurrence la plus courante." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "_cell_guid": "bf351113-9b7f-ef56-7211-e8dd00665b18", "_uuid": "1e3f8af166f60a1b3125a6b046eff5fff02d63cf" }, "outputs": [ { "data": { "text/plain": [ "'S'" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "freq_port = train_df.Embarked.dropna().mode()[0]\n", "freq_port" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "_cell_guid": "51c21fcc-f066-cd80-18c8-3d140be6cbae", "_uuid": "d85b5575fb45f25749298641f6a0a38803e1ff22" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EmbarkedSurvived
0C0.553571
1Q0.389610
2S0.339009
\n", "
" ], "text/plain": [ " Embarked Survived\n", "0 C 0.553571\n", "1 Q 0.389610\n", "2 S 0.339009" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for dataset in combine:\n", " dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)\n", " \n", "train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "f6acf7b2-0db3-e583-de50-7e14b495de34", "_uuid": "d8830e997995145314328b6218b5606df04499b0" }, "source": [ "### Converting categorical feature to numeric\n", "\n", "Nous pouvons maintenant convertir la variable Embarked en remplacant par de variable numérique." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "_cell_guid": "89a91d76-2cc0-9bbb-c5c5-3c9ecae33c66", "_uuid": "e480a1ef145de0b023821134896391d568a6f4f9" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassSexAgeFareEmbarkedTitleIsAloneAge*Class
003017.25000103
1111271.28331302
213117.92500213
3111253.10000302
403028.05000116
\n", "
" ], "text/plain": [ " Survived Pclass Sex Age Fare Embarked Title IsAlone Age*Class\n", "0 0 3 0 1 7.2500 0 1 0 3\n", "1 1 1 1 2 71.2833 1 3 0 2\n", "2 1 3 1 1 7.9250 0 2 1 3\n", "3 1 1 1 2 53.1000 0 3 0 2\n", "4 0 3 0 2 8.0500 0 1 1 6" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for dataset in combine:\n", " dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)\n", "\n", "train_df.head()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "e3dfc817-e1c1-a274-a111-62c1c814cecf", "_uuid": "d79834ebc4ab9d48ed404584711475dbf8611b91" }, "source": [ "### Quick completing and converting a numeric feature\n", "\n", "Nous pouvons maintenant compléter la variable Fare pour une valeur manquante unique dans l'ensemble de données de test en utilisant la valeur la plus fréquente. \n", "\n", "Notez que nous ne créons pas de nouvelle variable intermédiaire et que nous ne faisons pas d'analyse supplémentaire de corrélation pour deviner la fonctionnalité manquante, car nous ne remplaçons qu'une seule valeur. L'objectif d'achèvement permet d'atteindre l'exigence souhaitée pour que l'algorithme du modèle fonctionne sur des valeurs non nulles.\n" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "_cell_guid": "3600cb86-cf5f-d87b-1b33-638dc8db1564", "_uuid": "aacb62f3526072a84795a178bd59222378bab180" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassSexAgeFareEmbarkedTitleIsAloneAge*Class
08923027.82922116
18933127.00000306
28942039.68752116
38953018.66250113
489631112.28750303
\n", "
" ], "text/plain": [ " PassengerId Pclass Sex Age Fare Embarked Title IsAlone Age*Class\n", "0 892 3 0 2 7.8292 2 1 1 6\n", "1 893 3 1 2 7.0000 0 3 0 6\n", "2 894 2 0 3 9.6875 2 1 1 6\n", "3 895 3 0 1 8.6625 0 1 1 3\n", "4 896 3 1 1 12.2875 0 3 0 3" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)\n", "test_df.head()" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "4b816bc7-d1fb-c02b-ed1d-ee34b819497d", "_uuid": "3466d98e83899d8b38a36ede794c68c5656f48e6" }, "source": [ "Nous pouvons créer une variable FareBand, regrouper les tarifs par interval" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "_cell_guid": "0e9018b1-ced5-9999-8ce1-258a0952cbf2", "_uuid": "b9a78f6b4c72520d4ad99d2c89c84c591216098d" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FareBandSurvived
0(-0.001, 7.91]0.197309
1(7.91, 14.454]0.303571
2(14.454, 31.0]0.454955
3(31.0, 512.329]0.581081
\n", "
" ], "text/plain": [ " FareBand Survived\n", "0 (-0.001, 7.91] 0.197309\n", "1 (7.91, 14.454] 0.303571\n", "2 (14.454, 31.0] 0.454955\n", "3 (31.0, 512.329] 0.581081" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)\n", "train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "d65901a5-3684-6869-e904-5f1a7cce8a6d", "_uuid": "89400fba71af02d09ff07adf399fb36ac4913db6" }, "source": [ "Convertissons Fare en fonction des valeurs de FareBand." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "_cell_guid": "385f217a-4e00-76dc-1570-1de4eec0c29c", "_uuid": "640f305061ec4221a45ba250f8d54bb391035a57" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassSexAgeFareEmbarkedTitleIsAloneAge*Class
0030100103
1111231302
2131110213
3111230302
4030210116
5030112113
6010330113
7030020400
8131110303
9121021300
\n", "
" ], "text/plain": [ " Survived Pclass Sex Age Fare Embarked Title IsAlone Age*Class\n", "0 0 3 0 1 0 0 1 0 3\n", "1 1 1 1 2 3 1 3 0 2\n", "2 1 3 1 1 1 0 2 1 3\n", "3 1 1 1 2 3 0 3 0 2\n", "4 0 3 0 2 1 0 1 1 6\n", "5 0 3 0 1 1 2 1 1 3\n", "6 0 1 0 3 3 0 1 1 3\n", "7 0 3 0 0 2 0 4 0 0\n", "8 1 3 1 1 1 0 3 0 3\n", "9 1 2 1 0 2 1 3 0 0" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for dataset in combine:\n", " dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0\n", " dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1\n", " dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2\n", " dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3\n", " dataset['Fare'] = dataset['Fare'].astype(int)\n", "\n", "train_df = train_df.drop(['FareBand'], axis=1)\n", "combine = [train_df, test_df]\n", " \n", "train_df.head(10)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "27272bb9-3c64-4f9a-4a3b-54f02e1c8289", "_uuid": "531994ed95a3002d1759ceb74d9396db706a41e2" }, "source": [ "Les données de test ressemble à ça finalement" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "_cell_guid": "d2334d33-4fe5-964d-beac-6aa620066e15", "_uuid": "8453cecad81fcc44de3f4e4e4c3ce6afa977740d" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassSexAgeFareEmbarkedTitleIsAloneAge*Class
089230202116
189331200306
289420312116
389530110113
489631110303
589730010110
689831102213
789920120102
890031101313
990130120103
\n", "
" ], "text/plain": [ " PassengerId Pclass Sex Age Fare Embarked Title IsAlone Age*Class\n", "0 892 3 0 2 0 2 1 1 6\n", "1 893 3 1 2 0 0 3 0 6\n", "2 894 2 0 3 1 2 1 1 6\n", "3 895 3 0 1 1 0 1 1 3\n", "4 896 3 1 1 1 0 3 0 3\n", "5 897 3 0 0 1 0 1 1 0\n", "6 898 3 1 1 0 2 2 1 3\n", "7 899 2 0 1 2 0 1 0 2\n", "8 900 3 1 1 0 1 3 1 3\n", "9 901 3 0 1 2 0 1 0 3" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_df.head(10)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "69783c08-c8cc-a6ca-2a9a-5e75581c6d31", "_uuid": "a55f20dd6654610ff2d66c1bf3e4c6c73dcef9e5" }, "source": [ "## Model, predict and solve\n", "\n", "Nous sommes maintenant prêts à entrainer un modèle et à prédire la question de qui à la chance de survivre. Il existe plus de 60 algorithmes de modélisation prédictive parmi lesquels choisir. \n", "\n", "Nous devons comprendre le type de problème et la solution requise pour nous limiter à quelques modèles que nous pouvons évaluer. \n", "\n", "Le problème posé est un problème de classification et de régression. Nous voulons identifier la relation entre la sortie (Survécu ou non) avec d'autres variables ou caractéristiques (Sexe, Age, Port...). \n", "\n", "Nous sommes également en train de mettre au point une catégorie d'apprentissage machine appelée apprentissage supervisé, car nous formons notre modèle avec un ensemble de données donné. Avec ces deux critères - l'apprentissage supervisé plus la classification et la régression, nous pouvons réduire notre choix de modèles à quelques uns. Il s'agit notamment de\n", "\n", "- Logistic Regression\n", "- KNN or k-Nearest Neighbors\n", "- Support Vector Machines\n", "- Naive Bayes classifier\n", "- Decision Tree\n", "- Random Forrest\n", "- Perceptron\n", "- Artificial neural network\n", "- RVM or Relevance Vector Machine" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# machine learning\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.svm import SVC, LinearSVC\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.linear_model import Perceptron\n", "from sklearn.linear_model import SGDClassifier\n", "from sklearn.tree import DecisionTreeClassifier" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "_cell_guid": "0acf54f9-6cf5-24b5-72d9-29b30052823a", "_uuid": "04d2235855f40cffd81f76b977a500fceaae87ad" }, "outputs": [ { "data": { "text/plain": [ "((891, 8), (891,), (418, 8), (418,))" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## DOnnées de train et de test\n", "X_train = train_df.drop(\"Survived\", axis=1)\n", "Y_train = train_df[\"Survived\"]\n", "X_test = test_df.drop(\"PassengerId\", axis=1).copy()\n", "Y_test = test_df[\"PassengerId\"]\n", "X_train.shape, Y_train.shape, X_test.shape, Y_test.shape" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "579bc004-926a-bcfe-e9bb-c8df83356876", "_uuid": "782903c09ec9ee4b6f3e03f7c8b5a62c00461deb" }, "source": [ "La régression logistique est un modèle utile à appliquer dès le début du processus. La régression logistique mesure la relation entre la variable dépendante catégorielle (caractéristique) et une ou plusieurs variables indépendantes (caractéristiques) en estimant les probabilités à l'aide d'une fonction logistique, qui est la distribution logistique cumulative. Référence [Wikipédia] (https://en.wikipedia.org/wiki/Logistic_regression).\n", "\n", "Notez le score de confiance généré par le modèle basé sur notre ensemble de données d'entrainement." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "_cell_guid": "0edd9322-db0b-9c37-172d-a3a4f8dec229", "_uuid": "a649b9c53f4c7b40694f60f5c8dc14ec5ef519ec" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/rodrique/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n", " FutureWarning)\n" ] }, { "data": { "text/plain": [ "80.36" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Logistic Regression\n", "\n", "logreg = LogisticRegression()\n", "\n", "logreg.fit(X_train, Y_train)\n", "\n", "Y_pred = logreg.predict(X_test)\n", "\n", "acc_log = round(logreg.score(X_train, Y_train) * 100, 2)\n", "acc_log" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "list = [1,2,3,4,5,6,7]\n", "\n", "v_al_train = rnd(list, 60) # 60%\n", "\n", "v_al_test = rnd(list, 40) #40\n", "\n", "v_al_test = list - v_al_train #\n", "\n", "train = # 60% des données de la liste\n", "test = # 40% des données de liste" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 245, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.441994\n", " Iterations 8\n", " Results: Logit\n", "=================================================================\n", "Model: Logit Pseudo R-squared: 0.336 \n", "Dependent Variable: Survived AIC: 803.6334 \n", "Date: 2020-12-17 13:17 BIC: 841.9721 \n", "No. Observations: 891 Log-Likelihood: -393.82 \n", "Df Model: 7 LL-Null: -593.33 \n", "Df Residuals: 883 LLR p-value: 3.8665e-82\n", "Converged: 1.0000 Scale: 1.0000 \n", "No. Iterations: 8.0000 \n", "------------------------------------------------------------------\n", " Coef. Std.Err. z P>|z| [0.025 0.975]\n", "------------------------------------------------------------------\n", "Pclass -0.7583 0.1107 -6.8531 0.0000 -0.9752 -0.5415\n", "Sex 2.2939 0.2039 11.2516 0.0000 1.8943 2.6935\n", "Age 0.2949 0.1045 2.8235 0.0047 0.0902 0.4997\n", "Fare -0.0832 0.0768 -1.0834 0.2786 -0.2337 0.0673\n", "Embarked 0.2640 0.1380 1.9137 0.0557 -0.0064 0.5345\n", "Title 0.3938 0.0904 4.3552 0.0000 0.2166 0.5710\n", "IsAlone 0.1588 0.1727 0.9197 0.3577 -0.1796 0.4972\n", "Age*Class -0.3186 0.1016 -3.1359 0.0017 -0.5177 -0.1195\n", "=================================================================\n", "\n" ] } ], "source": [ "import statsmodels.api as sm\n", "logit_model=sm.Logit(Y_train,X_train)\n", "result=logit_model.fit()\n", "print(result.summary2())" ] }, { "cell_type": "code", "execution_count": 240, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The best parameters for log classifier: {'C': 0.1, 'max_iter': 100, 'solver': 'lbfgs'}\n" ] } ], "source": [ "#define hyper parameters and ranges\n", "param_grid_logreg = [{'C': [0.1, 1, 10], 'solver': ['lbfgs', 'liblinear'], \n", " 'max_iter':[100, 300]}]\n", "#apply gridsearch\n", "\n", "grid_logreg = GridSearchCV(logreg, param_grid=param_grid_logreg, cv=5)\n", "#fit model with grid search\n", "grid_logreg.fit(X_train, Y_train)\n", "print('The best parameters for log classifier: ', grid_logreg.best_params_)" ] }, { "cell_type": "code", "execution_count": 250, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "80.81" ] }, "execution_count": 250, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Logistic Regression with hyper-parameter\n", "\n", "logreg = LogisticRegression(C= 0.1, max_iter= 100, solver= 'lbfgs')\n", "logreg.fit(X_train, Y_train)\n", "Y_pred = logreg.predict(X_test)\n", "acc_log = round(logreg.score(X_train, Y_train) * 100, 2)\n", "acc_log" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "3af439ae-1f04-9236-cdc2-ec8170a0d4ee", "_uuid": "180e27c96c821656a84889f73986c6ddfff51ed3" }, "source": [ "Nous pouvons utiliser la régression logistique pour valider nos hypothèses et nos décisions en vue de la création et de la réalisation d'objectifs. Cela peut être fait en calculant le coefficient des caractéristiques dans la fonction de décision.\n", "\n", "Les coefficients positifs augmentent les log-odds de la réponse (et donc augmentent la probabilité), et les coefficients négatifs diminuent les log-odds de la réponse (et donc diminuent la probabilité).\n", "\n", "- Le sexe est le coefficient positif le plus élevé, ce qui signifie que plus la valeur du sexe augmente (homme : 0 à femme : 1), plus la probabilité de Survécu=1 augmente.\n", "- Inversement, plus la classe P augmente, plus la probabilité de Survécu=1 diminue.\n", "- De cette façon, Age*Class est une bonne caractéristique artificielle à modéliser car elle a la deuxième corrélation négative la plus élevée avec Survived.\n", "- Il en va de même pour le titre, qui est la deuxième corrélation positive la plus élevée." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "_cell_guid": "e545d5aa-4767-7a41-5799-a4c5e529ce72", "_uuid": "6e6f58053fae405fc93d312fc999f3904e708dbe" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FeatureCorrelation
1Sex2.201527
5Title0.398234
2Age0.287163
4Embarked0.261762
6IsAlone0.129140
3Fare-0.085150
7Age*Class-0.311200
0Pclass-0.749007
\n", "
" ], "text/plain": [ " Feature Correlation\n", "1 Sex 2.201527\n", "5 Title 0.398234\n", "2 Age 0.287163\n", "4 Embarked 0.261762\n", "6 IsAlone 0.129140\n", "3 Fare -0.085150\n", "7 Age*Class -0.311200\n", "0 Pclass -0.749007" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coeff_df = pd.DataFrame(train_df.columns.delete(0))\n", "coeff_df.columns = ['Feature']\n", "coeff_df[\"Correlation\"] = pd.Series(logreg.coef_[0])\n", "\n", "coeff_df.sort_values(by='Correlation', ascending=False)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "ac041064-1693-8584-156b-66674117e4d0", "_uuid": "ccba9ac0a9c3c648ef9bc778977ab99066ab3945" }, "source": [ "Ensuite, nous modélisons à l'aide de SVM qui sont des modèles d'apprentissage supervisés avec des algorithmes d'apprentissage associés qui analysent les données utilisées pour la classification et l'analyse de régression. \n", "\n", "Étant donné un ensemble d'échantillons d'apprentissage, chacun étant marqué comme appartenant à l'une ou l'autre de **deux catégories**, un algorithme d'apprentissage SVM construit un modèle qui assigne de nouveaux échantillons d'essai à l'une ou l'autre catégorie, ce qui en fait un classificateur linéaire binaire non probabiliste. Référence [Wikipedia] (https://en.wikipedia.org/wiki/Support_vector_machine).\n", "\n", "Notez que le modèle génère un score de confiance qui est plus élevé que le modèle de régression logistique." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "_cell_guid": "7a63bf04-a410-9c81-5310-bdef7963298f", "_uuid": "60039d5377da49f1aa9ac4a924331328bd69add1" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/rodrique/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.\n", " \"avoid this warning.\", FutureWarning)\n" ] }, { "data": { "text/plain": [ "83.84" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Support Vector Machines\n", "\n", "svc = SVC()\n", "svc.fit(X_train, Y_train)\n", "Y_pred = svc.predict(X_test)\n", "acc_svc = round(svc.score(X_train, Y_train) * 100, 2)\n", "acc_svc" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The best parameters for svm classifier: {'C': 100, 'gamma': 'scale', 'kernel': 'rbf'}\n" ] } ], "source": [ "#----------------------------------------------------------------SVM classifier\n", "#define hyper parameters and ranges\n", "param_grid_svc = [{'C': [100, 50, 10, 1.0, 0.1, 0.01], 'gamma': ['scale'], \n", " 'kernel': ['poly', 'rbf', 'sigmoid'] }]\n", "#apply gridsearch\n", "grid_svc = GridSearchCV(svc, param_grid=param_grid_svc, cv=5)\n", "#fit model with grid search\n", "grid_svc.fit(X_train, Y_train)\n", "print('The best parameters for svm classifier: ', grid_svc.best_params_)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "81.03" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Support Vector Machines\n", "\n", "svc = SVC(C = 100, gamma= 'scale', kernel ='rbf')\n", "svc.fit(X_train, Y_train)\n", "Y_pred = svc.predict(X_test)\n", "acc_svc = round(svc.score(X_train, Y_train) * 100, 2)\n", "acc_svc" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "172a6286-d495-5ac4-1a9c-5b77b74ca6d2", "_uuid": "bb3ed027c45664148b61e3aa5e2ca8111aac8793" }, "source": [ "En **reconnaissance de formes**, l'algorithme k-Nearest Neighbors (ou k-NN en abrégé) est une méthode non-paramétrique utilisée pour la classification et la régression. \n", "\n", "Un échantillon est classé par un vote majoritaire de ses voisins, l'échantillon étant affecté au cluster le plus proche (dstance, similarity) parmi ses k plus proches voisins (k est un entier positif, généralement petit). Si k = 1, alors l'objet est simplement assigné à la classe de ce seul voisin le plus proche. Référence [Wikipedia](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm).\n", "\n", "Le score de confiance KNN est meilleur que la régression logistique mais pire que le MVC." ] }, { "cell_type": "code", "execution_count": 257, "metadata": { "_cell_guid": "ca14ae53-f05e-eb73-201c-064d7c3ed610", "_uuid": "54d86cd45703d459d452f89572771deaa8877999" }, "outputs": [ { "data": { "text/plain": [ "83.95" ] }, "execution_count": 257, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn = KNeighborsClassifier()#(n_neighbors = 3)\n", "knn.fit(X_train, Y_train)\n", "Y_pred = knn.predict(X_test)\n", "acc_knn = round(knn.score(X_train, Y_train) * 100, 2)\n", "acc_knn" ] }, { "cell_type": "code", "execution_count": 258, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The best parameters for knn classifier: {'metric': 'manhattan', 'n_neighbors': 6, 'weights': 'uniform'}\n" ] } ], "source": [ "#----------------------------------------------------------------kNN classifier\n", "#define hyper parameters and ranges\n", "#define hyper parameters and ranges\n", "param_grid_knn = [{'n_neighbors': [2, 3, 4, 6, 8, 10], 'weights': [ 'uniform', 'distance'], \n", " 'metric': ['euclidean', 'manhattan', 'minkowski']}]\n", "#apply gridsearch\n", "grid_knn = GridSearchCV(knn, param_grid=param_grid_knn, cv=5)\n", "#fit model with grid search\n", "grid_knn.fit(X_train, Y_train)\n", "print('The best parameters for knn classifier: ', grid_knn.best_params_)" ] }, { "cell_type": "code", "execution_count": 262, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "83.84" ] }, "execution_count": 262, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn = KNeighborsClassifier (metric= 'manhattan', n_neighbors= 6, weights= 'uniform')\n", "knn.fit(X_train, Y_train)\n", "Y_pred = knn.predict(X_test)\n", "acc_knn = round(knn.score(X_train, Y_train) * 100, 2)\n", "acc_knn" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "810f723d-2313-8dfd-e3e2-26673b9caa90", "_uuid": "1535f18113f851e480cd53e0c612dc05835690f3" }, "source": [ "Dans l'apprentissage machine, les classificateurs naïfs de Bayes sont une famille de classificateurs probabilistes simples basés sur l'application du théorème de Bayes avec de fortes hypothèses d'indépendance (naïve) entre les caractéristiques. \n", "\n", "Les classificateurs naïfs de Bayes sont très évolutifs, nécessitant un certain nombre de paramètres linéaires dans le nombre de variables (caractéristiques) d'un problème d'apprentissage. Référence [Wikipedia] (https://en.wikipedia.org/wiki/Naive_Bayes_classifier).\n", "\n", "Le score de confiance généré par le modèle est le plus bas parmi les modèles évalués jusqu'à présent." ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "_cell_guid": "50378071-7043-ed8d-a782-70c947520dae", "_uuid": "723c835c29e8727bc9bad4b564731f2ca98025d0" }, "outputs": [ { "data": { "text/plain": [ "72.28" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Gaussian Naive Bayes\n", "\n", "gaussian = GaussianNB()\n", "gaussian.fit(X_train, Y_train)\n", "Y_pred = gaussian.predict(X_test)\n", "acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)\n", "acc_gaussian" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "1e286e19-b714-385a-fcfa-8cf5ec19956a", "_uuid": "df148bf93e11c9ec2c97162d5c0c0605b75d9334" }, "source": [ "Le perceptron est un algorithme d'apprentissage supervisé des classificateurs binaires (fonctions qui peuvent décider si une entrée, représentée par un vecteur de nombres, appartient à une classe spécifique ou non). \n", "\n", "C'est un type de classificateur linéaire, c'est-à-dire un algorithme de classification qui fait ses prédictions sur la base d'une fonction prédictive linéaire combinant un ensemble de poids avec le vecteur de caractéristiques. L'algorithme permet l'apprentissage en ligne, en ce sens qu'il traite les éléments de l'ensemble de formation un par un. Référence [Wikipedia](https://en.wikipedia.org/wiki/Perceptron)." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "_cell_guid": "ccc22a86-b7cb-c2dd-74bd-53b218d6ed0d", "_uuid": "c19d08949f9c3a26931e28adedc848b4deaa8ab6" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/rodrique/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/stochastic_gradient.py:166: FutureWarning: max_iter and tol parameters have been added in Perceptron in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.\n", " FutureWarning)\n" ] }, { "data": { "text/plain": [ "78.0" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Perceptron\n", "\n", "perceptron = Perceptron()\n", "perceptron.fit(X_train, Y_train)\n", "Y_pred = perceptron.predict(X_test)\n", "acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)\n", "acc_perceptron" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "_cell_guid": "a4d56857-9432-55bb-14c0-52ebeb64d198", "_uuid": "52ea4f44dd626448dd2199cb284b592670b1394b" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/rodrique/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.\n", " \"the number of iterations.\", ConvergenceWarning)\n" ] }, { "data": { "text/plain": [ "79.12" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Linear SVC\n", "\n", "linear_svc = LinearSVC()\n", "linear_svc.fit(X_train, Y_train)\n", "Y_pred = linear_svc.predict(X_test)\n", "acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)\n", "acc_linear_svc" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "_cell_guid": "dc98ed72-3aeb-861f-804d-b6e3d178bf4b", "_uuid": "3a016c1f24da59c85648204302d61ea15920e740" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/rodrique/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/stochastic_gradient.py:166: FutureWarning: max_iter and tol parameters have been added in SGDClassifier in 0.19. If both are left unset, they default to max_iter=5 and tol=None. If tol is not None, max_iter defaults to max_iter=1000. From 0.21, default max_iter will be 1000, and default tol will be 1e-3.\n", " FutureWarning)\n" ] }, { "data": { "text/plain": [ "79.24" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Stochastic Gradient Descent\n", "\n", "sgd = SGDClassifier()\n", "sgd.fit(X_train, Y_train)\n", "Y_pred = sgd.predict(X_test)\n", "acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)\n", "acc_sgd" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "bae7f8d7-9da0-f4fd-bdb1-d97e719a18d7", "_uuid": "1c70e99920ae34adce03aaef38d61e2b83ff6a9c" }, "source": [ "Ce modèle utilise un arbre de décision comme modèle prédictif qui permet de cartographier les caractéristiques (branches de l'arbre) et de tirer des conclusions sur la valeur cible (feuilles de l'arbre). \n", "\n", "Les modèles d'arbre dans lesquels la variable cible peut prendre un ensemble fini de valeurs sont appelés arbres de classification ; dans ces structures d'arbre, les feuilles représentent des étiquettes de classe et les branches représentent des conjonctions de caractéristiques qui mènent à ces étiquettes de classe. \n", "\n", "Les arbres de décision dans lesquels la variable cible peut prendre des valeurs continues (généralement des nombres réels) sont appelés arbres de régression. Référence [Wikipedia] (https://en.wikipedia.org/wiki/Decision_tree_learning).\n", "\n", "Le score de confiance du modèle est le plus élevé parmi les modèles évalués jusqu'à présent." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "_cell_guid": "dd85f2b7-ace2-0306-b4ec-79c68cd3fea0", "_uuid": "1f94308b23b934123c03067e84027b507b989e52" }, "outputs": [ { "data": { "text/plain": [ "86.76" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Decision Tree\n", "\n", "decision_tree = DecisionTreeClassifier()\n", "decision_tree.fit(X_train, Y_train)\n", "Y_pred = decision_tree.predict(X_test)\n", "acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)\n", "acc_decision_tree" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "85693668-0cd5-4319-7768-eddb62d2b7d0", "_uuid": "24f4e46f202a858076be91752170cad52aa9aefa" }, "source": [ "Le modèle suivant, Random Forests, est l'un des plus populaires. Les forêts aléatoires ou forêts décisionnelles aléatoires sont une méthode d'apprentissage d'ensemble pour la classification, la régression et d'autres tâches, qui fonctionne en construisant une multitude d'arbres de décision (n_estimateurs=100) au moment de l'apprentissage et en produisant la classe qui est le mode des classes (classification) ou la prédiction moyenne (régression) des arbres individuels. Référence [Wikipédia] (https://en.wikipedia.org/wiki/Random_forest).\n", "\n", "Le score de confiance du modèle est le plus élevé parmi les modèles évalués jusqu'à présent. " ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "_cell_guid": "f0694a8e-b618-8ed9-6f0d-8c6fba2c4567", "_uuid": "483c647d2759a2703d20785a44f51b6dee47d0db" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/rodrique/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n", " \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n" ] }, { "data": { "text/plain": [ "86.31" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Random Forest\n", "\n", "random_forest = RandomForestClassifier()#(n_estimators=100)\n", "random_forest.fit(X_train, Y_train)\n", "Y_pred = random_forest.predict(X_test)\n", "random_forest.score(X_train, Y_train)\n", "acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)\n", "acc_random_forest" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The best parameters for rtree classifier: {'criterion': 'gini', 'max_depth': 5, 'n_estimators': 300}\n" ] } ], "source": [ "#--------------------------------------------------------------random forest classifier\n", "#define hyper parameters and ranges\n", "param_grid_random_forest = [{'max_depth': [5, 10, 15, 20], 'n_estimators':[100,300,500] ,\n", " 'criterion': ['gini', 'entropy']}]\n", "#apply gridsearch\n", "grid_rtree = GridSearchCV(random_forest, param_grid=param_grid_random_forest, cv=5)\n", "#fit model with grid search\n", "grid_rtree.fit(X_train, Y_train)\n", "print('The best parameters for rtree classifier: ', grid_rtree.best_params_)" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "84.06" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Random Forest with GS parameter\n", "\n", "random_forest = RandomForestClassifier(criterion = 'gini', max_depth = 5, n_estimators = 300)#(n_estimators=100)\n", "random_forest.fit(X_train, Y_train)\n", "Y_pred = random_forest.predict(X_test)\n", "random_forest.score(X_train, Y_train)\n", "acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)\n", "acc_random_forest" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "f6c9eef8-83dd-581c-2d8e-ce932fe3a44d", "_uuid": "2c1428d022430ea594af983a433757e11b47c50c" }, "source": [ "### Évaluation du modèle\n", "\n", "Nous pouvons maintenant classer notre évaluation de tous les modèles pour choisir celui qui convient le mieux à notre problème. " ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "_cell_guid": "1f3cebe0-31af-70b2-1ce4-0fd406bcdfc6", "_uuid": "06a52babe50e0dd837b553c78fc73872168e1c7d" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelScore
3Random Forest86.76
8Decision Tree86.76
1KNN84.74
0Support Vector Machines83.84
2Logistic Regression80.36
6Stochastic Gradient Decent79.24
7Linear SVC79.12
5Perceptron78.00
4Naive Bayes72.28
\n", "
" ], "text/plain": [ " Model Score\n", "3 Random Forest 86.76\n", "8 Decision Tree 86.76\n", "1 KNN 84.74\n", "0 Support Vector Machines 83.84\n", "2 Logistic Regression 80.36\n", "6 Stochastic Gradient Decent 79.24\n", "7 Linear SVC 79.12\n", "5 Perceptron 78.00\n", "4 Naive Bayes 72.28" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "models = pd.DataFrame({\n", " 'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', \n", " 'Random Forest', 'Naive Bayes', 'Perceptron', \n", " 'Stochastic Gradient Decent', 'Linear SVC', \n", " 'Decision Tree'],\n", " 'Score': [acc_svc, acc_knn, acc_log, \n", " acc_random_forest, acc_gaussian, acc_perceptron, \n", " acc_sgd, acc_linear_svc, acc_decision_tree]})\n", "models.sort_values(by='Score', ascending=False)" ] }, { "cell_type": "markdown", "metadata": { "_cell_guid": "aeec9210-f9d8-cd7c-c4cf-a87376d5f693", "_uuid": "cdae56d6adbfb15ff9c491c645ae46e2c91d75ce" }, "source": [ "## References\n", "\n", "- [Metrics pour évaluation d'un modèle1](https://towardsdatascience.com/whats-the-deal-with-accuracy-precision-recall-and-f1-f5d8b4db1021)\n", "- [Metrics pour évaluation d'un modèle2](https://medium.com/@shrutisaxena0617/precision-vs-recall-386cf9f89488)\n", "- https://towardsdatascience.com/practical-machine-learning-tutorial-part-2-build-model-validate-c98c2ddad744\n", "- https://www.kaggle.com/omarelgabry/titanic/a-journey-through-titanic)\n", "- https://www.kaggle.com/c/titanic/details/getting-started-with-random-forests\n", "- https://www.kaggle.com/sinakhorami/titanic/titanic-best-working-classifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "_change_revision": 0, "_is_fork": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 1 }