What I Learned During My Data Science Internship at Elite Tech Intern
DATA SCIENCE INTERN
Introduction
Hi! I’m Saravanan K, a third-year B.Sc., Computer Science (Data Science and Analytics)student at Subbalakshmi Lakshmipathy College of Science. I’m excited to share that I’ve been selected as an intern at Elite Intern, where I’ll be gaining hands-on experience and deepening my skills in the tech field. This blog will serve as a journal of my internship journey highlighting what I learn.
Email : saravanank212006@gmail.com
LinkedIn : https://www.linkedin.com/in/saravanan-k-dsa/
Contact :+91 9344731828
Project I Did
Task 1:
Create a Pipeline for data preprocessing, Transformation and loading using tools like pandas and scikit-learn
🔧 Key Features:
-
Data Extraction: Robust file handling with dynamic file selection
-
Preprocessing: Handling missing values, encoding categorical variables, and scaling numerical data
-
Transformation Pipelines: Powered by
ColumnTransformerfor clean, modular preprocessing -
Loading: Final cleaned dataset exported and ready for modeling or analysis.
- Loads the CSV file using .read_csv.
- Checks if the file exists; if not, it throws an error.
- After reading, it prints the shape (rows x columns).
- Returns a DataFrame containing the raw data
- Splits the data into x and y.
- Identifies (Numerical and Categorical).
Pipelines:
-
Numerical pipeline:
-
Fills missing values with the column mean.
-
Standardizes (scales) data with StandardScaler.
-
Categorical pipeline:
-
Fills missing values with the most common value.
-
One-hot encodes categorical variable
- Column Transformer : Applies the right pipeline to the right columns.
3. Load
load_data(processed_data, OUTPUT_FILE)- Ensures the folder for saving exists.
- Saves the final DataFrame as a new CSV file.
- Prints the save location.
Numerical pipeline:
-
Fills missing values with the column mean.
-
Standardizes (scales) data with
StandardScaler.
Categorical pipeline:
-
Fills missing values with the most common value.
-
One-hot encodes categorical variable
- Ensures the folder for saving exists.
- Saves the final DataFrame as a new CSV file.
- Prints the save location.
Data extracted: 25128 rows, 21 columns.Data transformed: 25128 rows, 55 columns.
Task 2: Implement a deep learning model for image classification using tensorflow
Task 3: Develop a full Data Science project , form data collection and preprocessing to model depolyment using flask or fastapi.
Project Title:🩺 Diabetes Prediction Web API using FastAPI
This project is a complete machine learning application that predicts whether a person is likely to have diabetes based on medical input data. It is built with Python, trained using scikit-learn, and deployed using the FastAPI web framework.
🔧 Features:
-
Cleaned and preprocessed the Pima Indian Diabetes dataset
-
Replaced missing values and standardized the input features
-
Trained a Random Forest Classifier for accurate prediction
-
Saved the trained model, imputer, and scaler using joblib
-
Created a FastAPI backend to expose the model as a REST API
-
Includes Swagger UI for easy testing and interaction
About Pima Indian Diabetes dataset:- Pregnancies: Number of times the patient has been pregnant.
- Glucose: Plasma glucose concentration after a 2-hour oral glucose tolerance test.
- BloodPressure: Diastolic blood pressure (mm Hg).
- SkinThickness: Thickness of the triceps skin fold (mm), used to estimate body fat.
- Insulin: 2-Hour serum insulin (mu U/ml), an indicator of insulin levels in the body.
- BMI: Body Mass Index, calculated as weight in kg/(height in m)^2.
- DiabetesPedigreeFunction: A score based on family history and genetics to quantify diabetes risk.
- Age: Age of the patient (years).
Task 4:
Problem Definition: Defining a business problem, such as maximizing profit or minimizing cost.
-
Cleaned and preprocessed the Pima Indian Diabetes dataset
-
Replaced missing values and standardized the input features
-
Trained a Random Forest Classifier for accurate prediction
-
Saved the trained model, imputer, and scaler using
joblib -
Created a FastAPI backend to expose the model as a REST API
-
Includes Swagger UI for easy testing and interaction
- Pregnancies: Number of times the patient has been pregnant.
- Glucose: Plasma glucose concentration after a 2-hour oral glucose tolerance test.
- BloodPressure: Diastolic blood pressure (mm Hg).
- SkinThickness: Thickness of the triceps skin fold (mm), used to estimate body fat.
- Insulin: 2-Hour serum insulin (mu U/ml), an indicator of insulin levels in the body.
- BMI: Body Mass Index, calculated as weight in kg/(height in m)^2.
- DiabetesPedigreeFunction: A score based on family history and genetics to quantify diabetes risk.
- Age: Age of the patient (years).
Task 4:
Problem Definition: Defining a business problem, such as maximizing profit or minimizing cost.
Comments
Post a Comment