What I Learned During My Data Science Internship at Elite Tech Intern

 DATA SCIENCE INTERN

Introduction

Hi! I’m Saravanan K, a third-year B.Sc., Computer Science (Data Science and Analytics)student at Subbalakshmi Lakshmipathy College of Science. I’m excited to share that I’ve been selected as an intern at Elite Intern, where I’ll be gaining hands-on experience and deepening my skills in the tech field. This blog will serve as a journal of my internship journey highlighting what I learn.

Email : saravanank212006@gmail.com

LinkedIn : https://www.linkedin.com/in/saravanan-k-dsa/

Contact :+91 9344731828

Project I Did

Task 1:

Create a Pipeline for data preprocessing, Transformation and loading using tools like pandas and scikit-learn

🔧 Key Features:

  • Data Extraction: Robust file handling with dynamic file selection

  • Preprocessing: Handling missing values, encoding categorical variables, and scaling numerical data

  • Transformation Pipelines: Powered by ColumnTransformer for clean, modular preprocessing

  • Loading: Final cleaned dataset exported and ready for modeling or analysis.


How it works..
ETL = Extract → Transform → Load.

1.Extract

raw_data = extract_data(INPUT_FILE)

  • Loads the CSV file using .read_csv.
  • Checks if the file exists; if not, it throws an error.
  • After reading, it prints the shape (rows x columns).
  • Returns a DataFrame containing the raw data
2.Transform

processed_data = transform_data(raw_data)

  • Splits the data into x and y.
  • Identifies (Numerical and Categorical).

Pipelines:

  • Numerical pipeline:

    • Fills missing values with the column mean.

    • Standardizes (scales) data with StandardScaler.

  • Categorical pipeline:

    • Fills missing values with the most common value.

    • One-hot encodes categorical variable

  • Column Transformer : Applies the right pipeline to the right columns.
3. Load 

load_data(processed_data, OUTPUT_FILE)
  • Ensures the folder for saving exists.
  • Saves the final DataFrame as a new CSV file.
  • Prints the save location.

Data extracted: 25128 rows, 21 columns.
Data transformed: 25128 rows, 55 columns.

Task 2:
    Implement a deep learning model for image classification using tensorflow 

Task 3:
            Develop a full Data Science project , form data collection and preprocessing to model depolyment using flask or fastapi.


Project Title:🩺 Diabetes Prediction Web API using FastAPI

This project is a complete machine learning application that predicts whether a person is likely to have diabetes based on medical input data. It is built with Python, trained using scikit-learn, and deployed using the FastAPI web framework.

🔧 Features:

  • Cleaned and preprocessed the Pima Indian Diabetes dataset

  • Replaced missing values and standardized the input features

  • Trained a Random Forest Classifier for accurate prediction

  • Saved the trained model, imputer, and scaler using joblib

  • Created a FastAPI backend to expose the model as a REST API

  • Includes Swagger UI for easy testing and interaction

About Pima Indian Diabetes dataset:
  • Pregnancies: Number of times the patient has been pregnant.
  • Glucose: Plasma glucose concentration after a 2-hour oral glucose tolerance test.
  • BloodPressure: Diastolic blood pressure (mm Hg).
  • SkinThickness: Thickness of the triceps skin fold (mm), used to estimate body fat.
  • Insulin: 2-Hour serum insulin (mu U/ml), an indicator of insulin levels in the body.
  • BMI: Body Mass Index, calculated as weight in kg/(height in m)^2.
  • DiabetesPedigreeFunction: A score based on family history and genetics to quantify diabetes risk.
  • Age: Age of the patient (years).

Task 4:

Problem Definition: Defining a business problem, such as maximizing profit or minimizing cost.






Comments