student performance dataset

In this post, we will explore the student performance dataset available on Kaggle. Taking part in the data competition improved my confidence in my ability to use the acquired knowledge in practical applications. Interestingly, the highest exam score was received by an undergraduate student. Also, we will use Pandas as a tool for manipulating dataframes. First, we create a dataframe with only numeric columns ( df_num). Figure 3 presents student scores for classification and regression questions. In both courses this accounted for 10% of the final mark. We also want to sort the list in descending order. They may not be familiar with sophisticated data science principles, but it is convenient for them to look at graphs and charts. To do this, we select the column sex, then use value_counts() method with normalize parameter equals True. Researchers from the University of Southern Queensland and UNSW Sydney looked at the association between internet use other than for schoolwork and electronic gaming, and the NAPLAN performance . Generally the results support that competition improved performance. Each scatter plot shows the interrelation between two of the specified columns. 5 Summary of responses to survey of Kaggle competition participants. We specify that we want to take only float64 and int64 data types, but for this dataset it is enough to take only integer columns (there are no float values). But this is out of the topic of our tutorial. Prediction of student's performance became an urgent desire in most of educational entities and institutes. Performance scores that are pretty close to each other should be given the same rank, reflecting that there may not be a discernible difference between them. Are you sure you want to create this branch? Two datasets were compiled for the Kaggle challenges: Melbourne property auction prices and spam classification. (2) Academic background features such as educational stage, grade Level and section. filterwarnings ( "ignore") The code and image are below: From the histogram above, we can say that the most frequent grade is around 1012, but there is a tail from the left side (near zero). Student performance will be categorized as Fail, Fair, Good, Excellent the definition will be made by you. Perform an exploratory data analysis (EDA) and apply machine learning model in Students Performance in Exams dataset to predict student's exam performance in each subject. Hello, lets do some analysis on the Students Performance dataset to learn and explore the reasons which affect the marks scored by students. The purpose is to predict students' end-of-term performances using ML techniques. However, the results became available to the lecturers only after all the grades were realized to the students. Details. Although, it may be surprising, the undergraduate students provide a reasonable comparison for the graduate students. The graph for fathers jobs is shown below: The boxplot allows seeing the average value and low and high quartiles of data. (3) Behavioral features such as raised hand on class, opening resources, answering survey by parents, and school satisfaction. Copy AWS Access Key and *AWS Access Secret *after pressing Show Access Key toggler: In Dremio GUI, click on the button to add a new source. Then we call the plot() method. Packages 0. Using a permutation test, this corresponds to a discernible difference in medians. Such system provides users with a synchronous access to educational resources from any device with Internet connection. To be able to manage S3 from Python, we need to create a user on whose behalf you will make actions from the code. We have also shown how to connect to your data lake using Dremio, as well as Dremio and Python code. Start the discussion. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. The data is collected using a learner activity tracker tool, which called experience API (xAPI). To do this, use the create_bucket() method of the client object: Here is the output of the list_buckets() method after the creation of the bucket: You can also see the created bucket in AWS web console: We have two files that we need to load into Amazon S3, student-por.csv and student-mat.csv. Calnon, Gifford, and Agah (Citation2012) discussed robotics competitions as part of computer science education. This column should be binary. For all questions in the exam, difficulty and discrimination scores were computed, using the mean and standard deviations. This article has described an experiment to examine the effectiveness of data competitions on student learning, using Kaggle InClass as the vehicle for conducting the competition. The response rate for ST-PG was 50%, 17 students out of 34 completed the survey. Students formed their own teams of 24 members to compete. Data Set Characteristics: (One of the 63 students elected not to take part in the competition, and another student did not sit the exam, producing a final sample size of 61.) Perhaps the link between the two could be emphasized by instructors when the competition is presented to students. We examine the percentage correct overall on the final exam for the different groups and the scores the students received for the second assignment. the data contains some challenges, that make standard off-the-shelf modeling less successful, like different variable types that need processing or transforming, some outliers, a large number of variables. But often, the most interesting column is the target column. Being able to make multiple submissions over a several week time frame enables them to try out approaches to improve their models. Now, we use the hist() method on the df_num dataframe to build a graph: In the parameters of the hist() method, we have specified the size of the plot, the size of labels, and the number of bins. The Seaborn package has many convenient functions for comparing graphs. That is reasonable to expect. This job is being addressed by educational data mining. The experience API helps the learning activity providers to determine the learner, activity and objects that describe a learning experience. Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: 1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira) The features are classified into three major categories: (1) Demographic features such as gender and nationality. A Novel Dataset for Aspect-based Sentiment Analysis for Teacher in S3: Now everything is ready for coding! This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. In the years prior to this experiment, the undergraduate scores on the final exam are comparable to those of the graduate students, although undergraduates typically have a larger range with both higher and lower scores. It is often useful to know basic statistics about the dataset. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. The sample() method returns random N rows from the dataframe. It may be recommended to limit students to one submission per day. Several years ago they released a simplified service that is ideal for instructors to run competitions in a classroom setting. Computational Statistics and Data Mining (CSDM) is designed for postgraduate level students with math, statistics, information technology or actuarial backgrounds. , Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries , CA A Cancer J. Clin. The training and the testing datasets of the Melbourne auction price data were similar but not identical across the two institutions. For the CSDM and ST-PG regression competitions, a clear pattern is that predictions improved substantially with more submissions. The xAPI is a component of the training and learning architecture (TLA) that enables to monitor learning progress and learners actions like reading an article or watching a training video. It is well known for its competitions (e.g., Rhodes Citation2011), some of which come with rich monetary prizes (e.g., Howard Citation2013). Undergraduate students performance in other tasks and exam questions, not relevant to the competition, was equivalent to the postgraduate . Scores for the relevant questions were summed, and converted into percentage of the possible score. A short description of the datasets, including the variables description, is given in the Online Supplementary file. Lets say we want to create new column famsize_bin_int. In our case, this visualization may not be as useful as it could be. Whats more, Freeman etal. Luciano Vilas Boas 46 Followers Student Performance Dataset study with Python Business Problem This data approach student achievement in secondary education of two Portuguese schools. In our case, this column is called final_target (it represents the final grade of a student). Fig. The second row of the code filters out all weak correlations. The academic assessment is recorded at two moments of the student life. Increasing student awareness of the association between the knowledge obtained from the data competition, better understanding of the material, and better marks might increase all students engagement with the competition. Table 1 Computational Statistics and Data Mining: summary statistics of the exam score (out of 100) and the second assignment (out of 10) for the two competition groups. Exploratory Data Analysis: Students Performance in Exam The data need to be split into training and testing sets. Scores for the question on regression (Q7a,b,c) in the final exam were compared with the total exam score (RE). I love the thrill of the chase when searching for answers in the messiest of data. In awarding course points to student effort, we typically align it to performance. Here is the SQL code for implementing this idea: On the following image, you can see that the column famsize_int_bin appears in the dataframe after clicking on the button: Finally, we want to sort the values in the dataframe based on the final_target column. Parts b and c were in the top 10 for discrimination and part a was at rank 13. Our advice is to keep it simple, so you, and the students, can understand the student scores. Just call isnull() method on the dataframe and then aggregate values using sum() method: As we can see, our dataframe is pretty preprocessed, and it contains no missing values. A student who is more engaged in the competition may learn more about the material, and consequently perform better on the exam. The following window should appear: In the window above, you should specify the name of the source ( student_performance) and the credentials that you had generated in the previous step. It can be helpful if you want to look not only at the beginning or end of the table but also to display different rows from different parts of the dataframe: To inspect what columns your dataframe has, you may use columns attribute: If you need to write code for doing something with a column name, you can do this easily using Pythons native lists. Student ID 1- Student Age (1: 18-21, 2: 22-25, 3: above 26) 2- Sex (1: female, 2: male) 3- Graduated high-school type: (1: private, 2: state, 3: other) 4- Scholarship type: (1: None, 2: 25%, 3: 50%, 4: 75%, 5: Full) 5- Additional work: (1: Yes, 2: No) 6- Regular artistic or sports activity: (1: Yes, 2: No) 7- Do you have a partner: (1: Yes, 2: No) 8- Total salary if available (1: USD 135-200, 2: USD 201-270, 3: USD 271-340, 4: USD 341-410, 5: above 410) 9- Transportation to the university: (1: Bus, 2: Private car/taxi, 3: bicycle, 4: Other) 10- Accommodation type in Cyprus: (1: rental, 2: dormitory, 3: with family, 4: Other) 11- Mothers education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.) 12- Fathers education: (1: primary school, 2: secondary school, 3: high school, 4: university, 5: MSc., 6: Ph.D.) 13- Number of sisters/brothers (if available): (1: 1, 2:, 2, 3: 3, 4: 4, 5: 5 or above) 14- Parental status: (1: married, 2: divorced, 3: died - one of them or both) 15- Mothers occupation: (1: retired, 2: housewife, 3: government officer, 4: private sector employee, 5: self-employment, 6: other) 16- Fathers occupation: (1: retired, 2: government officer, 3: private sector employee, 4: self-employment, 5: other) 17- Weekly study hours: (1: None, 2: <5 hours, 3: 6-10 hours, 4: 11-20 hours, 5: more than 20 hours) 18- Reading frequency (non-scientific books/journals): (1: None, 2: Sometimes, 3: Often) 19- Reading frequency (scientific books/journals): (1: None, 2: Sometimes, 3: Often) 20- Attendance to the seminars/conferences related to the department: (1: Yes, 2: No) 21- Impact of your projects/activities on your success: (1: positive, 2: negative, 3: neutral) 22- Attendance to classes (1: always, 2: sometimes, 3: never) 23- Preparation to midterm exams 1: (1: alone, 2: with friends, 3: not applicable) 24- Preparation to midterm exams 2: (1: closest date to the exam, 2: regularly during the semester, 3: never) 25- Taking notes in classes: (1: never, 2: sometimes, 3: always) 26- Listening in classes: (1: never, 2: sometimes, 3: always) 27- Discussion improves my interest and success in the course: (1: never, 2: sometimes, 3: always) 28- Flip-classroom: (1: not useful, 2: useful, 3: not applicable) 29- Cumulative grade point average in the last semester (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49) 30- Expected Cumulative grade point average in the graduation (/4.00): (1: <2.00, 2: 2.00-2.49, 3: 2.50-2.99, 4: 3.00-3.49, 5: above 3.49) 31- Course ID 32- OUTPUT Grade (0: Fail, 1: DD, 2: DC, 3: CC, 4: CB, 5: BB, 6: BA, 7: AA), Ylmaz N., Sekeroglu B. It also prevents the student spending too much time building and submitting models. To load these files, we use the upload_file() method of the client object: In the end, you should be able to see those files in the AWS web console (in the bucket created earlier): To connect Dremio and AWS S3, first go to the section in the services list, select Delete your root access keys tab, and then press the Manage Security Credentials button.