BSC Training Course: Introduction to Big Data Analytics

Date: 05/Feb/2024 Time: 09:30 - 09/Feb/2024 Time: 13:30

Place:
E101 Room,  C6 Building - Campus Nord, UPC.

Target group: For trainees with some theoretical and practical knowledge;

Cost: There is no registration fee. The course is free of charge.

Primary tabs

(This agenda may be subject to changes)

AGENDA

Day 1 (Feb 5th, 2024)

  • 9:30 – 13:00 Introduction to Big Data (Josep Lluis Berral, Computer Sciences - Data Centric Computing, BSC)

In this session we will introduce the students to the technologies associated with Big Data: data challenges, cloud computing, processing, and internet of things. An overview of the technologies will be provided, both from a technical and from a business model point of view.

11:00 - 11:30 Coffee break

13:00 – 14:00 Lunch Break

  •  14:00 – 15:45 Good practices for reproducible data science (part 1) (Miguel Ponce de León, Computational Biology, BSC)

Today, computers have become essential for data management. One of the main reasons for this is that all scientific disciplines share the common need to deal with big volumes of data: to organise, pre-process, analyze and finally, convert a collection of raw datasets into pieces of knowledge or data-supported actions. Due to the central role data plays in research and industrial applications, it is also critical to follow guidelines and use standard analysis workflows that generate reproducible results. Although the concept of reproducibility itself is at the heart of data analysis and scientific research (think of a researcher carrying out an experiment in a chemistry lab), the majority of data practitioners, students and researchers have no formal training in reproducible scientific computing. In most cases, data scientists and researchers acquire their technical skills by doing and, they also learn the lesson about good practices the hard way, i.e. by making a lot of mistakes and spending a lot of infernal hours trying to find out why things no longer works, or why the surprising result we found yesterday cannot be replicated when we want to show it to a colleague (or an advisor!). In this short tutorial, we will present a collection of good practices for reproducible data science, from our own experiences as well as from the experiences shared by colleagues. These practices can (or should) be adopted by any data scientist or researcher, regardless of their current level of computational skills.

15:45 – 16:15 Coffee break

  • 16:15 – 18:00 Good practices for reproducible data science (part 2) (Patricio Reyes, Data Analytics and Visualization, BSC)
 

Day 2 (Feb 6th, 2024)

  • 9:30 – 13:00 Big Data Management (Albert Abelló, UPC, inLab FIB and Petar Jovanovic, UPC)

Big Data has many definitions and facets, we'll pay attention to the problems we have to face to store it and how we can process it. More specifically, we'll focus on the Apache Hadoop ecosystem and its two basic components, namely HBase and MapReduce engine.

11:00 - 11:30 Coffee break

  • 11:30 - 13:00 Hands-on exercise

13:00 – 14:00 Lunch Break

  •  14:00 - 16:00 NoSQL databases (Sergi Nadal, UPC-BarcelonaTech)

The relational model has dominated data storage systems since the mid 1970s. However, the changing storage needs over the past decade have given rise to new models for storing data, collectively known as NoSQL. In this presentation, we will focus on two of the most common types of NoSQL databases: document-oriented databases and graph databases and explain the use cases suitable for each of them.

Day 3 (Feb 7th, 2024)

  • 9:30 – 13:00 Data Analytics with Apache Spark. Part 1(Josep Lluis Berral, Computer Sciences - Data Centric Computing, BSC)

Apache Spark has become a consolidated technology for large-scale processing in a fast and general way, with “programmer-friendly” interfaces and official bindings for many of the most used languages (Java, Scala, Python and R), extensive documentation and development tools. This course introduces Apache Spark, as well as some of its core libraries for data manipulation, machine learning, data streams and graph analytics.

11:00 - 11:30 Coffee break

13:00 – 14:00 Lunch Break

  •  14:00 – 15:30 Data Analytics with Apache Spark. Part 2 (Josep Lluis Berral, Computer Sciences - Data Centric Computing, BSC)

Day 4 (Feb 8th, 2024)

  • 10:00 – 11:15 Bias in Science - Sex and Gender Perspective in Big Data Analytics. Part I and Q&A (Atia Cortés, BSC and Davide Cirillo, Machine Learning For Biomedical Research Recognised Researcher, LS)

This workshop will provide knowledge about the existing biases in Big Data Analytics and Artificial Intelligence (AI) from a multidisciplinary perspective. The main objective is to raise awareness and build a culture towards responsible practices of AI research and development. For this, the social challenges in relation to AI will be reviewed, analyzing the different types of biases in science. In the second instance, ethical aspects of AI are addressed at the international level, which are key in its scientific and social impact. Finally, it will deepen the differences of sex and gender and their specific implications, analyzing from intersectional axes. (Recommended reading literature will be provided before the session)

11:15 - 11:45 Coffee break

  • 11:45 – 13:00 Bias in Science - Sex and Gender Perspective in Big Data Analytics. Part II and Q&A (Davide Cirillo, Machine Learning For Biomedical Research Recognised Researcher, LS)

13:00 – 14:00 Lunch Break

  • 16:30 – 18:00 Business Intelligence (Karina Gibert, Intelligent Data Science and Artificial Intelligence Research Center (IDEAI-UPC))

Data contains information.The session focuses on the relationship of concepts such as data mining, business intelligence, big data, data science and the old school of classical statistics. An overview of the data science process as a way to extract added value from data and real cases will be presented as examples of application.

Day 5 (Feb 9th, 2024)

  • 9:30 – 11:30 Intro to Data Visualization (Guillermo Marin, Designer for Scientific Visualization, BSC)

Theory
1. Basic concepts
2. Human perception
3. Design
4. Colour
5. Audience / Validation / Bad practices
6. Visualisation design process

11:30 - 12:00 Coffee break

  • 12:00 - 13:00Computational Social Sciences (Mercè Crosas, Head of Computational Social Sciences, BSC)

END of COURSE