Review of last lecture

00:01 uh just uh just to know that the lecture 00:03 is actually recorded 00:05 so you will be visible uh on the 00:07 recorded lecture at the end 00:09 if it doesn't bother you 00:13 just just let you know okay great 00:16 so hello everyone back to our lecture 00:21 last time we had a special lecture a 00:24 guest lecture 00:25 by one of our local data scientists 00:29 and uh he uh which was a fabian host 00:33 and fabian explained us from like a 00:36 hands-on perspective 00:38 because that's his job uh how 00:41 you can detect in 00:45 non-equilibrium systems and the 00:47 non-equilibrium systems that fabric is 00:49 working on are of course 00:50 biological systems and what he 00:54 basically showed is how biology how 00:57 older 00:58 and non-equitable systems manifest 01:00 itself 01:01 in low-dimensional structures in these 01:04 high-dimensional data sets that he 01:05 showed you some 01:07 methods that will also appear on the 01:10 website i didn't get to uploading uh the 01:12 slides in the video yet 01:13 um so so he showed you some methods of 01:17 how to 01:18 reduce dimensionality or how to extract 01:21 hidden dimensions in these 01:24 in the high dimensional data sets and

introduction & slide 1

01:27 so today so i will start by giving you a 01:30 little bit of 01:31 more introduction to data science and 01:35 some of the things that we need for the 01:36 next lecture that's in the first part of 01:38 the lecture 01:40 and in the second part of the lecture i 01:43 will 01:43 give you another hands-on experience for 01:46 a practical 01:47 example of from start to finish of how 01:50 to 01:51 go through such a data science pipeline 01:55 now to start the beginning beginning of 01:58 the lecture 01:59 uh we'll go back to our new york city 02:01 flights 02:02 data set so there's a little gap because 02:05 we had to 02:06 find dates with uh fabian last time so 02:08 there's a little 02:09 gap this lecture is now connected to 02:11 what i told you 02:12 uh two lectures ago and uh 02:16 let me just share the slide i'll tell 02:18 you give you a little a brief 02:20 introduction 02:21 to data visualization 02:24 just a short one because the slides have 02:26 been already 02:28 on the website since two weeks or maybe 02:31 some of you have already looked at them 02:33 let me just share the screen 02:37 there we go 02:40 okay great perfect so so i'll give you a 02:44 quick 02:44 introduction before we go on an enhanced 02:46 on example 02:48 so today we'll have a hands-on data 02:50 science example and next week we'll have 02:52 a hands-on 02:53 field theory combined with data science 02:56 example now that will be 02:57 next week um so today i'll still need to 03:01 introduce you to some methods that we 03:03 will need 03:04 and um if you 03:08 can confirm now that you can see my 03:11 slides 03:12 can somebody confirm is that working 03:15 yes okay perfect yeah so i'll just give 03:18 you a quick introduction uh to 03:20 how to visualize uh data 03:24 and of course you're all in your work 03:25 you're all visualizing data all the time 03:27 but if your data is high dimensional 03:29 very uh uh and uh complex and structure 03:33 uh it actually matters what you use for 03:36 visualization 03:38 yeah and um so 03:41 with this lectures two lectures ago when 03:43 we talked about this new york city 03:45 data set about the flights that were 03:47 departing from the new york city 03:49 airports i showed you all kinds of ways 03:52 of how to 03:53 do very efficient computations 03:56 on these data sets but all these 03:59 competitions like it didn't 04:00 really give us any real insight 04:04 and the reason was that we were dealing 04:05 with plain numbers but we 04:07 never had anything to look at and uh 04:11 today in the first part of this lecture 04:13 i'll quickly 04:15 show you a plotting scheme so to say 04:18 that is very powerful 04:20 in visualizing a general and visualizing 04:23 high dimensional data so

slide 2

04:27 before that just a quick reminder so 04:29 before we do 04:30 anything we always want to make our data 04:33 set uh 04:34 tidy you know so we typically we 04:37 collaborate with experimentalists they 04:39 give us the the 04:40 we obtain the data in a very messy 04:42 format and then the first step is to 04:45 tidy the data and that means that we 04:47 need to bring it in a form where 04:49 every column is an observable or 04:52 variable 04:53 and every row is it is an 04:56 observation or a sample sorry to 04:59 interrupt you did you change your 05:02 first slide yes oh you can't see that 05:05 no we cannot i don't know what this 05:06 always happens now in zoom 05:08 there was a zoom update 05:12 there's something maybe i have to share 05:15 the entire screen 05:17 let's try this um the problem is just 05:21 that i have like 05:21 100 windows on my desktop 05:25 um if somebody has a system that happens 05:30 lately all the time right um that zoom 05:33 is not working properly okay let's share 05:37 the desktop 05:42 okay now i'm not having embarrassing 05:45 stuff on the 05:46 desktop but that's not the case okay so 05:49 you should be seeing my 05:52 uh desktop now 05:58 oh okay okay a lot of 06:01 lot of wind okay okay okay 06:05 now you now you know what i've been 06:07 working on okay 06:08 so 06:12 can you see uh this messy and tidy slide 06:17 and when i change the slides now can you 06:19 see that it's just 06:20 right now it works okay perfect 06:24 so so this is uh so so 06:27 just a just a reminder uh that the first 06:31 step that we always do is to make the 06:33 data tidy now where we if you have the 06:37 data in entire defaults format then we 06:39 can perform 06:40 column wise operations that are in most 06:44 programming language 06:45 highly optimized and very efficient 06:49 to run until to program 06:53 i introduced you to this very simple 06:58 r package that allows you to implement 07:01 all of these operations and data science 07:05 and there are many other packages of 07:06 course and other other statements

slide 3

07:09 and typically such an operation 07:13 consists of three steps so you 07:16 filter the data that is in this data 07:18 table package the first 07:20 parameter then you group 07:24 the columns by some condition 07:27 now that's the third parameter and then 07:29 you perform 07:31 an operation independently on each group 07:33 that is the one in the middle 07:36 so this is a typical step in such a data 07:39 science 07:40 calculation and i showed you how you 07:42 then can combine these steps along 07:44 pipelines 07:46 uh to perform that complex operations

slide 4

07:50 today i want to show you a way of how to 07:53 more intuitively 07:55 interact with the data and the way we 07:57 typically do that you all 07:59 are familiar with uh plotting of course 08:01 and the way we typically do that 08:03 is we have a plot in mind yeah and this 08:06 plot then has a name 08:08 that's a scatter plot a surface plot 08:11 a histogram and then you look for the 08:13 function 08:15 in and you look for the function and 08:17 matlab also that gives you this specific 08:19 kind of plot 08:21 yeah another way to do that and that 08:24 would 08:25 introduced by someone called leland 08:27 wilkins 08:28 in a book is that you don't give the 08:31 plots names 08:33 but you construct you construct these 08:35 plots with a grammar 08:37 that means that you have a set of rules 08:40 that allows you to construct 08:42 step-by-step almost any visualization of 08:45 your data 08:47 and once you have that you don't have to 08:49 remember long names of different kinds 08:51 of plots 08:52 you just add bit by bit like in a 08:54 sentence you add word or 08:56 or there's word by word to make this 08:59 sentence 09:00 more rich in information but the only 09:02 thing that you need to know 09:04 is the grammar itself and this allows 09:06 you to 09:07 create from the simple grammar uh very 09:09 different kinds of visualizations 09:12 now and this idea that you have a 09:14 grammar of graphics 09:16 you know i've got a set of rules that 09:17 allows you to construct 09:19 visualizations uh it is an r 09:22 implemented in the ggplot package 09:25 gtw 2 that's quite famous 09:29 and in python and that's a little bit 09:30 newer i don't know how well it works 09:33 it's called the plot 09:34 9 package now that analyze the same idea 09:39 and in r we just load this using the dg 09:42 plot the library command and then we are 09:45 able to use all of these commands

slide 5

09:47 in this package now so the basic idea is 09:52 that we start with a tidy 09:56 data table or data frame in r 09:59 and we take this and then we construct 10:03 assign different 10:06 visual characteristics of our plot 10:10 two different columns of this data table 10:15 yeah so the first thing we have to do 10:18 is uh the first thing we have to do is 10:21 we have to say 10:22 what to plot a plug the point or a line 10:25 and that's called the geometry 10:28 a point line bar circle whatever and 10:31 that's the geometry 10:33 then we have this mapping that i 10:34 mentioned yeah where we map 10:36 different aesthetic properties of our 10:40 plot 10:41 two different columns in this table here 10:45 now for example we could say that the 10:47 position on the x coordinate 10:50 should be what is in column a 10:53 the y coordinate should be what is in 10:55 column b 10:56 the size of our dots or whatever should 10:59 be whatever is 11:00 reflect whatever is in column d and the 11:03 column 11:04 should miss a c and the column the color 11:07 should be what is in column d 11:12 how these values are then translated 11:16 to specific colors or to specific size 11:19 or so 11:20 is a different question now we have the 11:24 static properties of our plots or our 11:26 dots so where are they located 11:28 how do they look like and then we just 11:30 have to define a coordinate 11:32 system to define where they appear 11:36 on the screen yeah and if you have these 11:39 things together 11:40 we have the simplest version of a plot 11:44 on the right hand side

slide 6

11:48 so the way this works in practice is 11:51 that you 11:52 have these little building blocks 11:55 that you just put together 11:58 line by line now so 12:01 in r this looks like that you first have 12:03 to of course do 12:04 something to create an object and this 12:07 is just this ggplot command 12:10 where you tell the plot what data to use 12:13 as the first argument 12:15 and the second argument is then how to 12:18 map 12:18 different columns of your data 12:21 of your table to different pro visual 12:24 properties 12:25 of your plot and then you add a geometry 12:30 for example point or line and you have 12:32 your first plot already 12:35 you can of course also add more 12:37 properties yeah you can add more 12:39 geometries or you can add more 12:41 uh detailed properties of your plot for 12:44 example 12:45 if you are not happy with cartesian 12:48 coordinates then you can set your own 12:50 coordinate system 12:52 now you can have polar coordinates or 12:53 whatever 12:55 you can have subplots by 12:58 by adding this further rule here it's 13:01 called facets 13:03 you can change how values 13:06 in this data table how they map 13:10 to different properties so which color 13:12 represents 13:13 which value in your table you can change 13:17 themes for example or different uh the 13:20 way 13:20 lines and whatever are plotted and you 13:23 can of course also save your file 13:25 and these different building blocks of 13:27 your plots 13:28 are connected via these plus signs now 13:31 you put that 13:32 just as many as you want of these things 13:34 of these aspects 13:36 after each other and by this you 13:38 construct more and more complex 13:40 plots yeah and for 13:44 everything that you see here below there 13:45 are some sensible defaults for example 13:49 cartesian coordinates that in many cases 13:52 you don't need to touch 13:54 now let's have a little look at our new 13:56 york city data set 13:59 and this new york city data set right 14:02 here we go

slide 7

14:02 now we already discussed that uh 14:06 two weeks ago and in this new york city 14:08 data set we have 14:09 information about flights departing from 14:13 new york city airports for each flight 14:16 we have 14:17 different information we have for 14:19 example the time 14:21 when this flight departed we have the 14:24 origin 14:24 airport we have the number of the plane 14:26 that was used 14:28 the carrier and so on and the delay 14:31 uh for this specific flight and also as 14:35 we already discussed 14:36 you can connect this table that we can 14:39 have 14:40 that you can download from our github uh 14:44 from the from the website um we can 14:47 connect 14:48 these flights then to other sources of 14:50 information for example 14:52 the weather that uh with the weather 14:55 information for a given 14:57 point in time and a given location 15:00 we can also connect that to airport 15:02 information 15:04 uh we can connect that to information 15:06 about the planes 15:07 yeah and we can also get information 15:10 about the airlines if you want 15:12 yeah and that's how we what we do yes we 15:15 reload 15:16 again just like last time where we load 15:19 again 15:20 all of these different files using the 15:23 function f 15:23 read and then we merge them together 15:27 using these merge commands and when we 15:30 merge them together we sometimes have to 15:32 give for example when we merge with the 15:35 planes 15:36 or let me see here for example when we 15:38 move with the airports that we merge 15:41 with the airports we have to say that 15:43 here the airport identifier is in the 15:46 column origin 15:48 and here the airport identifier is in 15:51 the column 15:52 faa now so we motor all of these things 15:56 together and we have a huge data set 15:58 a data table containing all of these 16:01 information line by line and 16:04 in a tiny format we already did that two 16:08 weeks ago 16:10 now let's have a simple look at these 16:13 plots let's just let's say let's have a 16:15 simple plot

slide 8

16:16 and the first thing we can do is uh like 16:19 just like last time we can calculate 16:22 the average delay 16:27 for each month so we group the data by 16:29 month 16:31 and for each month we take the average 16:35 over all departure delays and save that 16:39 average 16:39 in the column mean delay 16:42 now so what we don't get is that for 16:44 each month 16:46 here in the first column we get a mean 16:49 departure delay this is the second 16:51 column 16:52 and below you can see now the simple 16:54 spot you can do you can 16:56 tell this ggplot function to take this 17:00 table 17:01 now a very simple table and map 17:04 the month to the x-axis and the delay to 17:08 the y-axis 17:10 and then you just add a geometry 17:13 which is just a point then you get what 17:16 you see on the right-hand side 17:18 and you see that something is happening 17:20 here in this summer months 17:22 and something is happening over 17:24 christmas apparently 17:26 okay 17:31 so okay there's something something in 17:33 the waiting room right 17:35 okay so something is happening over 17:37 christmas now let's go on

slide 9 & 10: different types

17:40 uh we can of course also add different 17:42 geometries to a plot 17:44 yeah so these different geometries so we 17:46 just use the geometry 17:49 of a line now of a point we can also add 17:52 different geometries 17:54 and for the sake of simplicity what i'm 17:56 doing here i'm using the tools that i 17:58 introduced two weeks ago 18:01 not to do all of these things in one 18:03 line so we take the flights 18:06 data set calculate for each month the 18:08 average delay 18:11 and send everything with this pipe 18:14 operator here 18:16 to the g plot and in the g 18:19 plot we just need to define that this 18:22 aesthetic mapping 18:23 that the x coordinate is the month and 18:25 the y coordinate 18:27 is the delay and we save this everything 18:31 we just save in an object g on the left 18:34 hand side 18:36 and now we can take this g and add 18:38 different things to it 18:39 we can add different geometries on the 18:41 left here top left we have the 18:44 the point as before we can add a 18:47 geometry line 18:48 now then we get a line we can add a bar 18:52 that's called column we can add a bar 18:55 or we can add all of them together to 18:58 the plot 18:59 and then we have all of them together 19:01 this information of what 19:03 what happens with the data that's not 19:06 done in the geometry that's we have done 19:08 once in the beginning and now we can 19:10 just operate on this object and add 19:12 different things and change the part the 19:14 way we like it 19:18 yeah so there are also geometries that 19:21 are 19:21 that involve analysis now for example uh 19:25 if you have a background in biology 19:27 maybe then you know your favorite 19:29 dot box plot on the right hand side that 19:32 summarizes 19:33 different uh properties of the 19:36 statistical distribution 19:38 for example here in the left hand side i 19:41 take the flights 19:42 all this combined information i take it 19:46 use the x the carrier as the x 19:48 coordinate 19:49 and the logarithm of the departure delay 19:52 as the y-coordinate 19:53 and i add this box plot where i have 19:56 automatically 19:57 the median and this is some 20:01 inter-quarter range and then i 20:04 always forget uh forget what this means 20:06 is probably the range of the data 20:08 without 20:08 outliers or so now that's basically in 20:11 some disciplines these box plots are 20:13 used to characterize distributions 20:15 another way to characterize 20:17 distributions are violence 20:19 violent plots that essentially give you 20:22 a plot of the probability distribution 20:24 here 20:25 um just in a vertical manner the thicker 20:29 the violin is 20:30 the more the the higher the probability 20:32 to find a data point 20:34 there yeah and

slide 11 & 12: aesthetic

20:37 uh so of course we can now also play 20:40 with 20:40 how these plots look like so 20:43 for example now i'm and if you look at 20:46 the red 20:47 i'm doing the same operation yeah i'm 20:50 calculating the average departure delay 20:54 for each month and each airport and each 20:57 carrier 20:58 but now i'm just for simplicity i only 21:01 take the big 21:02 three carriers united airlines delta and 21:04 american airlines 21:07 and now i create this plot again now i 21:09 have this aesthetic mapping 21:12 the month should be the x coordinate the 21:14 delay the y coordinate 21:16 and now i have another aesthetic which 21:19 is the color 21:20 i say the color should be the origin 21:23 airport 21:25 and the line type line type should 21:27 correspond to the carrier 21:30 and now i just add the geometry the line 21:32 and i get this plot that you see on the 21:34 bottom here 21:36 you can see that all carriers have a 21:38 problem in the summer month 21:40 and also in the over christmas you know 21:43 except for america so something is going 21:46 on with american 21:47 airlines around march uh no idea what 21:50 this is 21:52 yeah and uh 21:55 no wait that's not american airlines now 21:56 that that's new arc 21:58 newark american airlines has a problem 22:01 in march 22:02 yeah you see it's not the problem the 22:04 plot is not perfect yet 22:06 and uh okay so we can go on now we can 22:09 add 22:10 other aspect we can change other experts 22:13 of the plot 22:14 for example i here say that the film 22:18 should be the airport and then i use a 22:21 box plot 22:22 and then i get an overview over how the 22:25 difference 22:26 airport the airports the different 22:27 airports compared to each other 22:30 for each carrier and what you can see 22:34 is that jfk is doing well for 22:37 some of them but not for all 22:40 you know american airline and for united 22:43 and 22:44 and what is it where is it delta is 22:46 doing well 22:48 um but there is no clear trend here of 22:50 course

slide 13: subplots

22:52 something that's more interesting is if 22:54 you plot these delays 22:57 for the big three carriers yeah 23:00 uh as a as a function of the hour of the 23:04 day 23:06 yeah so here the x-axis the x-coordinate 23:09 is the hour and i turn that into a 23:12 factor 23:13 from numeric to something that is 23:16 discrete just for plotting purposes 23:19 the the fill color is the origin airport 23:24 and i added here a 23:27 subplot now that's called a facet 23:30 by carrier if you remember the two 23:34 lectures ago this is a formula we can 23:36 use a formula in r 23:37 to specify how plots are distributed 23:40 across different subplots 23:42 and you can see that for a delta and 23:44 united airlines 23:46 you can nicely see how these delays 23:49 add up during the day 23:52 yeah and yeah and it looks a little bit 23:54 even that is 23:56 let's have a look at the next slide

slide 14

23:58 maybe uh let's 23:59 have a look at the next slide so if we 24:01 can have also more complicated subplots 24:04 yeah so for example we have we can have 24:06 a grid 24:07 by using a more complicated formula 24:10 where the 24:11 y-axis yeah the y-direction should be 24:13 the origin 24:14 and the x-direction in this grid should 24:17 be the carrier 24:18 yeah and then we got these plots and you 24:21 can see 24:21 how you can actually see how these 24:23 delays 24:25 add up during the day and it seems like 24:28 a little bit yeah it's a speculation 24:30 but because we have a logarithm in the 24:32 y-axis 24:34 and we have linear linear increase 24:37 over time uh in these delays 24:41 uh over over time during the day that 24:44 you have an exponential 24:45 build up of delays now that's quite 24:47 quite interesting

slide 15: plot of 3 variables

24:49 yeah okay so we can do all those kind of 24:53 other fancy thing we can if you have 24:55 more than two variables 24:57 now for example when we calculate the 24:59 average delay 25:02 as a function of the month the hour and 25:05 the origin 25:05 airport yeah then we have more than 25:09 we have then we have then even more 25:11 variables that we want to visualize 25:14 uh we can do that for example with such 25:16 something that's called a heat map 25:19 yeah and this heat map here the 25:24 the the fill and also the color of these 25:27 tiles 25:28 is given by the mean delay 25:31 while the month and the hour are plotted 25:34 on these axes here 25:36 then we add the geometry of the tile to 25:39 get these heat maps 25:41 then you can visualize relationships 25:44 between two variables namely month 25:47 and hour and it seems like this buildup 25:51 of delays is specifically drastic at the 25:54 summer months 25:56 while it's not that evident in other 25:58 months

question on assigning facet

26:02 excuse me i have a question syntax 26:06 in this where you've written facet 26:10 underscore rap the first argument tells 26:13 us 26:13 the x argument and so origin will be 26:16 plotted on the x scale 26:18 yes so so this facet rep 26:21 is just to say okay take 26:25 one column of the data table yeah it is 26:28 already in this case origin 26:31 and group the data according to this 26:33 column your origin 26:35 and then make one plot for each of these 26:39 origin 26:40 airports and put them next to each other 26:44 as many as fit on the screen 26:47 now and if they don't fit on the screen 26:48 go to the next line 26:50 it's basically just this wrap what i'm 26:53 saying here is that you have a 26:54 one-dimensional so to say 26:56 line of plots uh compared to this grip 26:59 that is 27:00 this grid here where was that yeah this 27:03 grid 27:04 is the same thing it's basically the 27:06 same thing 27:07 uh so here we have these two coordinates 27:10 now these two 27:11 directions origin airport on the on the 27:14 y direction 27:17 and carrier on the x direction 27:20 now this is this this formula notation 27:22 and r 27:23 yeah so that's it's a little bit counter 27:25 intuitive but you give it a formula 27:28 uh in order to uh to to tell 27:32 r how all this this package how these 27:34 plots should be distributed 27:36 on your screen or on the on the in the 27:38 pdf file that you export 27:41 yeah and you can if you want you can use 27:46 the reason why the formula i used here 27:48 is that you could do something here 27:51 that you have here carrier plus 27:54 for example um what are we plotting here 28:00 carrier plus uh um 28:04 month or so yeah you can have here 28:08 uh if you do something like this 28:12 you can have more complicated formula 28:15 to say that a combination of carrier and 28:19 month of these two columns 28:21 should be on the x direction here 28:27 and on the y direction you have origin 28:30 now you can 28:30 you can construct more complicated 28:34 grids of plots if you wanted to 28:37 it's very often that's not very useful 28:40 and 28:41 just think about that that this is 28:43 what's on the uh 28:46 that was my that was my question because 28:48 in the next slide the 28:50 original argument the origins have been 28:52 plotted on the x 28:53 scale whereas in this particular case 28:55 the original airport has been plotted 28:56 along the y scale 28:59 so now i exactly so now i'm trying to 29:02 get my mouse cursor back is here 29:06 so okay now i can change the slide 29:08 hopefully 29:10 okay here we go so here i just left away 29:14 the first argument 29:17 and i left the the left one so that was 29:20 originally the y 29:21 direction and now i only have the x 29:24 direction left you know from left to 29:26 right 29:28 and i use here wrap and not grid because 29:30 i want 29:31 this these plots if i have not 29:34 three but like 15 different groups 29:38 i don't want them to be all in the same 29:40 line because i wouldn't be able to see 29:41 them on the screen 29:43 rep means once the screen is full go to 29:46 the next line 29:47 yeah nothing else that's not it's not a 29:50 complicated thing here let's just 29:52 just just make each plot for each origin 29:55 one plug for each origin 29:57 excuse me yes uh on this heat map 30:00 the color bar the minimum value is not 30:03 zero 30:05 does it mean that yes yes 30:08 rights that departed earlier than 30:10 scheduled 30:13 yes there are yes exactly exactly they 30:16 departed earlier oh 30:19 it's not very funny for the passengers 30:22 yes so so but but sometimes that happens 30:25 and it also it's also a question how the 30:27 data is recorded 30:29 um depends on how the data is recorded 30:33 so especially so this is position 30:37 specifically affects the early mornings 30:40 right and the very late times 30:44 now so this negative is negative 30:45 departure delays 30:47 um i mean that sometimes that can happen 30:51 you know so sometimes 30:52 the question is when that's not the time 30:56 when the gates close 30:58 uh it's probably the time when the 31:00 airplane starts or something like this 31:04 yeah and as you as you know if the 31:06 boarding is 31:07 so very often it happens that that once 31:09 boarding is completed 31:11 the the airport they will sometimes the 31:14 airplane leaves a little bit earlier 31:15 than uh than the schedules 31:18 okay thanks but as always there's a good 31:22 thing 31:22 it's always good to to to know how the 31:24 data was actually collected 31:26 now you think that a delay is well 31:28 defined but then you can 31:29 measure this in different ways yeah 31:32 that's always a very important aspect 31:34 aspect and here are also this data set 31:37 they're also 31:37 missing missing numbers a lot of missing 31:41 numbers 31:41 and that's when an airport started 31:44 somewhere but didn't end up on its 31:47 rival location but it's on other 31:49 airports yeah so that's also 31:51 that's also possible

slide 16: error bar

31:54 okay so let's uh go on just uh 31:58 so we can also use uh gg plot you have 32:01 to do statistical computations 32:03 on the fly and this is uh particularly 32:06 useful for computating 32:08 fancy error bars that's that's 32:11 how i use it and so what 32:15 we can do for example here on the top is 32:18 that 32:19 we have the hour on the x-axis the 32:22 departure delay 32:24 on the y-axis and then 32:27 color and uh and fill as the origin 32:30 airport and then for each of these 32:33 combinations because we take on the left 32:35 and raw data 32:36 now we have many different values we 32:38 have many different flights 32:40 and we can then take just a summary 32:43 function 32:44 statistical summary function from the 32:46 gdg plot 32:48 tell this function to calculate the mean 32:53 and use the geometry of a line 32:56 to do the computation for us yeah so for 32:59 so simple things like the mean 33:00 calculation 33:01 that's pretty good but the nice thing is 33:03 that we can also do 33:05 uh summary functions that are more 33:08 complicated yeah for 33:10 example in this here we have a 33:11 bootstrapping 33:13 confidence intervals yeah that's that's 33:16 basically a fancy way of calculating 33:18 confidence intervals 33:20 and we use a geometry of a written 33:23 not to visualize these confidence 33:25 intervals 33:27 and if you deal with confidence 33:28 intervals you know that's quite 33:30 complicated to calculate and then you 33:32 somehow have to bring them into your 33:33 plot 33:34 so here you don't have to worry about 33:36 this you have the most fancy methods 33:37 just 33:38 uh in one line and you get a nice 33:41 visualization of the uncertainty of your 33:43 data 33:45 another thing we can do is also here in 33:47 this upper case our 33:49 x-axis is discrete because it's an hour 33:52 from from 0 to 24 33:55 or 23 or 5 to 23 33:58 but sometimes we have real number values 34:01 yeah and then we need to tell 34:03 these functions which values to put 34:06 together that means we can 34:07 bin the data an example here is the 34:10 temperature 34:11 in fahrenheit on the 34:15 x-axis and we cannot calculate for every 34:19 value of the temperature 34:21 uh we cannot calculate a separate mean 34:23 yeah because temperature is a real value 34:25 it's uh 34:27 a real valued uh quantity 34:30 yeah and we we need to define a bin to 34:33 summarize values of the temperature 34:36 and that we can do automatically also 34:38 with these start summary bin 34:40 functions where we just calculate 34:43 we just tell the function to bin the 34:45 data 34:47 and for each of these bins again 34:48 calculate the mean 34:50 and plot the result with the geometry of 34:53 a nine 34:54 and we can do the same fancy arrow bars 34:57 and confidence interval calculations 34:59 as before and also you know probably 35:03 that 35:03 these kind of processes are quite 35:05 complicated if you have to do them 35:07 yourself 35:10 and here probably the the me the the 35:12 message is that it's 35:14 not good if it's too cold or too hot 35:18 but then there are correlations right so 35:21 when it's hot here in on the right hand 35:25 side 35:25 there's also when the holidays take 35:27 place yeah that's july 35:29 and june yeah in these previous heat 35:32 maps 35:33 where we had these delays yeah so it's 35:35 not 35:36 it's not quite quite clear whether it's 35:38 the temperature that's bad for the for 35:40 the engines or something like this 35:42 or whether it's just the number of 35:44 people that 35:45 go on holiday and block the airport 35:48 and lead to delays there

slide 17: interpolation

35:53 now we can do even more fancy things now 35:55 we can do 35:56 interpolation yeah on the right and the 35:59 left hand side you know i just 36:00 uh had the same plot as before 36:04 we had um yeah we have the month 36:08 wait okay here's a little error the hour 36:11 on the x-axis 36:13 and we do as the hour on the x-axis 36:17 yeah and now we can add like an 36:19 interpolation line 36:20 just in one line with this summary 36:23 function 36:24 and if we wanted to we would be able to 36:26 do like 36:27 non-linear interpolation linear 36:29 interpolation or anything we want just 36:31 with an argument 36:33 and we get our usual nice error bars for 36:36 free 36:38 of course if we can do non-linear fits 36:40 we can also do linear fits so we can fit 36:42 linear models to check for correlations 36:45 and that's what we do on the right hand 36:47 side and on this right hand side 36:50 what is actually quite interesting here 36:53 is that on the x-axis we have the month 36:56 on the y-axis we have the number of 36:59 seats 37:00 in an airplane and the color 37:04 is the origin airport 37:07 now what we find here is that there's a 37:09 perf almost perfect linear relationship 37:12 between the month and the number of 37:14 seats 37:15 yeah there's a perfect linear 37:16 relationship which is positive 37:20 for the for new arc and 37:23 jfk and uh 37:26 negative for lga which is i think 37:29 laguardia 37:30 yeah so no idea where this comes from 37:33 yeah but apparently you are sitting 37:37 in december in a smaller plane 37:40 higher likelihood to sit in a smaller 37:42 plane in december 37:44 if you are departing from laguardia 37:47 while the planes get larger 37:50 throughout the linearly larger 37:52 throughout the year 37:54 for some reason yeah so that's that's 37:56 one of these 37:57 one of these things where you should be 37:58 suspicious and check what is actually 38:01 the underlying 38:02 reasoning for this data yeah so so 38:05 that's that's something i will show you 38:06 also later 38:07 is that it's very important to check 38:10 whether 38:10 when you do statistical computations 38:12 whether they actually make sense or not 38:14 yeah big data gives you every result 38:18 that you want if you just look for it 38:21 now just because you have so many 38:23 dimensions so many samples you can 38:25 find every hypothesis you want in these 38:28 data sets 38:29 if you just keep looking for it

slide 18: scales

38:33 yeah so we can play around with scales 38:36 yeah so that means that we can change 38:38 how our plots look like for example 38:42 uh in this plot here on the top that's 38:44 something we have shown 38:45 seen before that's the box plot we save 38:48 this 38:48 in a variable p now and now you see how 38:51 i this weird arrow assignment operator 38:55 in 38:55 r why why why why the r community 39:00 likes that you can have that you can 39:02 assign in the different direction 39:03 right so i have the plot and i assign 39:06 the results to a variable p 39:09 and that's not something so it's it's an 39:12 asymmetric 39:12 assignment and so now we have our plot 39:16 and we can add different 39:18 color scales and we can add different 39:20 ways of how our data values 39:23 map to visual characteristics of the 39:26 plot 39:28 for example we can add a new scale score 39:31 for some scale color blur then we get 39:33 different 39:34 blue tones we can also add 39:38 like a manual mapping where we say that 39:42 the uwt 39:43 want to have black gray and white as the 39:45 colors for our airports

slide 19: positions

39:48 now so we can add change visual 39:50 characteristics and we can also change 39:52 of course how things are positioned 39:56 relative to each other yeah and i'll go 40:00 quickly over this because that's a 40:01 little bit of a detail 40:03 yeah so we have for example we can 40:05 create this uh plot here on the right 40:07 hand side 40:08 where we for each plot for each month 40:12 origin and carrier carrier calculate the 40:16 average delay 40:17 for the three carriers and then we can 40:20 make a plot 40:23 where we assign the month to the x-axis 40:25 now to the white 40:27 the delay to the y-axis the fill 40:30 color to the origin airport 40:34 and the transparency of this color 40:37 to the carrier yeah and then first 40:41 we can now plot all of this using a bar 40:44 plot 40:45 and if we have a bar plot we can decide 40:47 how to 40:48 put these bars relative to each other 40:51 and i'll just 40:52 give you three examples we can stack 40:55 them on top of each other 40:56 that's on the right hand side 40:59 now we can dodge them 41:02 that means that we put them next to each 41:05 other that's in the middle 41:07 and we can use a fill 41:10 position that means we always fill them 41:13 up to one 41:14 that means we look at the fraction that 41:16 a certain carrier 41:17 and origin airport contribute 41:21 to the total delays and 41:24 uh let me just see if there's any 41:28 yeah you can see here for example so 41:30 that that 41:32 that that here for example a large 41:35 fraction of the delays in march 41:37 comes actually from this newark airport 41:41 uh while other months uh for example in 41:44 the summer month 41:45 the larger fraction of the lace actually 41:47 comes from the other airports 41:49 um jfk and lga 41:53 yeah we can also do something

slide 20: corrdinate system

41:56 we can also change how the data is 42:00 how we can change the coordinate system 42:02 you know so now we always assume that we 42:04 have 42:04 cartesian coordinates but we can of 42:06 course have any other coordinate system 42:09 so here for example we are plotting uh 42:12 as the x-coordinate the wind direction 42:16 as the y coordinate the departure delay 42:20 and then we just calculate the average 42:22 delay 42:23 again using the summary function 42:27 for a certain certain intervals of the 42:30 wind direction 42:33 yeah and we can then plot that for 42:35 example in different ways 42:38 um we can plot that of course in 42:41 cartesian co-coordinates 42:43 something that's more instructive is 42:45 actually when we talk about 42:46 directions is to use polar coordinates 42:50 yeah and you can see that i do that just 42:53 with one adding just one line 42:54 one more rule to the plot 42:58 and now i can add more 43:01 aesthetic mappings for example i can 43:03 separate 43:04 as before these different contributions 43:07 from the wind direction 43:09 by airport and this is what i've done 43:11 here i just added one more aesthetic 43:13 mapping here 43:15 i said that these bars should be next to 43:18 each other and not on top of each other 43:20 or so 43:21 and i have the polar coordinates as 43:23 before 43:24 yeah then you get the plot that you have 43:26 on the right hand side 43:28 and in this plot what you see 43:31 is that there is a relation with the 43:33 wind direction 43:35 of the departure delays specifically 43:38 when the wind 43:39 comes from what is that southwest 43:43 a little southwest west for the two 43:46 airports 43:47 uh lga 43:50 and newark you know whatever is the 43:53 reason for that 43:54 it's actually for all of them yes and 43:56 actually for all of them 43:58 uh it's actually for all of them 44:01 but it's specifically strong for lga and 44:04 new arc 44:05 and if you look at the location of new 44:07 york that's where the sea is 44:09 that's probably also where a lot of the 44:11 wind 44:13 the strong winds come from from this 44:15 direction 44:16 yeah okay yeah this was just playing 44:19 around with the data 44:20 and you get some some insights 44:23 from just looking visualizing the data 44:27 and these insights are of course much 44:29 harder to get if you just look at data 44:31 tables on the console as we did two 44:34 weeks ago 44:35 and what you can also see here if you 44:38 create such plots 44:40 yeah so you can make more and more 44:42 complicated plots 44:44 but the complexity of your code 44:47 never changes yeah it does only 44:50 increases just linearly because you're 44:51 adding just one bit by bit 44:53 to your plot one layer by layer to your 44:56 plot and you can make 44:57 as complicated plots as you want 45:00 from this now without adding 45:04 more and more complexity actually to 45:06 your code or with 45:07 without requiring requiring more and 45:11 more specialized 45:12 functions and that is the advantage of 45:17 having such a 45:19 grammar of graphics now that allows you 45:21 to to have simple rules 45:23 uh that visual rules that allow you to 45:26 add more and more components to a plot 45:29 and then of course we can make these 45:31 plots look

slide 21

45:32 nice yeah we can add things like 45:35 axis labels for all of our columns 45:39 typically you get a data table that is 45:42 that where some experimentalist has used 45:45 their own notation for things 45:47 doesn't make much sense most of the time 45:49 you want to have your own 45:51 um you want to have your own uh 45:54 um names for the for the x's and for the 45:57 colors and for the 45:58 for the legends and uh specifically 46:01 including units if you want to publish 46:03 that and then you can do that easily 46:06 with this labs command and there's also 46:09 a title command if you want 46:11 here you can add a title and 46:14 annotate your plot as much as you want

slide 22: extensions

46:17 and then you can get as complicated as 46:21 you want you can 46:22 download extensions for example some 46:25 nice extensions add 46:26 new geometries and new coordinate 46:29 systems to these plots 46:31 so here these uh plots that are used in 46:33 anatomy 46:34 they add the human body and mouse body 46:37 coordinate systems 46:38 and you can then easily without adding 46:40 having more complexity 46:42 than what i already showed you you can 46:45 have visualize your data 46:49 that you for example imaging data on the 46:52 mouse or 46:53 human body or whatever you want to do 46:55 now that looks like 46:56 a ton of different extensions to this 47:00 okay so this is a very efficient way of 47:02 visualizing data

slide 23: there is also python implementation

47:03 that relies on a grammar or a set of 47:06 rules 47:07 i showed you an r implementation but 47:09 there's also a python implementation 47:12 and the python implementation is rather 47:14 new so it's just 47:16 i don't know what quality this is and 47:20 what we now want to do 47:24 is i want to uh 47:28 i want to show you how to use these 47:31 tools that we 47:33 that we've seen in the last couple of 47:34 lectures in a specific

The following slides are nowhere found in the slides shared on the course website. The lecturer went through the jupyter notebook about a RNA sequencing project.

47:37 data science project and for this we'll 47:40 just 47:41 go through the code of a real data 47:43 science project and this is a project 47:45 that fabian actually did 47:47 while he was in the group and the 47:50 uh starting point of this project 47:54 is a so-called sequencing experiment now 47:57 so i've already showed you this table 47:59 and here on the x-axis yeah 48:02 on the the color the rows in such an 48:05 experiment 48:06 on such that that look that would be so 48:08 to say a matrix that 48:10 experimentalists would send you so here 48:13 every row 48:14 is a different gene and every column is 48:20 a different cell 48:21 now we have now maybe then twenty 48:23 thousand cells thirty seven thousand 48:25 cells 48:26 and for each of these cells we have 48:29 roughly 48:30 ten thousand measurements yeah and 48:33 these measurements correspond to how 48:36 strongly 48:38 uh a certain gene yeah that's on the 48:42 in the row here is expressed in this 48:45 particular cell so these numbers here 48:47 correspond to how many products of these 48:50 genes these experimental techniques 48:53 found in a given cell yeah and the way 48:56 and these genes 48:57 you might hurt they tell us a lot about 49:00 what cells are doing and how they're 49:02 behaving and what kind of cells they are 49:04 so they're very important 49:06 molecular measurements of what's going 49:09 on inside cells now so for example here 49:13 this gene now that has this id here 49:16 that's a cryptic name and row four 49:20 is not expressed it has a little bit 49:23 information in this particular cell but 49:26 not in other cells 49:27 what other genes like this one here have 49:30 very 49:31 high expression values they have very 49:34 high counts of products from these genes 49:38 that were detected in these experiments 49:42 so what i have to tell you is that these 49:44 experiments are extremely messy 49:47 yeah so so especially there are there is 49:49 a step where 49:50 ex the data is exponentially amplified 49:54 and that exponentially amplifies errors 49:57 in these data sets so it's a 50:00 it's a big mess yeah and now we have to 50:02 find some structure 50:05 in these simple in this high dimensional 50:08 data set in these 50:10 genomics experiments and to show you how 50:13 this is working i'll share another file 50:19 um i'll share another screen 50:22 [Music] 50:24 where are we 50:28 here 50:31 here we go now i'll just give you a uh 50:34 like a hands-on look and 50:37 on how this actually works yeah i don't 50:40 tell you too much about the biological 50:43 background in this project because it's 50:45 not yet published 50:47 um so um can you see you can you should 50:50 be able to see 50:51 the browser right 50:54 and you can see that this here is 50:57 actually a combination 51:00 of python yeah the first block 51:03 and r yeah so this here 51:07 here he's loading some r packages 51:10 and here he's loading python packages 51:13 and all of this is a jupyter notebook 51:15 yeah so yeah combining r and python to 51:18 take the best of both worlds 51:23 yeah and then we then there's a lot of 51:25 data loading going on 51:27 yeah so i'll just go through of course 51:29 we don't have to look in detail how the 51:31 data is loaded 51:32 uh some some biological background 51:35 information about what 51:36 different genes are doing and 51:40 so on yeah and now 51:45 we start with the pre-processing of the 51:48 data 51:49 so as total that this data is messy 51:53 and this data has like 51:56 80 percent nonsensical 52:00 the nonsensical information in other 52:03 words this is dominated by 52:05 technical noise our technical noise is 52:08 extremely strong 52:10 and it gives rise to very weird results 52:13 so the first step we always have to do 52:16 and this 52:17 particular example in genomics but also 52:19 in other data sets 52:20 is we have to look at the data and and 52:23 polish it in a way so that we can are 52:27 actually in principle able 52:29 to detect information here 52:34 so for example this plot here on the top 52:38 shows you basically what is the 52:40 percentage 52:42 of all information in a cell 52:46 that goes to certain genes yeah they 52:48 have these weird names they're 52:49 completely 52:50 irrelevant but you see that this gene on 52:53 the top here 52:54 that has an even weirder name 52:57 in some cell comprises eighty percent of 53:00 the 53:02 information yeah and that 53:05 does not make any biological sense but 53:08 because if you have 53:09 thirty thousand genes in a cell it can't 53:11 be that basically the cell 53:13 is packed completely packed with 53:15 products from a single gene you know 53:17 that that's 53:17 cannot happen in real life and that's 53:20 why we see that here there are a lot of 53:22 cells 53:23 yeah so everything where we have more 53:25 than maybe 30 percent here 53:27 where we don't have any reasonable 53:31 information now that means that we need 53:34 to do quality control now we 53:36 need to filter out cells that actually 53:40 have meaningful information and we have 53:42 to 53:42 keep away cells that don't have 53:44 meaningful information 53:46 yeah and what we do is we look at such 53:50 histograms here 53:52 so we took a look at these histograms 53:54 and we calculate probability densities 53:58 over all cells so all columns of this 54:01 matrix 54:02 and on the x-axis here is the 54:06 total amount of information that we have 54:09 for 54:09 for a cell so the total number of 54:11 molecules that we detected 54:13 for a single cell and you can see this 54:15 follows a distribution 54:18 and the first thing that you see there 54:20 are two blocks now some cells are worse 54:22 and some cells are better so there is 54:25 already some variance in the data 54:27 just because the quality of our 54:30 measurement 54:31 is different for two between two groups 54:33 of cells 54:35 but all of these are actually good 54:37 values yeah they're all good 54:39 yeah and uh we just take out some cells 54:42 here that is the 54:43 vertical line that are below one million 54:46 of these counts now these we throw away 54:51 now we can also 54:53 see look at other accounts so um 54:59 so this is for example how many genes 55:01 that we detect 55:02 and here we also cut off 55:06 here we also cut off 55:09 basically cells that are low quality 55:12 in these tails here we just remove them 55:14 from the data set 55:16 because we know that if we kept them in 55:18 the data set in the long term 55:19 we would have problems with 55:21 dimensionality reduction now they would 55:23 these things dominate in the end 55:26 things that are based on machine 55:28 learning clustering 55:29 dimensionality reduction so we remove 55:31 them from the data set 55:34 and now that's sorry excuse me 55:36 replications 55:37 yes so is there any uh is a systematic 55:41 way to set 55:43 the threshold to filter out 55:46 in this case uh it's a matter of 55:49 experience 55:50 yeah there's not a systematic way 55:53 normally 55:53 [Music] 55:55 you would set the threshold in such a 55:57 data set 55:58 you would set the threshold here in the 56:00 middle between the two 56:02 peaks but between but because the two 56:04 peaks are both at reasonable values 56:06 and they're all of the same height yeah 56:09 then 56:10 uh you know i said we would get we would 56:12 lose 50 56:13 of the data and that's a little bit too 56:15 much yeah but they're all reasonable but 56:17 we have to check later 56:19 that if we find two groups of cells in 56:22 the data 56:23 that these two groups are not 56:26 just representing uh these two peaks 56:30 here in the quality of the measurement 56:32 yeah the later we have to we now go go 56:35 on with analysis 56:37 and if there's something is suspicious 56:39 we go back to this stage here 56:42 and we might have to be more rigorous 56:45 with this 56:45 cut off here yeah so in this case in 56:49 this particular case there's no rigorous 56:51 way of doing that 56:52 it's a matter of how much do you expect 56:54 what is the good measurement 56:56 and uh basically now this is this is a 56:59 this is a 57:00 this is a pretty good example in terms 57:03 of these counts here 57:04 now if you if you're working on genomics 57:06 so this is this is actually zebra fish 57:08 so we have less than in other animals 57:11 in total and um but but here we don't 57:15 have basically 57:17 sometimes you have a another peak here 57:20 at very low values 57:22 and that we would then completely cut 57:24 off 57:26 so here the problem more is this one 57:28 here this plot 57:30 we have a lot of cells that have a high 57:33 percentage 57:34 of mitochondrial accounts so that that 57:37 are genes 57:38 or the dna that is in the mitochondria 57:42 and these genes do not produce 57:45 this dna does not produce many gene 57:48 products 57:49 yeah so it suspicious if you have too 57:51 much of that in this 57:53 in these cells and here we take out 57:56 roughly i guess 20 or 30 57:59 20 20 of the cells 58:02 we lose in this step 58:05 yeah and we can also plot both with 58:08 respect to each other 58:09 for example we can here on the x-axis 58:11 have these different values that 58:13 represent the quality of our data 58:15 and then just in half these bars these 58:19 vertical and lines that we had in the 58:20 histograms 58:22 just in this scatter plot here yeah and 58:25 cut 58:25 and see which what are the kinds of 58:27 cells that we use here 58:29 uh visually yeah 58:33 okay now so now now we get rid of the 58:36 bad stuff the other things that are 58:38 totally crazy now the next thing 58:41 we need to do is to make 58:44 cells comparable so cells 58:48 still have different uh cells that 58:51 have different measurement qualities 58:54 they're still 58:55 different based on technical reasons so 58:57 for some cells we have a lot of 58:58 information 59:00 now we have a lot of these counts that 59:02 we detected 59:03 and in other cells we have less but we 59:06 want to make them comparable to each 59:08 other 59:08 and that's why we have to normalize the 59:11 data 59:12 yeah and there are fancy ways in 59:14 genomics there are fancy ways of doing 59:16 this and you can see we're 59:17 doing all of that basically and uh 59:21 so but we have to normalize the data you 59:24 have to make them comparable that's what 59:26 you always have to do 59:27 and what we also do here is because 59:30 these 59:30 these uh data counts you could see 59:34 like in the matrix that i showed you 59:36 there were very large numbers and very 59:38 small numbers 59:40 yeah and what we have so these 59:43 counts they live on an exponential scale 59:46 that's their these distributions are 59:48 very skewed it's a few cells that have a 59:51 huge amount or a few genes 59:52 that have a huge amount of these counts 59:55 of these 59:56 measurements yeah and that's not 59:58 something that works very well with 60:00 dimensionality reduction 60:02 or clustering methods so we take the 60:04 logarithm 60:05 we log transform the data for 60:08 any further processing now that's also 60:11 something that you do if your 60:13 data is too spread the other it comes 60:15 from some 60:16 exponential process also yeah then you 60:19 have the log transformers you want it to 60:21 be something 60:22 like normally distributed something 60:24 symmetrically and 60:25 rather compact as a distribution 60:30 yeah okay let's go on so here we can do 60:33 a variance stabilizing 60:35 uh transformation and we do more stuff 60:37 on the data 60:40 and then we can go and now we can start 60:44 to understand the data the first thing 60:46 we need to do 60:48 is to see does it actually make sense 60:51 what we do 60:52 what we have here are we actually 60:54 looking at biological information and 60:56 real information 60:57 or are we just looking at technical 61:00 aspects 61:01 of the experiments and as a first step 61:05 what we do here is we 61:08 plot a little pca fabian showed that 61:11 last week what a pca is 61:13 and these data on the pca on the 61:15 principal component analysis plots 61:18 they look like this and you can see 61:21 so here i plot for example the total 61:24 amount of these counts in a cell you 61:27 know that's a 61:28 that's a technical that's just measuring 61:31 the quality 61:33 of the measurements and you can see 61:35 there is some variability here so that 61:37 these cells have a little bit more these 61:40 cells are a little bit more 61:41 less and some of this technical 61:43 variability is captured 61:46 by the um by these experiments 61:50 by this principle component analysis so 61:53 here we're fine 61:54 with it it's not extreme we don't have 61:56 disconnected clusters 61:57 so we are fine with this it's already in 62:00 a good shape 62:01 and also we know that in cells can have 62:05 these differences 62:06 in the total number of molecules in the 62:09 cell for biological reasons 62:13 now and then we can look at these plots 62:15 here what is the percentage of variance 62:18 actually explained by certain principal 62:21 components now that's actually the 62:23 y-axis is 62:24 something else but we can order these 62:27 principal components 62:29 based on how much they explain the total 62:31 data 62:33 and what we do and that this is sort of 62:35 saying like a very professional 62:37 way of doing things is we do all other 62:39 calculations not on the real data 62:43 but on the principal components of the 62:45 data yeah that's an 62:46 intermediary step that we do just to get 62:49 cleaner results in the end 62:51 so so what we do here is we take the 62:53 first 62:54 20 25 or so principal components 62:58 that constitute like 99 of the variance 63:01 of the total 63:02 plot and we say the rest is noise that's 63:04 a way of getting rid of the noise 63:06 in the data and now we go on 63:10 yeah and do dimensionality reduction 63:12 further dimensionality reduction 63:16 let me make this larger 63:24 now i hope you can see these plots 63:28 so this is a u map yeah so this is the 63:31 this is a umap 63:32 uh that is a non-linear way of reducing 63:36 the dimensions that 63:37 fabian showed to you last week 63:40 and you can see once we do the 63:42 non-linear dimensionality 63:44 reduction our data looks already much 63:47 more structured so these cells here 63:51 are from actually from the brain these 63:53 are brain cells 63:55 and of course there are different kinds 63:57 of cells in the brain 63:59 and because they're different kinds of 64:01 cells in the brain we also 64:03 expect here a structure 64:06 to pop up in these low-dimensional 64:09 representations 64:11 typically these clusters correspond to 64:13 different kinds of cells 64:16 yeah so we don't know that's just a gray 64:18 bunch of cells here 64:19 of dots and the two dimensional planes 64:21 we don't know what the x's are 64:23 and we don't know what these cells are 64:25 and now we have to dig a little bit 64:26 deeper 64:28 and we do clustering so here is 64:31 clustering 64:32 and that's clustering that's one of 64:35 these community based clustering 64:37 algorithms 64:39 and we did that for several 64:43 resolutions of the clustering yeah so in 64:45 clustering 64:46 in most of the time you have to tell the 64:48 algorithm 64:49 how many clusters you want yeah 64:53 and uh that's that's kind of a 64:54 resolution 64:56 that you give to these algorithms and uh 64:59 by doing this 65:00 you uh and what you see here are 65:03 different clusterings 65:06 with different resolutions now so here 65:09 you say okay 65:10 give me what is that 15 clusters 65:13 then you got this plot on the bottom 65:14 left if you say okay 65:16 give me what is it eight clusters then 65:19 you get this 65:20 these clusters on the top left 65:24 and you can have more clusters if you 65:27 want yeah 65:28 they're different stages but we don't 65:30 know yet 65:31 what makes sense yeah we don't know how 65:33 many real clusters there are in the data 65:37 but we can take all of them and we can 65:39 go one step further 65:41 how do we know that such a cluster is a 65:43 real biological cluster 65:46 we know that if the cells in a cluster 65:50 all share some property that is not 65:53 shared 65:54 by other that is not shown by other 65:57 cells 65:59 now then we know that this cluster is 66:01 something real 66:02 uh that really is going on in the brain 66:06 and the way we do that is 66:09 let's go down then we look at the 66:12 literature 66:13 so now we look at the literature we look 66:15 at papers 66:16 yeah so we say we look at papers and 66:18 then in these papers 66:20 we see okay there are different cells 66:22 and kinds of cells in the brain 66:25 and people have done experiments 66:28 genetic experiments for example where 66:30 they found for example 66:32 that stem cells express a certain gene 66:36 now for example this gluolar gene here 66:38 that's expressed by stem cells 66:40 and now we plot this umap with the color 66:45 representing how much of the products of 66:48 this glue 66:49 gene we found in a certain cell 66:53 and now that makes a little bit sense 66:54 and now here in this corner here on the 66:56 top left 66:58 these are our stem cells 67:01 and then we can see okay so there are 67:03 also neurons in the brain and 67:05 many other things let's what's going on 67:07 yeah so there's another cell type that 67:09 expresses this 67:11 cell yeah that's the more advanced cell 67:13 type 67:14 from stem cells yeah that's all now in 67:17 the next step here let's express 67:18 in these cells here and then you could 67:21 can go down 67:22 and identify all of these clusters 67:27 step by step and identify 67:30 what kind of cells you have in the data 67:34 yeah you can do that with more genes 67:37 even 67:38 so there's a lot of genes like these 67:40 genes here that identify neurons 67:43 and different kinds of neurons so here 67:45 we have a little 67:46 just feature of this plot that is 67:48 identified by this gene 67:50 and if you talk bibologies all of these 67:52 names uh 67:54 are associated with different shapes or 67:56 different functions of cells 67:59 we can also do more fancy stuff and to 68:01 look at different gene scores and add 68:03 groups of genes 68:05 do statistical computations yeah and 68:08 once we've do 68:09 done that we decide okay 68:12 for these set of genes here we have a 68:15 unique 68:16 we fulfill this condition that each of 68:18 these clusters here 68:19 for these clusters we fulfill the 68:21 condition 68:22 that each of these clusters has 68:26 a certain biological function or 68:28 represents a biological function 68:30 because we found a gene in the 68:32 literature 68:33 that corresponds to a certain cell type 68:35 in the body 68:37 that is expressed in one of them but not 68:39 in the others 68:41 yeah so that's uh that's very important 68:45 and then we can give these clusters 68:47 names here for example 68:49 retinol or radioglia cells 68:52 uh uh only oligodendrocyte 68:56 precursor cells and so on and neurons 68:59 and then you find okay so these orange 69:01 ones here are the neurons 69:02 yeah and uh here were the stem cells 69:06 and then you can start thinking okay 69:08 these sensors somehow 69:09 turn here into these neurons and they 69:12 mature 69:13 now they get they get more mature and 69:15 then at the end they turn into these 69:17 neurons here 69:18 and then we have other cell types in the 69:20 brain like microglia 69:23 and so on that we also can find here 69:27 now remember in the u map the 69:31 distances in the u map is the humor 69:35 keeps the global topology intact 69:38 now that means that cells that are close 69:40 in here are 69:42 actually also very similar on uh 69:45 in the in this high dimensional space 69:49 so it's actually tempting that you think 69:50 about these 69:52 paths here as a trajectory that cells 69:55 take while they take from while they go 69:58 from stem cells 70:00 into neurons in the brain 70:04 now so let's go on now there's a lot of 70:06 consistency checks 70:08 yeah so you have to check all kinds of 70:10 genes 70:11 yeah a lot of them and discuss a lot of 70:14 with the people who actually know 70:16 who do nothing else in their lives but 70:17 to look at these cells and the brains 70:19 and they know 70:20 all of these genes and all of the papers 70:22 in these genes 70:23 yeah and i also do more fancy stuff 70:28 yeah and uh then you can do with 70:32 different clustering uh and now we have 70:35 a 70:36 identification and this one i said is 70:38 the one 70:40 that we can live with with these eight 70:42 clusters 70:43 while for these higher clusters now we 70:46 have 70:47 several several clusters representing 70:49 the same cell type 70:51 and we can in principle come back later 70:53 to these higher clusters 70:56 and typically biologists once want you 70:58 to do to get as many clusters 71:02 as you can yeah um 71:05 yeah so now we have these classes and we 71:06 can have some measurements 71:08 of how good these clusters are actually 71:12 and there are specific plots for these 71:15 for this 71:15 for example these are called dot plots 71:18 and 71:19 um what these plots show is on the 71:22 x-axis 71:23 our gene names and on the y-axis 71:27 are the cluster names yeah 71:31 and the color represents how much 71:34 a gene on average is present in the 71:38 cluster 71:40 and the size of these plots tells you 71:44 how what is the fraction of cells that 71:47 have this gene on 71:49 in this cluster now so and if we go now 71:53 here 71:54 what we want to see is that these genes 71:56 that we have here 71:58 are only on in one cluster but not in 72:01 other clusters 72:02 now for example a good thing is this one 72:05 here 72:06 this only go this oc cluster 72:10 now there's only one cluster here 72:13 there's only there are these these genes 72:16 here that we have 72:17 identified for this cluster that is 72:20 called 72:21 marker genes the other computation tools 72:23 to detect them 72:25 they are only you only find that in this 72:28 cluster 72:28 but not in the other cluster same for 72:32 this mg this microglia 72:34 we only find these genes in this single 72:36 this one cluster but not in any other 72:39 cluster 72:40 and while we go for what's a messy one 72:44 opc's 72:45 or we go for this neurons here 72:49 then it's more messy and it's not so 72:53 clearly defined 72:54 so if we go back now for example 72:57 now if we go here if we look at these 72:59 plots then these 73:01 mg's microglia they were very clean 73:04 in this plot that i just showed you yeah 73:07 and that's also represented in this new 73:09 new 73:10 you map they're a different cell type 73:14 that is not produced presumably by the 73:17 same stem 73:18 cells as the neurons 73:21 also these ocs yeah as oligodendrocytes 73:25 i think 73:26 yeah that's separated from the rest so 73:29 there's no overlap here there's a 73:30 distinct cell types 73:33 while for things like that we had an 73:35 overlap between these material neurons 73:38 at the sixth cluster that's called six 73:40 here 73:41 yeah there we found that 73:44 there is an overlap actually in these 73:46 marketers they have the same genes 73:48 that they express and probably that 73:50 means that this cluster six 73:52 is an artifact yeah that we cannot take 73:55 too seriously 73:56 and on the right hand side you can see 73:58 that in the next step now we took these 74:01 i have went a little bit broader and 74:03 took this cluster six 74:04 into the interval neurons here 74:08 now so now of course i showed you 74:10 basically a finalized version 74:12 normally you go back between these steps 74:14 and the earlier steps 74:16 again and again the call it back to the 74:17 quality control 74:19 until you have that do not have any 74:21 trace 74:22 of experimental technical parameters 74:25 in your plots and then also you go back 74:29 between these clusterings your 74:31 experimental friends 74:32 who do the experiments and the 74:35 literature 74:35 until you find something that is really 74:37 corresponding to 74:40 to to what makes biologically sense 74:43 it's not that these techniques give you 74:46 automatically 74:47 something uh that this was like a 74:50 mathematical criterion 74:53 that would tell you uh what makes sense 74:55 and you have to do that always have to 74:56 do that yourself 74:59 that's not that you push a button and 75:00 then everything is 75:02 works automatically and that's why these 75:04 people who do these kind of 75:06 analysis are very much looked looked for 75:09 on the job market 75:12 okay so uh let me just see if there's 75:15 any 75:16 other anything other interesting i think 75:17 there's you can go on and on forever you 75:20 know you can give the experimental lists 75:22 you know so which genes are on and off 75:25 in which cells 75:26 and once the experimentalists have these 75:29 lists they can do more 75:30 experiments uh they can create for 75:33 example 75:34 new animals that lack these genes 75:37 and what i want to show you 75:40 is now something that's on the bottom 75:44 you know so that the analysis is very 75:46 lengthy 75:50 now there are many aspects of course 75:51 that are irrelevant for biology 75:54 um i just want to show you 76:01 a lot of stuff a lot of stuff now that's 76:04 what the biologists are interested in 76:05 right 76:06 so these they hold these jeans um 76:09 all the stuff all the stuff a lot of 76:11 stuff even more stuff 76:20 yeah if more more more more also a lot 76:22 of calculations 76:24 yeah more heat maps to check 76:28 you know which genes are on where and so 76:30 on yeah so that's all 76:32 uh uh consistency checks 76:35 and one thing i want to just to show you 76:38 is uh something that's called let me see 76:40 do we have 76:42 [Music] 76:43 okay is something that's called 76:46 trajectory inference you know that's 76:48 so so so these are cells from a brain 76:52 and what cells do is they start that 76:54 they divide and they produce other cells 76:57 and then they get more specialized over 76:59 time 77:00 so these cells they start as a stem cell 77:03 and then 77:04 they mature until they at some point 77:06 they are in you 77:07 a new one so what we did here is 77:11 uh what you did here and what you do in 77:12 many cases is you take 77:14 okay i have a snapshot data so this 77:17 fish or these fish were killed for the 77:20 measurement 77:21 i don't have a time course or anything 77:23 now i don't uh as a snapshot measurement 77:26 just one measurement 77:28 but i have different cells that are at 77:30 different stages of this dynamic process 77:33 of cell maturation so what then people 77:36 do is uh can we get the temporal 77:39 information back 77:40 yeah and this is the trajectory 77:42 inference of how 77:44 these different clusters and cell types 77:46 relate to each other 77:48 and where you then can calculate rates 77:51 of how 77:51 one cell go one cell type 77:54 the flux that from one cell type leaks 77:57 leads to another cell type 77:59 now and in principle based on this you 78:01 can then come up with stochastic 78:03 models that you can in principle also 78:05 compare them to 78:06 uh to other experiments or theoretical 78:09 work 78:10 now that's i just wanted to show you uh 78:13 at the end 78:13 and let me just go through here if 78:15 there's anything else that's worth 78:17 showing you 78:18 um a lot of stuff now then you can 78:21 compare to humans and other animals 78:24 you can see where there are similarities 78:27 uh 78:27 these fish here that we're looking at 78:29 are very interesting because they can 78:31 regenerate their brain they can build 78:33 new neurons 78:34 that's something we cannot do so it's so 78:36 much that we want 78:38 we want to learn what are the 78:39 similarities and difference why are we 78:41 not 78:41 able to do that as humans yeah and 78:45 um yeah and then uh 78:48 we are already you know at the end of 78:51 this very lengthy analysis 78:53 yeah so this is a typical data science 78:56 project 78:57 from start to end and so you can see 79:00 here a mixture of 79:01 r and python and 79:05 the important thing is that you cannot 79:07 what we showed you 79:09 in the last lectures you cannot just go 79:11 and take the data and 79:12 throw a umap on it or do some machine 79:15 learning on it 79:16 so a large part of the data 79:20 of this pipeline here is to actually 79:24 clean up the data and think about what 79:26 is the part of the data that makes sense 79:29 and what is the part of the data that 79:31 does not make sense 79:33 yeah so for example think about this new 79:34 york city flights 79:36 it doesn't make sense to have a negative 79:37 departure delay and it was a 79:39 it was a very uh that was a very good 79:41 question 79:42 yeah it actually uh makes sense and this 79:44 is so to say a data set 79:46 that is used for for teaching a lot so 79:48 it's already cleaned up a lot but 79:50 typically you expect 79:51 to have a lot of nonsensical 79:54 measurements 79:55 uh in your data yeah or sometimes you 79:57 have a departure delay of 79:59 10 billion years or so that that would 80:02 be the correspondence in a real data 80:04 somebody made a typo somewhere yeah and 80:06 then you have that in your data set and 80:08 you have you have to 80:09 have to filter it out again now this 80:11 happens all the time if you 80:13 are not looking after this yeah so if 80:16 you're not taking care of this 80:18 then all of these nice plots here that i 80:20 showed you 80:21 they won't work now they only work 80:23 because we clean up the data 80:25 we normalize the data we transform the 80:28 data in a statistical way to make it 80:31 uh nicely behaved yeah it's 80:34 no huge outliers or so and then these 80:37 methods 80:38 like uh umap and so on work on the data 80:42 but you always have to do these 80:44 pre-processing steps 80:46 before that to make that work and the 80:47 next step is you you have these fancy 80:49 methods 80:50 but taking a loan they don't make sense 80:53 so you can see here how this order has 80:56 ordered this 80:56 trajectories cells move here along in 80:59 time 81:00 move along this line and turn into 81:02 neurons here 81:04 and of course we have many different 81:05 neurons in the brain 81:07 yeah and but to understand that what's 81:09 happening here this data to make sense 81:11 of this 81:12 now you have to come up with hypothesis 81:15 you have to connect these hypotheses 81:16 with 81:17 what is already out there in the 81:18 literature and then step by step you can 81:21 construct 81:22 an understanding about what are actually 81:24 here the degrees of freedom 81:26 that you see in this data set you know 81:29 and so this is this is an example and 81:32 this 81:32 is an iterative process that you improve 81:36 over time and so this is an example 81:39 that's a purity 81:40 purely data science project or there's 81:43 very little physics in it 81:45 and next time we show you how all of 81:47 these connects to something that we 81:49 actually 81:51 did in the first part of this lecture 81:53 namely field theory 81:54 face transitions criticality and so on 81:58 okay great so i'll stay online and in 82:01 case there are any questions otherwise 82:03 see you next week 82:08 bye 82:21 thank you it was very interesting yeah 82:24 so when are you when are you going

本文收录于以下合集: