.mpipks-txpt | 09. Data Wrangling

一节课速成R语言……

intro and slide 1

00:07 so 00:11 okay so you think i made you co-host 00:19 it's not working 00:42 okay 00:50 [Music] 00:56 so 01:05 okay i think there are no more people in 01:08 the waiting room 01:09 so we have a slightly different setting 01:11 this time 01:12 uh can somebody confirm that you can 01:15 hear me 01:17 yes okay perfect great so we have a 01:20 slightly different setting because 01:22 i um today we start a new topic and i 01:25 need my computer for that 01:27 and um so 01:31 in in the previous lectures before 01:33 christmas i gave you 01:35 an intro introduction uh about the 01:38 methods 01:39 that we need the theoretical methods 01:40 that we need in order to understand how 01:43 order emerges in non-equilibrium systems 01:47 and i also we also discuss how 01:50 order manifests in non-equilibrium 01:54 systems so now in the new 01:57 year we will look at the other side 02:01 we'll now introduce methods 02:04 that allow us actually to identify order 02:08 in experimental data 02:12 and of course i'm not talking about a 02:15 few 02:16 data points here we're talking about 02:18 large 02:19 high dimensional data sets as have 02:21 become common 02:23 in many fields of science like social 02:25 science but also biology now 02:28 so to begin let me share the screen 02:31 so i hope that rooks 02:37 so i hope that works 02:49 so actually that was a lecture that i 02:50 gave last year already 02:53 you know and um so 02:56 can you see you can't you can see the 02:58 screen i hope right 03:00 okay so just trying to get the meeting 03:02 controls 03:05 [Music] 03:11 okay it doesn't matter okay so i i 03:14 assume you can see the slides 03:15 now let me just uh make them full screen 03:20 now that's actually from a course that i 03:22 gave last year together with 03:24 fabian ross of course data science for 03:26 physicists 03:28 and today we'll discuss the methods that 03:30 we need 03:31 that actually allow us to identify 03:35 order in high dimensional data sets 03:38 and what is what do i mean with high 03:40 dimensional and large data sets

slide 2

03:42 this is one of the data sets that i 03:45 showed you in the very 03:46 beginning of the lecture that motivated 03:49 partially this lecture and this is a 03:52 data set that 03:53 was produced by collaborators in 03:55 cambridge 03:57 and in this data set on the x-axis you 04:00 basically have part of the dna 04:03 so in the dna you have maybe a billion 04:06 or three billion base pairs 04:08 three billion base pairs roughly and 04:10 mouse and human 04:12 and in these experiments for each of 04:14 these base pairs on the dna 04:16 or each of these positions for each of 04:18 these three million 04:20 uh sites of the dna we can make take a 04:24 measurement 04:25 and measure whether there is a chemical 04:28 modification 04:29 on the dna or not yeah and 04:34 on the y axis we have different cells so 04:38 we can do that 04:39 in individual cells and on the y axis we 04:41 have roughly 600 cells 04:43 from a mouse embryo so from each of for 04:47 each of these 04:50 cells we can take roughly 3 billion 04:53 measurements here along the sequence of 04:56 the dna 04:58 that means that we have a data set here 05:00 that is typically a few terabyte in size 05:03 and that has three billion dimensions 05:06 of course not all of these dimensions 05:08 are really meaningful 05:10 but to start with we have something that 05:12 is very 05:13 huge in terms of size and now we need 05:17 something we now release some tools some 05:19 computational tools 05:21 to process such data sets 05:26 yeah so what we want to do is we want to 05:29 start with 05:31 these measurements that i just showed 05:32 you now for example we use 05:35 we measure that that's a biological 05:37 figure it's not so important to 05:38 understand the details 05:40 i'm sorry because i think i only can see 05:44 your first 05:45 slide which slide can you see 05:48 data science oh now it's uh 05:52 okay okay okay okay so 05:55 how can we do that now it's working i 05:58 think 05:58 now it's working uh go to the next slide 06:00 is that working 06:01 it is the second slide sequencing cells 06:03 yes 06:04 no it's not okay okay let me stop 06:08 sharing 06:11 let me share the desktop maybe 06:15 um okay can you see 06:20 the title slide yes now the first slide 06:24 is that when i think so yes yes and now 06:27 uh 06:28 another slide yes okay great great 06:31 so i was talking all the time probably 06:32 you didn't see this cell this slide 06:34 before here 06:36 yeah so this is what was on the x-axis 06:38 here 06:39 on the x-axis we have these three 06:41 billion base pairs 06:43 and for these base pairs we can make 06:44 measurements on the x-axis let me 06:48 get a pointer 06:51 so we have here the x-axis and this 06:53 x-axis 06:54 has these three billion dimensions 06:57 these three billion measurements for 06:59 every base pair of the dna 07:01 that we can take on the y-axis 07:04 each row on the y-axis is a different 07:07 cell 07:08 and for each of these cells that are 07:09 taken from a mouse embryo 07:11 we can take these measurements these 07:15 hugely dimensional measurements uh 07:18 in this case a living embryo now 07:22 these are breakthroughs that happen in 07:23 biology but also similar breakthroughs 07:25 with similar detail we can measure 07:28 social 07:29 systems for example if you think about 07:31 the social sciences social networks 07:33 and so on we have huge amounts of of 07:36 data

slide 3

07:37 that we would like to understand yeah 07:39 this particular example 07:42 of biology we would like to understand 07:46 how we can transform the top part of 07:49 this big picture here 07:51 where we take measurements that have 07:53 lots of different dimensions 07:55 for this case many different cells you 07:58 can do that with 07:59 millions of cells if you want if you 08:00 have the money now 08:02 how can we transform these measurements 08:05 into a physical 08:06 theory of what's generating these 08:09 measurements 08:11 now that's what's that's what we want to 08:13 do and

slide 4

08:14 uh to do this we need to identify 08:18 order in these structures in order to 08:20 build a hypothesis 08:22 now this is the rule this is how this 08:24 this kind of 08:26 data sets arrive on our desk 08:30 so here you see tables that contain 08:33 different bits of information 08:36 for example on the 08:40 top right here you have a table that 08:43 tells us 08:44 where on the dna we have chemical 08:46 modifications 08:47 now that has in this case 50 million 08:50 rows 08:51 we have another table here on the top 08:55 left that tells us something about 08:58 the same position on the dna it tells us 09:02 which what is the topological structure 09:05 of the dna is it 09:07 is it compact is it is it like a 09:09 spaghetti ball is it more open 09:11 or not tells us about the something 09:13 about how this dna 09:14 lives in real space now then we have 09:18 other kinds of information and we can 09:20 ask for example 09:21 how 09:24 how is this particular bit on the dna 09:29 what else do we know about that is there 09:32 a gene in this region 09:34 is there some other interesting stuff 09:35 going on in this region 09:37 now we can ask what happens by the to 09:40 these genes what are the 09:42 genes doing that's on the topic of 09:44 different experiments again 09:46 in these regions of the dna yeah and we 09:49 then have for example information about 09:51 these cells where are they located in 09:52 the embryo 09:54 uh what were the cultural conditions of 09:56 these cells for example and so on 10:00 now so now we have these uh 10:02 high-dimensional data sets from 10:03 different sources and we now 10:05 need to combine them to create a big 10:08 picture that allows us as physicists 10:11 to generate a hypothesis

slide 5

10:14 yeah so 10:18 and this is an overview of this lecture 10:20 today and i have to uh 10:22 make a shocking uh confession that will 10:25 actually be using 10:26 r in this lecture that's something we're 10:28 not using typically in physics 10:30 but in this case it actually makes sense 10:32 it's actually suited but the syntax of 10:34 r is a little bit uh not what we're used 10:37 to in 10:37 physics so i'll give you a quick 10:40 introduction 10:41 to r to the language of r and 10:44 i assume that most of you have already 10:47 learned program and some programming in 10:49 some kind of other language like matlab 10:52 or python or c 10:55 so i'll just give you a quick 10:56 introduction to the syntax of 10:58 r and then i'll show you what are the 11:01 tools 11:02 that we have available to deal with 11:06 these large 11:07 data sets that are coming up in science 11:11 yeah so so what are the tools that we 11:13 have available what are the 11:14 computational tools and how do we select 11:17 the tools that we actually want to use 11:21 now then the very important thing is to 11:23 bring the data that has many dimensions 11:26 in a shape that we can actually deal 11:29 with 11:30 in a computationally efficient way now 11:33 that's called tidying the data to bring 11:35 it a specific format 11:36 so that we can vectorize our 11:39 computational or numerical operations on 11:42 this data sets 11:43 as i'll show you once we've done that 11:46 how we can then perform 11:48 very easily computations on 11:51 these data sets and uh and finally i'll 11:55 end with showing you how we can combine 11:57 this 11:59 these steps to produce data processing 12:01 pipelines 12:03 and of course we want to do all of this 12:05 in an elegant way 12:07 that is not so to say heavy in terms of 12:09 syntax we want to focus on the structure 12:11 of the data 12:12 but not on what is actually 12:15 on writing things and see also where you 12:18 have like hundreds of lines of codes for 12:20 a simple 12:21 operation yeah so this is the structure 12:24 and if you have time in the end i'll 12:26 uh show you something more 12:30 on data visualization

slide 6

12:34 okay so this is arya so most of you 12:36 won't have used r before so we 12:39 don't use that in physics very much uh 12:42 r is um it's not better than python or 12:47 anything as it's very similar to python 12:49 and that it's an interpreted language 12:52 it's extremely flexible 12:54 so nothing is fixed in rng it's like you 12:56 have type function 12:58 that rewrites itself while 13:01 you execute this function it's called 13:03 extreme dynamism 13:06 um r is very popular in statistics 13:09 now you probably have heard in this 13:10 context 13:12 and uh it's very easy to include a c 13:15 code in r so if you worry about speed 13:17 you can always 13:18 write things and see and that's also 13:20 what we're usually doing 13:22 the main advantage of r is that's a huge 13:25 repository 13:27 of packages available particularly in 13:29 data science 13:31 genomics biology and so on 13:34 and one particular thing that is 13:37 actually 13:38 of the most practical relevant is that 13:41 it has 13:42 extremely convenient and high quality 13:45 plotting 13:46 functions yeah 13:49 so so these are the benefits of our that 13:51 of some downsides the syntax is very 13:54 difficult for our physicists to swallow 13:57 i'll show you later why and typically 13:59 because 14:00 nothing is fixed in r it's very dynamic 14:03 it's typically slower than python 14:05 in general terms so you wouldn't use 14:08 r to write as stochastic simulation or 14:11 something like this 14:13 but for the tasks that we're doing today 14:15 is actually uh 14:17 actually a very good choice now so it's 14:20 slowest 14:21 r is typically slow but the core 14:23 functions 14:24 are written in c so once you once you 14:27 rely on 14:28 core functions on vectorization and 14:30 things like that 14:31 then it's very fast but you know i have 14:33 to know how to use it 14:35 yeah but taken together there's no 14:37 particular reason 14:38 for for not using python for this 14:42 the only reason is that for the 14:45 for the um for the tables and for the 14:48 data i showed you the previous slides 14:50 r has been so to say the standard 14:53 language 14:54 in these experiments in genomics and 14:57 there's a lot of packages and a lot of 15:00 libraries developed for these genomics 15:03 data 15:05 that are collected in huge repositories 15:08 in our 15:10 meanwhile python is sketching a little 15:11 bit up uh 15:13 in this context so but that's why while 15:15 we're using r in 15:16 our group uh it's the context of 15:18 genomics these particular experiments 15:20 that i showed you on the previous slide

slide 7

15:25 so typically when you use r you use it 15:28 in a 15:30 development environment an interacting 15:32 environment 15:34 that you can download from rstudio.com 15:36 and then you have 15:37 something that you looks very familiar 15:39 to you if you use matlab 15:41 or python you have a notebook here 15:44 a little bit like a jupiter notebook if 15:46 you use 15:47 python for example now where you can 15:50 write code and where you then have also 15:52 then directly the output of your code 15:54 and the same uh the same and the same 15:57 window in the same file you can 15:58 export html documents 16:03 if you want you have a console here to 16:06 type in 16:07 code now and then you have your 16:09 workspace like in matlab where you see 16:11 your variables 16:13 and you have a window that that is made 16:16 for looking at help 16:17 pages and looking at plots and other 16:20 stuff 16:22 now and um i'll show you 16:25 later getting a little bit more 16:26 interactive i'll show you later how this 16:28 works and 16:29 practice here but this is just how how 16:32 working in r looks like looks exactly 16:34 like working in any other language 16:36 actually

slide 8, 9, 10, 11

16:38 yeah so so this is some basic r 16:41 so this this is just just to show you 16:44 that 16:44 r looks very familiar to uh 16:48 other languages actually in many in many 16:51 respects 16:52 and before i do that let me just show 16:54 you how these 16:55 what these boxes here mean and i'll 16:58 share 16:58 for that a little 17:02 our window 17:05 so let's go here 17:09 um 17:12 let me show the screen again 17:14 [Music] 17:16 okay now you should see an 17:19 r you know you should see this r studio 17:22 window 17:23 now so this that i just showed you and 17:26 here we have the code 17:28 now i can run the code and click on this 17:30 uh 17:31 little arrows here and then i run a 17:34 certain 17:34 chunk of code i can run this if i want 17:38 and load some data and i have a console 17:41 down here 17:43 that allows me to type commands if i 17:45 don't want to use this notebox here 17:47 and if you look at the bottom i can now 17:50 type assign a variable 17:51 here a and set it to one with these 17:54 weird assignment operators 17:57 as in set a to one and if i press a 18:01 i just type a and enter then i'll get 18:04 this 18:04 output the the value that is stored in 18:08 a now i can also output more complicated 18:13 values like like this table here that i 18:16 already loaded yeah and then i can look 18:19 at this in the console 18:21 and this console output is now what you 18:23 see in the slides 18:26 let me open the slides again can you see 18:27 the slides 18:32 are we back are we back with the slides 18:36 yes okay great perfect so it's working 18:38 yeah 18:39 so so here we have this i mean i've made 18:41 this fake console here and 18:43 and powerpoint now this is just to show 18:46 you 18:47 it's basically the same as in any as 18:49 many other languages like python 18:51 so here i call the function yeah and 18:54 the arguments are given in these 18:56 brackets uh 18:57 let me give you um 19:04 let me give you a pointer 19:09 here we go so somebody wrote in the chat 19:11 that the text is very 19:13 tiny was that related to our studio to 19:15 this 19:17 development environment 19:22 yeah probably yes and so so we don't 19:25 rely on that very much 19:27 um yes rc 19:30 okay okay so that's good to know 19:34 okay so just to just show you if i type 19:38 like a function 19:39 i would have called a function and an r 19:42 i do it in the usual way i can give 19:44 different arguments to this function in 19:46 this case 19:47 i take a normal normally distributed 19:50 random variable 19:52 and i tell r to give me five of them 19:55 and then i have some parameters that i 19:58 can identify by names 20:00 now so so the parameter mean are set to 20:03 one 20:04 and the parameter standard deviation are 20:06 also set to one 20:07 now these parameters have names 20:09 sometimes they have names 20:10 and they can call them with their names 20:12 it's very convenient some if you have a 20:14 long 20:15 list of parameters and you don't want to 20:18 give them all 20:19 i also can write my own functions that 20:22 looks a little bit like 20:24 mac and c so here i define a function 20:28 that's called my sum and this function 20:31 has three parameters 20:32 a b and c and c has 20:36 a default value 1. 20:39 now and then this function returns a 20:42 value that is just 20:43 equal to the sum of these three 20:46 arguments here 20:47 a plus b plus c now if i call this 20:50 function 20:50 i set a to 1 and b to 1 20:53 and c is 1 by d4 so i have don't have to 20:56 state that 20:58 i only have to state that if if i want c 21:00 to be a different value 21:02 then this function returns the value of 21:04 three not just like in 21:06 any other programming language 21:09 now we also have loops of course an r 21:12 and so this is a for loop 21:14 where you can for example have a loop 21:17 that goes from one to five 21:19 and then i can print something out or so 21:21 i have a while loop 21:23 now so while some condition is 21:26 a full fields or we want to print 21:28 something 21:30 now typically in r you don't want to 21:33 have loops if you have a loop in 21:34 r uh that means that you're doing 21:37 something wrong 21:38 so these loops are slow because r is 21:41 slow 21:42 and if you have such a loop then it 21:45 means that you didn't vectorize 21:47 your operation that you're doing 21:49 something bit by bit that you could also 21:51 do 21:52 in one step now typically if you have a 21:55 loop 21:55 then there's something wrong with your 21:58 code 22:00 or you're very extreme extremely 22:01 inefficient 22:03 and so i guess i'll have like written 22:06 like many 22:07 thousands of lines of r code 22:10 uh in my life and i have had uh exactly 22:13 one loop and uh in this 22:17 in these many thousands of lines and 22:19 several years 22:20 and this is one loop i use for a 22:22 stochastic simulation 22:25 now the mistake was that you would never 22:26 write a stochastic 22:28 simulation r but i did that because it 22:31 was very lazy 22:32 now so if you use these loops then 22:34 there's something wrong 22:35 because these loops are very you can 22:38 typically replace them with much more 22:40 efficient operations now so these are 22:44 the standard constructs that you have in 22:45 any programming language 22:47 you also have the if clauses here like 22:50 if 22:50 the value of i is smaller than 5 then 22:53 five 22:54 print hello and otherwise print not 22:56 hello yeah so that's 22:58 that's just also like in any other 23:00 language and you use 23:01 these curled brackets like in c to 23:04 define the scope 23:06 of a certain uh frame of the of the code 23:11 now that's very should all be very 23:14 familiar 23:16 now where things get a little bit 23:18 different different 23:19 are the data types of r 23:22 now so they our house has standard data 23:25 types 23:26 now so for example here we can define 23:29 vectors in principle everything is a 23:30 vector 23:31 or that's the most simple data type your 23:34 a 23:35 for example the data the the the vector 23:38 a 23:39 we define with this c here 23:42 now that's a little bit strange so in 23:48 matlab 23:58 was that a question probably not 24:03 i'll just just repeat it if it was a 24:05 question okay 24:06 so this is uh this is just a vector 24:09 how can i go back in the code 24:16 yeah okay so this is a typical vector we 24:19 we use the the letter c 24:22 to create such a vector that means 24:24 combine 24:26 and this vector has the elements one two 24:28 three 24:29 and we can also put other stuff in this 24:31 vector like this nas 24:33 that's for example a measurement that 24:36 is not available that didn't work for 24:38 example yeah but it's very convenient to 24:40 have 24:41 a representation on the computer for 24:43 measurements that 24:44 didn't work for example we can also have 24:47 a vector of 24:48 other types so here this is a string or 24:50 character vector 24:52 of m and f's so we can define this in 24:55 the very same way 24:57 and we can access elements of such 24:59 vectors 25:00 with these squared brackets yeah like in 25:03 c 25:03 for example but others but unlike in c 25:07 we start counting by one and not by zero 25:11 the next element is a list then the next 25:14 data type is a list 25:15 and then now it gets more interesting a 25:18 list 25:19 can contain several elements of any type 25:23 for example a list what elements of a 25:26 list 25:26 can be vectors now for example if i want 25:30 to make a list 25:31 and the first element of the list is our 25:34 vector 25:35 a and the second element of the list 25:38 is our vector b now they have 25:42 different types but you can nevertheless 25:44 put them 25:45 together in a list and then i can access 25:49 these uh then i can access these 25:53 lists here these list elements by the 25:56 name so i gave 25:57 gave it a name number and gender 26:01 and if i want to access the first 26:03 element number by name then i just use 26:05 these dollar signs here 26:07 and i can also assign access this first 26:10 element 26:11 by its number then i use the double 26:14 squared brackets that's not so important 26:17 just to show you that these title types 26:18 exist 26:20 when it gets more important for 26:22 statistics is that we also have 26:25 categorial variables and that's 26:26 something probably you don't know from 26:28 from matlab 26:29 or c i don't know about python category 26:32 so 26:33 so we have here a vector 26:36 that saves the gender of something 26:40 of somebody yeah so we have a long and 26:43 then 26:43 so suppose we do this measurements like 26:45 um 100 millions 26:47 of times yeah and then we have males and 26:50 females 26:51 and in principle we could store 26:54 the words male and female 100 million 26:57 times 26:58 in memory but that would be very 27:01 inefficient 27:02 what we could do instead is to say 27:05 okay i have a variable that i can only 27:07 take two values 27:10 yeah so i need this one byte at most 27:14 uh to store these which of these two 27:17 values 27:18 a given element has and then 27:23 i save an additional to that i save 27:26 labels to these two values 27:28 and that's what a factor is doing it's a 27:30 categorial variable 27:32 that can only take limited number of 27:35 values 27:36 for example uh this valve this vector b 27:39 here can only take two values 27:41 and i tell r that this can only take two 27:44 values 27:45 by making this a factor now this this 27:48 category variable is called a vector 27:51 and when i type then look at this vector 27:56 f here then it gives me the output 27:59 m f and f m f yeah and it tells me these 28:03 levels here 28:05 these are the typical values these 28:06 elements can take 28:08 yeah and if i want to make these these 28:11 levels 28:12 of these labels of these two values 28:14 different 28:16 i can call these two female and male 28:20 you know by changing these levels and 28:23 then 28:23 i have i can output this again 28:26 and have male female female male and so 28:29 on 28:30 so what happens here in r is that still 28:33 internally i don't use any more memory 28:36 although the strings are longer 28:38 what i changed here is a lookup table of 28:40 r 28:41 where r looks up where these two values 28:45 how these two values my vector can take 28:47 are called 28:49 for any plotting or any printing 28:51 purposes 28:52 now that's a that's a factor and it's 28:54 very efficient 28:55 if you think about for example biology 28:58 for these measurements that i showed you 29:02 in the beginning you have these hundreds 29:04 of millions of billions of measurements 29:06 and you have 200 million measurements 29:10 for chromosome 1. now you could either 29:13 write in your memory have a vector 29:15 that has the element chromosome 1 200 29:18 million times 29:21 or you just save a number an identifier 29:24 for chromosome one 29:26 and give it a label like this one here 29:30 with a real name in a separate table 29:33 then you don't have to save the string 29:36 chromosome one two one a million times 29:38 but only once in this table where you 29:41 look up what is actually the name 29:42 of this factory level so this is a very 29:44 efficient way of saving 29:47 variables that have that appear 29:50 very often but can only take a limited 29:54 number of values and that's that called 29:56 a factor 29:58 now that's the data type that you're 30:00 probably not familiar with 30:04 yeah and then we can go on yeah and we 30:07 can 30:07 now go to data types that can really 30:10 store 30:11 the data that we're looking at for 30:13 example experimental data 30:15 and the simplest way you can do that is 30:17 called an r a data frame 30:20 a data frame python also has data frames 30:25 and as far as i know and so these data 30:28 frames are nothing but lists 30:30 of vectors as i remember the lists 30:34 now this is a list it can save any kind 30:37 of element that you want 30:39 and if we can if we if each element of 30:42 this list here 30:43 has the same length then we can 30:46 interpret this list as a table 30:51 yeah and this is what we do in these 30:52 data frames so we here we have 30:54 three vectors x y and z 30:58 and they have different types so this is 30:59 a numerical vector 31:01 this is a character vector and this is a 31:04 boolean or logical vector 31:06 and we now can create this data frame 31:09 and say that the first element is x 31:12 the name of the first element is x and 31:16 there's the second element the second 31:18 column should be y 31:19 and the third one should be said sorry i 31:22 don't know why this has happened 31:23 all the time um 31:26 the third one should be that and what 31:28 you can see here 31:30 is how this is then represented if you 31:33 look at the output 31:35 of such a data frame so here is the 31:37 first vector 31:39 that's now the first column in our table 31:41 the second column result on our table 31:44 is this one and the third column of our 31:46 table is the boolean vector 31:48 now this is a data frame is it 31:50 essentially the 31:52 r version of a table and the only 31:56 and internally this data frame is 31:58 basically a list 32:00 of vectors that have the same length 32:07 okay so this is a data frame so and this 32:10 data frame looks close to what we could 32:13 be 32:14 dealing with or is actually what we 32:15 could be dealing with or 32:17 so sufficiently general to allow us to 32:20 store any amount of uh 32:24 experimental data now like a table is a 32:26 general way of storing data 32:28 and these data frames are essentially 32:31 tables 32:32 and they can they allow us to store any 32:35 kind of 32:36 data that we might measure

slide 12

32:40 now the problem is now we have this data 32:42 table but what 32:43 as i told you in the beginning these 32:45 data frames 32:46 that i told you in the beginning that we 32:48 actually have 32:50 data data that is terabytes in size 32:55 so now we need a way of performing 32:59 operations on these tables in a very 33:02 efficient way now so we need the right 33:05 methods 33:06 and how important that is to pick the 33:09 right methods is here 33:11 on the left hand side so here you can 33:15 see these bars 33:16 and these bars are time measurements 33:19 that it takes to perform certain 33:22 operations on data size of 33:24 certain size yeah and 33:30 so so here this is 500 megabytes of data 33:33 so 33:34 pretty small data set and then you 33:36 measure this the length 33:37 of such an operation here that's denoted 33:39 by the bar 33:41 and you can compare different versions 33:44 different packages that allow you to 33:48 to look at this data so for example 33:52 one popular popular method in r is 33:55 called 33:56 dplyr the varia is extremely popular and 34:00 very easy to learn 34:01 way of managing these tables 34:05 another pandas 34:08 is basically the are the python version 34:12 of such a data frame and then we have 34:16 here data 34:17 table on the top the blue one 34:20 now so this is a very fast c 34:23 implementation of these 34:24 operations on these data frames it's 34:27 very fast and memory 34:28 efficient so identity you have some 34:31 things that are more 34:32 used and companies 34:35 uh also in this list now let's but this 34:38 all seems very reasonable so we have 34:41 here 12 seconds 34:43 20 seconds 90 seconds 34:46 nothing of that stops us from doing from 34:48 getting results 34:50 now we increase the size of our data set 34:54 to 5 gigabytes or 50 gigabytes still 34:57 very small compared to what i've been 34:59 talking about 35:00 so here look let's let's have a look at 35:04 these 50 35:05 gigabytes now suddenly 35:08 there's a huge difference here 35:11 a lot of these packages a lot of these 35:13 methods you do not produce 35:15 any results at all yeah 35:18 they they run out of memory for example 35:22 and some of them like this very popular 35:25 one 35:26 takes just a huge amount of time 35:30 while others especially this data table 35:34 method here 35:34 performs very well so we have here 35:39 what is that 30 minutes so that sounds 35:42 reasonable 35:45 no no that's seconds there are 13 35:46 seconds so 13 seconds so this was 35:49 sorry so this was uh 0.2 seconds here at 35:52 the top 35:53 and now we're at uh at 13.56 because 35:56 that sounds pretty reasonable 35:58 but if you go down here to these methods 36:00 that we have here so the 13 seconds is 36:02 something that you can 36:03 wait in front of the computer 36:07 and still have some interactive way of 36:09 working 36:10 now if you go down here of course a lot 36:12 of these methods just fail 36:14 but there's also some of them just takes 36:16 so long 36:17 that any reasonable working like this 36:20 diploid here 36:21 takes so long that any reasonable way of 36:23 working with data 36:24 uh is that not possible yeah 36:28 so that's why choosing the right method 36:31 here 36:32 is uh important 36:35 and also what's important is to look at 36:37 how these methods 36:39 scale with the size and the complexity 36:41 and the dimensions of your data 36:44 so what this tells us here is there's a 36:49 huge difference 36:50 yeah and actually what i did when i 36:54 was a poster for example like every 36:56 physicist i used matlab i was used to 36:58 use 36:58 matlab yeah and then very quickly almost 37:01 immediately 37:02 matlab failed on such operations 37:05 yeah and then when these genomics 37:07 methods came up 37:09 yeah i used the red one here dipper 37:13 this is extremely popular and very easy 37:16 to get into it's very flexible 37:18 and very well documented and so on and i 37:21 used that 37:22 but then experimental progress moved on 37:25 exponentially now so when i started i 37:28 was looking at like data 37:29 a few hundred megabytes of size and 37:32 at the beginning of this talk i showed 37:35 you data that were 37:36 like a terabyte of size yeah and very 37:38 quickly i went into this problem here 37:41 that i wasn't able to get any results at 37:44 all and 37:44 meaningful times so at some point i had 37:46 to rewrite all my code 37:48 and what then turn out to be a 37:51 reasonable choice for really large data 37:53 is this data table package it is our 37:56 version 37:57 and also the python version as you see 38:00 here 38:01 is not much slower than that than the r 38:04 version 38:05 this data table is written in c is very 38:07 fast 38:08 and uh once you if you 38:12 stay strict to it then you can expect a 38:15 performance that is not much worse 38:17 than actually c but with orders of 38:19 magnitude less 38:22 programming work so 38:25 today i'll use this data table package 38:28 here 38:29 and i will also introduce some concepts 38:33 here 38:34 that are applicable to any other of 38:37 these methods or any other way 38:39 of dealing with data actually

slide 13

38:43 okay so let's use this data table 38:46 package 38:47 now this is very fast and and works for 38:50 extremely large 38:52 uh data sets the downside is that this 38:54 syntax is not very friendly for 38:56 beginners 38:57 it's a little bit cryptic it's very 38:59 condensed and very compact 39:01 uh but it turns out that this at least 39:04 for me was the only 39:07 way that i could uh deal with data 39:10 like in an experimental field where the 39:13 sizes of data sets are exponentially 39:15 increasing 39:16 this was the only way that made me work 39:19 in an efficient way yeah so we start 39:22 this data table 39:23 we reload this package just by calling 39:26 library 39:26 data table no and then we if we have a 39:30 data frame we can just convert it to a 39:32 data table 39:33 by this command as data table that's 39:36 very simple 39:37 nothing else is necessary

slide 14

39:41 so in this lecture uh i would like to 39:44 look at some real 39:45 data yeah so and this real data i could 39:48 now load like one terabyte of data of 39:51 experimental data 39:52 and uh but that would not be very 39:55 efficient actually for this lecture 39:56 because it would be very 39:59 slow and also very difficult 40:02 to follow so i'll look here at a simple 40:05 data set 40:06 and we'll do some operations on this 40:08 very simple data set and this is 40:10 actually data set that any of you can 40:11 download 40:12 and i'll also upload or share a link 40:15 where you can download the code of this 40:18 lecture that i'm 40:19 using in this lecture and the data 40:22 itself 40:24 okay so this is the data set this is a 40:26 called 40:27 new york city flights and these are all 40:31 flights 40:32 that departed 40:35 or arrived in new york city airports 40:39 in 2013. 40:42 yeah so the this uh this one that this 40:45 data set consists of 40:46 several files of several tables 40:50 one is the actual information about 40:53 these flights here 40:55 yeah so we have for each slide we have 40:57 information about the year 40:59 that's 2013. but we also have 41:02 information about the month 41:03 the day the hour we have an information 41:07 about the flight number and identifier 41:09 of the flight 41:10 we know we can uh know 41:14 where it came from this flight 41:17 now we have a tail number that is that 41:19 would that mean that we can identify 41:21 the airplane that was used the carrier 41:25 so is it united airlines american 41:27 airlines 41:28 lufthansa and then we have some 41:30 information about the delays 41:32 in these slides so delays in the 41:34 departure delays in the arrival 41:37 how long was the airplane in 41:40 air and many other bits of information 41:46 now this is the information for all 41:48 flights 41:49 now we want to interpret this 41:51 information that means that we need to 41:52 connect this information to other 41:55 uh other uh if you want to understand 41:59 for example where these 42:00 delays come from in this data set 42:04 we want to connect it to additional 42:05 information the one thing you might be 42:08 looking at 42:09 is the weather now if you ask for 42:12 whether why is a flight delayed we can 42:14 look at the weather 42:16 yeah and this weather is a different 42:19 source of data 42:20 and for this weather we can also get the 42:22 time 42:23 you know the month and day and hours we 42:26 can 42:27 look at the weather at a certain 42:29 airports 42:30 you know at certain locations we have 42:33 temperature humidity wind speed and 42:36 other information 42:37 about the weather at a certain location 42:39 at a certain time 42:42 now we have information about the planes 42:44 so we have this 42:45 tail number here so that it identifies 42:48 the planes 42:49 and then we can also download some 42:50 information about these airplanes 42:52 we can for example look 42:56 what manufacturer the airplane was made 42:59 we can look at the model we can look at 43:02 the 43:03 the year this airplane was built number 43:05 of engines the number of seats 43:08 the type of airplane and so on 43:11 we can get some information about the 43:13 airport 43:14 where we have uh airport information 43:17 an airport identify the name of the 43:20 airport 43:20 the longitude and latitude and altitude 43:23 of this 43:24 airport and so on and finally we can 43:27 also get some information about the 43:28 airline 43:29 now that corresponds to a certain flight

slide 15

43:35 yeah so and now one thing you need to 43:37 notice 43:38 is that these tables here they share 43:41 information 43:43 yeah for example if we want to learn 43:47 uh what the weather was for a given 43:49 flight 43:50 then we can look at the year and the 43:52 month of the day 43:54 at the origin airport of this flight 43:58 and look and compare that to the same 44:00 columns 44:01 in this other table here 44:04 on the left hand side that contains the 44:06 weather information 44:08 now if we want to learn about the 44:10 airplane that was used 44:13 we can take this column here this bit of 44:15 information the tail number 44:18 and compare that to 44:22 this the corresponding tail number in 44:25 this planes data set 44:26 and look what kind of manufacturer this 44:28 plane was uh made by 44:30 uh which year it was produced and so on 44:34 yeah so these bits of information are 44:36 connected 44:38 uh by different variables and we can use 44:41 these overlaps between these datasets to 44:43 bring them all together later 44:46 but first let me show you how to load 44:48 data and loading data is actually also 44:50 not a trivial task 44:52 now the loading data can take hours or 44:55 minutes 44:56 depending on what function you use for 44:58 that 44:59 and in the scope of r the fastest 45:02 functions 45:03 are the ones from the data table package 45:07 they're called 45:07 f read now they're an f right that 45:10 allows you to write data 45:12 and they're basically parallelized 45:14 versions of reading 45:16 huge amounts of text data of ascii files 45:20 and it's very simple you just use the 45:23 command 45:23 f3 it's called fast read to load a 45:26 certain file here for example the text 45:29 file 45:30 and f read will do all of the rest 45:34 uh that you need to do that you need to 45:35 do it will identify the columns 45:37 will identify that the data types and so 45:39 on typically you don't need to do 45:41 anything 45:43 for example we can read in this flights 45:46 data sets 45:47 here that contains information about 45:50 flights 45:51 and it's actually the flights that 45:52 departed from new york city 45:54 and this is then how this data set looks 45:58 like 45:58 if we just print what is in once we load 46:01 these 46:02 flights we can we can print what is here 46:05 and see what's in there 46:07 we have these different columns they are 46:08 like 2030 that's the year 46:11 then we have different months we have 46:12 different days of the months 46:15 and we have departure times 46:19 and we have real departure times we have 46:22 scheduled departure times and we have 46:25 delays 46:27 now we have arrival times and so on we 46:30 have here carrier we have flight numbers 46:33 the airplane the tail number of the 46:34 airplane and so on 46:37 origin destination and so on yeah 46:41 and we have a timestamp here a time 46:43 signature 46:45 of when this flight actually took place 46:49 now so let's go uh 46:53 to our studio and see how this looks 46:55 like 46:56 [Music] 46:58 um 47:01 here we go now this is our studio i hope 47:04 you can see our studio now it should be 47:06 still 47:07 sharing yeah and uh 47:11 so here the top rows so okay so this was 47:14 a little bit small right 47:16 um let me zoom in 47:25 okay so i hope this is better um 47:29 so here we're loading 47:33 all of these files actually the same 47:35 thing that we did 47:38 [Music] 47:39 that i had on slides so for each of 47:41 these different bits of information 47:43 we now load these files into memory 47:47 now so i had already done that before so 47:49 here they are 47:50 in the workspace and we know we can 47:52 click on these files so we can 47:53 either now just type flights 47:57 yeah and get this huge table out 48:00 or you know so we see we have here 48:05 300 more than 300 000 flights that we 48:08 have information about 48:10 or i can just click on this file here 48:14 and get a little nice table view of 48:17 the data that we have now this is the 48:20 flight information 48:21 it's also high dimensional not as 48:23 extremely high dimensional 48:25 as the example i gave you but it has 48:27 still enough 48:28 information that we cannot easily detect 48:31 patterns 48:32 in this data set yeah so this is 48:36 this is the data we have 300 000 flights 48:40 and now we can try to detect some 48:44 structures 48:45 in this data set 48:49 let's get back to powerpoint 48:55 okay here we go now we've loaded the 48:58 data we've got all of these different 49:00 data tables now or data frames and 49:03 memory

slide 16

49:04 and now we're trying to make sense we're 49:06 trying to make sense out of these 49:08 statuses 49:09 the first thing that we always need to 49:12 do 49:13 is to bring data in a format 49:16 that we can deal with it that the data 49:19 and this format is typically called date 49:22 tidy data or a long data or long format 49:27 now that two simple rules that you can 49:31 use to decide whether 49:32 a data table or a data framework or 49:35 table is tidy 49:36 so one thing you need to make sure is 49:38 that every column 49:41 is a different observable for example 49:44 different types of measurements 49:47 every row are then different 49:50 observations of these measurements of 49:54 these observables or different or 49:56 variables 49:57 yeah once you stick to these rules your 50:00 data is tidy 50:02 and when your data is tidy so every 50:05 column 50:06 is a different observable and every row 50:09 is an observed observation 50:11 then we can use vectorized operations 50:15 in r or python to perform operations 50:19 on entire columns of these 50:22 data tables yeah and that 50:26 first is extremely fast you know because 50:28 it's vector-wise we've informed these 50:30 operations 50:31 one column at a time and it drastically 50:34 reduces the complexity 50:36 of the code now the 50:39 the data that i showed you are one of 50:41 the standards 50:43 data sets and data signs and uh 50:46 that that you you look at for for 50:49 teaching 50:50 and uh these date this data is already 50:53 tidy 50:54 so let's have a look now at these 50:56 flights 50:57 so every column is a different 51:00 observation this is a different 51:02 observable yeah for example a different 51:04 measurement 51:05 a different departure time a different 51:08 arrival time a different flight number 51:11 a different time points 51:15 and every row is a different observation 51:19 so a different flight 51:21 so this data table is already in a tidy 51:24 format 51:26 and it's already in a format that we can 51:28 deal with so we have nothing to do here

slide 17

51:31 in a standard example of a non a messy 51:34 data set 51:36 that we also often find in physics but 51:38 also here in this case in 51:43 in in genomics is that we have 51:46 matrices that we share data as a matrix 51:50 for example in this matrix here that's a 51:52 typical matrix here 51:54 so this is a measurement one of these 51:55 genomics measurements so here on each 51:58 row 51:59 is a different gene now so and the first 52:02 column here is the name of this gene 52:05 that's a very cryptic name these genes 52:07 have cryptic names 52:10 each row is in gene and each column is a 52:13 different cell 52:16 yeah so this data set 52:19 is not tidy 52:23 because the same measurement so sorry 52:25 that these numbers 52:26 mean how many products of this given 52:30 gene 52:31 we have measured in a given cell 52:35 so these are the numbers it's not 52:37 important to understand the biological 52:38 context 52:40 and but this data set is not tidy 52:43 because the same 52:45 observable namely the number of these 52:48 these gene products is repeated in 52:51 different columns 52:54 and you can easily see that by having 52:57 this format 52:59 we're not very very flexible so if i now 53:01 for the same gene 53:03 and for the same cell make another 53:05 measurement 53:06 what do i do where do i put that in this 53:08 matrix so we have 53:10 then we have to have a separate matrix 53:12 and i have to live with separate 53:14 matrices 53:15 in parallel so this data set is not

slide 18

53:18 it's not tidy and 53:22 i cannot perform easily parallelized 53:25 operations on this data table and now we 53:27 want to make this data table 53:29 tidy i know there are special functions 53:33 that allow us to do that and 53:37 these functions are 53:41 basically i just give you the names and 53:43 they have similar names basically in all 53:45 contexts to make such a data table tidy 53:48 like the one that i have here this 53:50 function is called 53:52 mels it's called you also said you melt 53:54 a data table 53:56 and that brings us from this matrix 53:58 format 54:00 to a format where every 54:03 column represents an observation and 54:06 each row 54:08 represents i sorry each column 54:10 represents an observable 54:11 like a measurement and each row 54:14 corresponds 54:15 to a measurement or an observable yeah 54:18 so 54:18 the right would you see on the right 54:20 hand side that would be 54:22 the tidy version of the data that show i 54:25 showed you on the 54:26 on the last slide so the first column 54:28 would be the cell 54:30 the second is the gene the name of the 54:32 gene the second column would be the name 54:34 of the cell 54:35 and the third column would be the value 54:38 that i measure 54:39 for the g products now so that would be 54:42 the tidy format 54:43 and now i can have a long i have a long 54:46 vector here 54:47 of numbers in this 54:50 one column and i don't have 200 columns 54:54 as in the previous slide 54:57 okay so how do we do that in our this on 55:00 data in this data table package the 55:02 command is called 55:03 melt we just give it the name of our 55:06 data table 55:08 and then we identify id variables 55:11 so these are variables uh 55:14 that need to remain so to say intact 55:19 uh once we once we reshape this matrix 55:22 now in our case this is the id column 55:26 yeah and the value name is 55:30 uh how we want to call 55:33 the values that are in these matrix 55:36 elements here now and we just call this 55:39 expression now that's what we call 55:42 and the variable name is the cell that 55:44 we have here already 55:46 that tells us that this variable name is 55:48 distributed 55:49 now into this column now and then if you 55:52 look at these colors here 55:54 you can see how this command distributes 55:56 what is actually now distributed 55:58 into several columns 56:01 how this reshapes this data table into 56:04 something that has a longer format it's 56:06 called actually also called long format 56:09 uh where we have uh the cell name 56:12 this column and then the corresponding 56:14 measurements in this column 56:17 for the products yeah you can look at 56:19 this i'll upload the slides of course so 56:21 you can look at this 56:22 uh in detail afterwards 56:25 it's a little bit abstract but this is 56:27 so say how we reshape 56:29 data tables to make them tidy

slide 19

56:33 yeah and we can also uh make them messy 56:37 again now so i'll go quickly over that 56:39 because that's not something we use 56:41 but sometimes you just want to have your 56:43 good old matrix back 56:44 because some functions some other 56:46 package wants to have a matrix 56:48 yeah and that's called then this 56:50 function is called d 56:52 cast and that just reverses 56:55 uh that reverses the operation that i 56:58 told you on the last slide 56:59 where we give you where this function 57:02 takes the data table 57:04 uh the name of the data table variable 57:06 as the first 57:07 argument and then a formula of how the 57:11 rows and column columns 57:13 should be now separated in this new 57:17 matrix that we want to have now it's not 57:20 so 57:20 important so i'll not go into it it's 57:22 not so we 57:24 rarely use that and 57:28 now we have some mess some tidy data 57:30 yeah so 57:31 we have some tidy data now every column 57:34 is an observation every row sorry i 57:38 always mess it up 57:39 every column is an observable or a 57:41 variable 57:42 every row is an observation

slide 20

57:45 once we're in this format we can now 57:47 have a very simple syntax 57:50 now and this syntax actually captures 57:53 ideas 57:54 that uh we also have in other 57:57 maybe more popular languages like sql or 58:01 sql that's that's 58:04 more that captures operations on this 58:07 data table 58:08 that are quite generic so so this is the 58:11 general syntax that we use in this data 58:14 table package 58:15 so we have three now we take this data 58:18 table 58:18 d and in squared brackets 58:22 we have three arguments 58:25 yeah and the way you read these three 58:28 arguments 58:29 is that you take the data table d 58:32 and then you have something at the first 58:34 argument that operates 58:36 on the rows yeah you have here a 58:39 condition that 58:40 tells you which kind of rows you want to 58:42 take 58:44 the second one operates on the columns 58:47 yeah and for example you can in the 58:51 second 58:51 argument you can perform a calculation 58:55 on the columns and the third ones the 58:58 third argument 58:59 is a grouping variable now the third one 59:02 is it's like it's like a typical matrix 59:05 a statement y and j the third one is a 59:07 grouping variable 59:09 where we can group rows together by 59:11 certain conditions 59:12 and perform these calculations here 59:16 group by group now so the way to read 59:19 that is to take 59:20 the data table d subset the rows using i 59:24 that can be some expression some logical 59:27 statement for example 59:28 we calculate what is in j 59:32 and do this calculation group grouped 59:36 by whatever is in this third argument 59:39 and i'll now show you step by step 59:41 how this looks in detail 59:44 and these uh arguments here these three 59:48 albums are 59:48 a very very general way this is also say 59:52 parametrize um operations on these 59:56 large tables now so with these just 59:58 three arguments we can do 60:00 a lot of stuff already and if you think 60:03 about it it's actually 60:04 abstract but it's very simple 60:09 so now first let's have a look at the 60:11 very first argument

slide 21

60:15 yeah so let's just say we have a table 60:18 and we just want to get 60:19 a subset of rows from the table 60:22 now for example we want to look at all 60:26 planes 60:26 that have four engines 60:30 now the way we do that is we just 60:33 write this logical statement here as the 60:36 first argument 60:38 so engines equals four 60:41 and if we do that then we get back a 60:43 data table that only has the planes 60:48 that have for four engines 60:51 in them that have four engines now we 60:53 filtered 60:54 the rows of these data tables for the 60:57 rows that have 60:58 the value of the engine column equals to 61:02 four 61:04 now we can also select a subset of 61:07 columns here on the right hand side 61:10 and this we do by this notation here 61:12 this dot is actually a shortcut for 61:15 for for and for a list yeah so 61:18 what we do is we give a list of the 61:21 names of 61:22 columns that we want to extract 61:26 as the second argument so for example we 61:30 take this planes data set 61:32 and we want to get the tail number and 61:34 the year 61:35 this airplane was built and then we get 61:38 a shorter 61:39 a smaller data table back that only has 61:42 these two columns 61:44 in them yeah so this is all something 61:46 that you could also do with any other 61:48 package or any other programming 61:50 language of course 61:54 now we want to get a little bit more 61:58 complicated

slide 22

61:59 now we want to do an operation on these 62:01 data tables on these 62:02 large data sets and these operations in 62:05 general have a quite general scheme 62:09 yeah so and this general scheme is 62:11 already 62:12 in this syntax notation that i showed 62:14 you a few slides ago with the i 62:17 and j and the group is already 62:21 so say in there so that the the steps 62:26 that we take 62:27 is called the group aggregate combine 62:29 scheme 62:30 so what we do the general operation on 62:33 such a data table looks like that we 62:36 first group our observations 62:40 by some meaningful way so for example we 62:43 want to group 62:44 all slides that happen in the same month 62:50 yeah and then we aggregate aggregates 62:52 means that we 62:53 put extract all of these groups all 62:56 flies from the same month 63:00 at a time and aggregates mean at the 63:04 third step that we 63:05 perform some kind of summary function 63:08 on these flights that departed in a 63:11 certain month for example we want to 63:13 then calculate 63:14 the average temperature or the average 63:16 delay 63:17 now this is the third thing here where 63:19 we aggregate 63:21 all of these different flies that belong 63:23 to the same group 63:25 and calculate a single result from that 63:30 and i'll also give you now a specific 63:32 example of this

slide 23

63:34 so what we can do for example 63:37 is that we can group by carrier so we 63:40 want we might want to 63:42 ask are all carriers equally bad 63:46 or equally good or is there one carrier 63:49 that is worse than other carriers 63:51 now the way we do that is we 63:55 first group that's the last column 63:59 here we group all slides 64:03 and we take the flights data table and 64:06 then we group the flights 64:08 all flights together that have the same 64:10 carrier 64:13 and for each carrier we then calculate 64:17 the average delay using the mean 64:20 function 64:21 now we take the mean of the departure 64:24 delay and this third argument 64:27 just tells us that we should remove an 64:30 ace so 64:30 and not the numbers for example we're 64:32 not a proper measurement was taken where 64:34 this information is missing 64:36 now we just just ignore this yeah so for 64:39 every 64:40 carrier we take all flights 64:45 and calculate the average delay time 64:50 now so this is what we do here now all 64:52 carriers 64:54 that's here the grouping is the third 64:55 variable and then we perform this 64:58 operation and the second variable 65:00 because it operates on the columns 65:02 and the operation we do is we calculate 65:04 the average 65:05 of the delay and we save it in a new 65:08 column 65:09 that's called mean delay 65:13 now what we get from that is for every 65:15 carrier 65:18 we get one number which is this average 65:20 delay 65:23 now and you can see that basically that 65:26 confirms what you 65:27 expected already that united airlines is 65:30 not performing very well 65:34 now we can also have more complicated 65:36 procedures now we can for example 65:39 uh combine that with only 65:42 with a subsetting in the roads we don't 65:44 want to take all flights 65:46 we always want to look we only want to 65:48 look at flights 65:49 in the evenings you know very late 65:51 flights 65:53 and then we can for example do the same 65:55 thing we uh 65:57 yeah and we can we can also have a more 66:00 fine grade 66:02 fine-grained or we can look at different 66:06 combinations of variables here for 66:08 example 66:09 in this case here we 66:12 take all flights that 66:15 depart after 8 pm 66:21 and the remaining flights we group by 66:24 month at the origin airport 66:29 and we again calculate the average 66:32 departure delay 66:35 and now for every combination of month 66:38 and airport 66:40 we get a value for the average delay 66:42 here at the bottom 66:44 now for example in january uh the 66:47 newark airport had an average delay 66:50 of 14 minutes 66:55 yeah and jfk was doing much better 66:58 yeah and so you can see that this also 67:02 kind of is something that that people 67:05 expect to see here actually 67:07 yeah um okay so this is how you how this 67:12 group aggregate combined paradigm 67:16 can be fit into very compact syntax 67:20 on basically arbitrary complex tables 67:25 now we can also so this was 67:29 now a summary statistics where we 67:31 performed where we 67:32 summarized several rows yeah all flights 67:36 that have the same carrier 67:37 or all flights in a different in a 67:39 certain month and a certain 67:41 airport we summarize and summarize 67:44 all of these slides into just one 67:47 quantity for example the average delay 67:51 so we perform the summary here what we

slide 24

67:54 can also do 67:56 is we can calculate we can add a new 67:59 column to our table 68:02 that has one new value for each 68:06 row that is in the original table so we 68:09 don't perform 68:10 a summary it's also sometimes called 68:12 windowing 68:13 for whatever reason and so we have 68:16 one new value for each row that we had 68:20 in our original data set for example 68:23 here the top row 68:24 if i don't want to have the flights 68:27 uh if i wanted to have this the speed in 68:30 kilometers per hour 68:32 now i can for example then get uh 68:35 calculate the distance over the air time 68:39 in minutes 68:40 now that's the speed and miles per 68:42 minute 68:44 and then i multiply by 60 and convert to 68:48 kilometers 68:49 and then i have the speed for every 68:51 flight i have the average speed in 68:53 kilometers 68:53 per hour i can also rescale for example 68:57 i can combine 68:59 now these simple computations here with 69:01 the grouping 69:02 for example i can rescale all distances 69:07 by the average distance 69:11 added of a certain carrier for example i 69:14 can ask 69:15 is this flight much further 69:18 or much shorter than a typical flight 69:21 conducted by the same carrier yeah so 69:25 then i can calculate a rescale 69:27 distance that is just the distance 69:30 divided by the average distance 69:33 of this carrier here and then again i 69:36 get a new 69:37 variable a new value for each row that i 69:41 had in my original table 69:44 and the way i do these operations is by 69:46 these symbols here these define 69:48 symbols column and equal signs 69:52 now the way this looks like is now that 69:54 we have fear 69:56 our original 300 000 rows 70:00 but for each row we have calculated now 70:03 a new 70:05 column a new value that we save in a new 70:07 column 70:08 speed uh kmh uh 70:11 that is the speed in kilometers 70:15 and we also have the the rescale 70:17 distance 70:19 now for example this flight here was a 70:21 little bit ten percent shorter 70:23 than the average flight of this carrier 70:27 now and uh this also is saved into a new 70:31 column 70:32 that we've given these names here 70:35 now this is something that is that is 70:38 also happening very fast 70:40 and memory efficient 70:45 so now comes now comes more complicated 70:49 things so this was in principle easy 70:52 step we add more columns 70:53 we perform a summary and this data table 70:58 package allows us to write a very simple 71:01 syntax 71:02 in order to do that now if you try to do 71:05 these groupings 71:06 in c also you write many many lines of 71:08 codes 71:09 to do that now

slide 25, 26, 27, 28, 29

71:12 the next step we want to make use that 71:16 these that these to get some more 71:19 understanding about these delays here 71:22 in new york city we want to make use of 71:24 the fact that these whether we have 71:26 different bits of information 71:28 and these different bits of information 71:31 share 71:33 columns or share information that allows 71:35 us to link them 71:38 yeah for example we can link the weather 71:42 to a certain flight by matching the year 71:45 the month the day and the hour 71:49 and the airport yeah 71:52 and with this we can connect the flight 71:56 to the weather that that occurred when 71:58 this flight 71:59 departed yeah we can also 72:05 connect the airport information for 72:07 example 72:08 so this faa is an identifier that has a 72:11 different name in this table 72:13 but it's essentially the same as in the 72:15 origin airport 72:16 and we can link we have the information 72:18 to link these tables together 72:21 now the planes the the tail number is a 72:24 unique 72:24 identifier that allows us to look up the 72:27 plane 72:28 of our at our of our flight 72:31 in this data table this data set of all 72:34 plates 72:36 and also the same with an airline we can 72:38 look up the carrier 72:40 identifier in another data table and get 72:43 the name of the carrier 72:45 now so these these tables are 72:48 interlinked 72:49 and now we need efficient ways to merge 72:53 these these big different bits of 72:55 information together 72:58 yeah and this merging happens 73:03 in a way that is also often called joins 73:05 now this is a general principle that you 73:07 do basically 73:08 in any kind of of data handling uh 73:11 task independent of the 73:14 programming language that you're using 73:17 and 73:18 the first join and then you can perform 73:21 these joints or these mergings of these 73:23 data tables 73:24 in different ways and these different 73:27 ways 73:28 differ in the way which information you 73:31 keep in case you only find it in one 73:34 table but not in the other table 73:37 now for example so let's begin with one 73:41 kind of join 73:43 it's called the left join 73:47 that's called the left join and what you 73:49 do 73:50 is that you keep all rows 73:53 that are in the left data table now in 73:56 this a here 73:58 now we want to combine in this case the 73:59 information in a 74:01 and b yeah and then we can look up 74:06 okay so here we have a value x1 74:09 that corresponds to a 74:12 and we have a value of x one a and now 74:15 we can look up 74:16 what is the other columns x two and x 74:19 three 74:20 now so x two is one and x three is this 74:23 is true 74:25 yeah and then we can combine these two 74:27 columns in the right way 74:29 that they correspond to the value of x1 74:32 a same for b now we can look up b 74:36 in both columns so they don't need to be 74:37 in the same position 74:39 in both tables we can look up the b and 74:42 get the values of the remaining columns 74:45 and sort them here to the end of this 74:47 table 74:48 and now we have the three the c 74:52 now c is in table a and the first one 74:55 on the left one but not in the right one 74:59 and a left join means that we take c 75:05 from the left one we don't have 75:08 information 75:09 from the white one so x3 is empty 75:13 and we disregard all rows that are on 75:16 the right 75:17 table but not in the left table so 75:20 that's a left 75:21 join and an r or in this data table 75:24 framework this is done by the function 75:27 merge that takes two tables 75:31 and the left join is then specified by 75:33 saying that 75:34 all x so all first column all entries in 75:37 the first 75:38 column should be true should be taken 75:42 there's also a short notation for this 75:44 joining 75:45 and that's just when you index one data 75:47 table 75:48 by another data table so it's a very 75:50 compact syntax 75:51 for performing a very complex operation 75:56 and this is something these joints are 75:57 also something that are not trivial 75:59 computation think you're doing that with 76:01 with something that has terabytes 76:05 in size yeah so in principle you have to 76:07 look up 76:08 all rows in both data table and find the 76:10 corresponding matches 76:12 now you need to find have uh very 76:15 efficient algorithms 76:16 to do that and depending on what you 76:19 choose here 76:20 which package you choose to perform this 76:22 algorithm you can wait for days 76:24 or you can wait for minutes now that 76:26 makes a huge difference 76:29 so if there's a left join there's also a 76:30 right join and right join just means 76:32 that we 76:33 keep all rows from the right table 76:38 but we disregard rows from the left 76:40 table if they're not 76:42 in the right table that's the right join 76:45 and 76:45 in all this just is that works by uh 76:48 also by the merge command 76:50 and we just give the argument that oh y 76:53 yeah that you know that we should all 76:55 keep all 76:57 uh columns in the second y data set 77:01 and there's also then a short notation 77:03 for these right joints 77:07 there's so called inner join that's the 77:10 most restrictive joint 77:12 and this 77:16 inner join just keeps values 77:19 keeps rows where we have information 77:22 in both tables now if we don't have 77:26 information in both tables 77:28 then this row will not end up in the 77:30 final 77:31 data set for example c only is in the 77:34 left 77:35 table d is only in the right table 77:40 yeah and then we uh 77:44 none of these ends up in the final 77:46 results 77:48 that's the inner join yeah and uh 77:52 so again here there's a merge command 77:54 for that we just say all 77:56 equals false and there's always a 77:58 shorthand notation 78:00 with an additional argument for that 78:04 so in all of these commands here i 78:05 didn't specify 78:07 the column that we should use for 78:09 joining i didn't specify 78:11 that x1 is the column we should use for 78:14 comparing 78:15 now i didn't specify for example if you 78:18 look here 78:20 which column we should actually use to 78:22 match slides to the 78:23 together and i didn't do that 78:27 bef because this column has the same 78:30 name 78:32 in both data sets so x1 pops up 78:35 both in a and b and then r automatically 78:40 assigns these columns these columns that 78:42 have the same names 78:44 uh uses them to match these different 78:46 tables 78:48 so if there's an inner join there's also 78:50 an outer join 78:52 or a full join and this full join 78:55 retains 78:55 all values from all rows 78:59 so that it happens if we just now join 79:02 or merge these two tables 79:04 then we get our column x1 that is shared 79:07 between these tables 79:09 and our column x2 from this first table 79:13 where we don't have a value for d 79:17 and the column x3 from the right table 79:21 where we don't have a value for c 79:25 now so this retains the most information 79:32 okay so now we can do some

slide 30, 31

79:36 merging so now we can for example ask 79:39 are these delays affected by bad 79:43 weather now so the first thing we need 79:45 to do 79:46 is to merge the flights 79:49 and the weather data sets because they 79:52 share columns with the same names 79:56 for example time hour or so of 79:59 our 80:03 whereas the origin airport now since i 80:05 didn't plot that here they share columns 80:07 with the same names that the ones that 80:08 are connected with these arrows 80:10 that works automatically so i merged 80:12 them 80:14 and now i have a new table that 80:19 contains information about the flight 80:22 the carrier 80:23 and also the delay but also at the same 80:27 time 80:27 the columns from the weather data set 80:30 that have information about 80:32 wind speed and precipitation and about 80:34 the temperature 80:37 now and now we can go on and ask for 80:40 example 80:42 let's group this combined 80:45 table here let's group that 80:49 on the one hand by the rows that have a 80:52 wind speed larger than 30 80:55 and by wind speeds smaller than 30. 81:00 yeah and then for each of these two 81:04 groups 81:05 calculate the average departure delay as 81:08 before 81:11 now so if the wind speed is smaller or 81:14 equals to 30 81:15 here then the departure delay was 12 81:19 minutes 81:21 if the wind speed was larger than 30 81:24 then the departure delay was 28 minutes 81:27 so much larger 81:29 and if i couldn't evaluate 81:33 this condition here for example if i 81:35 don't 81:36 have information about wind speed yeah 81:39 then i get 81:39 also a value for these nas for these 81:42 non-available data 81:44 so that means that that if it's very 81:47 windy there's a 81:48 higher chance or there's a typical 81:50 higher average delay 81:52 in these flights in new york city 81:57 now we can also ask we can also merge 82:00 the flights table with the plain table 82:09 oh so is anybody stuck at the joint 82:11 slide 82:18 now so which what is the slide that 82:19 you're seeing at the moment 82:24 are planes equally reliable 82:28 our place equal so that's the right so 82:30 matthew matthew is having a problem 82:32 um but at least some of you are seeing 82:36 the 82:37 the right slide okay um 82:42 okay so we can merge the flight 82:44 information 82:46 with information about planes 82:50 and now i give this additional column 82:52 here this additional 82:53 argument i tell the merged command to 82:56 use this column tail number 82:59 to to match the information to match the 83:01 rows 83:03 now if i do that then i get also my 83:05 flight information about the delay 83:08 about the carrier and so on but i know 83:11 also get information about the 83:12 manufacturer 83:14 about the model number and about 83:18 the year this airplane was created 83:21 and because we had already a year column 83:24 namely the year of the flight 83:26 we now have two different columns your x 83:29 and your y 83:30 your x is the column of the flight when 83:32 did the flight take place 83:34 and your why is the year when 83:37 the airplane was built 83:41 okay and now we can just 83:44 do our calculation now we can ask are 83:48 certain manufacturers uh more 83:51 or less reliable are the planes by 83:53 certain manufacturers more or less 83:55 reliable 83:56 yeah so that was we do the same thing 83:59 yeah so we group by manufacturer 84:04 now by this column here we group now the 84:07 rows 84:08 and then for each group we calculate 84:11 the average departure delay 84:16 here and save it into a new column mean 84:18 delay 84:20 and in this case we also calculate the 84:21 standard error now which is 84:23 just the standard deviation divided by 84:25 the square root 84:26 of the sample size now we can have two 84:29 computations or as many computations as 84:31 we want 84:32 in the same row yeah and here we see 84:36 that okay so we have here airboroughs 84:38 industries i don't know what 84:39 i forgot what is actually the uh the 84:42 outcome of this 84:43 boeing is a little bit later than airbus 84:47 and then we have here two two times uh 84:49 airbus 84:51 and uh the problem now here is that 84:54 airbus apparently changed its name and 84:56 one of them is older than the others and 84:58 that's why they have different delays 85:01 now these are smaller airplanes here 85:03 like 85:04 at this m and blair and bombardier these 85:07 are smaller planes 85:08 and these smaller planes have higher 85:12 delays now i can ideas like 85:15 uh i'm not kind of there there's also a 85:18 smaller 85:19 yeah and some of these smaller planes 85:21 seem to have 85:22 higher delays

slide 32

85:25 yeah and then we can also do a trick now 85:28 so 85:29 typically you do many many operations 85:32 and such a processing of such data says 85:34 you do many many operations 85:37 after each other yeah and 85:40 what we typically do in programming we 85:43 do one operation 85:44 and save the result in a new variable 85:47 now then we do the next operation and 85:49 save the result in a new variable now we 85:51 do another operation and save the result 85:53 in a new variable 85:55 now this is very inefficient if you have 85:57 long pipelines 86:00 of maybe hundreds of different steps in 86:02 your analysis 86:03 and how you transform the data 86:06 yeah and in our 86:10 uh and also in other languages there are 86:13 ways 86:14 to chain operations after each other 86:18 now to make these pipelines in a very 86:21 efficient 86:22 and in condensed 86:25 condensed way in terms of syntax now so 86:28 there's a 86:29 there's a tailway that is implemented in 86:31 this data table package 86:33 now there you just change they just put 86:36 the squared brackets directly off with 86:39 each other 86:40 for example here we create a new column 86:45 wind speed and kilometers per hour and 86:48 here we just 86:49 multiply the wind speed in miles per 86:51 hour by 86:52 1.61 to get the kilometers per hour 86:57 and then directly in the next step we 87:00 group by month 87:01 and calculate the average wind speed in 87:04 kilometers per hour so these are two 87:07 steps and we don't have to save the 87:09 results and intermediate variables 87:11 we just put them down we chain them 87:13 directly after each other 87:16 so there's a more common way to 87:20 chain or to make these pipelines and 87:22 this is 87:23 in another package that's called 87:24 magritter 87:26 now we load this package just at the 87:29 beginning of our code 87:31 and this package does one thing it 87:33 provides us with an 87:34 operator this one here 87:38 and everything this operator does is 87:41 it takes whatever is on the left of it 87:47 and pass it is to the first as the first 87:50 argument 87:51 to the function that is on the right of 87:52 this operator 87:55 now so this is very very simple we take 87:58 what is on the left 88:00 and pass it to the function of the on 88:02 the right as the first argument 88:05 now and with this we can then have 88:08 longer chains if we want 88:10 now for example we take weather 88:12 calculate the wind speed 88:14 in kilometers per hour and then 88:18 use this chaining upper this this pipe 88:21 operator here 88:22 from this package and pause it 88:25 to the next argument here we 88:29 have here an updated data table with 88:32 this new 88:33 package and that that's what we take 88:36 and we pass it to the next step of our 88:39 analysis 88:40 where we calculate the average wind 88:42 speed 88:44 per month and we take the result again 88:48 and passes to basically any function 88:51 that we want and here we pass it to for 88:53 example the the head 88:55 function and this head function just 88:57 gives us the first five rows or so 88:59 of our data table now so with these 89:02 operators here you can make very very 89:04 long pipelines 89:05 and you still have your codes uh in a 89:08 very 89:10 basic you have a list then in your code 89:11 you have a list of steps 89:13 that you perform one after each other 89:16 and so your code is still 89:17 in a very convenient way that you can 89:20 still

slide 33

89:22 understand yeah so this is a more 89:25 uh more complex example of such a 89:28 pipeline here 89:29 that we take the flights data sets 89:34 now merge it with the weather data set 89:39 now merge it with a plane state merge 89:41 the result 89:42 now with the planes data set 89:45 merge the results with the air ports 89:48 data set 89:49 and here i have to specify the columns 89:52 yeah that because they have different 89:53 names in both data sets 89:56 and we merge it with the airlines data 89:58 sets at the very end 90:00 yeah and now we have a data set that has 90:02 all the different information 90:04 all the different source information the 90:06 same table 90:08 yeah and now we can for example remove 90:11 flights that don't have information 90:13 about the departure delay 90:15 and we can do the steps that we did 90:18 previously for example we can get the 90:20 speed of the plane the average speed 90:23 in kilometers per hour we get the 90:25 rescale speed so every faster or slower 90:30 than what is typical for a certain 90:32 airplane model 90:34 and a certain carrier now and we can 90:38 also calculate things like correlations 90:41 now so for example here 90:43 we calculate the correlation between the 90:46 temperature 90:49 and the rescape speed of the airplane 90:54 now and here what do we do here 90:59 ah so what do we do here so we calculate 91:03 the difference in speed of the airplane 91:08 for flights that have a delay larger 91:12 than 20 minutes 91:14 and where the delay is zero 91:18 or smaller than zero yeah so here's the 91:21 difference in the speeds that have a 91:23 large delay 91:24 and that have a negative delay or no 91:26 delay 91:28 i recalculate that by carrier that's 91:30 just an example here 91:32 and uh what we see here is that 91:35 uh now so here this this correlation 91:38 here 91:39 is typically positive now so this 91:42 correlation is positive 91:44 that means for some reason the warmer 91:46 the temperature 91:48 the higher the speed of the airplane 91:51 now there's a positive correlation yeah 91:56 now so whatever this means yeah and 91:59 also what we see here is that this 92:02 difference in speed 92:04 is very often positive now american 92:08 airlines are so it's positive 92:09 that means that these airplanes fly 92:11 faster 92:13 when they have a delay compared to when 92:16 they don't have a delay 92:18 and you can also compare the the 92:20 carriers 92:22 united airlines has a little bit more 92:25 speed 92:26 you know than than american airlines 92:29 and you can get all kinds of insights 92:31 here from that now that's of course not 92:33 extremely insightful yeah but it 92:35 illustrates 92:36 how you can do simple operations and 92:38 complex data sets 92:40 in just a few lines of codes and i don't 92:44 have to tell you that in c 92:45 or matlab you would be spend a lot of 92:47 time and a lot of lines of codes to 92:49 implement that

slide 34

92:52 yeah just to summary this part and uh 92:55 if we have time i'll show you a little 92:56 bit about plotting no we don't have time 92:59 um so i showed you that these data 93:03 frames are efficient ways of storing 93:06 high dimensional data of different kinds 93:10 and 93:13 depending on 93:16 where your data is stored and how large 93:19 is it 93:20 you can choose different tools yeah if 93:23 your data is stored locally 93:25 and it's very large and you rely on 93:27 speed then data table 93:29 is a good way to go if your data is 93:34 is for example on a server on the 93:36 internet 93:37 yeah and it's not that large then you 93:40 would use 93:40 other codes that are better at 93:42 interacting with net network uh 93:44 resources 93:45 yeah for example like this diplo package 93:48 yeah and the key step is actually always 93:52 to bring it to clean up the data to 93:53 bring it to this tidy format 93:56 and once you've done that we have this 93:59 split 94:00 aggregate combine paradigm where you 94:03 group 94:04 your data by different conditions 94:07 by basically any condition you want and 94:10 then for each group 94:12 you extract the corresponding rows of 94:14 the data table 94:16 and perform operations on this for 94:18 example to perform summaries 94:20 on this data on this on these rows 94:23 and the package that i showed you here 94:25 allows you to do all of these steps 94:27 in a single very short line of code 94:31 yeah and finally i showed you how uh 94:34 pipelines actually help you to create 94:38 to structure your code uh if it gets 94:41 very complex which is usually usually 94:43 the case 94:44 yeah and uh what i will do is i will 94:47 upload some slides 94:48 that shows show you how to visualize 94:51 data 94:51 now that's just i mean you all have your 94:54 favorite topic plotting packages 94:56 i'll upload to some slides that shows 94:58 you how to basically 95:00 uh how people developed a grammar for 95:03 graphics that allows you basically to 95:05 make an infinitely complex 95:09 clause or data visualization using a 95:12 very simple 95:13 grammatical construct so i'll upload 95:16 this on the website 95:17 and then let me know if you have any 95:20 questions 95:21 and otherwise see you all next week 95:25 so next week i should say so the next 95:26 week we go into a machine learning and 95:28 we'll have a guest 95:29 speaker which is our local which is 95:33 basically that the chief data scientist 95:37 from on one of the json institutes here 95:40 from the 95:41 center for regenerative therapies 95:44 and he'll share fabian us and he'll 95:46 share some of his insights 95:48 into how to use machine learning to 95:51 detect that 95:51 actually low dimensional structures 95:56 and high dimensional data sets now as i 95:59 see as you probably realize now we're 96:01 able to 96:02 do computations fast computations on 96:06 on huge data sets but what i showed you 96:10 was a little bit tedious you know 96:12 this was i actually didn't show you 96:13 actually how to come up 96:15 with uh low dimensional structures or 96:17 with order and data 96:19 yeah so and this is something we'll deal 96:21 with uh next week 96:22 and we'll have a guest lecturer then 96:25 which is uh which is fabian here from 96:27 from dresden 96:28 okay see you all next week bye 96:32 i'll stay online for a while to case of 96:34 any questions 96:43 uh i was wondering if um 96:47 if you dealt with like temporal data 96:50 or are most of the data that's coming 96:53 out of these experiments just kind of 96:55 you know there there's some static 96:58 information about the 96:59 genomics and proteomics and all that yes 97:02 so that's typically static data 97:04 but this although it's static it 97:06 contains dynamic information 97:09 yeah um indirectly so you have 97:12 measurements so of course like like very 97:15 often biology you have to kill the cells 97:17 that you actually want to 97:19 want to measure or you have to kill the 97:21 animals that you actually want to 97:23 look into and but you have some 97:26 indirect dynamical information i'll 97:29 actually 97:30 share you send me an email about a paper 97:33 we just 97:34 uploaded something 97:39 that actually answers the question about 97:41 how this field theory 97:43 and data science approaches work 97:46 together and this is actually an example 97:48 from genomics 97:49 i share it with you in the chat um 97:58 just see where this is here we go 98:09 okay so i'll share it with you in the 98:11 [Music] 98:14 chat all right thank you 98:16 there you go and that that that that 98:18 actually answers the question in your 98:20 email 98:20 and just took and you needed the needed 98:22 the christmas holidays 98:24 some time to to finish and upload it 98:27 and um this is actually an example where 98:30 you have 98:31 static data in genomics you can only 98:34 have static data 98:36 but indirectly you have dynamic 98:39 information 98:40 so actually actually actually although 98:42 the map in this 98:43 this specific example here we have 98:46 static data because of course you have 98:48 to kill the cells you have to kill the 98:49 embryo 98:50 but you can still have time causes 98:54 with different embryos or different 98:56 cells now so that you 98:59 kill one embryo at one stage that's of 99:02 course mouse non-human 99:03 yeah so you kill one embryo at one stage 99:06 and then next embryo one day later 99:09 yeah and the next embryo one day later 99:11 so then the day you get some implicit 99:13 temporal information and actually in 99:16 this 99:16 same paper we also conducted a time 99:19 course 99:20 with a very high temporary resolution 99:23 but 99:24 you always have at each time when you 99:26 kill you look at different cells and 99:28 that's that's always the uh so 99:31 that's the best you can hope for is 99:33 something semi-static 99:35 yeah but but nevertheless yes so we 99:37 still have dynamic information 99:40 indirectly via the measurements that we 99:43 can make for example 99:44 along the dna yeah and 99:48 so that's although we cannot make direct 99:51 dynamic information we can have 99:54 indirectly generate hypothesis 99:57 that we then can indirectly basically 100:00 test using state 100:01 static data if you have a look at this 100:04 at this manuscript and you'll see how it 100:06 works and let me know if you have any 100:07 questions 100:08 it's very a little bit more logical and 100:10 this method 100:11 also has a supplement with all the field 100:13 theory in it 100:15 okay thank you 100:18 okay great perfect 100:29 okay so if there are no more questions 100:31 then see you all next week 100:34 bye

本文收录于以下合集：