How to calculate difference on previous within set of grouped rows in a dataframe
By : Sundhar Sun S
Date : March 29 2020, 07:55 AM
I wish this helpful for you [Note: your data doesn't seem to match your desired output; there are no CONTRACT_REF Cs in the second, and even in your output, I don't see why the 5, B row is 1 and not 0. I'm assuming that these are mistakes on your part. Since you didn't comment, I'm going to use the data from the output, because it leads to a more interesting column.] I might do something like code :
df["SUBMISSION_DATE"] = pd.to_datetime(df["SUBMISSION_DATE"],dayfirst=True)
gs = df.groupby(["USER_ID", "CONTRACT_REF"])["SUBMISSION_DATE"]
df["TIME_DIFF"] = gs.diff().fillna(0) / pd.datetools.timedelta(hours=1)
>>> df
# USER_ID CONTRACT_REF SUBMISSION_DATE TIME_DIFF
0 1 1 A 2014-06-20 01:00:00 0.0
1 2 1 A 2014-06-20 02:00:00 1.0
2 3 1 B 2014-06-20 03:00:00 0.0
3 4 4 A 2014-06-20 04:00:00 0.0
4 5 5 A 2014-06-20 05:00:00 0.0
5 6 5 B 2014-06-20 06:00:00 0.0
6 7 7 A 2014-06-20 07:00:00 0.0
7 8 7 A 2014-06-20 08:00:00 1.0
8 9 7 A 2014-06-20 09:30:00 1.5
9 10 7 B 2014-06-20 10:00:00 0.0
[10 rows x 5 columns]
>>> df
# USER_ID CONTRACT_REF SUBMISSION_DATE
0 1 1 A 20/6 01:00
1 2 1 A 20/6 02:00
2 3 1 B 20/6 03:00
3 4 4 A 20/6 04:00
4 5 5 A 20/6 05:00
5 6 5 B 20/6 06:00
6 7 7 A 20/6 07:00
7 8 7 A 20/6 08:00
8 9 7 A 20/6 09:30
9 10 7 B 20/6 10:00
[10 rows x 4 columns]
>>> df["SUBMISSION_DATE"] = pd.to_datetime(df["SUBMISSION_DATE"],dayfirst=True)
>>> df
# USER_ID CONTRACT_REF SUBMISSION_DATE
0 1 1 A 2014-06-20 01:00:00
1 2 1 A 2014-06-20 02:00:00
2 3 1 B 2014-06-20 03:00:00
3 4 4 A 2014-06-20 04:00:00
4 5 5 A 2014-06-20 05:00:00
5 6 5 B 2014-06-20 06:00:00
6 7 7 A 2014-06-20 07:00:00
7 8 7 A 2014-06-20 08:00:00
8 9 7 A 2014-06-20 09:30:00
9 10 7 B 2014-06-20 10:00:00
[10 rows x 4 columns]
>>> gs = df.groupby(["USER_ID", "CONTRACT_REF"])["SUBMISSION_DATE"]
>>> gs
<pandas.core.groupby.SeriesGroupBy object at 0xa7af08c>
>>> gs.diff()
0 NaT
1 01:00:00
2 NaT
3 NaT
4 NaT
5 NaT
6 NaT
7 01:00:00
8 01:30:00
9 NaT
dtype: timedelta64[ns]
>>> gs.diff().fillna(0)
0 00:00:00
1 01:00:00
2 00:00:00
3 00:00:00
4 00:00:00
5 00:00:00
6 00:00:00
7 01:00:00
8 01:30:00
9 00:00:00
dtype: timedelta64[ns]
>>> gs.diff().fillna(0) / pd.datetools.timedelta(hours=1)
0 0.0
1 1.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 1.0
8 1.5
9 0.0
dtype: float64
>>> df["TIME_DIFF"] = gs.diff().fillna(0) / pd.datetools.timedelta(hours=1)
>>> df
# USER_ID CONTRACT_REF SUBMISSION_DATE TIME_DIFF
0 1 1 A 2014-06-20 01:00:00 0.0
1 2 1 A 2014-06-20 02:00:00 1.0
2 3 1 B 2014-06-20 03:00:00 0.0
3 4 4 A 2014-06-20 04:00:00 0.0
4 5 5 A 2014-06-20 05:00:00 0.0
5 6 5 B 2014-06-20 06:00:00 0.0
6 7 7 A 2014-06-20 07:00:00 0.0
7 8 7 A 2014-06-20 08:00:00 1.0
8 9 7 A 2014-06-20 09:30:00 1.5
9 10 7 B 2014-06-20 10:00:00 0.0
[10 rows x 5 columns]
|
How to calculate mean of grouped dataframe?
By : Shashank Tiwari
Date : March 29 2020, 07:55 AM
Does that help For your specific case, you can just add the two columns together, take the mean and then divide it by two, since the two columns always have the same count: code :
df %>% group_by(participant_number) %>% mutate(emoMean = mean(Happiness + Joy)/2)
Source: local data frame [5 x 5]
Groups: participant_number [2]
participant_number Happiness Joy Lolz emoMean
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 1 3 2.50
2 1 4 2 3 2.50
3 1 2 3 3 2.50
4 2 1 5 3 3.25
5 2 3 4 3 3.25
|
A simpler way to calculate grouped percentages in a Spark dataframe?
By : Yousaf Khalifa
Date : March 29 2020, 07:55 AM
this one helps. Given the following dataframe: , mean / avg combined with when: code :
from pyspark.sql.functions import avg, col, when
df.groupBy("date").agg(avg(when(col("foo") == "a", 1).otherwise(0)))
df.groupBy("date").agg(avg((col("foo") == "a").cast("integer")))
|
Calculate medians of rows in a grouped dataframe
By : BOT Veronica
Date : March 29 2020, 07:55 AM
|
R: Calculate distance between first and current row of grouped dataframe
By : user2706672
Date : March 29 2020, 07:55 AM
it fixes the issue Edit: added alternative formulation using a join. I expect that approach will be much faster for a very wide data frame with many columns to compare. Approach 1: To get euclidean distance for a large number of columns, one way is to rearrange the data so each row shows one month, one student, and one original column (e.g. A or B in the OP), but then two columns representing current month value and first value. Then we can square the difference, and group across all columns to get the euclidean distance, aka root-mean-squared / RMS for each student-month. code :
library(tidyverse)
df %>%
group_by(student) %>%
mutate_all(list(first = first)) %>%
ungroup() %>%
# gather into long form; make col show variant, col2 show orig column
gather(col, val, -c(student, month, month_first)) %>%
mutate(col2 = col %>% str_remove("_first")) %>%
mutate(col = if_else(col %>% str_ends("_first"),
"first",
"comparison")) %>%
spread(col, val) %>%
mutate(square_dif = (comparison - first)^2) %>%
group_by(student, month) %>%
summarize(RMS = sqrt(sum(square_dif)))
# A tibble: 6 x 3
# Groups: student [2]
student month RMS
<fct> <int> <dbl>
1 Amy 1 0
2 Amy 2 5
3 Amy 3 3.61
4 Bob 1 0
5 Bob 2 2.24
6 Bob 3 2.24
library(tidyverse)
df_long <- gather(df, col, val, -c(month, student))
df_long %>% left_join(df_long %>%
group_by(student) %>%
top_n(-1, wt = month) %>%
rename(first_val = val) %>%
select(-month),
by = c("student", "col")) %>%
mutate(square_dif = (val - first_val)^2) %>%
group_by( student, month) %>%
summarize(RMS = sqrt(sum(square_dif)))
# A tibble: 6 x 3
# Groups: student [2]
student month RMS
<fct> <int> <dbl>
1 Amy 1 0
2 Amy 2 5
3 Amy 3 3.61
4 Bob 1 0
5 Bob 2 2.24
6 Bob 3 2.24
|