logo
down
shadow

how to calculate cumsum with depreciation in a grouped dataframe?


how to calculate cumsum with depreciation in a grouped dataframe?

By : user3099876
Date : January 11 2021, 05:14 PM
this one helps. Your function would work if you pass vector to your function instead of dataframe
code :
depre <-  function(num){
    rate = 0.5
    r= length(num)
    sl = num
    nl =  num
    for (i in 2:r){
      sl[i]=sl[i-1]*rate+nl[i]
    }
    return(sl)
}
library(dplyr)
df %>% group_by(id) %>% mutate(sl = depre(num))


Share : facebook icon twitter icon
How to calculate difference on previous within set of grouped rows in a dataframe

How to calculate difference on previous within set of grouped rows in a dataframe


By : Sundhar Sun S
Date : March 29 2020, 07:55 AM
I wish this helpful for you [Note: your data doesn't seem to match your desired output; there are no CONTRACT_REF Cs in the second, and even in your output, I don't see why the 5, B row is 1 and not 0. I'm assuming that these are mistakes on your part. Since you didn't comment, I'm going to use the data from the output, because it leads to a more interesting column.]
I might do something like
code :
df["SUBMISSION_DATE"] = pd.to_datetime(df["SUBMISSION_DATE"],dayfirst=True)

gs = df.groupby(["USER_ID", "CONTRACT_REF"])["SUBMISSION_DATE"]
df["TIME_DIFF"] = gs.diff().fillna(0) / pd.datetools.timedelta(hours=1)
>>> df
    #  USER_ID CONTRACT_REF     SUBMISSION_DATE  TIME_DIFF
0   1        1            A 2014-06-20 01:00:00        0.0
1   2        1            A 2014-06-20 02:00:00        1.0
2   3        1            B 2014-06-20 03:00:00        0.0
3   4        4            A 2014-06-20 04:00:00        0.0
4   5        5            A 2014-06-20 05:00:00        0.0
5   6        5            B 2014-06-20 06:00:00        0.0
6   7        7            A 2014-06-20 07:00:00        0.0
7   8        7            A 2014-06-20 08:00:00        1.0
8   9        7            A 2014-06-20 09:30:00        1.5
9  10        7            B 2014-06-20 10:00:00        0.0

[10 rows x 5 columns]
>>> df
    #  USER_ID CONTRACT_REF SUBMISSION_DATE
0   1        1            A      20/6 01:00
1   2        1            A      20/6 02:00
2   3        1            B      20/6 03:00
3   4        4            A      20/6 04:00
4   5        5            A      20/6 05:00
5   6        5            B      20/6 06:00
6   7        7            A      20/6 07:00
7   8        7            A      20/6 08:00
8   9        7            A      20/6 09:30
9  10        7            B      20/6 10:00

[10 rows x 4 columns]
>>> df["SUBMISSION_DATE"] = pd.to_datetime(df["SUBMISSION_DATE"],dayfirst=True)
>>> df
    #  USER_ID CONTRACT_REF     SUBMISSION_DATE
0   1        1            A 2014-06-20 01:00:00
1   2        1            A 2014-06-20 02:00:00
2   3        1            B 2014-06-20 03:00:00
3   4        4            A 2014-06-20 04:00:00
4   5        5            A 2014-06-20 05:00:00
5   6        5            B 2014-06-20 06:00:00
6   7        7            A 2014-06-20 07:00:00
7   8        7            A 2014-06-20 08:00:00
8   9        7            A 2014-06-20 09:30:00
9  10        7            B 2014-06-20 10:00:00

[10 rows x 4 columns]
>>> gs = df.groupby(["USER_ID", "CONTRACT_REF"])["SUBMISSION_DATE"]
>>> gs
<pandas.core.groupby.SeriesGroupBy object at 0xa7af08c>
>>> gs.diff()
0        NaT
1   01:00:00
2        NaT
3        NaT
4        NaT
5        NaT
6        NaT
7   01:00:00
8   01:30:00
9        NaT
dtype: timedelta64[ns]
>>> gs.diff().fillna(0)
0   00:00:00
1   01:00:00
2   00:00:00
3   00:00:00
4   00:00:00
5   00:00:00
6   00:00:00
7   01:00:00
8   01:30:00
9   00:00:00
dtype: timedelta64[ns]
>>> gs.diff().fillna(0) / pd.datetools.timedelta(hours=1)
0    0.0
1    1.0
2    0.0
3    0.0
4    0.0
5    0.0
6    0.0
7    1.0
8    1.5
9    0.0
dtype: float64
>>> df["TIME_DIFF"] = gs.diff().fillna(0) / pd.datetools.timedelta(hours=1)
>>> df
    #  USER_ID CONTRACT_REF     SUBMISSION_DATE  TIME_DIFF
0   1        1            A 2014-06-20 01:00:00        0.0
1   2        1            A 2014-06-20 02:00:00        1.0
2   3        1            B 2014-06-20 03:00:00        0.0
3   4        4            A 2014-06-20 04:00:00        0.0
4   5        5            A 2014-06-20 05:00:00        0.0
5   6        5            B 2014-06-20 06:00:00        0.0
6   7        7            A 2014-06-20 07:00:00        0.0
7   8        7            A 2014-06-20 08:00:00        1.0
8   9        7            A 2014-06-20 09:30:00        1.5
9  10        7            B 2014-06-20 10:00:00        0.0

[10 rows x 5 columns]
How to calculate mean of grouped dataframe?

How to calculate mean of grouped dataframe?


By : Shashank Tiwari
Date : March 29 2020, 07:55 AM
Does that help For your specific case, you can just add the two columns together, take the mean and then divide it by two, since the two columns always have the same count:
code :
df %>% group_by(participant_number) %>% mutate(emoMean = mean(Happiness + Joy)/2)

Source: local data frame [5 x 5]
Groups: participant_number [2]

  participant_number Happiness   Joy  Lolz emoMean
               <dbl>     <dbl> <dbl> <dbl>   <dbl>
1                  1         3     1     3    2.50
2                  1         4     2     3    2.50
3                  1         2     3     3    2.50
4                  2         1     5     3    3.25
5                  2         3     4     3    3.25
A simpler way to calculate grouped percentages in a Spark dataframe?

A simpler way to calculate grouped percentages in a Spark dataframe?


By : Yousaf Khalifa
Date : March 29 2020, 07:55 AM
this one helps. Given the following dataframe: , mean / avg combined with when:
code :
from pyspark.sql.functions import avg, col, when

df.groupBy("date").agg(avg(when(col("foo") == "a", 1).otherwise(0)))
df.groupBy("date").agg(avg((col("foo") == "a").cast("integer")))
Calculate medians of rows in a grouped dataframe

Calculate medians of rows in a grouped dataframe


By : BOT Veronica
Date : March 29 2020, 07:55 AM
To fix this issue I have a dataframe containing multiple entries per week. It looks like this: , We could use summarise_at for multiple columns
code :
library(dplyr)
colsToKeep <- c("t_10", "t_30")
df1 %>%
   group_by(Week) %>%
   summarise_at(vars(colsToKeep), median) 
# A tibble: 4 x 3
#   Week  t_10  t_30
#  <int> <dbl> <dbl>
#1     1 51.40  5.60
#2     2 52.15  6.15
#3     3 52.20  5.90
#4     4 52.05  5.90
R: Calculate distance between first and current row of grouped dataframe

R: Calculate distance between first and current row of grouped dataframe


By : user2706672
Date : March 29 2020, 07:55 AM
it fixes the issue Edit: added alternative formulation using a join. I expect that approach will be much faster for a very wide data frame with many columns to compare.
Approach 1: To get euclidean distance for a large number of columns, one way is to rearrange the data so each row shows one month, one student, and one original column (e.g. A or B in the OP), but then two columns representing current month value and first value. Then we can square the difference, and group across all columns to get the euclidean distance, aka root-mean-squared / RMS for each student-month.
code :
  library(tidyverse)
  df %>% 
    group_by(student) %>% 
    mutate_all(list(first = first)) %>%
    ungroup() %>%
  # gather into long form; make col show variant, col2 show orig column
  gather(col, val, -c(student, month, month_first)) %>%
  mutate(col2 = col %>% str_remove("_first")) %>% 
  mutate(col = if_else(col %>% str_ends("_first"),
                        "first",
                        "comparison")) %>% 
  spread(col, val) %>% 
  mutate(square_dif = (comparison - first)^2) %>%
  group_by(student, month) %>%
  summarize(RMS = sqrt(sum(square_dif)))

# A tibble: 6 x 3
# Groups:   student [2]
  student month   RMS
  <fct>   <int> <dbl>
1 Amy         1  0   
2 Amy         2  5   
3 Amy         3  3.61
4 Bob         1  0   
5 Bob         2  2.24
6 Bob         3  2.24
library(tidyverse)
df_long <- gather(df, col, val, -c(month, student))
df_long %>% left_join(df_long %>% 
              group_by(student) %>%
              top_n(-1, wt = month) %>%
              rename(first_val = val) %>% 
              select(-month),
            by = c("student", "col")) %>%
  mutate(square_dif = (val - first_val)^2) %>%
  group_by( student, month) %>%
  summarize(RMS = sqrt(sum(square_dif)))

# A tibble: 6 x 3
# Groups:   student [2]
  student month   RMS
  <fct>   <int> <dbl>
1 Amy         1  0   
2 Amy         2  5   
3 Amy         3  3.61
4 Bob         1  0   
5 Bob         2  2.24
6 Bob         3  2.24
Related Posts Related Posts :
  • How to retrieve the data frame used in a GEE model fit?
  • R: How to find/replace and then automatically execute code?
  • Slope of time series (xts) object over rolling window
  • Is there an R function for comparing rows in data.frame?
  • Changing linetype and line color with plot_model()
  • Update existing package on CRAN
  • Delete NA data ,but with certain condition in R
  • calculate number and names of similar sounding words from a data frame
  • Reset input fields of dynamically generated widgets through insertUI
  • Select certain region of column for lm
  • Convert multiple rows into one row depending on unique values in another column
  • Issues installing Plotly Dash for R
  • Is there an R function to retrieve values from a matrix of column names?
  • R;Too slow to overate loops for million vectors
  • How to optimize intersect of rows and columns in a matrix?
  • Format and export the output of Mann-Kendall test in R to excel from Rstudio
  • reshape wide to long based on part of column name
  • How to get a hyperlink for the words in a description in an r dataframe?
  • shinymeta works locally but breaks when published to shinyapps.io
  • Deparse and (un)escape quotes
  • Regression table with clustered standard errors in R jupyter notebook?
  • Disaggregate quarterly data to daily data in R keeping values?
  • How to save output to console and file simultaneously in RStudio server?
  • Why does data.table j have a different environment when directly calling mget() vs calling mget() inside another functio
  • scale_fill_viridis_c color bar on a log scale
  • How to change the lab name corresponding to function in ggplot
  • R, filtering for an element in a list in a dataframe cell
  • Extracting only bottom temperature from 4d NetCDF file
  • How to add/wrap lines of text to .tex with .sh script
  • R - building new variables from sequenced data
  • Sum rows values one after the other
  • Nesting ifelse inside summarytools
  • How best to divide different levels of a factor by one another in dataframe in R?
  • Why does my code run multiple times before I type data into the table? How do I make an action button that creates a tab
  • How to impute missing values not at random?
  • Set the y limits of an added average line of a plotly plot
  • how to calculate a new column after grouping with dplyr
  • Extract data from rows creating new columns using R
  • Create a filled area line plot with plotly
  • When do I need parentheses around an if statement to control the sequence of a formula in R?
  • my graph in ggplot2 contains an "e" character in y-axis
  • Making variables immutable in R
  • R: Difference between the subsequent ranks of a item group by date
  • Match data within multiple time-frames with dplyr
  • Conditional manipulation and extension of rows in data.table also considering previous extensions without for-loop
  • Conditional formula referring to preview row in DF not working
  • Set hoverinfo text in plotly scatterplot
  • Histogram of Sums from Categorical/Binary Data
  • Efficiently find set differences and generate random sample
  • Find closest points from data set B to point in data set A, using lat long in R
  • dplyr join on column A OR column B
  • Replace all string if row starts with (within a column)
  • Is there a possibility to combine position_stack and nudge_x in a stacked bar chart in ggplot2?
  • How can I extract bounding boxes in a row-wise manner using R?
  • How do I easily sum up values in different columns?
  • How to identify all columns that contain binary representation
  • Filter different groups by different factor levels
  • Saving .xlsx file to disc, form http post request
  • Add an "all" option under the filter that selects the number of rows displayed in a datatable
  • How to select second column of every xts in list
  • shadow
    Privacy Policy - Terms - Contact Us © festivalmusicasacra.org