Development Tip

dplyr 행의 하위 집합에서 여러 열을 변경 / 바꾸기

yourdevel 2020. 11. 23. 20:20

dplyr 행의 하위 집합에서 여러 열을 변경 / 바꾸기

dplyr 기반 워크 플로를 시도하는 중입니다 (대부분 data.table을 사용하는 대신). 이에 상응하는 dplyr 솔루션을 찾을 수없는 문제가 발생했습니다. . 일반적으로 단일 조건에 따라 여러 열을 조건부로 업데이트 / 교체해야하는 시나리오에 직면합니다. 다음은 내 data.table 솔루션과 함께 몇 가지 예제 코드입니다.

library(data.table)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

# Replace the values of several columns for rows where measure is "exit"
dt <- dt[measure == 'exit', 
         `:=`(qty.exit = qty,
              cf = 0,
              delta.watts = 13)]

이 같은 문제에 대한 간단한 dplyr 솔루션이 있습니까? 조건을 여러 번 입력 할 필요가 없기 때문에 ifelse 사용을 피하고 싶습니다. 이것은 간단한 예이지만 때로는 단일 조건을 기반으로하는 많은 할당이 있습니다.

도움을 주셔서 미리 감사드립니다!

이러한 솔루션은 (1) 파이프 라인을 유지하고, (2) 입력을 덮어 쓰지 않으며 , (3) 조건을 한 번만 지정하면됩니다.

1a) mutate_cond 파이프 라인에 통합 할 수있는 데이터 프레임 또는 데이터 테이블에 대한 간단한 함수를 만듭니다. 이 함수는 다음과 mutate같지만 조건을 충족하는 행에서만 작동합니다.

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data[condition, ] %>% mutate(...)
  .data
}

DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)

1b) mutate_last 이것은 다시 비슷 mutate하지만 group_by(아래 예에서와 같이) 내에서만 사용 되며 모든 그룹이 아닌 마지막 그룹에서만 작동하는 데이터 프레임 또는 데이터 테이블에 대한 대체 함수입니다 . TRUE> FALSE이므로 group_by조건을 지정하면 해당 조건을 mutate_last충족하는 행에서만 작동합니다.

mutate_last <- function(.data, ...) {
  n <- n_groups(.data)
  indices <- attr(.data, "indices")[[n]] + 1
  .data[indices, ] <- .data[indices, ] %>% mutate(...)
  .data
}


DF %>% 
   group_by(is.exit = measure == 'exit') %>%
   mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
   ungroup() %>%
   select(-is.exit)

2) 조건 제거 나중에 제거 할 추가 열로 만들어 조건을 제거합니다. 그런 다음 그림과 같이 논리와 함께 ifelse, replace또는 산술을 사용 합니다. 이것은 데이터 테이블에서도 작동합니다.

library(dplyr)

DF %>% mutate(is.exit = measure == 'exit',
              qty.exit = ifelse(is.exit, qty, qty.exit),
              cf = (!is.exit) * cf,
              delta.watts = replace(delta.watts, is.exit, 13)) %>%
       select(-is.exit)

3) sqldfupdate 데이터 프레임을 위해 파이프 라인에서 sqldf 패키지를 통해 SQL 을 사용할 수 있습니다 (하지만 변환하지 않는 한 데이터 테이블이 아닙니다. 이것은 dplyr의 버그를 나타낼 수 있습니다. dplyr 문제 1579 참조 ). 이 코드의 존재로 인해이 코드의 입력을 원치 않게 수정하는 것처럼 보일 수 update있지만 실제로 update는 실제 입력이 아닌 임시로 생성 된 데이터베이스의 입력 복사본에 대해 작동합니다.

library(sqldf)

DF %>% 
   do(sqldf(c("update '.' 
                 set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13 
                 where measure = 'exit'", 
              "select * from '.'")))

참고 1 : 우리는 이것을 다음과 같이 사용했습니다.DF

set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

참고 2 : 행의 하위 집합 업데이트를 쉽게 지정하는 방법에 대한 문제는 dplyr 문제 134 , 631 , 1518 및 1573 에서 631 이 주 스레드이고 1573 이 여기에 대한 답변 검토입니다.

의 magrittr양방향 파이프로 이를 수행 할 수 있습니다 %<>%.

library(dplyr)
library(magrittr)

dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
                                    cf = 0,  
                                    delta.watts = 13)

이것은 타이핑의 양을 줄이지 만 여전히 data.table.

내가 좋아하는 해결책은 다음과 같습니다.

mutate_when <- function(data, ...) {
  dots <- eval(substitute(alist(...)))
  for (i in seq(1, length(dots), by = 2)) {
    condition <- eval(dots[[i]], envir = data)
    mutations <- eval(dots[[i + 1]], envir = data[condition, , drop = FALSE])
    data[condition, names(mutations)] <- mutations
  }
  data
}

예를 들어

mtcars %>% mutate_when(
  mpg > 22,    list(cyl = 100),
  disp == 160, list(cyl = 200)
)

상당히 읽기 쉽습니다.하지만 성능은 떨어질 수 있습니다.

eipi10이 위에 보여 주듯이, DT는 pass-by-reference 의미론을 사용하는 반면, pass-by-value를 사용하는 dplyr를 사용하기 때문에 dplyr에서 하위 집합을 대체하는 간단한 방법이 없습니다. dplyr은 ifelse()전체 벡터 에서를 사용해야하는 반면 DT는 하위 집합을 수행하고 참조로 업데이트합니다 (전체 DT 반환). 따라서이 연습에서는 DT가 훨씬 더 빠릅니다.

또는 먼저 하위 집합을 만든 다음 업데이트하고 마지막으로 다시 결합 할 수 있습니다.

dt.sub <- dt[dt$measure == "exit",] %>%
  mutate(qty.exit= qty, cf= 0, delta.watts= 13)

dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])

그러나 DT는 훨씬 더 빠를 것입니다. (eipi10의 새로운 답변을 사용하도록 편집 됨)

library(data.table)
library(dplyr)
library(microbenchmark)
microbenchmark(dt= {dt <- dt[measure == 'exit', 
                            `:=`(qty.exit = qty,
                                 cf = 0,
                                 delta.watts = 13)]},
               eipi10= {dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
                                cf = 0,  
                                delta.watts = 13)},
               alex= {dt.sub <- dt[dt$measure == "exit",] %>%
                 mutate(qty.exit= qty, cf= 0, delta.watts= 13)

               dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])})


Unit: microseconds
expr      min        lq      mean   median       uq      max neval cld
     dt  591.480  672.2565  747.0771  743.341  780.973 1837.539   100  a 
 eipi10 3481.212 3677.1685 4008.0314 3796.909 3936.796 6857.509   100   b
   alex 3412.029 3637.6350 3867.0649 3726.204 3936.985 5424.427   100   b

나는 이것을 우연히 발견 mutate_cond()했고 @G가 정말 좋아 했습니다. Grothendieck은 새로운 변수를 처리하는 것이 유용 할 것이라고 생각했습니다. 따라서 아래에는 두 가지 추가 사항이 있습니다.

관련없는 : 두 번째 마지막 줄에 더 많은 비트를 만들어 dplyr사용하여filter()

처음에 세 개의 새 줄은에서 사용할 변수 이름을 가져 mutate()오고 mutate()발생 하기 전에 데이터 프레임에서 새 변수를 초기화 합니다. 나머지 data.frameusing에 대해 새 변수가 초기화되며 기본값 new_init은 누락 ( NA)으로 설정됩니다.

mutate_cond <- function(.data, condition, ..., new_init = NA, envir = parent.frame()) {
  # Initialize any new variables as new_init
  new_vars <- substitute(list(...))[-1]
  new_vars %<>% sapply(deparse) %>% names %>% setdiff(names(.data))
  .data[, new_vars] <- new_init

  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data %>% filter(condition) %>% mutate(...)
  .data
}

다음은 홍채 데이터를 사용하는 몇 가지 예입니다.

Petal.Length88로 변경하십시오 Species == "setosa". 이 새로운 버전은 물론 원래 기능에서도 작동합니다.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88)

위와 동일하지만 새 변수 x( NA조건에 포함되지 않은 행) 도 만듭니다 . 이전에는 불가능했습니다.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE)

위와 동일하지만 조건에 포함되지 않은 행 x은 FALSE로 설정됩니다.

iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE, new_init = FALSE)

이 예제는 값이 다른 여러 새 변수를 초기화 new_init하기 위해 list로 설정 하는 방법 을 보여줍니다 . 여기에서 제외 된 행이 다른 값을 사용하여 초기화되는 두 개의 새 변수가 생성됩니다 ( , 로 x초기화 됨 ).FALSEyNA

iris %>% mutate_cond(Species == "setosa" & Sepal.Length < 5,
                  x = TRUE, y = Sepal.Length ^ 2,
                  new_init = list(FALSE, NA))

mutate_cond는 훌륭한 기능이지만 조건을 생성하는 데 사용 된 열에 NA가 있으면 오류가 발생합니다. 조건부 돌연변이는 단순히 그러한 행을 그대로 두어야한다고 생각합니다. 이는 조건이 TRUE 일 때 행을 반환하지만 FALSE 및 NA가있는 두 행을 모두 생략하는 filter ()의 동작과 일치합니다.

이 작은 변경으로 기능은 매력처럼 작동합니다.

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
    condition <- eval(substitute(condition), .data, envir)
    condition[is.na(condition)] = FALSE
    .data[condition, ] <- .data[condition, ] %>% mutate(...)
    .data
}

나는 실제로 dplyr이것을 훨씬 쉽게 만들 수있는 어떤 변화도 보지 못했다 . case_when하나의 열에 대해 여러 다른 조건과 결과가있을 때 유용하지만 하나의 조건에 따라 여러 열을 변경하려는 경우에는 도움이되지 않습니다. 마찬가지로 recode한 열에서 여러 다른 값을 바꾸는 경우 입력을 저장하지만 한 번에 여러 열에서 그렇게하는 데 도움이되지 않습니다. 마지막으로 mutate_at등은 데이터 프레임의 행이 아닌 열 이름에만 조건을 적용합니다. 잠재적으로 mutate_at에 대한 함수를 작성할 수 있지만 다른 열에 대해 다르게 작동하는 방법을 알 수 없습니다.

그것은 내가 nestform tidyr과 mapfrom을 사용하여 접근하는 방법 입니다 purrr.

library(data.table)
library(dplyr)
library(tidyr)
library(purrr)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                                  replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

dt2 <- dt %>% 
  nest(-measure) %>% 
  mutate(data = if_else(
    measure == "exit", 
    map(data, function(x) mutate(x, qty.exit = qty, cf = 0, delta.watts = 13)),
    data
  )) %>%
  unnest()

생성하여 rlang, 그로 텐 디크 1A의 예의 약간 수정 된 버전에 대한 필요성 제거 가능 envir같은 인자를 enquo()포착 환경을 .p자동으로 생성된다.

mutate_rows <- function(.data, .p, ...) {
  .p <- rlang::enquo(.p)
  .p_lgl <- rlang::eval_tidy(.p, .data)
  .data[.p_lgl, ] <- .data[.p_lgl, ] %>% mutate(...)
  .data
}

dt %>% mutate_rows(measure == "exit", qty.exit = qty, cf = 0, delta.watts = 13)

데이터 세트를 분할하고 TRUE부품 에 대해 정기적 인 mutate 호출을 수행 할 수 있습니다.

dplyr 0.8 은 group_split그룹별로 분할 하는 기능 (그리고 그룹은 호출에서 직접 정의 할 수 있음)을 특징으로하므로 여기서 사용하지만 base::split잘 작동합니다.

library(tidyverse)
df1 %>%
  group_split(measure == "exit", keep=FALSE) %>% # or `split(.$measure == "exit")`
  modify_at(2,~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%
  bind_rows()

#    site space measure qty qty.exit delta.watts          cf
# 1     1     4     led   1        0        73.5 0.246240409
# 2     2     3     cfl  25        0        56.5 0.360315879
# 3     5     4     cfl   3        0        38.5 0.279966850
# 4     5     3  linear  19        0        40.5 0.281439486
# 5     2     3  linear  18        0        82.5 0.007898384
# 6     5     1  linear  29        0        33.5 0.392412729
# 7     5     3  linear   6        0        46.5 0.970848817
# 8     4     1     led  10        0        89.5 0.404447182
# 9     4     1     led  18        0        96.5 0.115594622
# 10    6     3  linear  18        0        15.5 0.017919745
# 11    4     3     led  22        0        54.5 0.901829577
# 12    3     3     led  17        0        79.5 0.063949974
# 13    1     3     led  16        0        86.5 0.551321441
# 14    6     4     cfl   5        0        65.5 0.256845013
# 15    4     2     led  12        0        29.5 0.340603733
# 16    5     3  linear  27        0        63.5 0.895166931
# 17    1     4     led   0        0        47.5 0.173088800
# 18    5     3  linear  20        0        89.5 0.438504370
# 19    2     4     cfl  18        0        45.5 0.031725246
# 20    2     3     led  24        0        94.5 0.456653397
# 21    3     3     cfl  24        0        73.5 0.161274319
# 22    5     3     led   9        0        62.5 0.252212124
# 23    5     1     led  15        0        40.5 0.115608182
# 24    3     3     cfl   3        0        89.5 0.066147321
# 25    6     4     cfl   2        0        35.5 0.007888337
# 26    5     1  linear   7        0        51.5 0.835458916
# 27    2     3  linear  28        0        36.5 0.691483644
# 28    5     4     led   6        0        43.5 0.604847889
# 29    6     1  linear  12        0        59.5 0.918838163
# 30    3     3  linear   7        0        73.5 0.471644760
# 31    4     2     led   5        0        34.5 0.972078100
# 32    1     3     cfl  17        0        80.5 0.457241602
# 33    5     4  linear   3        0        16.5 0.492500255
# 34    3     2     cfl  12        0        44.5 0.804236607
# 35    2     2     cfl  21        0        50.5 0.845094268
# 36    3     2  linear  10        0        23.5 0.637194873
# 37    4     3     led   6        0        69.5 0.161431896
# 38    3     2    exit  19       19        13.0 0.000000000
# 39    6     3    exit   7        7        13.0 0.000000000
# 40    6     2    exit  20       20        13.0 0.000000000
# 41    3     2    exit   1        1        13.0 0.000000000
# 42    2     4    exit  19       19        13.0 0.000000000
# 43    3     1    exit  24       24        13.0 0.000000000
# 44    3     3    exit  16       16        13.0 0.000000000
# 45    5     3    exit   9        9        13.0 0.000000000
# 46    2     3    exit   6        6        13.0 0.000000000
# 47    4     1    exit   1        1        13.0 0.000000000
# 48    1     1    exit  14       14        13.0 0.000000000
# 49    6     3    exit   7        7        13.0 0.000000000
# 50    2     4    exit   3        3        13.0 0.000000000

행 순서가 중요하면 tibble::rowid_to_column먼저 사용한 다음 dplyr::arrangeon을 사용 rowid하고 마지막에 선택하십시오.

데이터

df1 <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                                  replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50),
                 stringsAsFactors = F)

일반적인 dplyr 구문을 사용 within하는 대신 기본에서 사용할 수 있습니다 .

dt %>% within(qty.exit[measure == 'exit'] <- qty[measure == 'exit'],
              delta.watts[measure == 'exit'] <- 13)

파이프와 잘 통합되는 것 같고 그 안에서 원하는 모든 것을 할 수 있습니다.

I think this answer has not been mentioned before. It runs almost as fast as the 'default' data.table-solution..

Use base::replace()

df %>% mutate( qty.exit = replace( qty.exit, measure == 'exit', qty[ measure == 'exit'] ),
                          cf = replace( cf, measure == 'exit', 0 ),
                          delta.watts = replace( delta.watts, measure == 'exit', 13 ) )

replace recycles the replacement value, so when you want the values of columns qty entered into colums qty.exit, you have to subset qty as well... hence the qty[ measure == 'exit'] in the first replacement..

now, you will probably not want to retype the measure == 'exit' all the time... so you van create an index-vector containing that selection, and use it in the functions above..

#build an index-vector matching the condition
index.v <- which( df$measure == 'exit' )

df %>% mutate( qty.exit = replace( qty.exit, index.v, qty[ index.v] ),
               cf = replace( cf, index.v, 0 ),
               delta.watts = replace( delta.watts, index.v, 13 ) )

benchmarks

# Unit: milliseconds
#         expr      min       lq     mean   median       uq      max neval
# data.table   1.005018 1.053370 1.137456 1.112871 1.186228 1.690996   100
# wimpel       1.061052 1.079128 1.218183 1.105037 1.137272 7.390613   100
# wimpel.index 1.043881 1.064818 1.131675 1.085304 1.108502 4.192995   100

One concise solution would be to do the mutation on the filtered subset and then add back the non-exit rows of the table:

library(dplyr)

dt %>% 
    filter(measure == 'exit') %>%
    mutate(qty.exit = qty, cf = 0, delta.watts = 13) %>%
    rbind(dt %>% filter(measure != 'exit'))

참고URL : https://stackoverflow.com/questions/34096162/dplyr-mutate-replace-several-columns-on-a-subset-of-rows

'Development Tip' 카테고리의 다른 글

C ++에 대한 까다로운 인터뷰 주제 (0)	2020.11.24
정수 열의 auto_increment가 데이터베이스의 max_value에 도달하면 어떻게됩니까? (0)	2020.11.24
스크롤 뷰 내부의 동적 크기 컨트롤러로 컨테이너 뷰 크기 조정 (0)	2020.11.23
2 배에서 x 개의 유효 숫자로 반올림 (0)	2020.11.23
배열 이름을 복수 또는 단수로 지정합니까? (0)	2020.11.23

현재글dplyr 행의 하위 집합에서 여러 열을 변경 / 바꾸기

yourdevel

dplyr 행의 하위 집합에서 여러 열을 변경 / 바꾸기

dplyr 행의 하위 집합에서 여러 열을 변경 / 바꾸기

데이터

'Development Tip' 카테고리의 다른 글

'Development Tip'의 다른글

티스토리툴바

dplyr 행의 하위 집합에서 여러 열을 변경 / 바꾸기

dplyr 행의 하위 집합에서 여러 열을 변경 / 바꾸기

데이터

'Development Tip' 카테고리의 다른 글

'Development Tip'의 다른글

관련글

티스토리툴바