Shuffle에 대해서

shuffle이라는 말은 임의로 섞는다는 의미이다. mp3에서 임의로 노래의 순서를 섞어서 플레이할 때 사용되기도 한다. 더 많은 내용은 키워드 'random permutation'혹은 'the algorithm shuffles the sequence' 으로 살펴볼 수 있다. 아래는 그 이론에 대한 내용과 구현에 대해서 살펴본다.

shuffle 이론들

1. Fisher-Yates shuffle

def shuffle(a):
    n = len(a)
    for i in range(n - 1):  # i from 0 to n-2, inclusive.
        j = random.randrange(i, n)  # j from i to n-1, inclusive.
        a[i], a[j] = a[j], a[i]  # swap a[i] and a[j].

2. Sattolo algorism
Fisher-Yates의 변형으로 큰 틀은 비슷하지만 random의 범주가 줄어든다는 특징이 있다. 그래서 bias가 증가한다.

def sattolo(a):
    n = len(a)
    for i in range(n - 1):
        j = random.randrange(i+1, n)  # i+1 instead of i
        a[i], a[j] = a[j], a[i]

3. inside-out algorism

Fisher-Yates의 변형으로 Fisher-Yates는 원본 배열을 swap하는 형태라면 inside-out은 원본 배열은 변경하지 않고 새로운 배열을 만들어 결과를 저장하는 알고리즘이다.

1) 하나의 배열에서 새로운 랜덤 배열 만들기

To initialize an array a of n elements to a randomly shuffled copy of source, both 0-based:
  for i from 0 to n − 1 do
      j ← random integer such that 0 ≤ j ≤ i
      if j ≠ i
          a[i] ← a[j]
      a[j] ← source[i]

2) 여러개의 소스 배열에서 랜덤 배열 만들기

To initialize an empty array a to a randomly shuffled copy of source whose length is not known:
  while source.moreDataAvailable
      j ← random integer such that 0 ≤ j ≤ a.length
      if j = a.length
          a.append(source.next)
      else
          a.append(a[j])
          a[j] ← source.next

추가적인 개념

Cyclic permutation
Reservoir sampling

결론

스포티파이의 shuffle사례에서 보듯이 실제 사용자에게는 단순한 shuffle이 아니라 cluster분류가 균등한 shuffle이 필요하다. 또한 머신러닝 분야에서는 데이터 범위가 매우 큰 범위에서 임의의 sampling을 하는 부분에서도 활용되고 있다. 따라서 실제로 비즈니스에서 shuffle을 사용한다면 랜덤한 데이터를 사용할 클라이언트(사람 혹은 AI)가 원하는 목적에 따라서 별도의 로직을 추가해야겠다.

Reference

https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
https://danluu.com/sattolo/
Shuffle 활용 - https://yamalab.tistory.com/136
1. https://engineering.atspotify.com/2014/02/how-to-shuffle-songs/
Reservoir sampling - https://en.wikipedia.org/wiki/Reservoir_sampling

저작자표시 (새창열림)