大きなCSVファイルの特定列だけを抜粋して重複を取り除く処理を書いてみた

思いつくままにやってみた版（ミュータブルなSetを使う）
C列目の値を取ってくる

val br = new(BufferedReader(new InputStreamReader(new FileInputStream(“hogehoge.csv”)))

def distinctCsv(br: BufferedReader, c: Int): Set [String]= {
	val it = Iterator.continually(br.readLine).takeWhile(_!=null).toList
	var s = new Set[String]()

	it.foreach{ l => s += l.split(‘,’)(c-1)}
	s
}

distinctCsv(br, 5)

うん、実にScalaっぽくない。foreachが値を返さないというのが実によろしくない。

ここを参考（http://voidy21.hatenablog.jp/entry/20110508/1304789580 , http://yuroyoro.hatenablog.com/entry/20100317/1268819400 ）にしてfoldLeftを使ってみた。

def distinctCsv(br: BufferedReader, c: Int): Set[String] = {
	val it = Iterator.continually(br.readLine).takeWhile(_!=null)
	it.foldLeft(Set[String]())((m,l) => m += l.split(‘,’)(c - 1)
}

一気にScalaっぽくなった。調子に乗ってイミュータブルにしてみる

def distinctCsv(br: BufferedReader, c: Int): Set[String] = {
	val it = Iterator.continually(br.readLine).takeWhile(_!=null)
	it.foldLeft(Set[String]())((m,l) => m + l.split(‘,’)(c - 1)
}

極限まで型推論を使うと、

def distinctCsv(br: BufferedReader, c: Int): Set[String] = {
	val it = Iterator.continually(br.readLine).takeWhile(_!=null)
	it.foldLeft(Set[String]())(_ + _.split(‘,’)(c - 1)
}

分かりにくい……。ので、元に戻した。ついでに複数列を取得するようにしてみる。タプル版。

def distinctCsv (br: BufferedReader, cs: List[Int]):  List[(Int, Set[String])] = {
	val it = Iterator.continually(br.readLine).takeWhile(_!=null)
	cs.map{c => 
		(c, it.foldLeft(Set[String]())((m,l) => m + l.split(',')(c - 1)))
	}
}

タプルのリストってなんか気持ち悪い。Mapを使ってみる。

def distinctCsv (br: BufferedReader, cs: List[Int]):  Map[Int, Set[String]] = {
	val it = Iterator.continually(br.readLine).takeWhile(_!=null)
	cs.foldLeft(Map[Int, Set[String]]()){(m, c) => 
		m + (c -> it.foldLeft(Set[String]())((s, l) => s + l.split(',')(c-1)))
	}
}

ごちゃごちゃしてきたので、foldLeftのかわりに　/: を使ってみる。

def distinctCsv (br: BufferedReader, cs: List[Int]):  Map[Int, Set[String]] = {
	val it = Iterator.continually(br.readLine).takeWhile(_!=null)
	(Map.empty[Int, Set[String]] /: cs) {(m, c) =>
		m + {c -> (Set[String]() /: it){(s,l) => s + l.split(',')(c - 1)}}
	}
}

見た目あんまり変わらないという残念な結果になる。Set[String]が駄目なんじゃないか、とも思うが、重複を簡単に取り除くいい方法も思いつかない。一旦方向性を変えてタブ区切りでも対応できるように一般化してみよう。よく一般化するとすっきりすると言うし。

def distinctText(br: BufferedReader, cs: List[Int], f: (String, Int) => String):  Map[Int, Set[String]] = {
	val it = Iterator.continually(br.readLine).takeWhile(_!=null)
	(Map.empty[Int, Set[String]] /: cs) {(m, c) =>
		m + {c -> (Set[String]() /: it){(s,l) => s + f(l,c)}
	}
}

val f_csv : (String, Int) => String = (l,c) => l.split(‘,’)(c-1)
val f_tab: (String, Int) => String = (l,c) => l.split(‘\t’)(c-1)

あまり変わらない。やはりfoldLeftが二重になっていて汚いと思う。