Файлы CSV слияния с разделителями полей, также происходящими в кавычках

Question

Файлы CSV слияния с разделителями полей, также происходящими в кавычках

Командная строка find сделана из разных видов опций, которые объединены в выражения.

Опция find -delete является действием.
Это значит, что он выполняется для каждого файла, который пока что совпадает.
Как первый вариант после путей, все файлы совпадают... ой!

Это опасно - но на man-странице, по крайней мере, есть большое предупреждение :

Из man find:

ACTIONS
    -delete
           Delete  files; true if removal succeeded.  If the removal failed, an
           error message is issued.  If -delete fails, find's exit status  will
           be nonzero (when it eventually exits).  Use of -delete automatically
           turns on the -depth option.

           Warnings: Don't forget that the find command line is evaluated as an
           expression,  so  putting  -delete first will make find try to delete
           everything below the starting points you specified.  When testing  a
           find  command  line  that  you later intend to use with -delete, you
           should explicitly specify -depth in order to avoid later  surprises.
           Because  -delete  implies -depth, you cannot usefully use -prune and
           -delete together.

From further up in man find:

EXPRESSIONS
    The expression is made up of options (which affect overall operation rather
    than  the  processing  of  a  specific file, and always return true), tests
    (which return a true or false value), and actions (which have side  effects
    and  return  a  true  or false value), all separated by operators.  -and is
    assumed where the operator is omitted.

    If the expression contains no actions other than  -prune,  -print  is  per‐
    formed on all files for which the expression is true.

On trying out what a find command will do:

Чтобы посмотреть, что удалит такая команда, как

find . -name '*ar' -delete

, можно сначала заменить действие -удалить на более безобидное действие - например, -fls или -печать:

find . -name '*ar' -print

Это напечатает, на какие файлы повлияет данное действие.
В данном примере -print можно пропустить. В этом случае никакого действия не происходит, поэтому самое очевидное добавляется неявно: -print. (См. второй абзац раздела "ВЫРАЖЕНИЯ", цитируемого выше)

.

1

awk quoting csv join

wass rubleff 08.08.2018, 00:58

Ссылка

2 ответа

TXR язык:

@(do
   (defun csv-parse (str)
     (let ((toks (tok-str str #/[^\s,][^,]+[^\s,]|"[^"]*"|[^\s,]/)))
       [mapcar (do let ((l (match-regex @1 #/".*"/)))
                     (if (eql l (length @1))
                       [@1 1..-1] @1)) toks]))

   (defun csv-format (list)
     (cat-str (mapcar (do if (find #\, @1) `"@1"` @1) list) ", "))

   (defun join-recs (recs-left recs-right)
     (append-each ((l recs-left))
       (collect-each ((r recs-right))
         (append l r))))

   (let ((hashes (collect-each ((arg *args*))
                   (let ((stream (open-file arg)))
                     [group-by first [mapcar csv-parse (gun (get-line stream))]
                               :equal-based]))))
     (when hashes
       (let ((joined (reduce-left (op hash-isec @1 @2 join-recs) hashes)))
         (dohash (key recs joined)
           (each ((rec recs))
             (put-line (csv-format rec))))))))

Пример данных.

Примечание: ключ 3792318 встречается дважды в третьем файле, поэтому мы ожидаем две строки в выходных данных соединения для этого ключа.

Примечание: данные необязательно сортировать; хеширование используется для соединения.

$ for x in csv* ; do echo "File $x:" ; cat $x ; done
File csv1:
3792318, 2014-07-15 00:00:00, "A, B"
3792319, 2014-07-16 00:00:01, "B, C"
3792320, 2014-07-17 00:00:02, "D, E"
File csv2:
3792319, 2014-07-15 00:02:00, "X, Y"
3792320, 2014-07-11 00:03:00, "S, T"
3792318, 2014-07-16 00:02:01, "W, Z"
File csv3:
3792319, 2014-07-10 00:04:00, "M"
3792320, 2014-07-09 00:06:00, "N"
3792318, 2014-07-05 00:07:01, "P"
3792318, 2014-07-16 00:08:01, "Q"

Выполнить:

$ txr join.txr csv1 csv2 csv3
3792319, 2014-07-16 00:00:01, "B, C", 3792319, 2014-07-15 00:02:00, "X, Y", 3792319, 2014-07-10 00:04:00, M
3792318, 2014-07-15 00:00:00, "A, B", 3792318, 2014-07-16 00:02:01, "W, Z", 3792318, 2014-07-05 00:07:01, P
3792318, 2014-07-15 00:00:00, "A, B", 3792318, 2014-07-16 00:02:01, "W, Z", 3792318, 2014-07-16 00:08:01, Q
3792320, 2014-07-17 00:00:02, "D, E", 3792320, 2014-07-11 00:03:00, "S, T", 3792320, 2014-07-09 00:06:00, N

Более "правильная" функция csv-parse :

   ;; Include the comma separators as tokens; then parse the token
   ;; list, recognizing consecutive comma tokens as an empty field,
   ;; and stripping leading/trailing whitespace and quotes.
   (defun csv-parse (str)
     (labels ((clean (str)
                (set str (trim-str str))
                (if (and (= [str 0] #\")
                         (= [str -1] #\"))
                  [str 1..-1]
                  str))
              (post-process (tokens)
                (tree-case tokens
                  ((tok sep . rest)
                   (if (equal tok ",")
                     ^("" ,*(post-process (cons sep rest)))
                     ^(,(clean tok) ,*(post-process rest))))
                  ((tok . rest)
                   (if (equal tok ",")
                     '("")
                     ^(,(clean tok)))))))
       (post-process (tok-str str #/[^,]+|"[^"]*"|,/))))

1

27.01.2020, 23:38

Ссылка

Похожие вопросы

score 1 · Accepted Answer · 27.01.2020, 23:38

[

]Очевидно, использование парсера csv было бы лучше, но если мы можем смело предположить, что[

] [

]Первое поле никогда не будет содержать запятую;[
]Вам нужны только идентификаторы, присутствующие в 1-ом файле (если идентификатор находится в файле2 или файле3, а не в файле1, то вы его игнорируете);[
]Файлы достаточно малы, чтобы поместиться в вашу оперативную память. [

] [

]Тогда этот Perl подход должен работать:[

] [

#!/usr/bin/env perl 
use strict;

my %f;
## Read the files
while (<>) {
    ## remove trailing newlines
    chomp;
    ## Replace any commas within quotes with '|'.
    ## I am using a while loop to deal with multiple commas.
    while (s/\"([^"]*?),([^"]*?)\"/"$1|$2"/){}
    ## match the id and the rest.
    /^(.+?)(,.+)/; 
    ## The keys of the %f hash are the ids
    ## each line with the same id is appended to
    ## the current value of the key in the hash.
    $f{$1}.=$2; 
}
## Print the lines
foreach my $id (keys(%f)) {
    print "$id$f{$id}\n";
}

] [

]Сохраните скрипт выше как []foo.pl[] и запустите его следующим образом:[

] [

perl foo.pl file1.csv file2.csv file3.csv

] [

]Скрипт выше также может быть написан как one-liner:[

] [

perl -lne 'while(s/\"([^"]*?),([^"]*)\"/"$1|$2"/){} /^(.+?)(,.+)/; $k{$1}.=$2; 
           END{print "$_$k{$_}" for keys(%k)}' file1 file2 file3

]

Файлы CSV слияния с разделителями полей, также происходящими в кавычках

Теги

Похожие вопросы