From 86686314e037d0eb38cbc2e77f412ae96519a76e Mon Sep 17 00:00:00 2001 From: FourierTransformer Date: Sat, 4 Apr 2020 13:47:24 -0500 Subject: [PATCH] New Release 1.2.0 (#26) ## Features * Can now parse files line by line in a fixed-size reading mode * Now has an option to ignore quotes when parsing ## Improvements * Speed increases in vanilla Lua and LuaJIT (benchmarks updated!) * Refactored code for easier maintenance ## Bugfixes * Better handling of multiple escaped quotes in vanilla lua (thanks @fredrikj83 #25) --- ERRORS.md | 11 +- README.md | 101 ++-- ftcsv-1.1.6-1.rockspec | 30 -- ftcsv-1.2.0-1.rockspec | 35 ++ ftcsv.lua | 934 ++++++++++++++++++++------------- spec/dynamic_features_spec.lua | 28 + spec/error_spec.lua | 29 +- spec/feature_spec.lua | 32 +- spec/parseLine_spec.lua | 76 +++ spec/parse_encode_spec.lua | 18 +- 10 files changed, 866 insertions(+), 428 deletions(-) delete mode 100644 ftcsv-1.1.6-1.rockspec create mode 100644 ftcsv-1.2.0-1.rockspec create mode 100644 spec/parseLine_spec.lua diff --git a/ERRORS.md b/ERRORS.md index 7a136e4..1d08464 100644 --- a/ERRORS.md +++ b/ERRORS.md @@ -1,9 +1,9 @@ -#Error Handling +# Error Handling Below you can find a more detailed explanation of some of the errors that can be encountered while using ftcsv. For parsing, examples of these files can be found in /spec/bad_csvs/ -##Parsing +## Parsing Note: `[row_number]` indicates the row number of the parsed lua table. As such, it will be one off from the line number in the csv. However, for header-less files, the row returned *will* match the csv line number. | Error Message | Detailed Explanation | @@ -12,4 +12,9 @@ Note: `[row_number]` indicates the row number of the parsed lua table. As such, | ftcsv: Cannot parse a file which contains empty headers | If a header field contains no information, then it can't be parsed
(ex: `Name,City,,Zipcode`) | | ftcsv: too few columns in row [row_number] | The number of columns is less than the amount in the header after transformations (renaming, keeping certain fields, etc) | | ftcsv: too many columns in row [row_number] | The number of columns is greater than the amount in the header after transformations. It can't map the field's count with an existing header. | -| ftcsv: File not found at [path] | When loading, lua can't open the file at [path] | \ No newline at end of file +| ftcsv: File not found at [path] | When loading, lua can't open the file at [path] | +| ftcsv: fieldsToKeep only works with header-less files when using the 'rename' functionality | when dealing with header-less files, you can only use the fieldsToKeep if you use rename. The fields are limited after the renaming happens | +| ftcsv: bufferSize needs to be larger to parse this file | The buffer size selected is too small to parse the file. It must be at least the length of the longest row (but, for performance, should probably be a bit larger). | +| ftcsv: parseLine currently doesn't support loading from string | `parseLine` relies on reading a file a few bytes at a time and currently doesn't work on strings | +| ftcsv: bufferSize can only be specified using 'parseLine'. When using 'parse', the entire file is read into memory | bufferSize can't be specified for parse, it can only be specified for parseLine | + diff --git a/README.md b/README.md index f5c16ef..dc46096 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,9 @@ # ftcsv [![Build Status](https://travis-ci.org/FourierTransformer/ftcsv.svg?branch=master)](https://travis-ci.org/FourierTransformer/ftcsv) [![Coverage Status](https://coveralls.io/repos/github/FourierTransformer/ftcsv/badge.svg?branch=master)](https://coveralls.io/github/FourierTransformer/ftcsv?branch=master) -ftcsv is a fast pure lua csv library. - -It works well for CSVs that can easily be fully loaded into memory (easily up to a hundred MB) and correctly handles `\n` (LF), `\r` (CR) and `\r\n` (CRLF) line endings. It has UTF-8 support, and will strip out the BOM if it exists. ftcsv can also parse headerless csv-like files and supports column remapping, file or string based loading, and more! - -Currently, there isn't a "large" file mode with proper readers for ingesting large CSVs using a fixed amount of memory, but that is in the works in [another branch!](https://github.com/FourierTransformer/ftcsv/tree/parseLineIterator) - -It's been tested with LuaJIT 2.0/2.1 and Lua 5.1, 5.2, and 5.3 - +ftcsv is a fast csv library written in pure Lua. It's been tested with LuaJIT 2.0/2.1 and Lua 5.1, 5.2, and 5.3 +It features two parsing modes, one for CSVs that can easily be loaded into memory (up to a few hundred MBs depending on the system), and another for loading files using an iterator - useful for manipulating large files or processing during load. It correctly handles most csv (and csv-like) files found in the wild, from varying line endings (Windows, Linux, and OS9), UTF-8 BOM support, and odd delimiters. There are also various options that can tweak how a file is loaded, only grabbing a few fields, renaming fields, and parsing header-less files! ## Installing You can either grab `ftcsv.lua` from here or install via luarocks: @@ -20,9 +14,11 @@ luarocks install ftcsv ## Parsing -### `ftcsv.parse(fileName, delimiter [, options])` +There are two main parsing methods: `ftcv.parse` and `ftcsv.parseLine`. +`ftcsv.parse` loads the entire file and parses it, while `ftcsv.parseLine` is an iterator that parses one line at a time. -ftcsv will load the entire csv file into memory, then parse it in one go, returning a lua table with the parsed data and a lua table containing the column headers. It has only two required parameters - a file name and delimiter (limited to one character). A few optional parameters can be passed in via a table (examples below). +### `ftcsv.parse(fileName, delimiter [, options])` +`ftcsv.parse` will load the entire csv file into memory, then parse it in one go, returning a lua table with the parsed data and a lua table containing the column headers. It has only two required parameters - a file name and delimiter (limited to one character). A few optional parameters can be passed in via a table (examples below). Just loading a csv file: ```lua @@ -30,11 +26,28 @@ local ftcsv = require('ftcsv') local zipcodes, headers = ftcsv.parse("free-zipcode-database.csv", ",") ``` -### Options -The following are optional parameters passed in via the third argument as a table. For example if you wanted to `loadFromString` and not use `headers`, you could use the following: +### `ftcsv.parseLine(fileName, delimiter, [, options])` +`ftcsv.parseLine` will open a file and read `options.bufferSize` bytes of the file. `bufferSize` defaults to 2^16 bytes (which provides the fastest parsing on most unix-based systems), or can be specified in the options. `ftcsv.parseLine` is an iterator and returns one line at a time. When all the lines in the buffer are read, it will read in another `bufferSize` bytes of a file and repeat the process until the entire file has been read. + +If specifying `bufferSize` there are a couple of things to remember: + * `bufferSize` must be at least the length of the longest row. + * If `bufferSize` is too small, an error is returned. + * If `bufferSize` is the length of the entire file, all of it will be read and returned one line at a time (performance is roughly the same as `ftcsv.parse`). + +Parsing through a csv file: ```lua -ftcsv.parse("apple,banana,carrot", ",", {loadFromString=true, headers=false}) +local ftcsv = require("ftcsv") +for zipcode in ftcsv.parseLine("free-zipcode-database.csv", ",") do + print(zipcode.Zipcode) + print(zipcode.State) +end ``` + + +### Options +The options are the same for `parseLine` and `parse`, with the exception of `loadFromString` and `bufferSize`. `loadFromString` only works with `parse` and `bufferSize` can only be specified for `parseLine`. + +The following are optional parameters passed in via the third argument as a table. - `loadFromString` If you want to load a csv from a string instead of a file, set `loadFromString` to `true` (default: `false`) @@ -64,6 +77,17 @@ ftcsv.parse("apple,banana,carrot", ",", {loadFromString=true, headers=false}) local actual = ftcsv.parse("a,b,c\r\napple,banana,carrot\r\n", ",", options) ``` + Also Note: If you apply a function to the headers via headerFunc, and want to select fields from fieldsToKeep, you need to have what the post-modified header would be in fieldsToKeep. + + - `ignoreQuotes` + + If `ignoreQuotes` is `true`, it will leave all quotes in the final parsed output. This is useful in situations where the fields aren't quoted, but contain quotes, or if the CSV didn't handle quotes correctly and you're trying to parse it. + + ```lua + local options = {loadFromString=true, ignoreQuotes=true} + local actual = ftcsv.parse('a,b,c\n"apple,banana,carrot', ",", options) + ``` + - `headerFunc` Applies a function to every field in the header. If you are using `rename`, the function is applied after the rename. @@ -92,13 +116,17 @@ ftcsv.parse("apple,banana,carrot", ",", {loadFromString=true, headers=false}) In the above example, the first field becomes 'a', the second field becomes 'b' and so on. -For all tested examples, take a look in /spec/feature_spec.lua and /spec/dynamic_features_spec.lua +For all tested examples, take a look in /spec/feature_spec.lua +The options can be string together. For example if you wanted to `loadFromString` and not use `headers`, you could use the following: +```lua +ftcsv.parse("apple,banana,carrot", ",", {loadFromString=true, headers=false}) +``` ## Encoding ### `ftcsv.encode(inputTable, delimiter[, options])` -ftcsv can also take a lua table and turn it into a text string to be written to a file. It has two required parameters, an inputTable and a delimiter. You can use it to write out a file like this: +`ftcsv.encode` takes in a lua table and turns it into a text string that can be written to a file. It has two required parameters, an inputTable and a delimiter. You can use it to write out a file like this: ```lua local fileOutput = ftcsv.encode(users, ",") local file = assert(io.open("ALLUSERS.csv", "w")) @@ -116,54 +144,53 @@ file:close() ``` -## Error Handling -ftcsv returns a bunch of errors when passed a bad csv file or incorrect parameters. You can find a more detailed explanation of the more cryptic errors in [ERRORS.md](ERRORS.md) +## Error Handling +ftcsv returns a litany of errors when passed a bad csv file or incorrect parameters. You can find a more detailed explanation of the more cryptic errors in [ERRORS.md](ERRORS.md) ## Benchmarks We ran ftcsv against a few different csv parsers ([PIL](http://www.lua.org/pil/20.4.html)/[csvutils](http://lua-users.org/wiki/CsvUtils), [lua_csv](https://github.com/geoffleyland/lua-csv), and [lpeg_josh](http://lua-users.org/lists/lua-l/2009-08/msg00020.html)) for lua and here is what we found: -### 20 MB file, every field is double quoted (ftcsv optimal lua case\*) +### 20 MB file, every field is double quoted | Parser | Lua | LuaJIT | | --------- | ------------------ | ------------------ | -| PIL/csvutils | 3.939 +/- 0.565 SD | 1.429 +/- 0.175 SD | -| lua_csv | 8.487 +/- 0.156 SD | 3.095 +/- 0.206 SD | -| lpeg_josh | **1.350 +/- 0.191 SD** | 0.826 +/- 0.176 SD | -| ftcsv | 3.101 +/- 0.152 SD | **0.499 +/- 0.133 SD** | +| PIL/csvutils | 1.754 +/- 0.136 SD | 1.012 +/- 0.112 SD | +| lua_csv | 4.191 +/- 0.128 SD | 2.382 +/- 0.133 SD | +| lpeg_josh | **0.996 +/- 0.149 SD** | 0.725 +/- 0.083 SD | +| ftcsv | 1.342 +/- 0.130 SD | **0.301 +/- 0.099 SD** | -\* see Performance section below for an explanation ### 12 MB file, some fields are double quoted | Parser | Lua | LuaJIT | | --------- | ------------------ | ------------------ | -| PIL/csvutils | 2.868 +/- 0.101 SD | 1.244 +/- 0.129 SD | -| lua_csv | 7.773 +/- 0.083 SD | 3.495 +/- 0.172 SD | -| lpeg_josh | **1.146 +/- 0.191 SD** | 0.564 +/- 0.121 SD | -| ftcsv | 3.401 +/- 0.109 SD | **0.441 +/- 0.124 SD** | +| PIL/csvutils | 1.456 +/- 0.083 SD | 0.691 +/- 0.071 SD | +| lua_csv | 3.738 +/- 0.072 SD | 1.997 +/- 0.075 SD | +| lpeg_josh | **0.638 +/- 0.070 SD** | 0.475 +/- 0.042 SD | +| ftcsv | 1.307 +/- 0.071 SD | **0.213 +/- 0.062 SD** | [LuaCSV](http://lua-users.org/lists/lua-l/2009-08/msg00012.html) was also tried, but usually errored out at odd places during parsing. NOTE: times are measured using `os.clock()`, so they are in CPU seconds. Each test was run 30 times in a randomized order. The file was pre-loaded, and only the csv decoding time was measured. -Benchmarks were run under ftcsv 1.1.6 +Benchmarks were run under ftcsv 1.2.0 ## Performance -We did some basic testing and found that in lua, if you want to iterate over a string character-by-character and look for single chars, `string.byte` performs faster than `string.sub`. This is especially true for LuaJIT. As such, in LuaJIT, ftcsv iterates over the whole file and does byte compares to find quotes and delimiters. However, for pure lua, `string.find` is used to find quotes but `string.byte` is used everywhere else as the CSV format in its proper form will have quotes around fields. If you have thoughts on how to improve performance (either big picture or specifically within the code), create a GitHub issue - I'd love to hear about it! +I did some basic testing and found that in lua, if you want to iterate over a string character-by-character and compare chars, `string.byte` performs faster than `string.sub`. As such, ftcsv iterates over the whole file and does byte compares to find quotes and delimiters and then generates a table from it. When using vanilla lua, it proved faster to use `string.find` instead of iterating character by character (which is faster in LuaJIT), so ftcsv accounts for that and will perform the fastest option that is availble. If you have thoughts on how to improve performance (either big picture or specifically within the code), create a GitHub issue - I'd love to hear about it! ## Contributing Feel free to create a new issue for any bugs you've found or help you need. If you want to contribute back to the project please do the following: - 0. If it's a major change (aka more than a quick bugfix), please create an issue so we can discuss it! - 1. Fork the repo - 2. Create a new branch - 3. Push your changes to the branch - 4. Run the test suite and make sure it still works - 5. Submit a pull request - 6. Wait for review - 7. Enjoy the changes made! + 1. If it's a major change (aka more than a quick bugfix), please create an issue so we can discuss it! + 2. Fork the repo + 3. Create a new branch + 4. Push your changes to the branch + 5. Run the test suite and make sure it still works + 6. Submit a pull request + 7. Wait for review + 8. Enjoy the changes made! diff --git a/ftcsv-1.1.6-1.rockspec b/ftcsv-1.1.6-1.rockspec deleted file mode 100644 index 0e73004..0000000 --- a/ftcsv-1.1.6-1.rockspec +++ /dev/null @@ -1,30 +0,0 @@ -package = "ftcsv" -version = "1.1.6-1" - -source = { - url = "git://github.com/FourierTransformer/ftcsv.git", - tag = "1.1.6" -} - -description = { - summary = "A fast pure lua csv library (parser and encoder)", - detailed = [[ - ftcsv works well for CSVs that can easily be fully loaded into memory (easily up to a hundred MB) and correctly handles `\n` (LF), `\r` (CR) and `\r\n` (CRLF) line endings. It has UTF-8 support, and will strip out the BOM if it exists. ftcsv can also parse headerless csv-like files and supports column remapping, file or string based loading, and more! - - Note: Currently it cannot load CSV files where the file can't fit in memory. - ]], - homepage = "https://github.com/FourierTransformer/ftcsv", - maintainer = "Shakil Thakur ", - license = "MIT" -} - -dependencies = { - "lua >= 5.1, <5.4", -} - -build = { - type = "builtin", - modules = { - ["ftcsv"] = "ftcsv.lua" - }, -} \ No newline at end of file diff --git a/ftcsv-1.2.0-1.rockspec b/ftcsv-1.2.0-1.rockspec new file mode 100644 index 0000000..8319536 --- /dev/null +++ b/ftcsv-1.2.0-1.rockspec @@ -0,0 +1,35 @@ +package = "ftcsv" +version = "1.2.0-1" + +source = { + url = "git://github.com/FourierTransformer/ftcsv.git", + tag = "1.2.0" +} + +description = { + summary = "A fast pure lua csv library (parser and encoder)", + detailed = [[ + ftcsv is a fast and easy to use csv library for lua. It can read in CSV files, + do some basic transformations (rename fields) and can create the csv format. + It supports UTF-8, header-less CSVs, and maintaining correct line endings for + multi-line fields. + + It supports loading an entire CSV file into memory and parsing it as well as + buffered reading of a CSV file. + ]], + homepage = "https://github.com/FourierTransformer/ftcsv", + maintainer = "Shakil Thakur ", + license = "MIT" +} + +dependencies = { + "lua >= 5.1, <5.4", +} + +build = { + type = "builtin", + modules = { + ["ftcsv"] = "ftcsv.lua" + }, +} + diff --git a/ftcsv.lua b/ftcsv.lua index a923614..74b455c 100644 --- a/ftcsv.lua +++ b/ftcsv.lua @@ -1,11 +1,11 @@ local ftcsv = { - _VERSION = 'ftcsv 1.1.5', + _VERSION = 'ftcsv 1.2.0', _DESCRIPTION = 'CSV library for Lua', _URL = 'https://github.com/FourierTransformer/ftcsv', _LICENSE = [[ The MIT License (MIT) - Copyright (c) 2016-2018 Shakil Thakur + Copyright (c) 2016-2020 Shakil Thakur Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal @@ -27,27 +27,29 @@ local ftcsv = { ]] } --- lua 5.1 load compat -local M = {} -if type(jit) == 'table' or _ENV then - M.load = _G.load -else - M.load = loadstring -end - -- perf local sbyte = string.byte local ssub = string.sub +-- luajit/lua compatability layer +local luaCompatibility = {} +if type(jit) == 'table' or _ENV then + -- luajit and lua 5.2+ + luaCompatibility.load = _G.load +else + -- lua 5.1 + luaCompatibility.load = loadstring +end + -- luajit specific speedups -- luajit performs faster with iterating over string.byte, -- whereas vanilla lua performs faster with string.find if type(jit) == 'table' then + luaCompatibility.LuaJIT = true -- finds the end of an escape sequence - function M.findClosingQuote(i, inputLength, inputString, quote, doubleQuoteEscape) + function luaCompatibility.findClosingQuote(i, inputLength, inputString, quote, doubleQuoteEscape) local currentChar, nextChar = sbyte(inputString, i), nil while i <= inputLength do - -- print(i) nextChar = sbyte(inputString, i+1) -- this one deals with " double quotes that are escaped "" within single quotes " @@ -59,7 +61,6 @@ if type(jit) == 'table' then -- identifies the escape toggle elseif currentChar == quote and nextChar ~= quote then - -- print("exiting", i-1) return i-1, doubleQuoteEscape else i = i + 1 @@ -69,231 +70,24 @@ if type(jit) == 'table' then end else + luaCompatibility.LuaJIT = false + -- vanilla lua closing quote finder - function M.findClosingQuote(i, inputLength, inputString, quote, doubleQuoteEscape) + function luaCompatibility.findClosingQuote(i, inputLength, inputString, quote, doubleQuoteEscape) local j, difference i, j = inputString:find('"+', i) - if j == nil then return end - if i == nil then - return inputLength-1, doubleQuoteEscape + if j == nil then + return nil end difference = j - i - -- print("difference", difference, "I", i, "J", j) if difference >= 1 then doubleQuoteEscape = true end - if difference == 1 then - return M.findClosingQuote(j+1, inputLength, inputString, quote, doubleQuoteEscape) + if difference % 2 == 1 then + return luaCompatibility.findClosingQuote(j+1, inputLength, inputString, quote, doubleQuoteEscape) end return j-1, doubleQuoteEscape end - end --- load an entire file into memory -local function loadFile(textFile) - local file = io.open(textFile, "r") - if not file then error("ftcsv: File not found at " .. textFile) end - local allLines = file:read("*all") - file:close() - return allLines -end - --- creates a new field -local function createField(inputString, quote, fieldStart, i, doubleQuoteEscape) - local field - -- so, if we just recently de-escaped, we don't want the trailing " - if sbyte(inputString, i-1) == quote then - -- print("Skipping last \"") - field = ssub(inputString, fieldStart, i-2) - else - field = ssub(inputString, fieldStart, i-1) - end - if doubleQuoteEscape then - -- print("QUOTE REPLACE") - -- print(line[fieldNum]) - field = field:gsub('""', '"') - end - return field -end - --- main function used to parse -local function parseString(inputString, inputLength, delimiter, i, headerField, fieldsToKeep) - - -- keep track of my chars! - local currentChar, nextChar = sbyte(inputString, i), nil - local skipChar = 0 - local field - local fieldStart = i - local fieldNum = 1 - local lineNum = 1 - local doubleQuoteEscape, emptyIdentified = false, false - local exit = false - - --bytes - local CR = sbyte("\r") - local LF = sbyte("\n") - local quote = sbyte('"') - local delimiterByte = sbyte(delimiter) - - local assignValue - local outResults - -- outResults[1] = {} - -- the headers haven't been set yet. - -- aka this is the first run! - if headerField == nil then - headerField = {} - assignValue = function() - headerField[fieldNum] = field - emptyIdentified = false - return true - end - else - outResults = {} - outResults[1] = {} - assignValue = function() - emptyIdentified = false - if headerField[fieldNum] then - outResults[lineNum][headerField[fieldNum]] = field - else - error('ftcsv: too many columns in row ' .. lineNum) - end - end - end - - -- calculate the initial line count (note: this can include duplicates) - local headerFieldsExist = {} - local initialLineCount = 0 - for _, value in pairs(headerField) do - if not headerFieldsExist[value] and (fieldsToKeep == nil or fieldsToKeep[value]) then - headerFieldsExist[value] = true - initialLineCount = initialLineCount + 1 - end - end - - while i <= inputLength do - -- go by two chars at a time! currentChar is set at the bottom. - -- currentChar = string.byte(inputString, i) - nextChar = sbyte(inputString, i+1) - -- print(i, string.char(currentChar), string.char(nextChar)) - - -- empty string - if currentChar == quote and nextChar == quote then - skipChar = 1 - fieldStart = i + 2 - emptyIdentified = true - -- print("fs+2:", fieldStart) - - -- identifies the escape toggle. - -- This can only happen if fields have quotes around them - -- so the current "start" has to be where a quote character is. - elseif currentChar == quote and nextChar ~= quote and fieldStart == i then - -- print("New Quoted Field", i) - fieldStart = i + 1 - - -- if an empty field was identified before assignment, it means - -- that this is a quoted field that starts with escaped quotes - -- ex: """a""" - if emptyIdentified then - fieldStart = fieldStart - 2 - emptyIdentified = false - end - - i, doubleQuoteEscape = M.findClosingQuote(i+1, inputLength, inputString, quote, doubleQuoteEscape) - -- print("I VALUE", i, doubleQuoteEscape) - skipChar = 1 - - -- create some fields if we can! - elseif currentChar == delimiterByte then - -- create the new field - -- print(headerField[fieldNum]) - if fieldsToKeep == nil or fieldsToKeep[headerField[fieldNum]] then - field = createField(inputString, quote, fieldStart, i, doubleQuoteEscape) - -- print("FIELD", field, "FIELDEND", headerField[fieldNum], lineNum) - -- outResults[headerField[fieldNum]][lineNum] = field - assignValue() - end - doubleQuoteEscape = false - - fieldNum = fieldNum + 1 - fieldStart = i + 1 - -- print("fs+1:", fieldStart) - - -- newline?! - elseif (currentChar == CR or currentChar == LF) then - if fieldsToKeep == nil or fieldsToKeep[headerField[fieldNum]] then - -- create the new field - field = createField(inputString, quote, fieldStart, i, doubleQuoteEscape) - - exit = assignValue() - if exit then - if (currentChar == CR and nextChar == LF) then - return headerField, i + 1 - else - return headerField, i - end - end - end - doubleQuoteEscape = false - - -- determine how line ends - if (currentChar == CR and nextChar == LF) then - -- print("CRLF DETECTED") - skipChar = 1 - end - - -- incrememnt for new line - if fieldNum < initialLineCount then - error('ftcsv: too few columns in row ' .. lineNum) - end - lineNum = lineNum + 1 - outResults[lineNum] = {} - fieldNum = 1 - fieldStart = i + 1 + skipChar - -- print("fs:", fieldStart) - - end - - i = i + 1 + skipChar - if (skipChar > 0) then - currentChar = sbyte(inputString, i) - else - currentChar = nextChar - end - skipChar = 0 - end - - -- create last new field - if fieldsToKeep == nil or fieldsToKeep[headerField[fieldNum]] then - field = createField(inputString, quote, fieldStart, i, doubleQuoteEscape) - assignValue() - end - - -- if there's no newline, the parser doesn't return headers correctly... - -- ex: a,b,c - if outResults == nil then - return headerField, i-1 - end - - -- clean up last line if it's weird (this happens when there is a CRLF newline at end of file) - -- doing a count gets it to pick up the oddballs - local finalLineCount = 0 - local lastValue = nil - for _, v in pairs(outResults[lineNum]) do - finalLineCount = finalLineCount + 1 - lastValue = v - end - - -- this indicates a CRLF - -- print("Final/Initial", finalLineCount, initialLineCount) - if finalLineCount == 1 and lastValue == "" then - outResults[lineNum] = nil - - -- otherwise there might not be enough line - elseif finalLineCount < initialLineCount then - error('ftcsv: too few columns in row ' .. lineNum) - end - - return outResults -end -- determine the real headers as opposed to the header mapping local function determineRealHeaders(headerField, fieldsToKeep) @@ -313,25 +107,307 @@ local function determineRealHeaders(headerField, fieldsToKeep) return realHeaders end --- runs the show! -function ftcsv.parse(inputFile, delimiter, options) + +local function determineTotalColumnCount(headerField, fieldsToKeep) + local totalColumnCount = 0 + local headerFieldSet = {} + for _, header in pairs(headerField) do + -- count unique columns and + -- also figure out if it's a field to keep + if not headerFieldSet[header] and + (fieldsToKeep == nil or fieldsToKeep[header]) then + headerFieldSet[header] = true + totalColumnCount = totalColumnCount + 1 + end + end + return totalColumnCount +end + +local function generateHeadersMetamethod(finalHeaders) + -- if a header field tries to escape, we will simply return nil + -- the parser will still parse, but wont get the performance benefit of + -- having headers predefined + for _, headers in ipairs(finalHeaders) do + if headers:find("]") then + return nil + end + end + local rawSetup = "local t, k, _ = ... \ + rawset(t, k, {[ [[%s]] ]=true})" + rawSetup = rawSetup:format(table.concat(finalHeaders, "]] ]=true, [ [[")) + return luaCompatibility.load(rawSetup) +end + +-- main function used to parse +local function parseString(inputString, i, options) + + -- keep track of my chars! + local inputLength = options.inputLength or #inputString + local currentChar, nextChar = sbyte(inputString, i), nil + local skipChar = 0 + local field + local fieldStart = i + local fieldNum = 1 + local lineNum = 1 + local lineStart = i + local doubleQuoteEscape, emptyIdentified = false, false + + local skipIndex + local charPatternToSkip = "[" .. options.delimiter .. "\r\n]" + + --bytes + local CR = sbyte("\r") + local LF = sbyte("\n") + local quote = sbyte('"') + local delimiterByte = sbyte(options.delimiter) + + -- explode most used options + local headersMetamethod = options.headersMetamethod + local fieldsToKeep = options.fieldsToKeep + local ignoreQuotes = options.ignoreQuotes + local headerField = options.headerField + local endOfFile = options.endOfFile + local buffered = options.buffered + + local outResults = {} + + -- in the first run, the headers haven't been set yet. + if headerField == nil then + headerField = {} + -- setup a metatable to simply return the key that's passed in + local headerMeta = {__index = function(_, key) return key end} + setmetatable(headerField, headerMeta) + end + + if headersMetamethod then + setmetatable(outResults, {__newindex = headersMetamethod}) + end + outResults[1] = {} + + -- totalColumnCount based on unique headers and fieldsToKeep + local totalColumnCount = options.totalColumnCount or determineTotalColumnCount(headerField, fieldsToKeep) + + local function assignValueToField() + if fieldsToKeep == nil or fieldsToKeep[headerField[fieldNum]] then + + -- create new field + if ignoreQuotes == false and sbyte(inputString, i-1) == quote then + field = ssub(inputString, fieldStart, i-2) + else + field = ssub(inputString, fieldStart, i-1) + end + if doubleQuoteEscape then + field = field:gsub('""', '"') + end + + -- reset flags + doubleQuoteEscape = false + emptyIdentified = false + + -- assign field in output + if headerField[fieldNum] ~= nil then + outResults[lineNum][headerField[fieldNum]] = field + else + error('ftcsv: too many columns in row ' .. options.rowOffset + lineNum) + end + end + end + + while i <= inputLength do + -- go by two chars at a time, + -- currentChar is set at the bottom. + nextChar = sbyte(inputString, i+1) + + -- empty string + if ignoreQuotes == false and currentChar == quote and nextChar == quote then + skipChar = 1 + fieldStart = i + 2 + emptyIdentified = true + + -- escape toggle. + -- This can only happen if fields have quotes around them + -- so the current "start" has to be where a quote character is. + elseif ignoreQuotes == false and currentChar == quote and nextChar ~= quote and fieldStart == i then + fieldStart = i + 1 + -- if an empty field was identified before assignment, it means + -- that this is a quoted field that starts with escaped quotes + -- ex: """a""" + if emptyIdentified then + fieldStart = fieldStart - 2 + emptyIdentified = false + end + skipChar = 1 + i, doubleQuoteEscape = luaCompatibility.findClosingQuote(i+1, inputLength, inputString, quote, doubleQuoteEscape) + + -- create some fields + elseif currentChar == delimiterByte then + assignValueToField() + + -- increaseFieldIndices + fieldNum = fieldNum + 1 + fieldStart = i + 1 + + -- newline + elseif (currentChar == LF or currentChar == CR) then + assignValueToField() + + -- handle CRLF + if (currentChar == CR and nextChar == LF) then + skipChar = 1 + fieldStart = fieldStart + 1 + end + + -- incrememnt for new line + if fieldNum < totalColumnCount then + -- sometimes in buffered mode, the buffer starts with a newline + -- this skips the newline and lets the parsing continue. + if buffered and lineNum == 1 and fieldNum == 1 and field == "" then + fieldStart = i + 1 + skipChar + lineStart = fieldStart + else + error('ftcsv: too few columns in row ' .. options.rowOffset + lineNum) + end + else + lineNum = lineNum + 1 + outResults[lineNum] = {} + fieldNum = 1 + fieldStart = i + 1 + skipChar + lineStart = fieldStart + end + + elseif luaCompatibility.LuaJIT == false then + skipIndex = inputString:find(charPatternToSkip, i) + if skipIndex then + skipChar = skipIndex - i - 1 + end + + end + + -- in buffered mode and it can't find the closing quote + -- it usually means in the middle of a buffer and need to backtrack + if i == nil then + if buffered then + outResults[lineNum] = nil + return outResults, lineStart + else + error("ftcsv: can't find closing quote in row " .. options.rowOffset + lineNum .. + ". Try running with the option ignoreQuotes=true if the source incorrectly uses quotes.") + end + end + + -- Increment Counter + i = i + 1 + skipChar + if (skipChar > 0) then + currentChar = sbyte(inputString, i) + else + currentChar = nextChar + end + skipChar = 0 + end + + if buffered and not endOfFile then + outResults[lineNum] = nil + return outResults, lineStart + end + + -- create last new field + assignValueToField() + + -- remove last field if empty + if fieldNum < totalColumnCount then + + -- indicates last field was really just a CRLF, + -- so, it can be removed + if fieldNum == 1 and field == "" then + outResults[lineNum] = nil + else + error('ftcsv: too few columns in row ' .. options.rowOffset + lineNum) + end + end + + return outResults, i, totalColumnCount +end + +local function handleHeaders(headerField, options) + -- make sure a header isn't empty + for _, headerName in ipairs(headerField) do + if #headerName == 0 then + error('ftcsv: Cannot parse a file which contains empty headers') + end + end + + -- for files where there aren't headers! + if options.headers == false then + for j = 1, #headerField do + headerField[j] = j + end + end + + -- rename fields as needed! + if options.rename then + -- basic rename (["a" = "apple"]) + for j = 1, #headerField do + if options.rename[headerField[j]] then + headerField[j] = options.rename[headerField[j]] + end + end + -- files without headers, but with a options.rename need to be handled too! + if #options.rename > 0 then + for j = 1, #options.rename do + headerField[j] = options.rename[j] + end + end + end + + -- apply some sweet header manipulation + if options.headerFunc then + for j = 1, #headerField do + headerField[j] = options.headerFunc(headerField[j]) + end + end + + return headerField +end + +-- load an entire file into memory +local function loadFile(textFile, amount) + local file = io.open(textFile, "r") + if not file then error("ftcsv: File not found at " .. textFile) end + local lines = file:read(amount) + if amount == "*all" then + file:close() + end + return lines, file +end + +local function initializeInputFromStringOrFile(inputFile, options, amount) + -- handle input via string or file! + local inputString, file + if options.loadFromString then + inputString = inputFile + else + inputString, file = loadFile(inputFile, amount) + end + + -- if they sent in an empty file... + if inputString == "" then + error('ftcsv: Cannot parse an empty file') + end + return inputString, file +end + +local function parseOptions(delimiter, options, fromParseLine) -- delimiter MUST be one character assert(#delimiter == 1 and type(delimiter) == "string", "the delimiter must be of string type and exactly one character") - -- OPTIONS yo - local header = true - local rename local fieldsToKeep = nil - local loadFromString = false - local headerFunc + if options then if options.headers ~= nil then assert(type(options.headers) == "boolean", "ftcsv only takes the boolean 'true' or 'false' for the optional parameter 'headers' (default 'true'). You passed in '" .. tostring(options.headers) .. "' of type '" .. type(options.headers) .. "'.") - header = options.headers end if options.rename ~= nil then assert(type(options.rename) == "table", "ftcsv only takes in a key-value table for the optional parameter 'rename'. You passed in '" .. tostring(options.rename) .. "' of type '" .. type(options.rename) .. "'.") - rename = options.rename end if options.fieldsToKeep ~= nil then assert(type(options.fieldsToKeep) == "table", "ftcsv only takes in a list (as a table) for the optional parameter 'fieldsToKeep'. You passed in '" .. tostring(options.fieldsToKeep) .. "' of type '" .. type(options.fieldsToKeep) .. "'.") @@ -342,91 +418,214 @@ function ftcsv.parse(inputFile, delimiter, options) fieldsToKeep[ofieldsToKeep[j]] = true end end - if header == false and options.rename == nil then + if options.headers == false and options.rename == nil then error("ftcsv: fieldsToKeep only works with header-less files when using the 'rename' functionality") end end if options.loadFromString ~= nil then assert(type(options.loadFromString) == "boolean", "ftcsv only takes a boolean value for optional parameter 'loadFromString'. You passed in '" .. tostring(options.loadFromString) .. "' of type '" .. type(options.loadFromString) .. "'.") - loadFromString = options.loadFromString end if options.headerFunc ~= nil then assert(type(options.headerFunc) == "function", "ftcsv only takes a function value for optional parameter 'headerFunc'. You passed in '" .. tostring(options.headerFunc) .. "' of type '" .. type(options.headerFunc) .. "'.") - headerFunc = options.headerFunc end - end - - -- handle input via string or file! - local inputString - if loadFromString then - inputString = inputFile + if options.ignoreQuotes == nil then + options.ignoreQuotes = false + else + assert(type(options.ignoreQuotes) == "boolean", "ftcsv only takes a boolean value for optional parameter 'ignoreQuotes'. You passed in '" .. tostring(options.ignoreQuotes) .. "' of type '" .. type(options.ignoreQuotes) .. "'.") + end + if options.bufferSize ~= nil then + assert(type(options.bufferSize) == "number", "ftcsv only takes a number value for optional parameter 'bufferSize'. You passed in '" .. tostring(options.bufferSize) .. "' of type '" .. type(options.bufferSize) .. "'.") + if fromParseLine == false then + error("ftcsv: bufferSize can only be specified using 'parseLine'. When using 'parse', the entire file is read into memory") + end + end else - inputString = loadFile(inputFile) - end - local inputLength = #inputString - - -- if they sent in an empty file... - if inputLength == 0 then - error('ftcsv: Cannot parse an empty file') + options = { + ["headers"] = true, + ["loadFromString"] = false, + ["ignoreQuotes"] = false, + ["bufferSize"] = 2^16 + } end - -- parse through the headers! - local startLine = 1 + return options, fieldsToKeep - -- check for BOM - if string.byte(inputString, 1) == 239 - and string.byte(inputString, 2) == 187 - and string.byte(inputString, 3) == 191 then - startLine = 4 - end - local headerField, i = parseString(inputString, inputLength, delimiter, startLine) - i = i + 1 -- start at the next char - - -- make sure a header isn't empty - for _, header in ipairs(headerField) do - if #header == 0 then - error('ftcsv: Cannot parse a file which contains empty headers') - end - end - - -- for files where there aren't headers! - if header == false then - i = startLine - for j = 1, #headerField do - headerField[j] = j - end - end - - -- rename fields as needed! - if rename then - -- basic rename (["a" = "apple"]) - for j = 1, #headerField do - if rename[headerField[j]] then - -- print("RENAMING", headerField[j], rename[headerField[j]]) - headerField[j] = rename[headerField[j]] - end - end - -- files without headers, but with a rename need to be handled too! - if #rename > 0 then - for j = 1, #rename do - headerField[j] = rename[j] - end - end - end - - -- apply some sweet header manipulation - if headerFunc then - for j = 1, #headerField do - headerField[j] = headerFunc(headerField[j]) - end - end - - local output = parseString(inputString, inputLength, delimiter, i, headerField, fieldsToKeep) - local realHeaders = determineRealHeaders(headerField, fieldsToKeep) - return output, realHeaders end --- a function that delimits " to "", used by the writer +local function findEndOfHeaders(str, entireFile) + local i = 1 + local quote = sbyte('"') + local newlines = { + [sbyte("\n")] = true, + [sbyte("\r")] = true + } + local quoted = false + local char = sbyte(str, i) + repeat + -- this should still work for escaped quotes + -- ex: " a "" b \r\n " -- there is always a pair around the newline + if char == quote then + quoted = not quoted + end + i = i + 1 + char = sbyte(str, i) + until (newlines[char] and not quoted) or char == nil + + if not entireFile and char == nil then + error("ftcsv: bufferSize needs to be larger to parse this file") + end + + local nextChar = sbyte(str, i+1) + if nextChar == sbyte("\n") and char == sbyte("\r") then + i = i + 1 + end + return i +end + +local function determineBOMOffset(inputString) + -- BOM files start with bytes 239, 187, 191 + if sbyte(inputString, 1) == 239 + and sbyte(inputString, 2) == 187 + and sbyte(inputString, 3) == 191 then + return 4 + else + return 1 + end +end + +local function parseHeadersAndSetupArgs(inputString, delimiter, options, fieldsToKeep, entireFile) + local startLine = determineBOMOffset(inputString) + + local endOfHeaderRow = findEndOfHeaders(inputString, entireFile) + + local parserArgs = { + delimiter = delimiter, + headerField = nil, + fieldsToKeep = nil, + inputLength = endOfHeaderRow, + buffered = false, + ignoreQuotes = options.ignoreQuotes, + rowOffset = 0 + } + + local rawHeaders, endOfHeaders = parseString(inputString, startLine, parserArgs) + + -- manipulate the headers as per the options + local modifiedHeaders = handleHeaders(rawHeaders[1], options) + parserArgs.headerField = modifiedHeaders + parserArgs.fieldsToKeep = fieldsToKeep + parserArgs.inputLength = nil + + if options.headers == false then endOfHeaders = startLine end + + local finalHeaders = determineRealHeaders(modifiedHeaders, fieldsToKeep) + if options.headers ~= false then + local headersMetamethod = generateHeadersMetamethod(finalHeaders) + parserArgs.headersMetamethod = headersMetamethod + end + + return endOfHeaders, parserArgs, finalHeaders +end + +-- runs the show! +function ftcsv.parse(inputFile, delimiter, options) + local options, fieldsToKeep = parseOptions(delimiter, options, false) + + local inputString = initializeInputFromStringOrFile(inputFile, options, "*all") + + local endOfHeaders, parserArgs, finalHeaders = parseHeadersAndSetupArgs(inputString, delimiter, options, fieldsToKeep, true) + + local output = parseString(inputString, endOfHeaders, parserArgs) + + return output, finalHeaders +end + +local function getFileSize (file) + local current = file:seek() + local size = file:seek("end") + file:seek("set", current) + return size +end + +local function determineAtEndOfFile(file, fileSize) + if file:seek() >= fileSize then + return true + else + return false + end +end + +local function initializeInputFile(inputString, options) + if options.loadFromString == true then + error("ftcsv: parseLine currently doesn't support loading from string") + end + return initializeInputFromStringOrFile(inputString, options, options.bufferSize) +end + +function ftcsv.parseLine(inputFile, delimiter, userOptions) + local options, fieldsToKeep = parseOptions(delimiter, userOptions, true) + local inputString, file = initializeInputFile(inputFile, options) + + + local fileSize, atEndOfFile = 0, false + fileSize = getFileSize(file) + atEndOfFile = determineAtEndOfFile(file, fileSize) + + local endOfHeaders, parserArgs, _ = parseHeadersAndSetupArgs(inputString, delimiter, options, fieldsToKeep, atEndOfFile) + parserArgs.buffered = true + parserArgs.endOfFile = atEndOfFile + + local parsedBuffer, endOfParsedInput, totalColumnCount = parseString(inputString, endOfHeaders, parserArgs) + parserArgs.totalColumnCount = totalColumnCount + + inputString = ssub(inputString, endOfParsedInput) + local bufferIndex, returnedRowsCount = 0, 0 + local currentRow, buffer + + return function() + -- check parsed buffer for value + bufferIndex = bufferIndex + 1 + currentRow = parsedBuffer[bufferIndex] + if currentRow then + returnedRowsCount = returnedRowsCount + 1 + return returnedRowsCount, currentRow + end + + -- read more of the input + buffer = file:read(options.bufferSize) + if not buffer then + file:close() + return nil + else + parserArgs.endOfFile = determineAtEndOfFile(file, fileSize) + end + + -- appends the new input to what was left over + inputString = inputString .. buffer + + -- re-analyze and load buffer + parserArgs.rowOffset = returnedRowsCount + parsedBuffer, endOfParsedInput = parseString(inputString, 1, parserArgs) + bufferIndex = 1 + + -- cut the input string down + inputString = ssub(inputString, endOfParsedInput) + + if #parsedBuffer == 0 then + error("ftcsv: bufferSize needs to be larger to parse this file") + end + + returnedRowsCount = returnedRowsCount + 1 + return returnedRowsCount, parsedBuffer[bufferIndex] + end +end + + + +-- The ENCODER code is below here +-- This could be broken out, but is kept here for portability + + local function delimitField(field) field = tostring(field) if field:find('"') then @@ -436,39 +635,64 @@ local function delimitField(field) end end +local function escapeHeadersForLuaGenerator(headers) + local escapedHeaders = {} + for i = 1, #headers do + if headers[i]:find('"') then + escapedHeaders[i] = headers[i]:gsub('"', '\\"') + else + escapedHeaders[i] = headers[i] + end + end + return escapedHeaders +end + -- a function that compiles some lua code to quickly print out the csv -local function writer(inputTable, dilimeter, headers) - -- they get re-created here if they need to be escaped so lua understands it based on how - -- they came in +local function csvLineGenerator(inputTable, delimiter, headers) + local escapedHeaders = escapeHeadersForLuaGenerator(headers) + + local outputFunc = [[ + local args, i = ... + i = i + 1; + if i > ]] .. #inputTable .. [[ then return nil end; + return i, '"' .. args.delimitField(args.t[i]["]] .. + table.concat(escapedHeaders, [["]) .. '"]] .. + delimiter .. [["' .. args.delimitField(args.t[i]["]]) .. + [["]) .. '"\r\n']] + + local arguments = {} + arguments.t = inputTable + -- we want to use the same delimitField throughout, + -- so we're just going to pass it in + arguments.delimitField = delimitField + + return luaCompatibility.load(outputFunc), arguments, 0 + +end + +local function validateHeaders(headers, inputTable) for i = 1, #headers do if inputTable[1][headers[i]] == nil then error("ftcsv: the field '" .. headers[i] .. "' doesn't exist in the inputTable") end - if headers[i]:find('"') then - headers[i] = headers[i]:gsub('"', '\\"') - end end - - local outputFunc = [[ - local state, i = ... - local d = state.delimitField - i = i + 1; - if i > state.tableSize then return nil end; - return i, '"' .. d(state.t[i]["]] .. table.concat(headers, [["]) .. '"]] .. dilimeter .. [["' .. d(state.t[i]["]]) .. [["]) .. '"\r\n']] - - -- print(outputFunc) - - local state = {} - state.t = inputTable - state.tableSize = #inputTable - state.delimitField = delimitField - - return M.load(outputFunc), state, 0 - end --- takes the values from the headers in the first row of the input table -local function extractHeaders(inputTable) +local function initializeOutputWithEscapedHeaders(escapedHeaders, delimiter) + local output = {} + output[1] = '"' .. table.concat(escapedHeaders, '"' .. delimiter .. '"') .. '"\r\n' + return output +end + +local function escapeHeadersForOutput(headers) + local escapedHeaders = {} + for i = 1, #headers do + escapedHeaders[i] = delimitField(headers[i]) + end + return escapedHeaders +end + +local function extractHeadersFromTable(inputTable) local headers = {} for key, _ in pairs(inputTable[1]) do headers[#headers+1] = key @@ -480,42 +704,42 @@ local function extractHeaders(inputTable) return headers end --- turns a lua table into a csv --- works really quickly with luajit-2.1, because table.concat life -function ftcsv.encode(inputTable, delimiter, options) - local output = {} - - -- dilimeter MUST be one character - assert(#delimiter == 1 and type(delimiter) == "string", "the delimiter must be of string type and exactly one character") - - -- grab the headers from the options if they are there +local function getHeadersFromOptions(options) local headers = nil if options then if options.fieldsToKeep ~= nil then - assert(type(options.fieldsToKeep) == "table", "ftcsv only takes in a list (as a table) for the optional parameter 'fieldsToKeep'. You passed in '" .. tostring(options.headers) .. "' of type '" .. type(options.headers) .. "'.") + assert( + type(options.fieldsToKeep) == "table", "ftcsv only takes in a list (as a table) for the optional parameter 'fieldsToKeep'. You passed in '" .. tostring(options.headers) .. "' of type '" .. type(options.headers) .. "'.") headers = options.fieldsToKeep end end + return headers +end + +local function initializeGenerator(inputTable, delimiter, options) + -- delimiter MUST be one character + assert(#delimiter == 1 and type(delimiter) == "string", "the delimiter must be of string type and exactly one character") + + local headers = getHeadersFromOptions(options) if headers == nil then - headers = extractHeaders(inputTable) + headers = extractHeadersFromTable(inputTable) end + validateHeaders(headers, inputTable) - -- newHeaders are needed if there are quotes within the header - -- because they need to be escaped - local newHeaders = {} - for i = 1, #headers do - if headers[i]:find('"') then - newHeaders[i] = headers[i]:gsub('"', '""') - else - newHeaders[i] = headers[i] - end - end - output[1] = '"' .. table.concat(newHeaders, '"' .. delimiter .. '"') .. '"\r\n' + local escapedHeaders = escapeHeadersForOutput(headers) + local output = initializeOutputWithEscapedHeaders(escapedHeaders, delimiter) + return output, headers +end - -- add each line by line. - for i, line in writer(inputTable, delimiter, headers) do +-- works really quickly with luajit-2.1, because table.concat life +function ftcsv.encode(inputTable, delimiter, options) + local output, headers = initializeGenerator(inputTable, delimiter, options) + + for i, line in csvLineGenerator(inputTable, delimiter, headers) do output[i+1] = line end + + -- combine and return final string return table.concat(output) end diff --git a/spec/dynamic_features_spec.lua b/spec/dynamic_features_spec.lua index 2b4b197..29bf291 100644 --- a/spec/dynamic_features_spec.lua +++ b/spec/dynamic_features_spec.lua @@ -460,4 +460,32 @@ describe("csv features", function() end end + for bom, i in pairs(BOM) do + for newline, j in pairs(newlines) do + for _, endline in ipairs(endlines) do + local name = "should handle ignoring quotes (%s + %s) EOF: %s" + it(name:format(bom, newline, endline), function() + local expectedHeaders = {"a", "b", "c"} + local expected = {} + expected[1] = {} + expected[1].a = '"apple"' + expected[1].b = '"banana"' + expected[1].c = '"carrot"' + + local defaultString = '%sa,b,c%s"apple","banana","carrot"%s' + + if endline == "NONE" then + defaultString = defaultString:format(i, j, "") + else + defaultString = defaultString:format(i, j, j) + end + + local options = {loadFromString=true, ignoreQuotes=true} + local actual, actualHeaders = ftcsv.parse(defaultString, ",", options) + assert.are.same(expected, actual) + assert.are.same(expectedHeaders, actualHeaders) + end) + end + end + end end) \ No newline at end of file diff --git a/spec/error_spec.lua b/spec/error_spec.lua index 5d8d412..d2b0863 100644 --- a/spec/error_spec.lua +++ b/spec/error_spec.lua @@ -41,4 +41,31 @@ it("should error out when you want to encode a table and specify a field that do end assert.has_error(test, "ftcsv: the field 'c' doesn't exist in the inputTable") -end) \ No newline at end of file +end) + +describe("parseLine features small, nonworking buffer size", function() + it("should error out when trying to load from string", function() + local test = function() + local parse = {} + for i, line in ftcsv.parseLine("a,b,c\n1,2,3", ",", {loadFromString=true}) do + parse[i] = line + end + return parse + end + assert.has_error(test, "ftcsv: parseLine currently doesn't support loading from string") + end) +end) + +it("should error when dealing with quotes", function() + local test = function() + local actual = ftcsv.parse('a,b,c\n"apple,banana,carrot', ",", {loadFromString=true}) + end + assert.has_error(test, "ftcsv: can't find closing quote in row 1. Try running with the option ignoreQuotes=true if the source incorrectly uses quotes.") +end) + +it("should error if bufferSize is set when parsing entire files", function() + local test = function() + local actual = ftcsv.parse('a,b,c\n"apple,banana,carrot', ",", {loadFromString=true, bufferSize=34}) + end + assert.has_error(test, "ftcsv: bufferSize can only be specified using 'parseLine'. When using 'parse', the entire file is read into memory") +end) diff --git a/spec/feature_spec.lua b/spec/feature_spec.lua index 62355d4..e2f0ba6 100644 --- a/spec/feature_spec.lua +++ b/spec/feature_spec.lua @@ -61,6 +61,16 @@ describe("csv features", function() assert.are.same(expected, actual) end) + it("should handle escaped doublequotes", function() + local expected = {} + expected[1] = {} + expected[1].a = 'A"B""C' + expected[1].b = 'A""B"C' + expected[1].c = 'A"""B""C' + local actual = ftcsv.parse('a;b;c\n"A""B""""C";"A""""B""C";"A""""""B""""C"', ";", {loadFromString=true}) + assert.are.same(expected, actual) + end) + it("should handle renaming a field", function() local expected = {} expected[1] = {} @@ -308,4 +318,24 @@ describe("csv features", function() assert.are.same(expected, actual) end) -end) \ No newline at end of file + it("should handle headers attempting to escape", function() + local expected = {} + expected[1] = {} + expected[1]["]] print('hello')"] = "apple" + expected[1].b = "banana" + expected[1].c = "carrot" + local actual = ftcsv.parse("]] print('hello'),b,c\napple,banana,carrot", ",", {loadFromString=true}) + assert.are.same(expected, actual) + end) + + it("should handle ignoring the single quote", function() + local expected = {} + expected[1] = {} + expected[1].a = '"apple' + expected[1].b = "banana" + expected[1].c = "carrot" + local actual = ftcsv.parse('a,b,c\n"apple,banana,carrot', ",", {loadFromString=true, ignoreQuotes=true}) + assert.are.same(expected, actual) + end) + +end) diff --git a/spec/parseLine_spec.lua b/spec/parseLine_spec.lua new file mode 100644 index 0000000..1710ff6 --- /dev/null +++ b/spec/parseLine_spec.lua @@ -0,0 +1,76 @@ +local ftcsv = require('ftcsv') +local cjson = require('cjson') + +local function loadFile(textFile) + local file = io.open(textFile, "r") + if not file then error("File not found at " .. textFile) end + local allLines = file:read("*all") + file:close() + return allLines +end + +describe("parseLine features small, working buffer size", function() + it("should handle correctness", function() + local json = loadFile("spec/json/correctness.json") + json = cjson.decode(json) + local parse = {} + for i, line in ftcsv.parseLine("spec/csvs/correctness.csv", ",", {bufferSize=52}) do + assert.are.same(json[i], line) + parse[i] = line + end + assert.are.same(#json, #parse) + assert.are.same(json, parse) + end) +end) + +describe("parseLine features small, nonworking buffer size", function() + it("should handle correctness", function() + local test = function() + local parse = {} + for i, line in ftcsv.parseLine("spec/csvs/correctness.csv", ",", {bufferSize=63}) do + parse[i] = line + end + return parse + end + assert.has_error(test, "ftcsv: bufferSize needs to be larger to parse this file") + end) +end) + +describe("parseLine features smaller, nonworking buffer size", function() + it("should handle correctness", function() + local test = function() + local parse = {} + for i, line in ftcsv.parseLine("spec/csvs/correctness.csv", ",", {bufferSize=50}) do + parse[i] = line + end + return parse + end + assert.has_error(test, "ftcsv: bufferSize needs to be larger to parse this file") + end) +end) + +describe("smaller bufferSize than header and incorrect number of fields", function() + it("should handle correctness", function() + local test = function() + local parse = {} + for i, line in ftcsv.parseLine("spec/csvs/correctness.csv", ",", {bufferSize=23}) do + parse[i] = line + end + return parse + end + assert.has_error(test, "ftcsv: bufferSize needs to be larger to parse this file") + end) +end) + +describe("smaller bufferSize than header, but with correct field numbers", function() + it("should handle correctness", function() + local test = function() + local parse = {} + for i, line in ftcsv.parseLine("spec/csvs/correctness.csv", ",", {bufferSize=30}) do + parse[i] = line + end + return parse + end + assert.has_error(test, "ftcsv: bufferSize needs to be larger to parse this file") + end) +end) diff --git a/spec/parse_encode_spec.lua b/spec/parse_encode_spec.lua index b7e1366..0ae250f 100644 --- a/spec/parse_encode_spec.lua +++ b/spec/parse_encode_spec.lua @@ -43,6 +43,22 @@ describe("csv decode", function() end end) +describe("csv parseLine decode", function() + for _, value in ipairs(files) do + it("should handle " .. value, function() + local json = loadFile("spec/json/" .. value .. ".json") + json = cjson.decode(json) + local parse = {} + for i, v in ftcsv.parseLine("spec/csvs/" .. value .. ".csv", ",") do + parse[i] = v + assert.are.same(json[i], v) + end + assert.are.same(#json, #parse) + assert.are.same(json, parse) + end) + end +end) + describe("csv decode from string", function() for _, value in ipairs(files) do it("should handle " .. value, function() @@ -70,4 +86,4 @@ describe("csv encode", function() assert.are.same(jsonDecode, reEncoded) end) end -end) \ No newline at end of file +end)