Amazon S3 Data Dictionary
  • 27 Jul 2021
  • 2 Minutes to read
  • Dark
    Light
  • PDF

Amazon S3 Data Dictionary

  • Dark
    Light
  • PDF

Article Summary

Because Amazon S3 data comes from a database system, Panoply cannot provide a data dictionary. But Panoply does automate the data schema for the collected data. This section includes useful information about the Panoply automations. You can adjust these settings in your data source under Advanced Settings.

File Types

Several of the supported file types have special requirements or handling.

Excel (.xslx) Files

Panoply collects the file into the destination table you entered. To have each sheet (tab) within the file into its own table, users must enter additional parameters.

Text (.txt) Files

Text files txt files can be tab delimited, which Panoply will automatically use to mark the columns. Other delimiters may also be used, but the user will have to specify that delimeter on the Advanced Settings tab.

Archive Files (.tar .gzip .zip)

Collection will fail if there are unsupported file types in the archive file.

Header Handling

The first row of the spreadsheet or file with values will be considered the header for the table. If a column has data but nothing in the header row, Panoply will add a header consisting of the word column + the column letter, such as column A.

For duplicate headers, Panoply will rename the second instance of the header by appending the header with column and the column letter, such as Name Column A. Columns with a header and no data will not be collected.

A column with a header but without values will be ignored.

Wildcards:

The path pattern is the way to decide whether or not to sync certain files. It applies to everything under the S3 bucket. The path pattern supports wildcards for better flexibility in finding specific files and/or folders:

  • * matches everything
  • ? matches any single character
  • [seq] matches any character in seq
  • [!seq] matches any character not in seq

For example, if the bucket orders has three folders named: 2021Q1, 2021Q2, and 2021Summary, using the pattern [0-9][0-9][0-9][0-9]Q[1-4]/*, you could include only the folders for each quarter (will exclude the 2021Summary folder). These are just a few of the use cases for using wildcards.

  • Use case: list all .csv files in my_folder
    • Pattern: my_folder/*.csv
    • Includes: my_folder/test.csv, my_folder/testtest.csv
    • Excludes: my_folder/test.txt, my_folder/test.json
  • Use case: list all files in specific subfolder
    • Pattern: my_folder/*/logs/*
    • Includes: my_folder/2020/logs/test1.txt, my_folder/2020/05/logs/test2.txt
    • Excludes: my_folder/test3.txt, my_folder/2020/errors/test4.txt
  • Use case: list all files which end with number
    • Pattern: my_folder/*[0-9].*
    • Includes: my_folder/test1.txt, my_folder/subfolder/test2.json
    • Excludes: my_folder/test.json, my_folder/subfolder/test.json

Similarly [a-z] will match any single letter

Metadata

  • id: If the user does not enter a primary key, and no id column exists in the source, Panoply will insert an id. Formatted as a GUID, such as 2cd570d1-a11d-4593-9d29-9e2488f0ccc2z.
  • __s3bucket: The source S3 bucket for the record.
  • __s3key: The source file name in S3 for the record.
  • __filename: The name of the source file.
  • __senttime: Formatted as a datetime, such as 2020-04-26T01:26:14.695Z.
  • __updatetime: Formatted as a datetime, such as 2020-04-26T01:26:14.695Z.
  • __sheetname: The name of the sheet (Excel files.
  • __xml_root: The name of the XML root element.
  • __xml_item: The name of a child element (XML)

Was this article helpful?