My last post scraping data of FINRA registrants for a particular Financial Institution got me thinking about getting a listing of all the firms that FINRA regulates.
FINRA does provide a site with a “listing” of all the firms they regulate
, but I wanted all the raw data for use in a spreadsheet (or where ever) and this was the challenge.
This was a great opportunity to use the Ruby Mechanize Gem alongside the Nokogiri Gem for parsing the output. Together these two Gems are very powerful and both crawl and mine data with beautiful efficiency
, grabbing the desired data and getting me my “true” list of all FINRA Members.
require 'nokogiri'
require 'mechanize'
require 'open-uri'
ary_of_members = Array.new rows = Array.new agent = Mechanize.new
### Defines Array Class for HTML Table output ### class Array def to_cells(tag) self.map { |c| "<#{tag}>#{c}" }.join end end
file = File.open('./finra_members.html', "w")
page = agent.get('http://www.finra.org/AboutFINRA/MemberFirms/ListOfMembers/p012909')
doc = Nokogiri::HTML(page.body)
doc.css('.FNRW_Alphabetical_DL-result').each do |linkz|
page = agent.get('http://www.finra.org' + linkz['href']) doc = Nokogiri::HTML(page.body)
##### Search for nodes by css doc.css('#col2cont span').each do |firm| if firm.to_s =~ /Mailing Address|10pt'>.*?<\/p><\/span>/ else
if firm.text =~ /\w+/ ary_of_members << firm.text end
end end end
(0..ary_of_members.length).step(2) do |n| rows << {"FINRA Member" => ary_of_members[n], "FINRA Member Address" => ary_of_members[n+1]} end
### Rolls HTML Table output ### headers = "#{rows[0].keys.to_cells('th')}" cells = rows.map do |row| "#{row.values.to_cells('td')}" end.join("\n ") table = "#{headers} #{cells}" file.puts table